What is Ilum?

Modular Data Lakehouse Platform with Multi-Engine Execution on Kubernetes

Ilum is an open, modular data lakehouse platform designed for Kubernetes and Apache Hadoop Yarn environments. It unifies multi-engine SQL execution (Apache Spark, Trino, DuckDB, Apache Flink), open table formats (Delta Lake, Apache Iceberg, Apache Hudi), and open catalogs (Hive Metastore, Project Nessie, Unity Catalog, DuckLake) behind a single control plane, with column-level lineage, multi-cluster orchestration, and an extensible module system that lets each deployment include only the components it needs.

Key Capabilities

Multi-engine SQL execution with Apache Spark, Trino, DuckDB, and Apache Flink, unified behind the Apache Kyuubi SQL gateway
Automatic engine routing that selects the right engine for each workload: Spark for large transformations, Trino for interactive analytics, DuckDB for small-data and local execution, Flink for streaming
Open lakehouse architecture with first-class support for Delta Lake, Apache Iceberg, and Apache Hudi, accessible through Hive Metastore, Project Nessie, Unity Catalog, and DuckLake
Column-level data lineage powered by OpenLineage and Marquez, with an interactive graph view that toggles between lineage and ERD perspectives
Multi-cluster control plane for Kubernetes (GKE, EKS, AKS, on-premise) and Yarn clusters, with namespace-scoped resource quotas and a single point of policy
Modular extensibility: optional components (notebooks, BI, orchestration, ML, observability) install and upgrade at runtime via the dedicated ilum-api module-management microservice
Cloud-native and open: Kubernetes-first, Helm-managed, OpenAPI-defined, no proprietary lock-in

Get started with Ilum → | View architecture documentation →

Ilum - A Modular Data Lakehouse on Kubernetes

Ilum is built around the principle that a modern data platform should be composed, not bought. Each layer of the lakehouse (execution engine, table format, catalog, orchestration, notebooks, observability) is a swappable module rather than a vertically integrated black box. This composition is what makes Ilum a credible alternative to proprietary platforms while remaining production-ready out of the box.

Architecture

Ilum is composed of clearly separated layers, each independently scalable and replaceable:

Web UI (ilum-ui): React-based control plane for clusters, jobs, SQL execution, table exploration, lineage, security, and module management
Platform service (ilum-core): The main backend, providing the public REST API, job orchestration, SQL execution, security, lineage, and gRPC/Kafka communication with running jobs
Module-management microservice (ilum-api): A dedicated service that installs, upgrades, and disables optional Ilum modules at runtime via Helm. Future releases will extend ilum-api with Model Context Protocol (MCP) capabilities and open APIs for third-party integration.
Engine layer: Spark, Trino, DuckDB, and Flink, fronted by the Kyuubi SQL gateway
Catalog layer: Hive Metastore, Nessie, Unity Catalog, DuckLake (selectable per workload)
Data layer: PostgreSQL (primary metadata store) and MongoDB (legacy, still supported), with object storage on RustFS (bundled default), MinIO (bundled, opt-in), AWS S3 (and any S3-compatible service), GCS, Azure Blob, or HDFS. See Object Storage in Ilum.

The platform supports both Python (PySpark) and Scala programming languages for batch and interactive workloads, plus first-class SQL across every supported engine.

Execution Engines

Ilum exposes execution as a multi-engine surface rather than tying the platform to a single processing framework. Each engine has a clear sweet spot:

Apache Spark: large-scale ETL, machine learning pipelines, and any workload that benefits from distributed processing across many executors
Trino: interactive analytics across federated data sources, with fast response times on medium-to-large datasets
DuckDB: single-node analytics on small-to-medium data, ideal for ad-hoc exploration and DuckLake-managed tables
Apache Flink: low-latency stream processing

The automatic engine router selects the appropriate engine based on data size, workload type, and locality. Manual override remains available for every query and job.

Open Lakehouse

Ilum supports the three major open table formats with full ACID guarantees:

Delta Lake: ACID transactions, time travel, schema evolution
Apache Iceberg: partition evolution, hidden partitioning, large-scale analytics
Apache Hudi: record-level upserts, incremental processing

Tables are addressable through any of four catalog backends:

Hive Metastore: traditional, broadly compatible
Project Nessie: Git-style branching and version control for Iceberg tables
Unity Catalog: Databricks-compatible governance and access control
DuckLake: DuckDB-native catalog enabled by default for local execution

The same table definitions are usable from every engine, making it trivial to shift a workload from Spark to Trino to DuckDB without rewriting queries or copying data.

REST API and Programmatic Access

Ilum exposes its full surface area through an OpenAPI 3.0 specification (currently 1.5.x):

# Submit a Spark job
POST /api/v1/job

# Open or interact with an interactive group
POST /api/v1/group

# Execute a SQL query against any registered engine
POST /api/v1/sql/execute

# Schedule a recurring job
POST /api/v1/schedule

Every UI action is backed by a documented REST endpoint, which makes Ilum straightforward to drive from CI pipelines, custom orchestrators, or downstream services. Use cases include:

Triggering Spark transformations from API gateways
Running on-demand Trino or DuckDB queries from BI tools
Submitting and monitoring streaming Flink jobs from external workflows
Driving Jupyter notebook kernels through HTTP

Multi-Cluster Orchestration

Ilum manages heterogeneous clusters from a single control plane:

Cloud Kubernetes: GKE, EKS, AKS with auto-scaling node pools
On-premise Kubernetes: bare metal, OpenShift, Rancher
Hadoop Yarn: hybrid architectures alongside Kubernetes deployments
Local: in-process execution for development and testing

Each cluster maintains independent resource quotas, storage backends, and security policies, while sharing centralized monitoring, lineage, and scheduling.

Modular by Design

Ilum ships as a small core with a curated set of optional modules. Each module is a Helm sub-chart enabled or disabled per deployment, and managed at runtime by the ilum-api microservice:

Engines: Spark, DuckDB (default-on), Trino, Flink
Catalogs: Hive Metastore (default-on), Nessie, Unity Catalog, DuckLake
Notebooks: Jupyter (default-on), JupyterHub (Enterprise), Zeppelin
Orchestration: Apache Airflow, Kestra, Mage, n8n, Apache NiFi
BI and visualization: Apache Superset, Streamlit
AI and ML: MLflow, LangFuse, AI Data Analyst
Observability: Prometheus, Grafana, Loki, Marquez (default-on)
Identity: OAuth2, OIDC, LDAP, Active Directory, Ory Hydra (Ilum as IdP)
Storage: RustFS (bundled default), MinIO (bundled, opt-in), AWS S3 (and any S3-compatible service), GCS, Azure Blob, HDFS

Future releases will extend ilum-api with MCP and additional open APIs, opening the module system to third-party extension.

Comparison with Alternative Solutions

Capability	Ilum	Databricks	Cloudera
Multi-engine SQL (Spark + Trino + DuckDB + Flink)	✓	Spark only	Limited
Automatic engine routing	✓	✗	✗
Open table formats (Delta + Iceberg + Hudi)	✓	Delta-first	Iceberg + Hudi
Open catalogs (Hive + Nessie + Unity + DuckLake)	✓	Unity only	Hive only
Column-level lineage (OpenLineage)	✓	Proprietary	Limited
Kubernetes-native	✓	Partial	Partial
Yarn integration	✓	✗	✓
On-premise deployment	✓	Limited	✓
Multi-cluster control plane	✓	Limited	✓
REST API for interactive sessions	✓	✓	Limited
Vendor lock-in	None	High	High
Built on open standards	✓	Partial	Partial

Video Overview

tip

Prefer a guided path? Build your first data product on Ilum in hours. Official course →.

Features

Multi-Engine SQL and Execution

Kyuubi SQL Gateway: Unified JDBC/ODBC entry point for Spark, Trino, and Flink
Engine selector and lifecycle: Start, stop, and restart engines from the UI; live engine status indicators
Dialect transpilation: Translate queries between Spark SQL, Trino SQL, DuckDB SQL, and Flink SQL via the built-in transpiler
In-app SQL notebooks: Persistent multi-cell notebooks with per-cell execution, profiling, and visualization
Saved queries: Folder-organized query library with bulk operations and move support

Open Lakehouse and Catalogs

Hive Metastore (default): Centralized table metadata compatible with every engine
Project Nessie: Git-style branches and tags for Iceberg tables, enabling reproducible analytics
Unity Catalog: Databricks-compatible governance, access control, and lineage
DuckLake (default): DuckDB-native catalog for local-first analytics
Unified Tables abstraction: Read and write Delta, Iceberg, and Hudi using the same Spark API
Table descriptions: Editable metadata directly in the Table Explorer

Spark Cluster Management

Kubernetes Operator integration: Native CRD-based Spark application deployment with pod lifecycle management
Multi-cluster control plane: Centralized management for GKE, EKS, AKS, on-premise Kubernetes, and Yarn
Horizontal pod autoscaling: Dynamic executor scaling based on CPU, memory, and queue depth
Resource quotas and LimitRanges: Namespace-scoped limits enforced through Kubernetes-native primitives

Workloads

Ilum manages five workload types as first-class concepts:

Clusters: Compute targets (Kubernetes, Yarn, Local)
Jobs: One-shot batch executions
Services: Long-running interactive sessions
Schedules: Cron-driven recurring executions
Requests: Ad-hoc query and batch submissions

Every workload exposes Status, Logs, Metrics, and Description tabs, with URL-persisted filters and bulk actions.

Interactive Computing and Notebooks

Jupyter and JupyterHub (Enterprise): SparkMagic kernels with automatic session binding
Apache Zeppelin: Multi-language interpreters with paragraph-level execution
In-app SQL notebooks: Multi-engine cells inside the Ilum SQL Editor
Spark Connect: Client-server Spark with Kubernetes-aware proxy for remote execution

Data Exploration and Lineage

Table Explorer: Browse Hive, Nessie (with branch switching), Unity Catalog, and DuckLake tables; preview data, edit descriptions, inspect partitions
File Explorer: Direct browsing of the active object storage provider (RustFS, MinIO, or external S3) plus GCS, Azure Blob, and HDFS
Column-level lineage: Powered by OpenLineage and Marquez, with an interactive React Flow graph
ERD ↔ lineage toggle: Switch between schema and runtime perspectives on the same dataset graph

Orchestration and Workflows

Built-in scheduler: Cron-based job scheduling with dependency management
Apache Airflow: DAG-based workflows with pre-configured Spark operators
Kestra: Event-driven pipelines with Spark task execution
Mage, n8n, Apache NiFi: Visual and code-based pipeline orchestrators
dbt: SQL transformations executed on any registered engine

Observability

Spark History Server: Job timeline, stage metrics, executor utilization
Prometheus + Grafana: Pre-configured dashboards via the kube-prometheus-stack
Loki + Promtail: Centralized log aggregation
Graphite exporter: Push-based metrics for multi-cluster environments
OpenLineage events: Captured automatically for every job

AI, ML, and BI

MLflow: Experiment tracking and model registry
LangFuse: LLM observability for AI-driven workloads
AI Data Analyst: Assistant tooling for SQL exploration. See the AI Data Analyst page for current capabilities.
Apache Superset: Open-source BI dashboards
Streamlit: Lightweight Python apps for analytics and ML demos
Tableau and PowerBI: External BI connectivity via Kyuubi JDBC

Security and Access Control

RBAC: Fine-grained permissions enforced through the RequiresPermission framework
OAuth2 / OIDC: Integration with Keycloak, Okta, Azure AD, Google, GitLab
LDAP / Active Directory: Enterprise directory integration
Ilum as Identity Provider: Embedded Ory Hydra lets Ilum issue OAuth2 tokens for embedded tools (Airflow, Superset, Grafana, Gitea, MinIO, etc.)
API tokens: Long-lived credentials for programmatic access
TLS / mTLS: Certificate-based encryption for inter-service traffic
Network policies: Kubernetes-native pod-to-pod restrictions

Explore full feature documentation → | Request new features →

Ilum data lakehouse dashboard showing multi-engine SQL execution, table catalogs, and column-level lineage

Advantages

Cloud-Native and Composable

Ilum is built as a set of containerized services with declarative configuration and GitOps-compatible deployment:

Helm charts: Parameterized Kubernetes manifests for reproducible deployments
Modular Helm sub-charts: Each engine, catalog, and integration installs independently
Custom Resource Definitions: Kubernetes API extensions for Spark application management
Service mesh ready: Compatible with Istio and Linkerd for advanced traffic management
Runtime module management: The ilum-api microservice installs, upgrades, and disables modules without redeploying the platform

No Vendor Lock-In

Unlike proprietary platforms, Ilum provides:

Open APIs: REST and gRPC interfaces defined by OpenAPI 3.0 specifications
Standard protocols: JDBC/ODBC connectivity, S3 API compatibility, Kafka integration, OpenLineage events
Portable workloads: Spark, Trino, DuckDB, and Flink jobs run on any Kubernetes cluster without modification
Open table formats and catalogs: Delta, Iceberg, Hudi over Hive, Nessie, Unity, or DuckLake. Pick what fits the team.
Multi-cloud deployment: AWS, GCP, Azure, and on-premise without platform-specific dependencies

Hadoop and Cloudera Migration Path

For organizations migrating from legacy big-data platforms:

Yarn compatibility: Run existing Yarn-based Spark jobs without code changes
HDFS connector: Direct access to HDFS clusters during migration phases
Hive Metastore reuse: Preserve existing table metadata and partitioning schemes
Incremental migration: Phased transition with hybrid Yarn/Kubernetes deployment
Bifrost (Enterprise): A dedicated migration automation tool for Hadoop and Cloudera CDP estates, covering discovery, phased execution, data validation, and rollback

Performance and Resource Control

Engine specialisation: The automatic router moves workloads to the engine that processes them most efficiently
Dynamic allocation: Spark executor scaling based on shuffle data and pending tasks
Adaptive Query Execution: Runtime optimisation for join strategies and partition coalescing
DuckLake local execution: Sub-second response times for small-data analytics
Kubernetes resource quotas and LimitRanges: Predictable per-namespace resource isolation

Enterprise Integration

Ilum is built for enterprise data platforms:

Apache Kafka: Native Spark Structured Streaming integration with exactly-once semantics
Apache Airflow: Managed Airflow with Spark operators pre-configured
MLflow: Model registry and experiment tracking for ML pipelines
LangFuse: Trace and audit LLM-driven workflows alongside data pipelines
OpenLineage: Standards-based data lineage emitted by every job
Superset, Tableau, PowerBI: BI connectivity through Kyuubi JDBC

Read architecture documentation → | View use cases →

Project Roadmap

Active areas of development include:

Apache Flink GA: Promoting Flink from Enterprise Beta to general availability for streaming workloads
Automatic engine router enhancements: Additional heuristics (cardinality estimates, locality awareness) and per-team policies
ilum-api MCP and open APIs: Extending the module-management microservice with Model Context Protocol support and a public extension API for third-party modules
GPU scheduling: CUDA-enabled executors for deep-learning workloads
Expanded Unity Catalog integration: Bidirectional governance sync with Databricks deployments

View full roadmap → | See changelog →

Additional Resources

API Reference: REST API documentation for programmatic access
Architecture: Layered platform architecture
Security Guide: Authentication, authorization, and network policies
Production Deployment: Best practices for production clusters
User Guides: Step-by-step tutorials for common workflows

Key Capabilities​

Ilum - A Modular Data Lakehouse on Kubernetes​

Architecture​

Execution Engines​

Open Lakehouse​

REST API and Programmatic Access​

Multi-Cluster Orchestration​

Modular by Design​

Comparison with Alternative Solutions​

Video Overview​

Features​

Multi-Engine SQL and Execution​

Open Lakehouse and Catalogs​

Spark Cluster Management​

Workloads​

Interactive Computing and Notebooks​

Data Exploration and Lineage​

Orchestration and Workflows​

Observability​

AI, ML, and BI​

Security and Access Control​

Advantages​

Cloud-Native and Composable​

No Vendor Lock-In​

Hadoop and Cloudera Migration Path​

Performance and Resource Control​

Enterprise Integration​

Project Roadmap​

Additional Resources​