Data Lakehouse

A Data Lakehouse is a modern data management architecture that aims to combine the best features of traditional Data Warehouses and Data Lakes. It seeks to provide the low-cost, flexible, and scalable storage of data lakes (typically on Cloud Object Storage) with the data structure, governance, reliability (ACID transactions), and performance features traditionally associated with data warehouses.

The core idea is to enable both traditional Business Intelligence (BI) analytics and data science/machine learning workloads directly on the same data stored in the lake, eliminating the need for separate, often redundant, data warehouse and data lake systems.

The Evolution: Bridging the Gap

Data Warehouses and Data Lakes emerged to solve different problems but also created new ones:

  • Data Warehouses: Good for structured BI reporting, but expensive, rigid (schema-on-write), and poor at handling unstructured/semi-structured data or supporting advanced analytics/ML directly. Often led to data silos.
  • Data Lakes: Cost-effective for storing all types of data (schema-on-read), flexible, and scalable. However, often suffered from reliability issues ('data swamps'), lack of transactions, poor query performance, and weak governance.

Maintaining both systems led to complex ETL pipelines copying data back and forth, increased costs, data staleness, and governance challenges. The Data Lakehouse architecture was proposed to unify these capabilities.

Key Enabling Technology: Open Table Formats

The practical realization of the Data Lakehouse is heavily reliant on Open Table Formats like Apache Iceberg, Delta Lake, and Apache Hudi. These formats act as a metadata and transaction layer on top of the raw files stored in the data lake (e.g., Parquet or ORC files on S3). They provide critical features directly on lake storage:

  • ACID Transactions: Ensure atomicity, consistency, isolation, and durability for operations like inserts, updates, deletes, and merges, preventing data corruption during concurrent operations or failures.
  • Schema Enforcement & Evolution: Define and enforce schemas for tables, while also allowing schemas to evolve safely over time without rewriting entire datasets.
  • Time Travel: Allow querying data as it existed at previous points in time (based on snapshots or versions), enabling auditing, reproducibility, and rollback capabilities.
  • Performance Optimizations: Implement techniques like metadata-driven file pruning, data skipping (using statistics), efficient partitioning (including hidden partitioning and partition evolution), and integration with compaction/Z-ordering to improve query performance.
  • Unified Batch & Streaming Support: Designed to handle both batch updates and streaming data ingestion concurrently and reliably.

By implementing these features directly on inexpensive cloud object storage, these table formats allow data lake storage to function much more like a traditional data warehouse in terms of reliability and performance, while retaining the lake's flexibility and cost advantages.

Architecture Components

A typical Data Lakehouse includes:

  1. Storage Layer: Cloud object storage (S3, GCS, ADLS) as the primary repository.
  2. Table Format Layer: An open table format (Iceberg, Delta Lake, or Hudi) managing the data layout, metadata, and transactions.
  3. Metadata Layer: Often includes a metastore (like Hive Metastore, AWS Glue Data Catalog, or Iceberg's own catalog implementations) to provide a central catalog of tables.
  4. API Layer: Interfaces provided by the table formats for reading and writing data.
  5. Query/Processing Engine Layer: Various engines can interact with the same lakehouse tables via the table format APIs, including:
    • Batch engines (Spark, Flink Batch, Presto, Trino)
    • SQL engines for BI
    • Streaming engines (Spark Streaming, Flink SQL, RisingWave)
    • Python libraries (Pandas, Polars accessing data via table format libraries)

Key Benefits

  • Simplified Architecture: Reduces complexity by potentially eliminating separate data lake and data warehouse silos and the ETL pipelines between them.
  • Reduced Costs: Leverages low-cost object storage and avoids data duplication.
  • Data Freshness: Enables faster data availability for analytics compared to traditional warehouse ETL cycles.
  • Flexibility: Supports diverse data types and workloads (SQL, ML, streaming) on the same data copy.
  • Reliability: Brings ACID transactions and robust governance features to data lake storage.
  • Openness: Often built using open-source formats and engines, reducing vendor lock-in.

Data Lakehouse and RisingWave

RisingWave fits naturally into the Streaming Lakehouse variant of this architecture. While a standard Lakehouse improves batch processing on the lake, a Streaming Lakehouse explicitly integrates real-time capabilities:

  • RisingWave as Streaming Engine: Ingests real-time data streams.
  • RisingWave Sinking to Lakehouse: Uses its sink connectors (especially the Apache Iceberg sink) to write processed, real-time data reliably into the Lakehouse tables managed by the open table format.
  • Unified Access: Both real-time queries (potentially against RisingWave's Materialized Views) and batch/ad-hoc queries (against the Lakehouse tables using other engines) can operate on a consistent data foundation.

RisingWave acts as the component that brings low-latency stream processing and serving capabilities to the Data Lakehouse foundation.

Related Glossary Terms

  • Data Lake
  • Streaming Lakehouse
  • Data Warehouse
  • Open Table Format
  • Apache Iceberg / Hudi / Delta Lake
  • ACID Transactions
  • Schema Evolution (in Lakehouse)
  • Time Travel (in Lakehouse)
  • Cloud Object Storage
  • ETL/ELT (Concepts)
The Modern Backbone for Your
Event-Driven Infrastructure
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.