Apache Hudi

Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source, high-performance data lake storage framework that brings core database and data warehouse functionalities directly onto data lake storage like HDFS or cloud object storage (AWS S3, GCS, Azure Blob Storage). It provides efficient mechanisms for managing large analytical datasets by enabling record-level inserts, updates (upserts), and deletes, along with features for data versioning and incremental processing.

Hudi is one of the key open table formats (alongside Apache Iceberg and Delta Lake) designed to address the limitations of traditional data lakes and enable reliable Data Lakehouse architectures.

The Need for Hudi: Challenges with Data Lakes

Traditional data lakes, while offering scalable and cost-effective storage, often suffered from several limitations, especially when dealing with changing data:

  1. Update/Delete Complexity: Modifying existing records in immutable file formats (like Parquet or ORC) on object storage was computationally expensive, often requiring rewriting entire partitions or tables.
  2. Data Quality Issues: Lack of transactional guarantees could lead to inconsistent data during concurrent writes or failures.
  3. Incremental Processing Difficulty: Efficiently identifying and processing only new or changed data since the last run was challenging.
  4. Lack of Schema Enforcement: Data lakes often struggled with schema changes and consistency over time.

Apache Hudi was developed to tackle these challenges by providing structured storage management on top of the raw data lake files.

Core Concepts and How Hudi Works

Hudi organizes data into tables residing on the data lake. Key concepts include:

  • Timeline: Hudi maintains a timeline of all actions performed on the table (commits, compactions, cleans, etc.), providing a detailed history and enabling features like time travel and incremental queries.
  • File Layout: Data is stored primarily in base files (e.g., Parquet) containing the main data. Changes can be logged in delta files (e.g., Avro logs storing updates for base files).
  • Table Types:
    • Copy-on-Write (CoW): Updates create new versions of the files containing the changed records. Simpler reads, but higher write amplification. Suitable for read-heavy workloads.
    • Merge-on-Read (MoR): Updates are written to delta files. Reads require merging the base files with relevant delta files on the fly. Faster writes, but potentially slower reads until compaction. Suitable for write-heavy or near real-time workloads. Compaction processes merge delta files into base files periodically.
  • Indexing: Hudi employs indexing mechanisms (e.g., Bloom filters, hash indexes) to efficiently locate the file groups containing specific records during upsert operations, avoiding full table scans.
  • Write Operations: Provides APIs for 'upsert', 'insert', 'delete', 'bulk_insert', managing the complexities of applying these changes to the underlying file structure based on the chosen table type.
  • Incremental Queries: Allows querying only the data that has changed between two points in time (commits) on the timeline, enabling efficient incremental data pipelines.
  • Concurrency Control: Offers optimistic concurrency control mechanisms to handle simultaneous writes to the same table.

Key Benefits

  • Record-Level Upserts & Deletes: Efficiently modifies data in the lake without rewriting large amounts of data.
  • Incremental Processing: Enables efficient streaming and incremental batch pipelines by querying only changed data.
  • Near Real-Time Ingestion: The Merge-on-Read table type facilitates faster data ingestion closer to real-time.
  • Transactionality (Write-Side): Provides atomicity and consistency for write operations.
  • Time Travel: Allows querying the table as it existed at specific points in time using the timeline.
  • Schema Evolution Support: Provides mechanisms to handle schema changes over time.
  • Pluggable Indexing: Optimizes write performance for updates and deletes.

Common Use Cases

  • Building CDC (Change Data Capture) pipelines from databases into the data lake.
  • Streaming data ingestion into the lakehouse.
  • Managing GDPR/CCPA compliance requiring record-level deletion.
  • Creating incrementally updated analytical tables.
  • Near real-time analytics on lake data.

Hudi in the Data Ecosystem

Hudi integrates with various big data processing engines like Apache Spark, Apache Flink, Presto, and Trino, allowing these tools to read from and write to Hudi tables. While RisingWave currently focuses more heavily on Apache Iceberg as a primary sink for its Streaming Lakehouse vision due to certain design alignments, Hudi remains a significant player in the open table format space, offering alternative trade-offs, particularly for write-heavy scenarios.

Related Glossary Terms

  • Data Lakehouse
  • Open Table Format
  • Apache Iceberg
  • Delta Lake
  • Change Data Capture (CDC)
  • Cloud Object Storage
  • Copy-on-Write (CoW)
  • Merge-on-Read (MoR)
  • Compaction
  • Incremental Processing
The Modern Backbone for Your
Event-Driven Infrastructure
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.