Cloud Object Storage
Cloud Object Storage refers to highly scalable and durable storage services offered by cloud providers, designed to store massive amounts of unstructured data as objects. Examples include Amazon Web Services (AWS) S3 (Simple Storage Service), Google Cloud Storage (GCS), and Azure Blob Storage.
Unlike traditional file systems (which organize data in hierarchical directories) or block storage (which manages data as fixed-size blocks, typically for disk volumes), object storage manages data as discrete units called objects. Each object typically consists of:
- Data: The actual content (e.g., a file, image, log, video).
- Metadata: A set of descriptive attributes about the object (e.g., content type, creation date, custom tags). Standard metadata is system-defined, while user-defined metadata can be added.
- Unique Identifier (Key): A globally unique ID used to retrieve the object, often resembling a file path but within a flat namespace (bucket).
Key Characteristics
- Scalability: Designed to scale virtually infinitely in terms of the number of objects and total storage capacity.
- Durability & Availability: Cloud providers typically replicate objects across multiple devices and availability zones within a region, offering very high durability (e.g., 99.999999999% - often called '11 nines') and high availability.
- Cost-Effectiveness: Generally offers a much lower cost per gigabyte compared to block storage or traditional file storage, especially for infrequently accessed data (using different storage tiers).
- HTTP(S) Access: Objects are typically accessed via standard HTTP(S) APIs (GET, PUT, POST, DELETE), making them easily accessible from anywhere on the internet (with proper permissions).
- Flat Namespace: Objects reside within containers called 'buckets' (AWS S3, GCS) or 'containers' (Azure Blob Storage). While keys can contain '/' characters to simulate directories, the underlying structure is flat.
- Eventual Consistency (Historically): Some object storage systems historically offered eventual consistency for overwrite PUTs and DELETEs, meaning changes might take some time to propagate. However, major providers like AWS S3 now offer strong read-after-write consistency for new objects and strong consistency for overwrites and deletes.
- Unstructured Data Focus: Ideal for storing files, backups, logs, images, videos, large datasets, and other forms of unstructured or semi-structured data.
Role in Data Architectures
Cloud object storage has become the foundational storage layer for modern data architectures, including:
- Data Lakes: Serves as the primary, cost-effective repository for raw and processed data in various formats.
- Data Lakehouses: Provides the storage for tables managed by open table formats like Apache Iceberg, Hudi, and Delta Lake. These formats add transactional capabilities and structure on top of object storage.
- Big Data Analytics: Stores input datasets and output results for processing frameworks like Spark and Flink.
- Backup and Recovery: Used for durable backups of databases and applications.
- Content Delivery: Stores static assets (images, videos) for web applications, often used in conjunction with Content Delivery Networks (CDNs).
Cloud Object Storage and RisingWave
Cloud object storage plays a critical role in RisingWave's architecture, particularly for its state management:
- Hummock State Store Backend: RisingWave's cloud-native state store, Hummock, is designed to use cloud object storage (like AWS S3, GCS, or S3-compatible systems like MinIO) as its durable backend.
- Checkpointing: When RisingWave performs checkpointing, the snapshots of operator state are persisted asynchronously to the configured object storage via Hummock.
- State Durability: This ensures that the critical state needed for stateful stream processing (joins, aggregations, materialized views) survives node failures, as it's stored reliably outside the compute nodes.
- Separation of Storage and Compute: Using object storage allows RisingWave to decouple its compute resources (Compute Nodes) from its state storage, enabling independent scaling and potentially better cost-efficiency. Compute nodes can be added or removed without affecting the durable state stored in Hummock on object storage.
- (Future/Potential) Sinking Data: While primarily used for the state store, object storage could also potentially be used directly as a sink target for certain data formats if needed, although sinking to structured formats like Iceberg (which uses object storage) is more common for analytical use cases.
In essence, cloud object storage provides the scalable, durable, and cost-effective foundation that enables RisingWave's robust, cloud-native state management and fault tolerance capabilities via the Hummock state store.
Related Glossary Terms
- Data Lake
- Data Lakehouse
- Streaming Lakehouse
- Checkpointing
- State Store (RisingWave Specific)
- Hummock (RisingWave Specific)
- Separation of Storage and Compute
- Fault Tolerance
- Durability (State Management)
- Apache Iceberg / Hudi / Delta Lake (Use object storage)