Apache Pulsar

Apache Pulsar is an open-source, distributed, cloud-native messaging and event streaming platform designed for high performance, scalability, and reliability. Originally developed at Yahoo! and now a top-level Apache Software Foundation project, Pulsar offers a flexible alternative to systems like Apache Kafka, particularly noted for its multi-layered architecture and built-in features like multi-tenancy and geo-replication.

Like Kafka, Pulsar functions as an Event Streaming Platform (ESP), enabling applications to publish and subscribe to streams of data records (events) asynchronously.

Design Philosophy and Architecture

Pulsar's architecture distinguishes it from Kafka and contributes to some of its unique features:

Multi-Layered Architecture: Pulsar explicitly separates the serving layer (stateless Brokers) from the storage layer (stateful Bookies from Apache BookKeeper).
- Brokers: Handle client connections, authentication, topic lookups, and message dispatching. They are stateless, making them easy to scale horizontally and replace quickly.
- Bookies (Apache BookKeeper): Provide persistent message storage. BookKeeper is a distributed write-ahead log (WAL) system that stores data in segments called 'ledgers.' This separation allows independent scaling of compute (brokers) and storage (bookies).
Segment-Based Storage: Data for topic partitions is stored as a sequence of BookKeeper ledgers (segments). This allows for features like tiered storage (moving older segments to cheaper storage) and independent scaling of storage capacity.
Unified Messaging Model: Pulsar supports both traditional message queuing semantics (exclusive or shared subscriptions) and event streaming semantics (key-shared or failover subscriptions) for topics, offering flexibility in consumption patterns.

Core Concepts

Many concepts in Pulsar are analogous to Kafka, but with some key differences:

Event/Message: The unit of data sent through Pulsar.
Topic: A named channel for messages. Like Kafka, topics are partitioned for scalability.
Partition: An ordered log within a topic, stored as BookKeeper ledgers. Pulsar manages partition distribution across brokers.
Producer: Application that publishes messages to a topic.
Consumer: Application that subscribes to a topic and processes messages.
Subscription: A named configuration rule for how a consumer reads messages from a topic. Different subscription types dictate message delivery semantics (e.g., exclusive, failover, shared, key_shared).
Cursor: Managed by the broker for each subscription, tracking the position (acknowledged messages) of consumers within a partition's log. Acknowledgment handling differs slightly from Kafka's offset commits.
Broker: Stateless serving component.
Bookie: Stateful storage component (from Apache BookKeeper).
ZooKeeper / Configuration Store: Used for cluster metadata management, coordination, service discovery, and configuration storage. Pulsar supports pluggable configuration stores.
Namespace: A logical grouping of related topics within a tenant. Policies (like replication, retention, access control) can be set at the namespace level.
Tenant: The highest administrative unit, allowing for multi-tenancy where different teams or applications can securely share the same Pulsar cluster with isolation guarantees.

Key Features & Benefits

Scalability & Elasticity: Independent scaling of stateless brokers and stateful bookies provides flexibility in resource allocation. Adding brokers or bookies is relatively seamless.
High Performance & Low Latency: Designed for high throughput and low end-to-end message latency.
Durability & Consistency: Achieved through persistent storage in BookKeeper with configurable replication and write quorums.
Multi-Tenancy: Built-in support for isolating users and applications within a shared cluster using tenants and namespaces.
Geo-Replication: Built-in mechanisms for replicating data across multiple geographically distributed data centers, configurable at the namespace level.
Unified Queuing and Streaming: Supports various subscription modes catering to different use cases on the same topic data.
Tiered Storage: Allows offloading older data segments to cheaper, long-term storage (like S3) transparently.
Schema Registry (Built-in): Includes schema management capabilities to enforce data consistency for topics.

Common Use Cases

Pulsar is used for similar use cases as Kafka, including:

Messaging between microservices.
Real-time data pipelines and ETL.
Event-driven architectures.
Log aggregation and activity tracking.
Serving as input/output for stream processing systems.
Applications requiring strong multi-tenancy or built-in geo-replication.

Pulsar and RisingWave

Apache Pulsar serves as another key integration point for RisingWave, acting as an alternative to Kafka:

Source: RisingWave can ingest data streams directly from Pulsar topics using its 'CREATE SOURCE' command. It understands Pulsar's concepts and can connect to a Pulsar cluster to consume messages.
Sink: RisingWave can also publish processed results back into Pulsar topics using 'CREATE SINK', enabling downstream applications to consume RisingWave's output via Pulsar.