Data Stream / Event Stream

A Data Stream, often used interchangeably with Event Stream, refers to an unbounded, potentially infinite sequence of data records (events) ordered in time. Unlike traditional batch datasets which are finite and processed at rest, data streams represent data that is continuously generated and needs to be processed 'in motion' as it arrives.

Streams are the fundamental input and output for Stream Processing systems. Examples of data streams include:

  • Sensor readings from IoT devices.
  • User activity logs from websites or mobile apps (clickstreams).
  • Financial market data (stock tickers, trades).
  • Database change events (Change Data Capture - CDC).
  • Application logs and metrics.
  • Social media feeds.

Key Characteristics of Data Streams

  • Unbounded: Streams typically have no predefined end. New events are continuously generated and arrive over time. Processing logic must handle this potentially infinite nature.
  • Ordered (Typically by Time): Events within a stream usually have an implicit or explicit temporal order, often based on when the event occurred (Event Time) or when it was processed (Processing Time). Maintaining order, especially within relevant contexts (like events for the same user), can be crucial.
  • Immutable Records: Once an event is published to a stream, it typically cannot be changed. Updates are usually represented as new events in the stream (e.g., a CDC update event).
  • Real-time / Low Latency: Data often needs to be processed shortly after it's generated to enable timely insights or actions.
  • Potentially High Volume: Streams can involve very high rates of incoming data.

Representing Data Streams

Data streams are often materialized or managed by intermediary systems:

  • Event Streaming Platforms (ESP): Systems like Apache Kafka, Apache Pulsar, or AWS Kinesis are explicitly designed to ingest, store (durably), partition, and serve data streams to consumers. They act as buffers and brokers for streams.
  • Message Queues (MQ): Simpler queues might be used, though they often lack the strong ordering, persistence, and replay capabilities of ESPs.
  • Direct Sources: Sometimes streams originate directly from sources like CDC feeds without an intermediary broker.

Processing Data Streams

Stream processing engines like RisingWave, Apache Flink, or Spark Streaming are designed to consume, transform, and analyze data streams:

  • Continuous Queries: They use continuous queries to define ongoing computations on the stream.
  • Stateful Operations: They manage internal state to perform operations like joins, aggregations, and windowing across events in the stream.
  • Time Handling: They incorporate mechanisms to handle event time, processing time, and associated challenges like out-of-order data and watermarking.

Data Streams in RisingWave

Data streams are the primary input and a potential output for RisingWave:

  • Sources: Defined using 'CREATE SOURCE', these represent connections to external data streams (e.g., Kafka topics, Pulsar topics, Kinesis streams, CDC streams). RisingWave continuously ingests data from these sources.
  • Internal Streams: Within RisingWave's Dataflow Graph, data flows between operators as internal streams.
  • Materialized Views (Conceptual Stream): While a materialized view stores a result, the stream of changes (inserts, updates, deletes) to that view can be considered an output data stream.
  • Sinks: Defined using 'CREATE SINK', these allow RisingWave to publish an output stream (often the change stream from a materialized view or table) to an external system like Kafka or an Iceberg table.

RisingWave effectively allows users to define transformations and computations on input data streams using SQL, often materializing the results for low-latency querying or sinking the output streams for downstream consumption.

Related Glossary Terms

  • Stream Processing
  • Continuous Query
  • Event Streaming Platform (ESP)
  • Apache Kafka / Apache Pulsar
  • Change Data Capture (CDC)
  • Unbounded Data (Concept)
  • Event Time / Processing Time
  • Source / Sink
  • Materialized View
  • Dataflow Graph
The Modern Backbone for Your
Event-Driven Infrastructure
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.