Filtering (Streaming)

Filtering, in the context of stream processing, is a fundamental operation that selectively passes or discards events from a Data Stream based on whether they meet certain criteria or conditions. It's a stateless transformation applied independently to each incoming event.

The purpose of filtering is often to:

  • Reduce data volume by removing irrelevant or uninteresting events.
  • Isolate specific types of events for targeted downstream processing (like aggregation, joining, or alerting).
  • Cleanse data by removing malformed or invalid records.
  • Route data to different processing paths based on event characteristics.

How Filtering Works

A filtering operation typically involves evaluating a boolean condition (a predicate) for each event in the stream:

  1. Event Arrival: An event arrives at the filtering operator.
  2. Predicate Evaluation: The operator evaluates a predefined condition using the event's data fields. This condition might involve:
    • Checking field values (e.g., status = 'ERROR', temperature > 100, country = 'USA').
    • Checking for null values (e.g., user_id IS NOT NULL).
    • Pattern matching (e.g., url LIKE '%/login%').
    • Combinations of conditions using logical operators (AND, OR, NOT).
  3. Decision:
    • If the predicate evaluates to true, the event is passed downstream.
    • If the predicate evaluates to false, the event is discarded and not processed further by that path.

Because filtering operates on each event independently without needing to remember information from previous events, it is considered a Stateless Stream Processing operation. This generally makes it computationally lightweight and easy to scale.

Filtering in RisingWave (SQL WHERE Clause)

In RisingWave, filtering is primarily achieved using the standard SQL WHERE clause within continuous queries, typically when defining Materialized Views or selecting data to be sent to a Sink.

Example:

Consider a source stream sensor_readings with columns sensor_id, timestamp, temperature, and humidity.

To create a materialized view containing only high-temperature readings from a specific sensor:

CREATE MATERIALIZED VIEW high_temp_alerts AS
SELECT
    sensor_id,
    timestamp,
    temperature
FROM sensor_readings
WHERE
    temperature > 100 AND sensor_id = 'sensor-A1';

In this continuous query:

  1. Events flow from the sensor_readings source.
  2. The WHERE temperature > 100 AND sensor_id = 'sensor-A1' clause acts as the filter predicate.
  3. Only events that satisfy both conditions (temperature is over 100 and the sensor ID is 'sensor-A1') are passed through.
  4. These filtered events are then used to update the high_temp_alerts materialized view. Events not matching the WHERE clause are discarded and do not affect this view.

The WHERE clause can be used in various parts of RisingWave SQL queries, including subqueries, joins, and window functions, to filter data at different stages of the processing pipeline defined by the Dataflow Graph.

Use Cases for Streaming Filters

  • Alerting: Isolating critical error events or readings exceeding thresholds.
  • Data Cleansing: Removing events with missing or invalid fields before further processing.
  • Targeted Aggregation: Counting only specific types of user actions or sales from particular regions.
  • Routing: Sending different event types to different downstream sinks or processing logic (often implemented by having multiple materialized views or sinks with different WHERE clauses on the same source).
  • Reducing Load: Discarding irrelevant data early in the pipeline to reduce processing load on downstream stateful operators like joins or aggregations.

Related Glossary Terms

  • Stateless Stream Processing
  • Data Stream
  • Continuous Query
  • Streaming SQL
  • WHERE Clause (SQL Concept)
  • Materialized View
  • Source / Sink
  • Dataflow Graph
  • Transformation
The Modern Backbone for Your
Event-Driven Infrastructure
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.