Dataflow Graph
A Dataflow Graph (or Streaming Job Graph) is a directed acyclic graph (DAG) commonly used by distributed stream processing systems to represent the logical structure and execution plan of a streaming computation or Continuous Query. It defines how data flows between different processing steps (operators) from sources to sinks.
Nodes in the graph represent operators performing specific tasks (e.g., reading from a source, filtering, aggregating, joining, writing to a sink), and edges represent the data streams flowing between these operators.
Purpose: Defining and Executing Streaming Jobs
Dataflow graphs serve several crucial purposes:
- Logical Representation: They provide a clear, abstract model of the streaming application's logic, independent of the physical execution details. Users often define their logic using higher-level APIs (like Streaming SQL or DataStream APIs), which the system then translates into a logical dataflow graph.
- Optimization: The stream processing engine analyzes the logical graph and optimizes it for efficient execution. This might involve fusing operators, reordering operations, choosing optimal join strategies, or determining appropriate state backends.
- Physical Execution Plan: The optimized logical graph is transformed into a physical execution plan. This plan details how the operators will be instantiated as parallel tasks and deployed across the cluster's physical resources (nodes/cores). It includes details about data partitioning, network shuffling (exchanges), and state management.
- Scheduling and Execution: The cluster manager or job scheduler uses the physical plan to deploy and run the tasks on the available worker nodes.
- Monitoring and Debugging: The graph structure provides a basis for monitoring data flow, throughput, latency, and backpressure between different stages of the pipeline, aiding in performance tuning and debugging.
Components of a Dataflow Graph
- Nodes (Operators/Vertices): Represent units of computation. Common types include:
- Source Operators: Read data from external systems (e.g., Kafka, Pulsar).
- Transformation Operators: Perform stateless or stateful operations like 'map', 'filter', 'project', 'flatMap', 'aggregate', 'join', 'window'.
- Sink Operators: Write results to external systems (e.g., Kafka, Iceberg, databases).
- Edges (Streams/Channels): Represent the flow of data records between operators. Edges define the partitioning strategy (e.g., hash, broadcast, round-robin) used when data needs to be shuffled between parallel instances of operators running on different nodes.
Logical vs. Physical Graph
- Logical Graph: Represents the high-level structure defined by the user's code or query. Operators might still be abstract.
- Physical Graph: The detailed execution plan generated by the system's optimizer and planner. It specifies the exact implementation of operators, the degree of Parallelism for each operator, how operators are chained together into tasks, and how data is physically exchanged (shuffled) between tasks across the network.
Dataflow Graphs in RisingWave
RisingWave internally uses the concept of a dataflow graph (often referred to as the 'streaming plan' or 'fragment graph') to execute Streaming SQL queries, particularly those defined within a 'CREATE MATERIALIZED VIEW' or 'CREATE SINK' statement:
- SQL Parsing & Planning: When a streaming SQL query is submitted, RisingWave parses it, performs semantic analysis, and generates a logical query plan.
- Optimization: The logical plan undergoes optimization based on cost models, available indexes, and stream-specific optimization rules (e.g., optimizing incremental computations).
- Streaming Plan (Physical Graph): The optimizer generates a distributed physical execution plan – the dataflow graph tailored for RisingWave's execution engine. This plan details:
- Fragments: Chunks of the dataflow graph that can run together on compute nodes.
- Operators: Specific implementations for SQL operations (Hash Join, Hash Aggregation, Filter, Project, Materialize, etc.).
- Exchanges: Nodes representing data shuffling between fragments, specifying the partitioning strategy (Hash, Broadcast, etc.).
- State Management: Specifies where and how state is managed for stateful operators (using Hummock).
- Scheduling & Execution: The Meta node distributes these graph fragments to the Compute Nodes for execution. Compute nodes instantiate the operators and process data streams according to the plan.
- Materialization: For 'CREATE MATERIALIZED VIEW', a specific 'Materialize' operator exists in the graph, responsible for maintaining the view's results in the State Store (Hummock) based on the incoming stream of changes from upstream operators.
Users typically don't interact directly with the low-level dataflow graph in RisingWave; they define the desired computation using SQL, and RisingWave handles the translation to an efficient distributed execution graph. However, understanding the concept helps in analyzing query performance (e.g., using 'EXPLAIN' commands which might show aspects of the plan) and comprehending how distributed stream processing works internally.
Related Glossary Terms
- Continuous Query
- Stream Processing
- Streaming SQL
- Materialized View
- Operator (Conceptual)
- Source / Sink
- Parallelism
- Data Partitioning (Streaming)
- Stateful Stream Processing
- Incremental Computation
- Optimization (Conceptual)