On the Way to Democratized Stream Processing: RisingWave’s Roadmap

Two months ago, we open-sourced RisingWave, a cloud-native streaming database. RisingWave is developed on the mission to democratize stream processing — to make stream processing simple, affordable, and accessible. You may check out our recent blog, document, and source code for more information about RisingWave.Rome was not built in a day, and neither are database systems. We started developing RisingWave (from scratch!) in early 2021; we did a major code refactor in the summer of 2021; in April 2022, we open-sourced RisingWave under Apache License 2.0 to make it accessible to everyone. So, why and how did we build the system from scratch? Where are we now? And what’s next?

Building the system from scratch

Many successful database products are forked from existing battle-tested systems. While there already exist several fully-fledged streaming systems, such as Apache Storm, Apache Flink, and Apache Spark Streaming, we made the decision to build RisingWave from scratch. The key reason is that most existing streaming systems are not designed as a database where storage is a primary concern. They are also not built for cloud infrastructures. Thus, building a cloud database from a big-data system can bring heavy technical debt.

However, building RisingWave from scratch doesn’t mean that we build every single component on our own. In fact, we adopt several building blocks from the open-source community, including etcd, Apache Calcite (we recently deprecated Calcite-based SQL frontend), and sqlparser-rs. We also use Grafana and Prometheus to implement the observability framework of RisingWave.

The current status

RisingWave is under rapid development. We have over 70 contributors from across the globe, and we review and merge 100+ Pull Requests weekly. However, RisingWave is still far away from being recognized as a “fully-fledged” system. So, what’s the current status of RisingWave? In this section, we answer this question from several perspectives.

SQL interface

RisingWave aims to support the complete set of PostgreSQL commands. It has supported the core SQL commands that revolve around these four categories of operations so far:

Creating tables and sources. The fundamental functionality of a DBMS is to store data. RisingWave supports PostgreSQL-compatible relational model, and includes most PostgreSQL basic data types. In addition, RisingWave is able to connect to a large range of external sources to consume streaming data. Currently, RisingWave supports many popular data formats such as JSON, Protobuf, and Avro on its source. When creating the source, users can manually specify the mapping from the nested structure to a plain table.
Creating materialized views. To deliver continuous analytics for users, RisingWave allows users to define materialized views over both tables and sources. Materialized views can capture the accumulative analytical result, and RisingWave will continuously and incrementally maintain materialized views with a strong snapshot consistency guarantee. Currently, RisingWave supports a large set of PostgreSQL-compatible SQL queries, which covers a large set of TPC-H queries, in materialized views. Different from traditional databases, as a streaming database, RisingWave also (1) allows joins between tables and streaming sources; (2) natively supports streaming window functions, including tumbling windows, hopping windows, and sliding windows; and (3) allows materialized views to be defined on top of existing views, with potential joins on other streams.
Querying the database. RisingWave allows users to issue queries over both tables and materialized views directly, without additional systems as data sinks. Currently, RisingWave supports a large set of PostgreSQL standard user queries, including 20 out of 22 TPC-H queries.

Cloud-native

We have implemented some of our key designs to support a cloud-native streaming database.

Disaggregated architecture. RisingWave can be decoupled into different components, including SQL frontend service, compute nodes, storage service, and meta service. Under such design, each component can scale out (or in) independently according to its own usage, enabling fine-grained control over the resource usage, without the cost overhead of the wasted resources.
Distributed engine. At the core of RisingWave is its distributed streaming engine. We have built a high-performance distributed streaming engine completely from scratch, such that we take advantage of recent tech advances, like Rust async runtime. A decoupled storage layer on top of cloud service guarantees the durability and infinite capacity of state storage. We have built a cloud-native storage service Hummock, which is based on LSM-tree to optimize its write performance and extensively uses tiered caches to bridge the performance gap of the latency between accessing memory and cloud storage.
High availability. RisingWave maintains its high availability via a periodic checkpointing procedure. Whenever there is a fail detected in the system, RisingWave can automatically recover from the latest consistent checkpoint and resume the stream processing.

Observability

Observability is a vital part of building a reliable and debuggable system. We spend quite an effort in the observability of RisingWave. In particular, we add tracing to the whole pipeline inside the computing engine. We use Prometheus to collect various metrics of the system status, and we provide a dashboard using Grafana to visualize both real-time system metrics and the streaming plan for maintaining materialized views.

Deployment

We have open-sourced the Kubernetes operator for RisingWave, and the project is under heavy development. With Kubernetes operator, RisingWave can be deployed as either a cloud-hosted service or an on-premise deployment in self-hosted clusters on top of Kubernetes.

Development

We also allow users and developers to try RisingWave on their local desktops. We developed Risedev, a developer tool to automatically config all services and manage the dependency. Any developer can deploy a local testing playground of RisingWave within a simple line of command in the terminal.

Ecosystem

RisingWave is a part of the modern real-time data stack. We spend a huge amount of effort integrating RisingWave with other systems. Currently, RisingWave supports connecting to Apache Kafka, Redpanda, Apache Pulsar, etc. It can also ingest data from Debezium CDC. We plan to collaborate with many other open-source systems, including Redis, ElasticSearch, Apache Hudi, and Apache Iceberg, to build a thriving community together.

Our roadmap

Building a robust database system can take more than a decade, and we will be continuously focusing on improving the system's performance and robustness in the future. While the design and implementation can evolve over time to catch up with the latest technology trends, we do have a clear roadmap of the key features for the next two years.

Short-term objectives (within the next 3 months)

Our recent target in the upcoming three months is to deliver a fully functional streaming database. Hence, we will focus on the following objectives:

Richer SQL support. Our primary goal is to further support more SQL standards, for example, more generic nested subqueries, more expressions, and more functions.
Creating Index. Indexes are a major feature to speed up query processing in all databases. Chances are, such indexes can also be shared across the streaming engine to reduce the cost of state maintenance. RisingWave will allow users to issue create index commands. These index implementations will be further optimized to enhance both the batch query engine and the streaming engine.
Access control. The next focus will be on access control. We will fully support fine-grained access control by adding basic functionalities including system catalog, user authentication, etc.
Connecting with downstream systems. While RisingWave can serve the users’ query requests over the streaming data directly, it also allows users to export the streaming data into existing data infrastructures in the data pipeline. We are working on adding a create sink command in RisingWave, such that we keep the simplicity of the SQL interface without losing the functionality of traditional stream processing.
Highly concurrent serving. So far, we mainly focus on stream processing. In the mid-term, we will also turn our focus back to the query engine of the database. We have a plan which includes a series of optimizations tailored to high concurrent serving workloads. A major refactor on the architecture of our query engine is on the schedule. In particular, we will introduce the local execution mode to avoid expensive coordination among compute nodes during the query runtime.
Benchmark. People constantly ask about our performance benchmarking results. We haven’t done it so far, because RisingWave is still under rapid development, and the performance numbers will be out of date anytime. Nevertheless, this is definitely in our next plan! We will also carry out extensive tests on the cost and cost-performance ratio, which will drive the technical design of our future optimization.

Mid-term objectives (within the next 6 months)

Aside from performance enhancement and bug fixing, we will add more functionalities to the system within the next six months.

Elasticity. In contrast to the traditional approach for scaling out (or in), RisingWave stands on the shoulder of the state-of-the-art in both academia and industry. We have designed a novel and lightweight online scaling mechanism to cope with the streaming traffic spike. The idea is that, instead of shutting down the cluster entirely, we take a fine-grained approach inside the database kernel. This feature is under heavy development, and we expect it to be finished in six months.
UDF support. We plan to support user-defined functions (UDF) to enable our users to express a larger spectrum of applications.
Schemaless. One key observation is that users still need to define how source data is mapped from nested format to flat tables upon ingestion. This requires users to have knowledge of the source data in advance. To further simplify the users' usage, we will support schemaless sources in RisingWave. Users can ingest raw JSON data into RisingWave, and RisingWave can automatically infer the source schema, such that users can speculate the data schema and define views after the data arrives.
Web UI 2.0. We will upgrade our web UI to the next generation. This will cover: (1) a comprehensive observability platform monitoring all detailed metrics in a user-friendly interface; (2) an administrative web console for DBAs to easily manage the configuration of both the cluster and database instances.

Long-term objectives (within the next 18 months)

It is always hard to predict the future. One thing we know for sure is that there are many unknown factors. Nevertheless, we are also clear that our mission is to make stream processing simple, affordable, and accessible, to bring stream processing to the next level. To achieve that goal, we will be continuously exploring the following topics:

Multi-tenancy. We are witnessing the trend that multi-tenant architecture dominates cloud infrastructures, as it significantly reduces the cost for both cloud vendors and end users. The idea is to share resources among multiple users within a single database instance. To enable such sharing, we have to do a series of future work in tenant isolation, security control, schema management, and many other aspects.
Machine learning support. A major application for all data engineering pipelines is to support machine learning. Current machine learning tools are often decoupled with databases. Although such architecture introduces the flexibility of using different machine learning tools, it often comes at a price of cost and latency. There is a huge waste of resource usage and poor data freshness for online machine learning. One of our future objectives is to explore how the modern real-time data stack can be co-designed and seamlessly integrated with existing machine learning tools, such that both the freshness of data and the efficiency of model training and inferencing can be well preserved.
Adaptive stream processing. All database systems support query optimization. The query optimizer collects the statistics of the data and searches for the best query plan when users issue their queries. However, for streaming databases, the story is different. A streaming pipeline is pre-defined when creating materialized views before the data are actually ingested into the system. The technical challenge here is how to automatically adapt the streaming pipeline on-the-fly, and let the streaming engine learn how to fit the fluctuating streaming workloads from the data directly without human intervention.
Smart cloud scheduler. RisingWave is designed to take advantage of diverse and infinite resources on the cloud, or even across multiple clouds! Chances are, we can build a smart cloud scheduler that monitors the status of the streaming tasks, and automatically scales RisingWave to meet users' needs in SLA in the best configuration. Our long-term objective is to use state-of-the-art machine learning models to learn the best and most cost-efficient scaling policy from the workloads instead of some pre-fixed manual policy. Hence we can provide an experience of serverless computation for streaming.

RisingWave is a next-generation cloud-native streaming database designed to democratize stream processing. In this blog, we discussed how we plan to approach our objectives. This is not only our internal plan, we make the project’s objectives transparent to hear the feedback from the community.

The next wave of stream processing is coming. To learn more, remember to check the RisingWave (https://twitter.com/SingularityData), and stay tuned with RisingWave!