Stream Processing Systems

Stream processing research has come a long way and streaming systems have matured considerably since their invention, almost 3 decades ago. What will the next generation of stream processing systems look like? We envision next-generation of stream processing systems that are not only scalable and reliable, but also capable of self-management and automatic reconfiguration without downtime.

Self-Managed Stream Processing Systems

Many cloud providers, including Amazon Web Services, Google Cloud, and Azure provide “Data Stream Analytics as a Service” products, to allow non-expert users to easily and efficiently extract knowledge from data streams. These services offer automatic scaling capabilities for applications with dynamic workloads, relieving the users from the burden to manage and configure stream processing pipelines. However, this automation comes at a cost for the non-expert user since the Cloud provider is incentivized to over-provision computations. We are conducting a study to uncover the automatic scaling policy models of various Cloud providers and quantify the extra cost they impose on users. Our goal is to develop novel self-managed stream processing systems, capable of automatic operation, optimization, and reconfiguration, that can maintain cost-efficiency without compromising performance.

Workload-Aware State Management
Modern streaming systems rely on persistent KV stores to perform stateful processing on data streams. Although the choice of the state store is crucial for the system’s performance, there has been little research in designing state stores tailored for streaming workloads. Streaming systems use general-purpose KV stores, such as RocksDB, to manage state. Being oblivious to workload characteristics of streaming applications, such stores incur unnecessary overheads. We have been conducting a thorough study of streaming state workloads to further the understanding of their characteristics and differences from traditional workloads. We are developing a new benchmark that can faithfully mimic streaming state workloads and enables researchers to easily evaluate alternative store designs. Our long-term goal is to design and develop workload-aware streaming state management to improve the latency and throughput of streaming analytics.

Graph Streaming

Graph streams can represent continuous interactions and evolving relationships in a variety of datasets. The structure of social networks gradually changes as new friendship relations are formed and others disappear, online discussion networks grow as users interact with each other, and financial transaction networks expand with every new purchase. The streaming nature of real-world graphs poses a challenge to emerging Machine Learning (ML) applications to train adaptive models, capable of resilient continuous learning.

Can we learn representations of streaming graphs efficiently and accurately?
How can we limit retraining to a fixed, representative subset of the graph stream to avoid catastrophic forgetting while ensuring fast and scalable retraining?

Systems for large-scale graph analytics and streaming graph ML

Nowadays, graph ML makes up for the lion’s share of use cases for ML applications. For instance, handling social networks or building recommendation systems brings the need of manipulating complex data structures like graphs. Both academia and industry strive to find effective ways to use graphs as inputs to their ML models for applications such as node classification or link prediction. To address this problem, Graph Neural Networks (GNNs) have been proposed as a method to turn the cumbersome structure of graphs into a concise form known as node embeddings based on both the structural similarity of the local neighborhood of each inspected node. Although GNNs have been a viable answer to the above issue, numerous research questions have arisen. What if the initial graph does not fit the memory or even worse the graph is given in a form of a real-time stream? How should we transform the training phase of GNNs to handle the limited memory space and reduce the total I/Os to the disk? Does the architecture of an SSD disk help us to store and access efficiently the required graph data in each training step?

SECRECY: Secure collaborative analytics in untrusted clouds

SECRECY is a system for privacy-preserving collaborative analytics as a service. SECRECY allows multiple data holders to contribute their data towards a joint analysis in the cloud, while keeping the data siloed even from the cloud providers. At the same time, it enables cloud providers to offer their services to clients who would have otherwise refused to perform a computation altogether or insisted that it be done on private infrastructure. SECRECY ensures no information leakage and provides provable security guarantees by employing cryptographically secure Multi-Party Computation (MPC).

In SECRECY we take a novel approach to optimizing MPC execution by co-designing multiple layers of the system stack and exposing the MPC costs to the query engine. To achieve practical performance, SECRECY applies physical optimizations that amortize the inherent MPC overheads along with logical optimizations that dramatically reduce the computation, communication, and space requirements during query execution. Our multi-cloud experiments demonstrate that SECRECY improves query performance by over 1000× compared to existing approaches and computes complex analytics on millions of data records with modest use of resources.

Read our upcoming NSDI’23 paper for more details.