Stream Processing Systems

Stream processing research has come a long way and streaming systems have matured considerably since their invention, almost 3 decades ago. What will the next generation of stream processing systems look like? We envision next-generation of stream processing systems that are not only scalable and reliable, but also capable of self-management and automatic reconfiguration without downtime.

Self-Managed Stream Processing Systems

Many cloud providers, including Amazon Web Services, Google Cloud, and Azure provide “Data Stream Analytics as a Service” products, to allow non-expert users to easily and efficiently extract knowledge from data streams. These services offer automatic scaling capabilities for applications with dynamic workloads, relieving the users from the burden to manage and configure stream processing pipelines. However, this automation comes at a cost for the non-expert user since the Cloud provider is incentivized to over-provision computations. We are conducting a study to uncover the automatic scaling policy models of various Cloud providers and quantify the extra cost they impose on users. Our goal is to develop novel self-managed stream processing systems, capable of automatic operation, optimization, and reconfiguration, that can maintain cost-efficiency without compromising performance.

Workload-Aware State Management
Modern streaming systems rely on persistent KV stores to perform stateful processing on data streams. Although the choice of the state store is crucial for the system’s performance, there has been little research in designing state stores tailored for streaming workloads. Streaming systems use general-purpose KV stores, such as RocksDB, to manage state. Being oblivious to workload characteristics of streaming applications, such stores incur unnecessary overheads. We have been conducting a thorough study of streaming state workloads to further the understanding of their characteristics and differences from traditional workloads. We are developing a new benchmark that can faithfully mimic streaming state workloads and enables researchers to easily evaluate alternative store designs. Our long-term goal is to design and develop workload-aware streaming state management to improve the latency and throughput of streaming analytics.

Graph Streaming

Graph streams can represent continuous interactions and evolving relationships in a variety of datasets. The structure of social networks gradually changes as new friendship relations are formed and others disappear, online discussion networks grow as users interact with each other, and financial transaction networks expand with every new purchase. The streaming nature of real-world graphs poses a challenge to emerging Machine Learning (ML) applications to train adaptive models, capable of resilient continuous learning.

Can we learn representations of streaming graphs efficiently and accurately?
How can we limit retraining to a fixed, representative subset of the graph stream to avoid catastrophic forgetting while ensuring fast and scalable retraining?

Systems for large-scale graph analytics and streaming graph ML

Nowadays, graph ML makes up for the lion’s share of use cases for ML applications. For instance, handling social networks or building recommendation systems brings the need of manipulating complex data structures like graphs. Both academia and industry strive to find effective ways to use graphs as inputs to their ML models for applications such as node classification or link prediction. To address this problem, Graph Neural Networks (GNNs) have been proposed as a method to turn the cumbersome structure of graphs into a concise form known as node embeddings based on both the structural similarity of the local neighborhood of each inspected node. Although GNNs have been a viable answer to the above issue, numerous research questions have arisen. What if the initial graph does not fit the memory or even worse the graph is given in a form of a real-time stream? How should we transform the training phase of GNNs to handle the limited memory space and reduce the total I/Os to the disk? Does the architecture of an SSD disk help us to store and access efficiently the required graph data in each training step?

Secrecy: Secure collaborative analytics on secret-shared data

The current digital era is data centric and while recent advances in computational power has lead to feasible processing of this data, there are some scenarios in which we can not process this data due to maybe conflicting interests or maybe the need to add to more guarantees against security breaches. Multi-party computing (MPC) based systems can help in such situations. MPC is useful in scenarios when some parties have common goal to compute something together but still share the same concern about possible data leakage. We aim to build privacy conserving systems that is computationally practical for real use cases. For example, we have built Secrecy which is a framework to plan and execute SQL queries using MPC protocols. As shown in the figure, Secrecy has three computing parties deployed on different cloud providers, where each party alone can not know the data they are operating on but can only share with the other two parties in a protocol to compute the needed function securely and get the final result.