Systems for Secure Computation

Groundbreaking research in the cryptography community has showed that any function can be securely computed by a group of participants in a distributed fashion such that each party learns the function’s output and nothing more. This fascinating idea started as a theoretical curiosity in the 1980s but since then it has evolved into a useful tool to realize use cases like the gender wage gap study in the Boston area. A series of recent theoretical advances in the field have further supported the belief that provably secure computation can perhaps become as practical and ubiquitous as public-key cryptography. Some of these advances include efficient multi-party computation protocols, fully homomorphic encryption schemes, zero-knowledge proofs, and oblivious RAM. Supporting end-to-end secure computation with practical performance requires rethinking the entire systems stack: from the programming abstractions all the way down to the hardware. Below are two recent results of our research in this area.

Secure relational analytics 

SECRECY (NSDI’23) is a novel system for privacy-preserving collaborative analytics as a service. SECRECY allows multiple data holders to contribute their data towards a joint analysis in the cloud, while keeping the data siloed even from the cloud providers. At the same time, it enables cloud providers to offer their services to clients who would have otherwise refused to perform a computation altogether or insisted that it be done on private infrastructure. SECRECY ensures no information leakage and provides provable security guarantees by employing cryptographically secure Multi-Party Computation (MPC). SECRECY’s core novelty is a generic cost-based optimization framework for relational MPC that does not rely on trusted infrastructure.

Read our paper for more details.

Secure time series analytics

TVA (Security’23) is a novel system for secure and expressive time series analysis under MPC. Similarly to SECRECY, TVA protects data by distributing trust and generalizes the functionality of prior systems without compromising security or performance. To mitigate the tension between these two, TVA employs new protocols, cross-layer optimizations, and vectorized primitives that enable it to scale to large datasets with modest use of resources. TVA is the first system to support arbitrary composition of oblivious time window operators, keyed aggregations, and multiple filter predicates, while keeping all data attributes private, including record timestamps and user-defined values in query predicates.

Read our paper for more details.

Stream Processing Systems

Stream processing research has come a long way and streaming systems have matured considerably since their invention, almost 3 decades ago. What will the next generation of stream processing systems look like? We envision next-generation of stream processing systems that are not only scalable and reliable, but also capable of self-management and automatic reconfiguration without downtime.

Self-Managed Stream Processing Systems

Many cloud providers, including Amazon Web Services, Google Cloud, and Azure provide “Data Stream Analytics as a Service” products, to allow non-expert users to easily and efficiently extract knowledge from data streams. These services offer automatic scaling capabilities for applications with dynamic workloads, relieving the users from the burden to manage and configure stream processing pipelines. However, this automation comes at a cost for the non-expert user since the Cloud provider is incentivized to over-provision computations. We are conducting a study to uncover the automatic scaling policy models of various Cloud providers and quantify the extra cost they impose on users. Our goal is to develop novel self-managed stream processing systems, capable of automatic operation, optimization, and reconfiguration, that can maintain cost-efficiency without compromising performance.

Workload-Aware State Management

Modern streaming systems rely on persistent KV stores to perform stateful processing on data streams. Although the choice of the state store is crucial for the system’s performance, there has been little research in designing state stores tailored for streaming workloads. Streaming systems use general-purpose KV stores, such as RocksDB, to manage state. Being oblivious to workload characteristics of streaming applications, such stores incur unnecessary overheads. We have been conducting a thorough study of streaming state workloads to further the understanding of their characteristics and differences from traditional workloads. We are developing a new benchmark that can faithfully mimic streaming state workloads and enables researchers to easily evaluate alternative store designs. Our long-term goal is to design and develop workload-aware streaming state management to improve the latency and throughput of streaming analytics.

Graph Streaming

Graph streams can represent continuous interactions and evolving relationships in a variety of datasets. The structure of social networks gradually changes as new friendship relations are formed and others disappear, online discussion networks grow as users interact with each other, and financial transaction networks expand with every new purchase. The streaming nature of real-world graphs poses a challenge to emerging Machine Learning (ML) applications to train adaptive models, capable of resilient continuous learning.

Can we learn representations of streaming graphs efficiently and accurately?
How can we limit retraining to a fixed, representative subset of the graph stream to avoid catastrophic forgetting while ensuring fast and scalable retraining?

Systems for large-scale graph analytics and streaming graph ML

Nowadays, graph ML makes up for the lion’s share of use cases for ML applications. For instance, handling social networks or building recommendation systems brings the need of manipulating complex data structures like graphs. Both academia and industry strive to find effective ways to use graphs as inputs to their ML models for applications such as node classification or link prediction. To address this problem, Graph Neural Networks (GNNs) have been proposed as a method to turn the cumbersome structure of graphs into a concise form known as node embeddings based on both the structural similarity of the local neighborhood of each inspected node. Although GNNs have been a viable answer to the above issue, numerous research questions have arisen. What if the initial graph does not fit the memory or even worse the graph is given in a form of a real-time stream? How should we transform the training phase of GNNs to handle the limited memory space and reduce the total I/Os to the disk? Does the architecture of an SSD disk help us to store and access efficiently the required graph data in each training step?