Research

Subarna Chatterjee

Software Engineer (R&D), DataStax

In my research career so far, I have worked on the following topics.

Self-Designing Storage Engines

We show that existing storage systems fail to scale in the face of the combined challenge: growing application diversity and growing data sizes, which in turn result in growing cloud budgets. The source of the problem is in the inherent complexity of data system design, opacity of cloud infrastructures, and the numerous metrics and factors that affect performance and cloud cost. As a result, organizations, systems administrators, and even expert data system designers cannot predict how a specific combination of a key-value store design, a cloud provider (their pricing policies and hardware), and a specific workload (data and queries) will ultimately behave in terms of end-to-end performance and cloud-cost requirements.

We present a self-designing key-value storage engine, Cosine, which can always take the shape of the close to “perfect” engine architecture given an input workload, a cloud budget, a target performance, and required cloud SLAs. By identifying and formalizing the first principles of storage engine layouts and core key-value algorithms, Cosine constructs a massive design space comprising of sextillions (10^36) possible storage engine designs over a diverse space of hardware and cloud pricing policies for three cloud providers – AWS, GCP, and Azure. Cosine spans across diverse designs such as Log-Structured Merge-trees, B-trees, Log-Structured Hash-tables, in-memory accelerators for filters and indexes as well as trillions of hybrid designs that do not appear in the literature or industry but emerge as valid combinations of the above. Cosine includes a unified distribution-aware I/O model and a learned concurrency-aware CPU model that with high accuracy can calculate the performance and cloud cost of any possible design on any workload and virtual machines. Cosine can then search through that space in a matter of seconds to find the best design and materializes the actual code of the resulting storage engine design using a templated Rust implementation. We demonstrate that on average Cosine outperforms state-of-the-art storage engines such as write-optimized RocksDB, read-optimized WiredTiger, and very write-optimized FASTER by 53x, 25x, and 20x, respectively, for diverse workloads, data sizes, and cloud budgets across all YCSB core workloads and many variants.

Exploring range filters in key-value stores

Key-value stores are everywhere today. Although these stores are fundamentally designed to support point-reads and write queries, with the increase in the diversity in applications, contemporary key-value stores face the need to support range queries as well. This project focuses on Log Structured Merge (LSM)-based key-value stores which is popularly used across diverse industries to serve numerous applications. While LSM-based key-value stores support efficient writes and point queries, due to the core design of LSM-trees, such stores suffer with range queries. We cannot rule out reading any data blocks of a target key range across all levels of the tree. The leads to sub-optimal performance emanating from significantly high I/O cost. In this work, we introduce Rosetta, a probabilistic range filter designed specifically for LSM-tree based key-value stores. The core intuition is that we can sacrifice filter probe time because it is not visible in end-to-end key-value store performance, which in turn allows us to significantly reduce the filter false positive rate for every level of the tree.

Data stream processing for IoT applications

A streaming platform cannot be categorized as “good” or “bad” as every platform is differently designed to process specific stream types. Each platform is unique about the nature of data sets, the variable message processing semantics, and the differential hardware expectations. Therefore, it is extremely difficult, yet crucial, to judicially select the streaming platform that not only suits the application requirements but also abides by the resource offerings of the physical infrastructure of the host organization (or an end-user). To address this issue, I focused on a thorough comparative analysis of three data stream processing platforms that are chosen based on their potential to process both streams and batches in real-time.

Sensor-cloud infrastructure

This research focused on the conceptualization and implementation of a new Internet of Things (IoT)-cloud infrastructure called sensor-cloud infrastructure rendering a novel service called Sensors-as-a-Service (Se-aaS). The motivation behind the emergence of sensor-cloud platforms was due to the limitations encountered with conventional Wireless Sensor Networks (WSNs) as WSN-based applications are generally single-user centric, in which a user-organization owns and deploys its personalized sensor network and typically does not share the accessed data to another party (user/organization). Thus, generally, only user- organizations that own a sensor network have satisfactory access to sensor data. To resolve the afore-mentioned limitations and widely disseminate the sensor technology to the common mass of people, a newly proposed architecture namely sensor-cloud architecture was being considered as a potential substitute for traditional WSNs. When I started my research back in 2013, several works were being initiated in this direction to primarily focus on the principles, the dogma, and the challenges involved in this shift of paradigm from WSN to sensor-cloud platforms. However, the technicalities that are required from an implementation perspective, inclusive of the theoretical modeling, experimental analysis, architectural designs, and development of this platform, were yet to be explored. Therefore, I focused on resolving the principal technical challenges associated with the complete conceptualization of sensor-cloud and eventually built a fully-functional prototype of sensor-cloud infrastructure.

Cloud-assisted smart healthcare

This research focused the domain of cloud-assisted smart healthcare as it is extremely relevant from an Indian (developing country) perspective. In order to ensure an optimal routing strategy, the work establishes an architecture in which health data from patients are transmitted from wearable body sensor nodes (that sense and report physiological parameters of human body) to cloud servers for remote analysis and diagnosis. Later, this architecture was further revisited from an ambulatory healthcare perspective in collaboration with the All India Institute of Medical Sciences, (AIIMS) Bhubaneswar, and B. C. Roy Hospital, India. The work further matured as we proposed a privacy-aware ambulatory healthcare system using wireless body area networks (WBANS).