A curated portfolio of data systems projects focused on performance-aware design,
scalability, and real-world data processing.
These projects explore how modern data platforms are built—from relational query
optimization to distributed analytics and storage-engine internals.
The repository demonstrates hands-on experience with database design, query execution, distributed computation, spatial analytics, and NoSQL storage systems, using industry-relevant tools and frameworks.
Design and optimization of a relational database using PostgreSQL, emphasizing schema modeling, relationships, constraints, and efficient query execution. The project highlights practical aspects of SQL performance tuning and data organization.
Implementation of distributed spatial queries using Apache Spark and SparkSQL. Includes range queries, distance queries, and spatial joins implemented through custom UDFs, showcasing scalable geospatial data processing.
A distributed spatiotemporal hotspot detection pipeline using Apache Spark. The project applies the Getis-Ord Gi* statistic over space-time grids to identify statistically significant activity clusters at scale.
A C++-based embedded key-value storage layer built on RocksDB. Demonstrates core NoSQL storage concepts such as batch ingestion, multi-key retrieval, range scans, and persistent data management using an LSM-tree architecture.
PostgreSQL • Apache Spark • SparkSQL • Scala • C++ • RocksDB • SQL • Distributed Systems
Datasets are intentionally not included due to size and licensing constraints. Each project README provides instructions on expected input formats and execution.