Personal repository of ==insights, annotations from research, academic, and white papers==, that I have read, as well as those I plan to explore in future

Papers are primarily focused on ==Distributed Systems==, Database Systems, ==Operating Systems== — which are analogous to each other

an interesting observation?

Turing Awardees from 2013 to 2016 worked on similar domains I’m interested in (that too, the same order!)

Lamport in Distributed Systems

Known for Paxos Consensus, Lamport Timestamps

Stonebraker in Database Systems

Known for Ingres, PostgreSQL

Diffie & Hellman in Cryptography

Known for Key Exchange Protocol

(whereas RSA authors won in 2002, the year I was born!)

Berners-Lee in Web Development

Known for World Wide Web
Papershelf
To Be Read
My Reading Setup
[[#How I Read [WIP] ]]
Inception Of Sources

Papershelf

This is the papershelf where anyone can find all the papers that I’ve read so far with notes.

Best thing? anyone can search, sort or filter by any property. I use

search for titles

sort for year (to get an idea on timeline of papers)

filter for Authors or Org.s

square brackets [ ] in titles represent that it’s not part of the original title (explicitly added for easy search)

Below view is restricted to load first 10 records to keep it tidy. To see complete paper-shelf in full page, you can goto gowthamkalla.com/papershelf or (if on desktop) click on current view header i.e., Shelf / Kanban/ Timeline and then “Open as full page”

Papershelf

Title	PDF	Annotated PDF	Year	Authors	Organizations	Categories
MapReduce- Simplified Data Processing on Large Clusters	https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf	http://gowthamkalla.com/drive/papers/mapreduce	2004	Jeffrey Dean, Sanjay Ghemawat	Google	Big Data, Distributed Computing
The Google File System GFS	https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf	http://gowthamkalla.com/drive/papers/gfs	2003	Howard Gobioff, Sanjay Ghemawat, Shun-Tak Leung	Google	Data Storage, Distributed Computing
In Search of an Understandable Consensus Algorithm Raft	https://raft.github.io/raft.pdf	http://gowthamkalla.com/drive/papers/raft	2013	Diego Ongaro, John Ousterhout	Stanford University	Consensus, Distributed Computing
The Chubby lock service for loosely-coupled distributed systems	https://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf	http://gowthamkalla.com/drive/papers/chubby	2006	Mike Burrows	Google	Consensus, Distributed Computing
ZooKeeper- Wait-free coordination for Internet-scale systems	https://www.usenix.org/legacy/event/atc10/tech/full_papers/Hunt.pdf	http://gowthamkalla.com/drive/papers/zookeeper	2010	Benjamin Reed, Flavio P. Junqueira, Mahadev Konar, Patrick Hunt	Yahoo	Consensus, Distributed Computing
Bigtable- A Distributed Storage System for Structured Data	https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf	http://gowthamkalla.com/drive/papers/bigtable	2006	Andrew Fikes, Deborah A. Wallach, Fay Chang, Jeffrey Dean, Mike Burrows, Robert E. Gruber, Sanjay Ghemawat, Tushar Chandra, Wilson C. Hsieh	Google	Database System, Distributed Computing
Spanner- Google’s Globally-Distributed Database	https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf	http://gowthamkalla.com/drive/papers/spanner	2012	Alexander Lloyd, Andrew Fikes, Andrey Gubarev, Christopher Frost, Christopher Heiser, Christopher Taylor, Dale Woodford, David Mwaura, David Nagle, Eugene Kogan, Hongyi Li, JJ Furman, James C. Corbett, Jeffrey Dean, Lindsay Rolig, Michael Epstein, Michal Szymaniak, Peter Hochschild, Rajesh Rao, Ruth Wang, Sanjay Ghemawat, Sean Quinlan, Sebastian Kanthak, Sergey Melnik, Wilson C. Hsieh, Yasushi Saito	Google	Database System, Distributed Computing
Dynamo- Amazon’s Highly Available Key-value Store	https://assets.amazon.science/ac/1d/eb50c4064c538c8ac440ce6a1d91/dynamo-amazons-highly-available-key-value-store.pdf	http://gowthamkalla.com/drive/papers/dynamo	2007	Alex Pilchin, Avinash Lakshman, Deniz Hastorun, Giuseppe DeCandia, Gunavardhan Kakulapati, Madan Jampani, Peter Vosshall, Swaminathan Sivasubramanian, Werner Vogels	AWS, Amazon	Data Storage, Distributed Computing
Scaling Memcache at Facebook	https://research.facebook.com/file/839620310074473/scaling-memcache-at-facebook.pdf	http://gowthamkalla.com/drive/papers/memcached-fb	2013	Daniel Peek, David Stafford, Hans Fugal, Harry C. Li, Herman Lee, Marc Kwiatkowski, Mike Paleczny, Paul Saab, Rajesh Nishtala, Ryan McElroy, Steven Grimm, Tony Tung, Venkateshwaran Venkataramani	Facebook	Data Storage, Distributed Computing
On-demand Container Loading in AWS Lambda	https://www.usenix.org/system/files/atc23-brooker.pdf	http://gowthamkalla.com/drive/papers/aws-lambda	2023	Chris Greenwood, Marc Brooker, Mike Danilov, Phil Piwonka	AWS, Amazon	Distributed Computing, Serverless
Firecracker- Lightweight Virtualization for Serverless Applications	https://www.usenix.org/system/files/nsdi20-paper-agache.pdf	http://gowthamkalla.com/drive/papers/firecracker	2020	Alexandra Iordache, Alexandru Agache, Andreea Florescu, Anthony Liguori, Diana-Maria Popa, Marc Brooker, Phil Piwonka, Rolf Neugebauer	AWS, Amazon	Distributed Computing, Serverless
Kafka- a Distributed Messaging System for Log Processing	https://www.microsoft.com/en-us/research/wp-content/uploads/2017/09/Kafka.pdf	http://gowthamkalla.com/drive/papers/kafka	2011	Jay Kreps, Jun Rao, Neha Narkhede	LinkedIn	Data Streaming, Distributed Computing
Cassandra - A Decentralized Structured Storage System	https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf	http://gowthamkalla.com/drive/papers/cassandra	2009	Avinash Lakshman, Prashant Malik	Facebook	Database System, Distributed Computing
Amazon DynamoDB- A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service	https://www.usenix.org/system/files/atc22-elhemali.pdf	http://gowthamkalla.com/drive/papers/dynamodb	2022	Akhilesh Mritunjai, Akshat Vig, Colin Lazier, Doug Terry, Erben Mo, James Christopher Sorenson III, Joseph Idziorek, Mostafa Elhemali, Niall Gallagher, Nicholas Gordon, Richard Krog, Somu Perianayagam, Sroaj Sosothikul, Swaminathan Sivasubramanian, Tim Rath	AWS, Amazon	Database System, Distributed Computing
Secure Untrusted Data Repository (SUNDR)	https://www.usenix.org/legacy/event/osdi04/tech/full_papers/li_j/li_j.pdf	http://gowthamkalla.com/drive/papers/sundr	2004	David Mazi`eres, Dennis Shasha, Jinyuan Li, Maxwell Krohn	[NYU] New York University	Cryptography, Data Security, Data Storage, Distributed Computing, Networking
Bitcoin- A Peer-to-Peer Electronic Cash System	https://bitcoin.org/bitcoin.pdf	http://gowthamkalla.com/drive/papers/bitcoin	2008	Satoshi Nakamoto	Bitcoin	Distributed Computing, Peer-to-peer
Apache Spark- A Unified Engine for Big Data Processing	https://www.databricks.com/sites/default/files/2018/12/Apache-Spark-A-Unified-Engine-for-Big-Data-Processing.pdf	http://gowthamkalla.com/drive/papers/spark-article	2016		Databricks	Big Data, Data Processing, Distributed Computing
Spark- Cluster Computing with Working Sets	https://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf	http://gowthamkalla.com/drive/papers/spark	2010	Ion Stoica, Matei Zaharia, Michael J. Franklin, Mosharaf Chowdhury, Scott Shenker	[UCB] University of California Berkeley	Big Data, Data Processing, Distributed Computing
Resilient Distributed Datasets- A Fault-Tolerant Abstraction for In-Memory Cluster Computing Spark	https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf	http://gowthamkalla.com/drive/papers/spark-rdd	2012	Ankur Dave, Ion Stoica, Justin Ma, Matei Zaharia, Michael J. Franklin, Mosharaf Chowdhury, Murphy McCauley, Scott Shenker, Tathagata Das	[UCB] University of California Berkeley	Big Data, Data Processing, Distributed Computing
Consistent Hashing and Random Trees- Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web	https://www.cs.princeton.edu/courses/archive/fall09/cos518/papers/chash.pdf		1997	Daniel Lewin, David Karger, Eric Lehman, Matthew Levine, Rina Panigrahy, Tom Leighton	[MIT] Massachusetts Institute of Technology	Algorithms, Distributed Computing
Ray- A Distributed Framework for Emerging AI Applications	https://www.usenix.org/system/files/osdi18-moritz.pdf		2018		[UCB] University of California Berkeley	Data Processing, Distributed Computing, Machine Learning
Ownership- A Distributed Futures System for Fine-Grained Tasks Ray	https://www.usenix.org/system/files/nsdi21-wang.pdf		2021		Anyscale, [UCB] University of California Berkeley	Distributed Computing

To Be Read

(I only prioritize and queue around 5 papers as To Be Read in shelf, and keep remaining as below)

Distributed Systems

(Inspired from 6.824 Distributed Systems)

Bluesky and the AT Protocol- Usable Decentralized Social Media

Swarm- Cost-Efficient Video Content Distribution with a Peer-to-Peer System

Apache Flink- Stream and Batch Processing in a Single Engine

Naiad- A Timely Dataflow System

Samza- Stateful Scalable Stream Processing at LinkedIn

Storm @Twitter

Dryad- Distributed Data-Parallel Programs from Sequential Building Blocks

The Hadoop Distributed File System HDFS

Boki- Stateful Serverless Computing with Shared Logs

Grove- a Separation-Logic Library for Verifying Distributed Systems

Chardonnay- Fast and General Datacenter Transactions for On-Disk Databases

Chord- A Scalable Peer-to-peer Lookup Service for Internet Applications

Mesos- A Platform for Fine-Grained Resource Sharing in the Data Center

Large-scale cluster management at Google with Borg

MillWheel- Fault-Tolerant Stream Processing at Internet Scale

No compromises- distributed transactions with consistency, availability, and performance FaRM

Ethereum- A Next-Generation Smart Contract and Decentralized Application Platform

Tango- Distributed Data Structures over a Shared Log

Chain Replication for Supporting High Throughput and Availability

Photon- Fault-tolerant and Scalable Joining of Continuous Data Streams

Paxos Made Live - An Engineering Perspective

CORFU- A Shared Log Design for Flash Clusters

Wormhole- Reliable Pub-Sub to Support Geo-replicated Internet Services

A simple totally ordered broadcast protocol ZAB Zookeeper Atomic Broadcast

Academic

(Inspired from Lamport’s Publications)

Time, Clocks, and the Ordering of Events in a Distributed System

Paxos

The Part-Time Parliament Paxos

Paxos Made Simple

Cheap Paxos

Fast Paxos

Vertical Paxos and Primary-Backup Replication

The Byzantine Generals Problem

The Temporal Logic of Actions

Practical Byzantine Fault Tolerance pBFT

Impossibility of Distributed Consensus with One Faulty Process FLP

Viewstamped Replication- A New Primary Copy Method to Support Highly-Available Distributed Systems

Conflict-free Replicated Data Types CRDTs

Zab- High-performance broadcast for primary-backup systems Zookeeper Atomic Broadcast

Timely Dataflow- A Model

Database Systems

(Inspired from 15-721 Advanced Database Systems)

What Goes Around Comes Around

What Goes Around Comes Around… And Around…

The Design Of Postgres

The Snowflake Elastic Data Warehouse

Building An Elastic Query Engine on Disaggregated Storage Snowflake

Photon- A Fast Query Engine for Lakehouse Systems

Lakehouse- A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics

Dremel- Interactive Analysis of Web-Scale Datasets BigQuery

Amazon Redshift Re-invented

Citus- Distributed PostgreSQL for Data-Intensive Applications

CockroachDB- The Resilient Geo-Distributed SQL Database

ClickHouse - Lightning Fast Analytics for Everyone

DuckDB- an Embeddable Analytical Database

MotherDuck- DuckDB in the cloud and in the client

TiDB- A Raft-based HTAP Database

FoundationDB- A Distributed Unbundled Transactional Key Value Store

F1- A Distributed SQL Database That Scales

Mesa- Geo-Replicated, Near Real-Time, Scalable Data Warehousing

Megastore- Providing Scalable, Highly Available Storage for Interactive Services

Large-scale Incremental Processing Using Distributed Transactions and Notifications Percolator

Yellowbrick- An Elastic Data Warehouse on Kubernetes

Aerospike- Architecture of a Real-Time Operational DBMS

Magma- A High Data Density Storage Engine Used in Couchbase

Book

Architecture of a Database System

Machine Learning

TensorFlow- A System for Large-Scale Machine Learning

TensorFlow- Large-Scale Machine Learning on Heterogeneous Distributed Systems

Attention Is All You Need

My Reading Setup

Once I have access to PDF file, I upload it to 2 folders in my gdrive

“Untouched” folder, for raw untouched pdfs (just in case)
“Papers” folder — which is public — for highlighted pdfs

With Adobe Acrobat on web, which is free and connected to my Gdrive, I highlight and underline Gdrive documents from Acrobat & write notes here in Notion

I did try Zotero for a while, instead of Acrobat, not pleased by annotation/comments’ styling. Some other time with good config!

How I Read [WIP]

https://github.com/papers-we-love/papers-we-love?tab=readme-ov-file#how-to-read-a-paper

http://ccr.sigcomm.org/online/files/p83-keshavA.pdf

I find papers mostly from the below section and once I’ve chosen paper, I’ll Papers up by priority in TBR section

Before I actually start reading a paper, I will have a glance to check sections and their sub headings

With this, I can guesstimate no. of pages I’m going to concentrate on and time to be invested as well

Papers will contain ending sections starting with benchmarks or perf numbers, from here, it’ll be a relaxed read where I just highlight points

I just highlight or annotate with just one color, whereas people in academia generally use 3-4 colors signifying different levels

All my annotated PDFs are publicly available in my drive collectively, and notes can be opened from papershelf that’s present on this very page

Like in the above , I open 3 windows in parallel, first for acrobat, second for notion + excalidraw, and third for general google search and GPTs from OpenAI or someother

When I start reading, I start convo with chatgpt that I’m reading so and so paper, and I will ask questions on that

Sometimes paragraphs can be tough to grasp, the brain will read it super smooth but won’t be braining. Since these papers (and their related) been in literature for a good time, you can argue with LLMs, it’s actually one of the best use-cases. So I just simply ask it till I get satisfied with answer (it’s good with answers, so far, but do validate the responses)

Inception Of Sources

Distributed Systems

dancres.github.io/pages U awesome-distributed-systems U Papers We Love [Begins: Perfect Start]
macintux/6227368 [TDK: Peak]

Database Systems

Mixed

USENIX(OSDI, NSDI) U Arxiv U Google Scholar [TDK Rises: Likely uninteresting for most]
Maybe ACM?

Home Page

gowthamkalla.com/socials

Notes

Explorer

Graph View

Papers

Papershelf

Papershelf

To Be Read

Distributed Systems

Academic

Database Systems

Book

Machine Learning

My Reading Setup

How I Read [WIP]

Inception Of Sources

Distributed Systems

Database Systems

Mixed

Table of Contents

Backlinks

Graph View