Spanner: Google’s Globally-Distributed Database

Pre-Read Thoughts

Heard about Spanner that cockroachdb is inspired from it. (yet another google paper)

Introduction

Spanner → multi-version, globally distributed and synchronously replicated (at the same time, not batched as in async)

it automatically reshards based on amount of data and servers

replication config can be controlled by clients/apps at fine grain level like which DC they want to put data in

read & writes are externally consistent → linearizable

consistent reads at same timestamp

time or clock is maintained by TrueTime server which exposes API

(TrueTime uses a combination of GPS clocks and atomic clocks across the data centers that Spanner runs in)

spanner will slow and wait when clock uncertainty is large

Implementation

A spanner deployment is called universe (the picture on right)

it’ll have different zones which are physically separated

a zone will have zonemaster which assigns data to thousands of spanservers

location proxies know which data is in which spanserver , so redirects clients to correct spanserver

image 16.png

universemaster is console that displays all the info about zones, placementdriver which polls spanserver helps in rebalancing data when config changes

Spanserver Software Stack

image 1 12.png

1 spanserver → 100 to 1000s of tablets

a tablet contains many mappings as KKV

(key, timestamp) → value

timestamp ⇒ multi versioning

tablet’s state is stored in colossus (gfs successor) and in WAL

state is replicated using paxos algo as in pic left side, replicas are grouped as Paxos Group

writes happen at leader, reads can be at any replica which is up to date

leader will have lock table containing 2PLs, with range of keys to lock mapping, for concurrency control

when there are multiple paxos groups under same participant leader, transaction manager helps with distributed transactions where group leaders commit 2PC

Directories and Placement

tablet can be bucketed under directory

dir contents will have same replication config

a paxos group can have multiple directoriess

contents will be moved to reduce or balance load or make access speeds faster by putting close to accessors

image 2 11.png

Data Model

image 3 8.png

semi relational, query lang similar to sql, general purpose transactions

2pc with paxos makes it simpler with FT

Left side schema showing INTERLEAVE IN with DELETE to delete Albums when User is deleted

TrueTime

TT.now() gives a bounded interval with [earliest, latest], where abs time of event will be in that interval

a time master is maintained across DC, machines have time slave daemons

image 4 8.png

majority masters return time based on GPS, remaining armageddon masters are based on atomic clocks

daemons will poll diff masters to finalize on accurate time

Concurrency Control

image 5 4.png

Op.s supported by Spanner

Paxos leader maintains lease for it’s time period as leader, and it’s extended on successful write

disjointness invariant ⇒ at any point of time, there is only one leader

Leader can give up it’s position (abdicating), for that it’ll have complete all the transactions happened till then ⇒ if is max timestamp by leader, it should complete all txs till then and wait until TT.after() is true

RW txs

  • reads in the tx don’t see effects of write, as writes won’t committed and kept at client buffer (isolated)
  • wound-wait for deadlock avoidance (tx with low priority (latest tx with latest ts) gets aborted)

RO txs

Evaluation

from reference pdf


Post-Read Thoughts

Spanner marks the NewSQL movement

Some concepts are actually a bit tough to grasp. Went to chatgpt with a lot of questions for clarification, I might come back to this paper again sometime in future

Depicts the real world implementation of distributed txs

Really went big time with TrueTime

wrt cockroachdb, raft is used for consensus instead of paxos, which makes sense because there is no raft at the time of spanner

Further Reading

https://vimeo.com/43759726