Notes: https://pdos.csail.mit.edu/6.824/notes/l-spanner.txt

Video: https://youtu.be/ZulDvY429B8

Prep:


About

wide area txs, tx happening across diff machines, DCs

rw txs → 2PC + 2PL + Paxos groups

ro txs → snapshot isolation, synchronized clocks

High Level Organization

Diff DCs where shards are replicated and maintained with individual Paxos instances under a grouping

replication helps with parallellism and close replica used for less latency

with Paxos ⇒ quorum, FT and slowness is mitigated

image 26.png

https://excalidraw.com/#json=GR7cmZ0yQsFbXA26BdQOT,eyV922jcSDUM3RPcPvE1E

Challenges

  • local reads should give latest write
  • Txs across shards
  • txs must be serializable

R/W Txs

image 1 18.png

https://excalidraw.com/#json=E_4r6FgGr4K90Msqrmk6w,QR2bnnICbv9RG4UI3_KXRg

R/O Txs

common, mostly used ⇒ fast

reads from local replica, no lock, no 2pc

challenge → correctness / consistency

Correctness

should be serializable

they also support external consistency ⇒ if t1 happens before t2, t2 should see t1’s changes

Bad Plan

to read latest commit

T3 will observe x from one point of time, and y from another

image 2 16.png

https://excalidraw.com/#json=Hfop5zL2NcQgyWsWouocP,adh36WHa7EvS8vPayqZZGQ

Good Plan: Snapshot Isolation

assign ts to tx, spanner will have multiple versions with timestamps (like tablet in bigtable)

ts for rw, commit time; for ro, start time

when Rx runs, it’ll try for value with ts < 15 i.e., 10

image 3 10.png

https://excalidraw.com/#json=MM5Z09jeMcDOSYbrU1hHB,MSxCZsx1ScEaTmXqecBTxA

Replica didn’t see Wx @10 (stale/not exists)

solved with safe time

paxos sends writes in ts order

before rx @ 15, replica need to wait for write ts > 15, then it’s sure that no other txs are made in b/w

(also wait for txs that are prepared and not committed)

Clocks

matters only for ro txs

ts is longer ⇒ waiting is longer

ts is smaller ⇒ t3 @ 9 (wrong time sent by clock), then it’ll break linearizability

Clock Sync

tough → clock drifting → atomic clock

sync clock over some time period with GPS

this error — epsilon — can be in range of few to ms (micro to milli seconds)

solution is not to use true ts

TrueTime uses some interval in which actual time might fall in as ts

⇒ [earliest, latest]

when starting tx or committing tx, .latest is used, which guarantees that real time happened

when waiting for commit, it’s delayed until ts < .earliest , which guarantees that real time didn’t happen yet

image 4 9.png

T2 will wait to commit until it’s earliest is > T1’s latest

T3 will have latest as ts, so it’ll definitely read latest than real time

https://excalidraw.com/#json=9DrMOXUYe_FCRAW_Qfhv_,8ZTTEDpskLSMjPRmGezp-w