Data Architecture

Data is the glue that holds distributed systems together. Data has it's own lifecycle starting from origination to flow, transformation, modification and finally deletion. At a high level these are some of the data design considerations

Database Type 

  • Relational, KV-store, Column Family, Document, Graph etc

Indexing and Querying

  • LSM, B-Tree, B+-Tree
  • R/W Access patterns
  • Range queries

Schema, Metadata, Encoding and Evolution

  • Versioning
    • Forward and Backward compatibility
  • Format
    • Avro, Thrift, Protobuf


  • Atomicity
  • Isolation level
    • Lost update
    • Read committed
    • Read repeatable (snapshot isolation - Readers never block writers and vice versa)
    • Prevent lost update (read-modify-update cycle)
      • CSET, Atomic write, Conflict resolution, 
    • Phantom (generalisation of lost update)
      • Materialising conflict
    • Serialisability
      • Actual serial execution
      • 2 phase locking
        • Writers don’t just block other writers; they also block readers and vice versa. 
        • Deadlocks
        • Predicate lock, index range lock
      • SSI (Optimistic concurrency control)
        • all reads within a transaction are made from a consistent snapshot of the database
        • Pessimistic concurrency control - 2PL, ASE
        • Optimistic concurrency control - MVCC
  • Data durability 
    • WAL Logs
    • Backup and Restore
    • Error handling, RRD, Save points


  • Master slave
    • Read your writes, monotonic read, consistent prefix reads
  • Multi master (couchdb)
    • Conflict resolution / avoidance
  • Leaderless (Dynamo style)
    • Quorum, Sloppy quorum, LWW, Concurrent writes, Version vectors

Data partitioning

  • Hotspot
  • Partition by Key range, consistent hashing, Compound primary key
  • Partition index - Local index (scatter-gather), Global index
  • Rebalancing - Fixed, Dynamic
  • Request routing 

Trouble with Distributed systems

  • Faults and partial failures
  • Unreliable networks - synchronous vs asynchronous networks
  • Unreliable clocks
    • Don’t rely on the accuracy of the clock
    • NTP sync is not accurate
    • Confidence interval in time - Spanner
  • Process pause
  • Knowledge truth lies
    • Truth is defined by majority
    • Fencing tokens is required for distributed lock
    • Safety and liveness


  • Serializability is an isolation property of transactions
  • Linearisablity is a recency guarantee
  • To make a system appear as if there is only a single copy of the data. 
  • Single leader replication is potentially linearisable. Multi-leader and leaderless are not linearisable
  • A network interruption forces a choice between linearisability and availability. 
  • CAP - Linearisable / Available when partitioned
  • The reason for dropping linearisability is performance , not fault tolerance.
  • Most DB are neither CAP-linearisable nor CAP-available 
    • MVCC is intentionally non-linearisable
    • Single leader replication (async) is non-linearisable
    • Partition makes it non-available
  • ZooKeeper by default is neither CAP-consistent (CP) nor CAP-available (AP) – it’s really just “P”. 
    • You can optionally make it CP by calling sync if you want, 
    • and for reads (but not for writes) it’s actually AP, if you turn on the right option.

Error handling , Recovery

  • Replay logs
  • Discover partitions
  • Replication catchup
  • Restores

Source of truth 

  • Primary DB
  • Derived data
    • Caches
    • Materialised Views
    • CQRS

Batch processing

  • Putting the computation near the data
  • Mapping and Chaining
  • Sort merge joins
  • GroupBy
  • Handling Skew
  • MapSide joins
    • Broadcast hash joins - joining a large dataset with a small dataset
    • Partitioned hash joins - partition and reduce the dataset
    • Mapside merge joins - if input is partitioned and sorted appropriately
  • Use cases - Search index, Key Value stores
  • The output of a reduce-side join is partitioned and sorted by the join key, whereas the output of a map-side join is partitioned and sorted in the same way as the large input 
  • Input is immutable and no side effects
  • In an environment where tasks are not so often terminated, the design decisions of MapReduce make less sense.
  • Materialising intermediate state
    • Makes fault tolerance easy
    • Sometimes they are not needed

Streaming processing

  • Messaging
    • Brokers
      • Routing: Fanout, load balancing
      • Acknowledgement and redelivery
      • Message ordering 
  • Partitioned logs - Kafka
    • Unified log
    • Partitioning to scale
    • Replayable
    • Consumer offsets
    • Durability
    • Immutable input and idempotent operations
  • CDC - Change Data Capture
    • Implemented using log based broker
    • Connectors - Debezium etc
  • Event sourcing
    • Immutable events written to event log (this is used in accounting where delta is captured)
    • CQRS - separate forms for read and write and allowing several different read views
  • Stream processing
    • Produce a new stream, real time dashboard, write it to a end database
    • Has operators - sorting doesn’t make sense
    • Uses: CEP, Stream analytics, Maintaining materialised views
    • Event time vs Processing time
    • Stream Joins
      • Stream-Stream joins (click through rate)
      • Stream-Table join (enrichment)
        • Similar to map side hash join
        • Table can be kept upto date using CDC
      • Table-Table join (materialised view maintenance)
        • Twitter
      • Time dependence of joins
        • Slowly changing dimension

Big Data

  • Data Analytics
  • ETL
  • Data warehouse 

Data Security

  • Encryption at Rest
  • Data Retention
  • Data classification
  • (Customer) Data isolation
  • Data Integrity
  • Data Availability
  • Data anonymisation
  • Backup and recovery
  • Database Security
  • Access Control

No comments:

Post a Comment