Home: Data Architecture

Data is the glue that holds distributed systems together. Data has it's own lifecycle starting from origination to flow, transformation, modification and finally deletion. At a high level these are some of the data design considerations

Database Type

Relational, KV-store, Column Family, Document, Graph etc

Indexing and Querying

LSM, B-Tree, B+-Tree
R/W Access patterns
Range queries

Schema, Metadata, Encoding and Evolution

Versioning

Forward and Backward compatibility

Format

Avro, Thrift, Protobuf

Transaction

Atomicity
Isolation level

Lost update
Read committed
Read repeatable (snapshot isolation - Readers never block writers and vice versa)
Prevent lost update (read-modify-update cycle)

CSET, Atomic write, Conflict resolution,

Phantom (generalisation of lost update)

Materialising conflict

Serialisability

Actual serial execution
2 phase locking

Writers don’t just block other writers; they also block readers and vice versa.
Deadlocks
Predicate lock, index range lock

SSI (Optimistic concurrency control)

all reads within a transaction are made from a consistent snapshot of the database
Pessimistic concurrency control - 2PL, ASE
Optimistic concurrency control - MVCC

Data durability

WAL Logs
Backup and Restore
Error handling, RRD, Save points

Replication

Master slave

Read your writes, monotonic read, consistent prefix reads

Multi master (couchdb)

Conflict resolution / avoidance

Leaderless (Dynamo style)

Quorum, Sloppy quorum, LWW, Concurrent writes, Version vectors

Data partitioning

Hotspot
Partition by Key range, consistent hashing, Compound primary key
Partition index - Local index (scatter-gather), Global index
Rebalancing - Fixed, Dynamic
Request routing

Trouble with Distributed systems

Faults and partial failures
Unreliable networks - synchronous vs asynchronous networks
Unreliable clocks

Don’t rely on the accuracy of the clock
NTP sync is not accurate
Confidence interval in time - Spanner

Process pause
Knowledge truth lies

Truth is defined by majority
Fencing tokens is required for distributed lock
Safety and liveness

Linearisablity

Serializability is an isolation property of transactions
Linearisablity is a recency guarantee
To make a system appear as if there is only a single copy of the data.
Single leader replication is potentially linearisable. Multi-leader and leaderless are not linearisable
A network interruption forces a choice between linearisability and availability.
CAP - Linearisable / Available when partitioned
The reason for dropping linearisability is performance , not fault tolerance.
Most DB are neither CAP-linearisable nor CAP-available

MVCC is intentionally non-linearisable
Single leader replication (async) is non-linearisable
Partition makes it non-available

ZooKeeper by default is neither CAP-consistent (CP) nor CAP-available (AP) – it’s really just “P”.

You can optionally make it CP by calling sync if you want,
and for reads (but not for writes) it’s actually AP, if you turn on the right option.

Error handling , Recovery

Replay logs
Discover partitions
Replication catchup
Restores

Source of truth

Primary DB
Derived data

Caches
Materialised Views
CQRS

Batch processing

Putting the computation near the data
Mapping and Chaining
Sort merge joins
GroupBy
Handling Skew
MapSide joins

Broadcast hash joins - joining a large dataset with a small dataset
Partitioned hash joins - partition and reduce the dataset
Mapside merge joins - if input is partitioned and sorted appropriately

Use cases - Search index, Key Value stores
The output of a reduce-side join is partitioned and sorted by the join key, whereas the output of a map-side join is partitioned and sorted in the same way as the large input
Input is immutable and no side effects
In an environment where tasks are not so often terminated, the design decisions of MapReduce make less sense.
Materialising intermediate state

Makes fault tolerance easy
Sometimes they are not needed

Streaming processing

Messaging

Brokers

Routing: Fanout, load balancing
Acknowledgement and redelivery
Message ordering

Partitioned logs - Kafka

Unified log
Partitioning to scale
Replayable
Consumer offsets
Durability
Immutable input and idempotent operations

CDC - Change Data Capture

Implemented using log based broker
Connectors - Debezium etc

Event sourcing

Immutable events written to event log (this is used in accounting where delta is captured)
CQRS - separate forms for read and write and allowing several different read views

Stream processing

Produce a new stream, real time dashboard, write it to a end database
Has operators - sorting doesn’t make sense
Uses: CEP, Stream analytics, Maintaining materialised views
Event time vs Processing time
Stream Joins

Stream-Stream joins (click through rate)
Stream-Table join (enrichment)

Similar to map side hash join
Table can be kept upto date using CDC

Table-Table join (materialised view maintenance)

Twitter

Time dependence of joins

Slowly changing dimension

Big Data

Data Analytics
ETL
Data warehouse

Data Security

Encryption at Rest
Data Retention
Data classification
(Customer) Data isolation
Data Integrity
Data Availability
Data anonymisation
Backup and recovery
Database Security
Access Control

Home

Data Architecture

No comments:

Post a Comment