[DDIA] Chapter 3. Storage and Retrieval

Posted on 2023-09-15 | In DDIA |

Stoarge optimized for transactional workloads vs optimized for analytics
Log: an append-only sequence of records

Index

An index is an additional structure that is derived from the primary data.

Speed up the reads
Slows down the writes, index needs to be updated when data is written

Hash indexes

Keep an in-memory hash map where every key is mapped to a byte offset (location) in the data file - the location at which the value can be found

Example: Bitcask (the default storage enginein Riak) requires all keys fit in the available RAM
Good for: situations where the value for each key is updated frequently
- i.e. key: URL of a cat video, value: number of times it has been played

Compaction: throwing away duplicate keys in the log, keeping only the most recent update for each key

Why this design

Appending and segment mergeing are sequential write operations, which are generally much faster than random writes
Concurrency and crash recovery are much simpler if segment files are appendonly or immutable
Merging avoids data files getting fragmented over time

Limitations

The hash table must fit in memory, couldn’t handle large amount of keys
Range queries are not efficient

SSTables and LSM-Trees