[DDIA] Chapter 3. Storage and Retrieval

  • Stoarge optimized for transactional workloads vs optimized for analytics
  • Log: an append-only sequence of records

Index

An index is an additional structure that is derived from the primary data.

  • Speed up the reads
  • Slows down the writes, index needs to be updated when data is written

Hash indexes

Keep an in-memory hash map where every key is mapped to a byte offset (location) in the data file - the location at which the value can be found

  • Example: Bitcask (the default storage enginein Riak) requires all keys fit in the available RAM
  • Good for: situations where the value for each key is updated frequently
    • i.e. key: URL of a cat video, value: number of times it has been played

Compaction: throwing away duplicate keys in the log, keeping only the most recent update for each key

Why this design

  • Appending and segment mergeing are sequential write operations, which are generally much faster than random writes
  • Concurrency and crash recovery are much simpler if segment files are appendonly or immutable
  • Merging avoids data files getting fragmented over time

Limitations

  • The hash table must fit in memory, couldn’t handle large amount of keys
  • Range queries are not efficient

SSTables and LSM-Trees