Good Short Video: https://www.youtube.com/watch?v=W2Z7fbCLSTw&ab_channel=Fireship
Questions to think:
- How to scale writes?
- How to scale reads?
- How to make both writes and reads fast?
- How not to lose data in case of hardware faults and network partitions?
- How to achieve strong consistency? What are the tradeoffs?
- How to recover data in case of an outage
- How to ensure data security?
- How to make it extensible for data model changes in the future?
- Where to run (cloud vs on-premises data centers)?
SQL vs NoSQL
NoSQL
- Key-Value Stores: Data is stored in an array of key-value pairs. Redis, Voldemort, and Dynamo
- Document Databases: Each document can have an entirely different structure. Document databases include the CouchDB and MongoDB.
- Wide-Column Databases: Instead of ‘tables,’ in columnar databases we have column families, which are containers for rows. Unlike relational databases, we don’t need to know all the columns up front and each row doesn’t have to have the same number of columns. Cassandra and HBase. Good for high write rate, and low reads.
- Graph Databases: These databases are used to store data whose relations are best represented in a graph. Neo4J and InfiniteGraph. Use cases: knowledge graphs, recommendation engine
Most of the NoSQL solutions sacrifice ACID compliance for performance and scalability.
Reasons to choose NoSQL:
- Storing large volumes of data that often have little to no structure.
- Making the most of cloud computing and storage.
- Rapid development.
Example:
DynamoDB
- High speed: 100M requests per day with single-digit ms latency
- Partition: Stores partition_id for each row, used for partition
- Availability: Uses an array of high-performing SSDs to persist data across multiple partitions in a single table
Firestore
- Good for real-time application. When integrated with Firebase, can utilize web socket connections
SQL
Reasons to choose SQL:
- We need to ensure ACID compliance.(Atomicity, Consistency, Isolation, Durability)
- Your data is structured and unchanging.
Indexes
The goal of creating an index on a particular table in a database is to make it faster to search through the table and find the row or rows that we want. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.
Simply saying, an index is a data structure that can be perceived as a table of contents that points us to the location where actual data lives. So when we create an index on a column of a table, we store that column and a pointer to the whole row in the index.
Trade-offs: An index can dramatically speed up data retrieval but may itself be large due to the additional keys, which slow down data insertion & update.
Data Partitioning
Partitioning Method
Horizontal Partition (Data Sharding): In this scheme, we put different rows into different tables. The key problem with this approach is that if the value whose range is used for Partitioning isn’t chosen carefully, then the partitioning scheme will lead to unbalanced servers.
Vertical Partition: In this scheme, we divide our data to store tables related to a specific feature in their own server. Vertical Partitioning is straightforward to implement and has a low impact on the application. The main problem with this approach is that if our application experiences additional growth, then it may be necessary to further partition a feature specific DB across various servers
Directory-Based Partitioning: A loosely coupled approach to work around issues mentioned in the above schemes is to create a lookup service that knows your current partitioning scheme and abstracts it away from the DB access code. So, to find out where a particular data entity resides, we query the directory server that holds the mapping between each tuple key to its DB server. This loosely coupled approach means we can perform tasks like adding servers to the DB pool or changing our partitioning scheme without having an impact on the application.
Partition Critiria
Key or Hash-based Partitioning: Under this scheme, we apply a hash function to some key attributes of the entity we are storing; that yields the partition number. This approach should ensure a uniform allocation of data among servers. The fundamental problem with this approach is that it effectively fixes the total number of DB servers, since adding new servers means changing the hash function which would require redistribution of data and downtime for the service. A workaround for this problem is to use Consistent Hashing.
List partitioning: In this scheme, each partition is assigned a list of values, so whenever we want to insert a new record, we will see which partition contains our key and then store it there.
Round-robin partitioning: This is a very simple strategy that ensures uniform data distribution.
Composite Partitioning: Under this scheme, we combine any of the above partitioning schemes to devise a new scheme.
Problems of Data Partition
- Joins and Denormalization: Performing joins on a database that is running on one server is straightforward, but once a database is partitioned and spread across multiple machines it is often not feasible to perform joins that span database partitions. A common workaround for this problem is to denormalize the database so that queries that previously required joins can be performed from a single table.
- Referential integrity: trying to enforce data integrity constraints such as foreign keys in a partitioned database can be extremely difficult. Applications that require referential integrity on partitioned databases often have to enforce it in application code. Often in such cases, applications have to run regular SQL jobs to clean up dangling references.
- Rebalancing
Consistent Hashing
There are two challenges when we try to distribute data:
- How do we know on which node a particular piece of data will be stored?
- When we add or remove nodes, how do we know what data will be moved from existing nodes to the new nodes? Additionally, how can we minimize data movement when nodes join or leave?
Consistent Hashing maps data to physical nodes and ensures that only a small set of keys move when servers are added or removed.
Consistent Hashing stores the data managed by a distributed system in a ring. Each node in the ring is assigned a range of data.
With consistent hashing, the ring is divided into smaller, predefined ranges. Each node is assigned one of these ranges. The start of the range is called a token.
Virtual Nodes
Consistent Hashing introduces a new scheme of distributing the tokens to physical nodes. Instead of assigning a single token to a node, the hash range is divided into multiple smaller ranges, and each physical node is assigned several of these smaller ranges. Each of these subranges is considered a Vnode. With Vnodes, instead of a node being responsible for just one token, it is responsible for many tokens (or subranges).
Practically, Vnodes are randomly distributed across the cluster and are generally non-contiguous so that no two neighboring Vnodes are assigned to the same physical node or rack. Additionally, nodes do carry replicas of other nodes for fault tolerance. Also, since there can be heterogeneous machines in the clusters, some servers might hold more Vnodes than others.
Advantages of Virtual Nodes
Vnodes gives the following advantages:
- As Vnodes help spread the load more evenly across the physical nodes on the cluster by dividing the hash ranges into smaller subranges, this speeds up the rebalancing process after adding or removing nodes. When a new node is added, it receives many Vnodes from the existing nodes to maintain a balanced cluster. Similarly, when a node needs to be rebuilt, instead of getting data from a fixed number of replicas, many nodes participate in the rebuild process.
- Vnodes make it easier to maintain a cluster containing heterogeneous machines. This means, with Vnodes, we can assign a high number of sub-ranges to a powerful server and a lower number of sub-ranges to a less powerful server.
- In contrast to one big range, since Vnodes help assign smaller ranges to each physical node, this decreases the probability of hotspots.
Data replication with Consistent Hashing
Each key is assigned to a coordinator node (generally the first node that falls in the hash range), which first stores the data locally and then replicates it to N-1 clockwise successor nodes on the ring. This results in each node owning the region on the ring between it and its Nth predecessor. In an eventually consistent system, this replication is done asynchronously (in the background).
In eventually consistent systems, copies of data don’t always have to be identical as long as they are designed to eventually become consistent. In distributed systems, eventual consistency is used to achieve high availability.
Amazon’s Dynamo and Apache Cassandra use Consistent Hashing to distribute and replicate data across nodes.
Concepts
Dead Letter Queue: temporary storage for jobs that could not be processed in the normal queue. Video explanation: https://www.youtube.com/watch?v=XNXbjWNsKAE&ab_channel=anthonywritescode
Data Structure
Log segments with Hash Indexes
- In memory hashmap where every key is mapped to a byte offset in the data file
- Example: Bitcask: all keys need to fit in RAM
- Good for write heavy tasks with small amount of keys
- Save space: Compaction and Merge segments
- Cons:
- Range query not efficient
- Hash table must fit in memory
SSTables and LSM-Trees
- The sequence of key-value pairs is sorted by key
- Pros:
- Merging segments is efficient: mergesort
- Fast: Binary search for read request.
- Save memory space: Don’t need to store all keys in memory, sparse key is enough.