[System Design] Stream Processing

References: https://lethain.com/introduction-to-architecting-systems-for-scale/

Message Queues

Message queues allow your web applications to quickly publish messages to the queue, and have other consumers processes perform the processing outside the scope and timeline of the client request.

Generally you’ll either:

  1. perform almost no work in the consumer (merely scheduling a task) and inform your user that the task will occur offline, usually with a polling mechanism to update the interface once the task is complete
  2. perform enough work in-line to make it appear to the user that the task has completed, and tie up hanging ends afterwards

Delivery type:

  • at-most-once
  • at-least-once
  • exactly-once

Fake exactly-once deliver

  • Decoupe strategy: abandon seen events
  • Messages should be idempotent

RabbitMQ

https://www.datadoghq.com/blog/rabbitmq-monitoring/

Messaging Systems

  • Unix pipe or TCP connection allows one sender with one recipient
  • Messaging system allows multiple producer nodes to send messages to the same topic and allows multiple consumer nodes to receive messages
  • Message queue: special DB:
    • Automatically delete message
    • working set is small: the queues are short
    • Not support arbitraryo queries
  • Patterns to consume messages:
    • Load balancing: each message delivered to one consumer
    • Fan out: each message delivered to all consumers
  • Acknowledgement: a client explicitly tell broker when it has finished processing the message. If not received, the broker will retry sending. This will result in reordering of the messages. To avoid this, we could use one queue per consuemr.

Log-based message brokers

  • A producer sends a message by appending it to the end of a log, and a consumer receives messages by reading the log sequentially.
  • The log can be Partitioned. Different partitions can be hosted on diffferent machines.
  • A Topic can be defined as a group of partitions that all carry messages of the same type
  • Within each partition, the broker assigns a monotonically increasing sequence number, or offset, to every message
  • To achieve load balancing, the broker can assign each partition to a group of consumer nodes
    • Cons: If a single message is slow to process, it could block the whole partition to be processed
    • Alternative: If message ordering not important and message processing is expensive: choose JMS/AMQP style of message broker
  • Disk Space: the extending log will take up space. We can divide the logs into segments and delete the old segments. Circular buffer: the log implements a bounded-size buffer that discards old messages when it gets full. The log can typically keep a buffer of several days’ or weeks’ worth of messages

Bring event stream idea to DB

  • CDC(Change Data Capture) The process of observing all data changes written to a database and extracting them in a form in which they can be replicated to other stytems.
  • Initial Snapshot If you don’t have the entire log history, you need to start with a consistent snapshot that’s linked to a log offset
  • Kafka Connect is an effort to integrate CDC in to s a wide range of db systems with Kafka
  • Kafka supports Log Compaction to only keep most recent entries to save disk space

Event Sourcing

  • Event sourcing Involves storing all changes to the application state as a log of change events

CAP Theorem

  • Consistency: All nodes see the same data at the same time.
  • Availability: Availability means every request received by a non-failing node in the system must result in a response.
  • Partition tolerance: A partition is a communication break (or a network failure) between any two nodes in the system, i.e., both nodes are up but cannot communicate with each other. Ensure all changes made to the system of record is also reflacted in the derived data systems