Fundamentals of Distributed Theory

Fundamentals of Distributed Theory

Classical Byzantine Generals Problem

Story overview: The problem of maintaining consistency in battle plans among multiple armies, ensuring that the final voting results remain consistent even when some army messengers deliver false messages.

Oral Message Solution: If there are m traitors, the number of generals must be no less than 3m + 1, then the Byzantine Generals Problem can be solved.

Signed Message Solution: Sign the battle plans before transmission, which allows detection of traitors without increasing the number of loyal generals.

The Byzantine problem is the most difficult and complex issue in distributed systems, involving not only fault behaviors but also malicious behaviors. Therefore, Byzantine Fault Tolerance (BFT) algorithms must be used in digital cryptocurrencies. Common other Byzantine algorithms include: PBFT and PoW algorithms.

Distributed problems in computer systems are more often non-Byzantine problems, i.e., Crash Fault Tolerance (CFT) algorithms. Common algorithms include the Paxos algorithm, Raft algorithm, and ZAB protocol.

How to choose appropriate algorithm types in practical scenarios? The answer is: if you can ensure that nodes in the environment are trustworthy and there are no malicious behaviors such as tampering with messages or forging messages (for example, distributed routing addressing systems in DevOps environments), it is recommended to use non-Byzantine fault tolerance algorithms; otherwise, it is recommended to use Byzantine fault tolerance algorithms, such as using PoW algorithm in blockchain.

CAP Theorem

  • Consistency
  • Availability
  • Partition Tolerance

Consistency: Regardless of which node is queried, the returned result is always the latest consistent data. Availability: No matter which non-faulty node is accessed, a response can be obtained, but it does not guarantee it is the same latest data (emphasizes service availability, but does not guarantee data correctness), measured by service latency. Partition Tolerance: No matter what kind of data synchronization problems occur within the cluster, the cluster will continue to operate.

CAP Impossible Triangle: All three cannot be achieved simultaneously; only two can be chosen among the three metrics.

How to make selections?

  • In cases where network partitions do not exist, CA can be guaranteed simultaneously.
  • When network partitions exist, C and A must be mutually exclusive. If reading stale data would have adverse effects on business, choose C; otherwise choose A.
  • Generally speaking, the closer the cluster is to the final storage cluster, the stronger the demand for A becomes.

CP model examples: Etcd, Consul, and HBase. AP model examples: Cassandra and DynamoDB.

  1. If the business requires strong consistency, availability must be sacrificed and CP model should be chosen.
  2. If the business only needs eventual consistency, priority should be given to satisfying availability and choose AP model.

ACID Theory

It is recommended that when developing distributed systems, unless absolutely necessary, try not to implement transactions and consider adopting eventual consistency instead.

Implementing strong consistency will inevitably affect availability. For example, in cluster systems using the two-phase commit protocol, because executing commit operations requires confirmation and voting from all nodes.

Two-Phase Commit Protocol

To complete distributed transactions, meaning all specific actions on different nodes either all occur or none occur, the two-phase commit protocol can be used to solve this.

  1. Client requests a node, then that node acts as a coordinator to contact other nodes for voting.
  2. Commit execution phase: starting from itself, take action, after success notify the next node to act, until all nodes have completed the action and reported back.

Before a participant votes to commit a transaction, it must guarantee that it can execute its part in the commit protocol, even if the participant fails or gets replaced midway. This characteristic is something we need to ensure in code implementation.

XA protocol is the distributed transaction protocol based on the two-phase commit protocol used by MySQL.

Disadvantage: Requires resource locking, affecting concurrent performance.

TCC

TCC is an acronym for Try (reserve), Confirm (confirm), and Cancel (revert) three operations. Essentially compensation transactions - register a corresponding confirmation operation and compensation operation (i.e., rollback operation) for each operation. Confirmation and compensation operations must be idempotent since these two operations may fail and require retries.

TCC does not rely on database transactions but implements distributed transactions in business logic. Clients send requests to various nodes to reserve time and resources. If everything is fine, proceed to confirmation stage. Clients send execution requests, return execution results. If any failure occurs, execute rollback; otherwise the transaction executes successfully.

BASE Theory

BASE theory extends AP from the CAP theorem, representing practical summaries of large-scale distributed systems on the Internet, emphasizing availability and commonly used in NoSQL. Core: Basically Available, Eventually Consistent. Through service degradation, sacrifice partial function availability to ensure core system functionality remains available. Common measures: traffic peak shaving, delayed response, experience degradation, overload protection circuit breaking. Eventual consistency means that after a period of synchronization, all data replicas in the system can eventually reach a consistent state.

How to ensure eventual consistency? Read-time repair, write-time repair, asynchronous repair (scheduled reconciliation to detect data consistency and repair).

Write-time repair implementation: when remote writing between nodes, if write fails, cache the data and periodically retransmit to fix data inconsistency. This approach has lower performance consumption and does not require system consistency comparison.

Read-time repair and asynchronous repair consume more performance because they require data consistency comparison. When developing actual systems, you should optimize consistency comparison algorithms as much as possible to reduce performance consumption and avoid impacts on system operation.

In implementing eventual consistency, it is recommended to simultaneously implement custom write consistency levels, allowing users to choose autonomously. For any cluster, the ultimate consequence of unpredictable failures is system overload.

Paxos Algorithm

Understanding roles is key to mastering Paxos. Three roles:

  • Proposer: After receiving client requests, initiates two-phase commit and conducts consensus negotiation.
  • Acceptor: Every node is an acceptor, votes on values proposed by proposers, accepts and stores agreed-upon values.
  • Learner: Passively receives agreed-upon values, typically backup machines for data replication.

Basic Paxos: Proposal number size represents priority. You can understand it this way: according to proposal number size, acceptors guarantee three commitments. Specifically: if the proposal number in a prepare request is less than or equal to the proposal number of prepare requests the acceptor has already responded to, then the acceptor will promise not to respond to this prepare request; if the proposal number in an accept request is less than the proposal number of prepare requests the acceptor has already responded to, then the acceptor will promise to not approve this proposal; if the acceptor has previously approved proposals, then the acceptor will promise to include information about the largest numbered approved proposal in the response to prepare requests.

BP only makes most nodes reach consensus on the value of a certain key. After consensus is reached (i.e., proposal is approved), the value will not change. If subsequent proposals have larger numbers, only update the proposal number without updating the value.

Multi-Paxos Algorithm

Consensus for single values uses basic Paxos, while consensus for a series of values is the ultimately deployable consensus algorithm, i.e., multi-Paxos.