Availability = Uptime / (Uptime + DownTime)
Usually expressed in terms of the number of “nines” we would like to provide: 99.9%, 99.99%, or 99.999% availability.
To achieve High Availability:
The ability of a system to continue to work as user load and data grow.
Common metrics:
1. Latency
request 1, request 2, …, request 9: 100ms request 10: 3s => mean latency = 390ms
p95 = 100ms => 95 out of 100 requests take less than 100ms, 5 requests take > 100ms.
Causes of high latency:
reduce Latency
2. Throughput
3. Throughput and Latency
If operation B started after operation A successfully completed, then operation B must see the system in the same state as it was on completion of operation A or a newer state.
Read-After-Write: Any read operation that begins after a write operation completes must return that value or the result of a later write operation
Example of inconsistency:
-Replication is asynchronously so it is not consistent.
1 Master DB and multiple Slave DBs. Master DB serves all update operations (insert, update, delete).
Data is synced from Master DB to Slave DBs asynchronously. Slave DBs is read-only (select)
Benefits:
Drawback
In practice, Sharding and Replication often go together: each Shard is a cluster of 1 master + n slaves.
storing a large database across multiple machines.
Why:
Benefits:
Challenges
Some shards become unbalanced due to the uneven distribution of data.
E.g: a single physical shard that contains customer names starting with A receives more data than others.
Application complexity: Most database management systems do not have built-in sharding features. Database designers and software developers must manually split, distribute, and manage the Database.
Infra & operational cost
How
shard key = hash(row) % N. => if N change, we need to reshard.
name = “quang” => hash(“quang”) = 12312834 % 3 = 1
When #user increase, it reaches the web server’s load limit => users experience slow response or timeout. adding more servers to handle user requests. => need a Load Balancer (LB) to distribute the traffic among web servers.
Benefits:
LB Algorithms
e.g.: there are three servers 1st req goes to 1st server 2nd req goes to 2nd server 3rd req goes to 3rd server 4th req goes to 1st server
e.g: #server = 5, req_id = 7 is routed to server_id = 7 % 5 = 2
map each user’s request to a slot on the ring
r = hash(req_id) % n
map each server_id to a slot on the ring as well
s = hash(server_id) % n
use balanced BST to maintain the ring (TreeMap in Java, map in C++, Orderred Dic in Python)
A temporary storage area stores the result of expensive responses or frequently accessed data in primary memory so that subsequent requests are served more quickly.
Benefits:
pitfalls
Stale Set: a stale set occurs when a client sets a stale value in Cache
- Client C1 get(k), k doesn’t exist in Cache (cache miss)
- Client C1 query(k) <- 1. C1 is somehow slow at this point (slow network)
- Client C2 update(k,2)
- Client C2 del(k) in cache
- Client C1 set(k,1) (C1 resumes)
Thundering herd when a specific key undergoes heavy read-and-write activity.
- Key k is trendy; lots of clients call get(k)
- C1 get(k), Cache returns v1 very fast (cache hit)
- C2 update k, causing it to be invalidated in Cache, del(k) in Cache
- Now, every other client that tries to get(k) will be cache miss, so they will try query(k), which leads to overload Database
for heavy-read workload
need support from cache library to read and write from DB. While in cache-aside, this is handled in Application.
The Message queue is a durable component that supports asynchronous communication.
Message queues are often used when we want to delegate work from a service. Ensure that the work is only executed one time
A durable component that support asynchronous communication.
Used where we need a guarantee that each subscriber gets a copy of the message.
What is maximum downtime per week for a system that has 99.99% availability?
0.01 7 24 * 60 = 100 mins
What if there is only 1 Slave and it crashes? Master still alive => ?
read operations will be directed to the master database temporarily
What if Master crashes? Slaves are still alive
- a slave database will be promoted to be the new master.
- data in the promoted slave is not up to date
Cache Aside pitfalls
Quick Links
Legal Stuff
Social Media