Akash's Blog

Showing posts with label System Design Series. Show all posts

Tuesday, April 15, 2025

System Design Series: DNS

Introduction

DNS (Domain Name System) which translates domain names like www.abc.com to IP address.

Architecture

Caching takes place at ISP as well as browser end once the IP is received from central DNS server. TTL (Time to Live) configuration determines till what time the DNS record can be cached after that it must be fetched again.

There are different types of DNS records,

NS Record : Specifies the DNS server for domain/subdomain.
MX Record: Specifies the mail server for accepting the messages.
A Record: Points name to IP.
CNAME: Points name to another canonical name.

DNS Services can also be used for balancing the load by routing the traffic. Different methods are there for balancing the load through DNS,

Round Robin (Weighted Round Robin preferably)
Latency Based
Geolocation Based

For example, DNS can return IP of US instead of Africa for www.abc.com to the user located in US.

DDoS Attack

A Distributed Denial-of-Service (DDoS) attack is a cyberattack that disrupts a network service, such as a website or server. The goal of a DDoS attack is to make the target inaccessible by overwhelming it with traffic.

Thursday, April 10, 2025

System Design Series: Load Balancer & Reverse Proxy

Introduction

Load balancer and reverse proxy are the components which are exposed to the client and abstracts the internal system which extends various benefits including scaling, maintenance, flexibility to change, security etc.

Load Balancer

Load Balancer distributes incoming requests to computing resources. Load balancer can be implemented using hardware or software. Hardware solutions are expensive compared to software solutions.

Benefits

Distributes the load.
Prevents requests to unhealthy instances.
Overcome Single Point of Failure.
SSL Termination
Session Persistance.

Distribution Methods

Random
Round Robin
Based on Session/Cookie
Least loaded instance
Layer 4 (Transport Layer) - Using source/destination IP, ports in header.
Layer 7 (Application Layer) - Using header content, message and cookies.

Reverse Proxy

Server acts an interface to internal services. Client requests are forwarded through reverse proxy.

Benefits

SSL Termination
Caching staic content
Compress/Encrypt server response
Flexibility to update underlying server configurations

Note: Tool like nginx can be used as both load balancer and reverse proxy at layer 7 (Application layer).

Tuesday, March 25, 2025

System Design Series: Asynchronism

Introduction

To achieve high performance system need to execute multiple tasks in parallel. It overall increases the throughput and makes system highly responsive. Not all operations should be asynchronous but there are many opportunities where asynchronism can be a better choice. For example, after sending email you can immediately send another without waiting first one to be delivered this is asynchrorous operation.

Message Queue

Message queue is used to achieve asynchronous communication between the components of the system. Producer sends the message to the 'Queue' and at the end of the queue Consumer receives it and processes as per the capacity.

We can consider it as a buffer where sent messages are kept until receiver processes them. Messages are processed in FIFO (First in first out) order by default but can also be configured to behave differently for example, priority based processing.

Components

Producer (Sends the message)
Queue (Stores the message)
Consumer (Receives the message)

Architectures

Point to Point (One producer one consumer)
Pub/Sub (One producer (publisher) multiple consumers (subscribers))

Examples

Kafka
SQS
RabbitMQ

Streaming

Streaming is a process of continuous transfer of bytes over the network in real time. Streaming is generally setup with message broker and stream processing component. Audio, Video, Data and Event can be streamed.

Examples

Producer: Application Logs, IoT Sensors, Financial Transactions
Message Broker: Apacke Kafka, RabbitQ, AWS Kinesis
Stream Processing: Apache Kafka Stream, AWS Kinesis, Spark Streaming
Consumer: Database, Data Lakes

Tuesday, March 11, 2025

System Design Series: Caching

Introduction

Idea of cahing is to store frequently used data in high speed temporary storage to reduce load on database and improve overall performance.

Caching helps to reduce work of underlying database drasticaly, when we write and read something from database it is being read from the disk. To avoid frequent read from disk which are slower compared to reading it from primary memory.

Thus caching helps to load frequently used data in memory and use it when requested instead of looking into disk, makes things way faster and the overall latency and throughput impoves.

Latency is time taken to finish task.
Throughput is number of tasks finished in given time.

If you take 1 minute to read a page. Your latency is 1 minute and throughput is 60 PPH (Pages per hour).

Aspects of Caching

Caching can be done in many layers; client side, server side or at dedicated (separate) cache layer.

There are options available for cahing in following layers,

Database
Web server
Applciation server
Browser (or Client Side)
Distributed Caching
CDN (Content Delivery Network)

We need to consider different aspects while choosing the right caching mechanism. For example, quantity of data to be loaded, which data to put, refresh frequency of the data, expiry of the unused data (eviction), our cache should not face cold start or celebrity problem (if distributed) etc.

Following are different strategies which we can use to update (refresh) the data stored in cache. Image clarifies what is the placement of cache and closely looking at arrow gives idea of data flow.

Cache Aside
Write Through
Write Behind
Refresh Ahead

Also, cache should be evicted if not used for certain period of time for optimal storage use. One of the following eviction Strategies can be used.

LRU (Least Recently Used)
LFU (Least Frequently Used)
FIFO (First In First Out)

Tuesday, March 4, 2025

System Design Series: Database

Introduction

Databases are essential in system, we have to choose the type and setup for the database as per our requirement. Majorly databases are characterised in two categories SQL (Structured Query Language) and NoSQL (Not Only SQL). We will going to understand the difference between these two and options to make database highly available and scalable.

Failover Strategies

Failover strategies can be applied as per the requirement to overcome data loss and interruptions.

Cold Standby

Take periodic backup of database and use the backup to restore database in case of failure. This may lead to significant data loss and downtime depending on the frequency of back up and time required to restore the database.

Warm Standby

Do replication of database and switch to replica in case actual databse fails. In this strategy delta between last replication and current databse state will be lost if database fails. Less downtime compared to cold standby.

Hot Standby

Application server (or client of database) makes changes in both, actual and stand by instance, so that in case actual database server fails, server can simply switch to the stand by instace quickly. No data loss but minor downtime during the database switching.

Scaling Strategies

Master-Slave Replication

In this setup Master databse can allow read and write, however slave can only allow read. Master replicates to data to slave on write. The idea is to keep slave as upto date as possible with master. Also, majority of the systems are read heavy, in that case such reads can be done through slave and overall load from master can be reduced.

Master-Master Replication

Both database instance are master, client can read and write in any of the instances and ideally the load is distributed. Replication takes places between masters which is bit tricky compared to master-slave as both instances is allowing write operation. However, in case of one instance failure other can quickly take place.

Federation

Database instances are created based on the functionality. Customer related data is stored in separate customer schema, product realted data is in product schema so and so forth. This allows individual databases to scale individually as per the need. If you have complex queries and join between such entities are frequent this setup may make it more complex.

Sharding

Data is distributed in nodes and each node maintains subset of data partitioned based on some criteria. For example, data of user name starting with A-N is in one node and M-Z is in another. This set up kind of struggle with "Celebrity Problem" where one of the node gets more traffic and other one sit ideal.

SQL vs. NoSQL

SQL and NoSQL databases has their own places, based on the need of the system we can choose either of them. SQL databases are relation databases, NoSQL databases can be of key-value store, document store, wide column store or graph database. Depending on the requirement we can use SQL and NoSQL or even combination of both if needed. Following are fundamental differences between these two which we should be aware.

SQL	NoSQL
Structured Query Language	Not Only SQL
Relationa Database	Non-relational Database
Schema is Predefined and Strict	Schema is dynamic and flexible
Vertically scalable	Horizontally scalable
MySQL, Postgres, Oracle etc.	Redis,MongoDB,Cassandra etc.
Better for structured data where complex queries with join and aggregation is required.	Better for large scale unstructured or semi-structured data
ACID Compliance	BASE (Basically Available, Soft state, Eventual consistency)

Tuesday, February 25, 2025

System Design Series: Resilience

Introduction

Resilience can be defined as an ability of the system to recover from the failures, disruptions or any kind of event which impacts the proper functioning.

In real world, any system is likely going to fail once a while due to various known or unknown reasons. We need to see what we can do to make our system get over with such situation, how we can avoid any harm to the system and how we can get back to the business quickly.

Approach

Resilience can be achieved using different approaches listed below,

Fault Tolerance
System continues to work even if any software or hardware fails (fully / partially).

e.g. In load balancing if one server fails, another server is created or load is distributed in remaining instnaces.

Redundancy
Redundancy or duplication ensures backup and also helps in recovering from failure quickly.

e.g. Create another instance using database replication and use it if original instance fails.

Monitoring

Continuos tracking and monitoring helps in early detection of problems.

e.g. Continuously track the system to ensure expected health and take automatic actions or alert if given health criteria is not met.

Disaster Recovery

Restore the system back after any disaster.

e.g. Regular back ups to avoid data loss or keep it to minimum.

Self Healing

System itself can automatically correct it if any kind of issues.

e.g. Automatic scaling of AWS, if one instance fails automatically create new instance and transfer trafic of unhealthy one to new instance.

Tuesday, February 18, 2025

System Design Series: Availability and Consistency

Introduction

We wish to build a system which is always working and giving accurate responses. However, in world of distributed systems it's not achievable, we have to trade-off based on the need of the user and business.

For banking system you must prioritise data consistency over availability, it's okay if it's not available for few minutes but it's not at all acceptable if it gives inaccurate results. In contrary, tiktok must be highly available otherwise people may loose interest but it's okay if user don't see newly uploaded video immediately.

We will understand why we have to trade-off and why can't we have both using the CAP theorem.

CAP Theorem

CAP stands for,

Consistency : Each read receives latest write.
Availability : Each requests receives non-error response, not necessarily latest data.
Partition Tolerance : Continue to work even if communication failure between nodes.

CAP Theorem states that we can achieve only two of these in distributed systems. Based on our requirement we need to trade-off one of these to achieve the desired results.

We can have any of these system according to the CAP Theorem,

CP : Consistent and Partition Tolerant.
AP : Available and Partition Tolerant.
AC : Available and Consistent (Not Possible in Distributed Environment.)

In distributed environment, it's technically impossible to be available all the time and return latest data on all reads, because if network partition (communication failure) happens, system has two choices, either it can fail (or return error) or return stale data which ultimately breaks the cosistency law.

There are different patterns for consistency and availability, which are listed below.

Consistency Patterns

Weak Consistency: Write may or may not be seen by reads. (e.g. Video Call)
Eventual Consistency: Write will be soon visible to reads. (e.g. Email)
Strong Consistency: Write is immediately visible. (e.g. File System)

Availability Patterns

Replication: Replicate data in additional component using Master-Master or Master-Slave setup.
Fail-over: Stand by instance to take over if original instance fails using Active-Active or Active Passive set up.

Availability of system is defined using percentage. For example, system is 99.9% available which is said to be availability of three 9's. If system is 90% available, which roughly mean that in a year it won't be available for ~ 36.5 days, in month it won't be available for ~3 days, in a day it won't be available for ~2.4 Hours.

Tuesday, February 11, 2025

System Design Series: Scalability

Introduction

Scalability means system's capability of handling more work.

Consider an example of a website where there is one server handling 1000 users. Due to some reason more and more users are opening the website and suddenly the number of users increased to 50,000. However, the system is not capable of handling more than 50,000 users. In this case if load further increases, the system will fail to serve the users, it may crash.

We can solve this problem primarily in two ways. We can either scale the system Vertically or Horizontally.

Vertical Scaling

When we increase the capacity of the server to handle more work is considered as vertical scaling.

Vertical Scaling

Pros
Easy to maintain as there will be fewer components in the system.
Cons
Capacity increase comes with additional cost which increases rapidly for large scale systems.
There is an upper limit till which only you can scale.
Single Point of Failure.

Horizontal Scaling

When we increase the number of servers to handle increasing work, is considered as horizontal scaling. By increasing the number of servers we can (evenly) distribute the load amongst these servers to handle more work using the load balancer.

Horizontal Scaling

Pros

Can solve single point of failure problem.
Highly scalable and available.
Comparatively cheaper as few small servers are cheaper compared to one high end server.

Cons

Added complexity in deployment and maintenance.

Labels

Archive

Tuesday, April 15, 2025

Introduction

Architecture

Thursday, April 10, 2025

Introduction

Load Balancer

Benefits

Distribution Methods

Reverse Proxy

Benefits

Tuesday, March 25, 2025

Introduction

Message Queue

Components

Architectures

Examples

Streaming

Tuesday, March 11, 2025

Introduction

Aspects of Caching

Tuesday, March 4, 2025

Introduction

Failover Strategies

Cold Standby

Warm Standby

Hot Standby

Scaling Strategies

Master-Slave Replication

Master-Master Replication

Federation

Sharding

SQL vs. NoSQL

Tuesday, February 25, 2025

Introduction

Approach

Tuesday, February 18, 2025

Introduction

We wish to build a system which is always working and giving accurate responses. However, in world of distributed systems it's not achievable, we have to trade-off based on the need of the user and business.

We will understand why we have to trade-off and why can't we have both using the CAP theorem.

CAP Theorem

Consistency Patterns

Availability Patterns

Tuesday, February 11, 2025

Introduction

Vertical Scaling

When we increase the capacity of the server to handle more work is considered as vertical scaling.

Vertical ScalingProsEasy to maintain as there will be fewer components in the system.ConsCapacity increase comes with additional cost which increases rapidly for large scale systems.There is an upper limit till which only you can scale.Single Point of Failure.

Horizontal Scaling

Vertical Scaling

Pros
Easy to maintain as there will be fewer components in the system.
Cons
Capacity increase comes with additional cost which increases rapidly for large scale systems.
There is an upper limit till which only you can scale.
Single Point of Failure.