Akash's Blog

_

Showing posts with label System Design Series. Show all posts
Showing posts with label System Design Series. Show all posts

Tuesday, April 15, 2025

System Design Series: DNS

Introduction

DNS (Domain Name System) which translates domain names like www.abc.com to IP address.

Architecture






















Caching takes place at ISP as well as browser end once the IP is received from central DNS server. TTL (Time to Live) configuration determines till what time the DNS record can be cached after that it must be fetched again.

There are different types of DNS records,
  • NS Record : Specifies the DNS server for domain/subdomain.
  • MX Record: Specifies the mail server for accepting the messages.
  • A Record: Points name to IP.
  • CNAME: Points name to another canonical name.

DNS Services can also be used for balancing the load by routing the traffic. Different methods are there for balancing the load through DNS,
  • Round Robin (Weighted Round Robin preferably)
  • Latency Based
  • Geolocation Based
For example, DNS can return IP of US instead of Africa for www.abc.com to the user located in US.

DDoS Attack
A Distributed Denial-of-Service (DDoS) attack is a cyberattack that disrupts a network service, such as a website or server. The goal of a DDoS attack is to make the target inaccessible by overwhelming it with traffic. 

Thursday, April 10, 2025

System Design Series: Load Balancer & Reverse Proxy

Introduction

Load balancer and reverse proxy are the components which are exposed to the client and abstracts the internal system which extends various benefits including scaling, maintenance, flexibility to change, security etc. 

Load Balancer

Load Balancer distributes incoming requests to computing resources. Load balancer can be implemented using hardware or software. Hardware solutions are expensive compared to software solutions.

Benefits

  • Distributes the load.
  • Prevents requests to unhealthy instances.
  • Overcome Single Point of Failure.
  • SSL Termination
  • Session Persistance.

Distribution Methods

  • Random
  • Round Robin
  • Based on Session/Cookie 
  • Least loaded instance
  • Layer 4 (Transport Layer) - Using source/destination IP, ports in header.
  • Layer 7 (Application Layer) - Using header content, message and cookies.




Reverse Proxy

Server acts an interface to internal services. Client requests are forwarded through reverse proxy.
















Benefits

  • SSL Termination
  • Caching staic content
  • Compress/Encrypt server response
  • Flexibility to update underlying server configurations

Note: Tool like nginx can be used as both load balancer and reverse proxy at layer 7 (Application layer).

Tuesday, March 25, 2025

System Design Series: Asynchronism

Introduction

To achieve high performance system need to execute multiple tasks in parallel. It overall increases the throughput and makes system highly responsive. Not all operations should be asynchronous but there are many opportunities where asynchronism can be a better choice. For example, after sending email you can immediately send another without waiting first one to be delivered this is asynchrorous operation.

Message Queue

Message queue is used to achieve asynchronous communication between the components of the system. Producer sends the message to the 'Queue' and at the end of the queue Consumer receives it and processes as per the capacity.

We can consider it as a buffer where sent messages are kept until receiver processes them. Messages are processed in FIFO (First in first out) order by default but can also be configured to behave differently for example, priority based processing.

Components

  • Producer (Sends the message)
  • Queue (Stores the message)
  • Consumer (Receives the message)

Architectures

  • Point to Point (One producer one consumer)
  • Pub/Sub (One producer (publisher) multiple consumers (subscribers))

Examples

  • Kafka
  • SQS
  • RabbitMQ




















Streaming

Streaming is a process of continuous transfer of bytes over the network in real time. Streaming is generally setup with message broker and stream processing component. Audio, Video, Data and Event can be streamed. 






Examples
  • Producer: Application Logs, IoT Sensors, Financial Transactions
  • Message Broker: Apacke Kafka, RabbitQ, AWS Kinesis
  • Stream Processing: Apache Kafka Stream, AWS Kinesis, Spark Streaming
  • Consumer: Database, Data Lakes

Tuesday, March 11, 2025

System Design Series: Caching

 Introduction


Idea of cahing is to store frequently used data in high speed temporary storage to reduce load on database and improve overall performance.

Caching helps to reduce work of underlying database drasticaly, when we write and read something from database it is being read from the disk. To avoid frequent read from disk which are slower compared to reading it from primary memory. 

Thus caching helps to load frequently used data in memory and use it when requested instead of looking into disk, makes things way faster and the overall latency and throughput impoves.

Latency is time taken to finish task.
Throughput is number of tasks finished in given time.

If you take 1 minute to read a page. Your latency is 1 minute and throughput is 60 PPH (Pages per hour).



Aspects of Caching

Caching can be done in many layers; client side, server side or at dedicated (separate) cache layer. 

There are options available for cahing in following layers,

  • Database
  • Web server
  • Applciation server
  • Browser (or Client Side)
  • Distributed Caching
  • CDN (Content Delivery Network)

We need to consider different aspects while choosing the right caching mechanism. For example, quantity of data to be loaded, which data to put, refresh frequency of the data, expiry of the unused data (eviction), our cache should not face cold start or celebrity problem (if distributed) etc.

Following are different strategies which we can use to update (refresh) the data stored in cache. Image clarifies what is the placement of cache and closely looking at arrow gives idea of data flow.

  • Cache Aside
  • Write Through
  • Write Behind
  • Refresh Ahead




















Also, cache should be evicted if not used for certain period of time for optimal storage use. One of the following eviction Strategies can be used.
  • LRU (Least Recently Used)
  • LFU (Least Frequently Used)
  • FIFO (First In First Out)

Tuesday, March 4, 2025

System Design Series: Database

Introduction

Databases are essential in system, we have to choose the type and setup for the database as per our requirement. Majorly databases are characterised in two categories SQL (Structured Query Language) and NoSQL (Not Only SQL). We will going to understand the difference between these two and options to make database highly available and scalable.


Failover Strategies

Failover strategies can be applied as per the requirement to overcome data loss and interruptions. 

Cold Standby


Take periodic backup of database and use the backup to restore database in case of failure. This may lead to significant data loss and downtime depending on the frequency of back up and time required to restore the database.



Warm Standby


Do replication of database and switch to replica in case actual databse fails. In this strategy delta between last replication and current databse state will be lost if database fails. Less downtime compared to cold standby.

Hot Standby


Application server (or client of database) makes changes in both, actual and stand by instance, so that in case actual database server fails, server can simply switch to the stand by instace quickly. No data loss but minor downtime during the database switching.








Scaling Strategies 

Master-Slave Replication 


In this setup Master databse can allow read and write, however slave can only allow read. Master replicates to data to slave on write. The idea is to keep slave as upto date as possible with master. Also, majority of the systems are read heavy, in that case such reads can be done through slave and overall load from master can be reduced.

Master-Master Replication 


Both database instance are master, client can read and write in any of the instances and ideally the load is distributed. Replication takes places between masters which is bit tricky compared to master-slave as both instances is allowing write operation. However, in case of one instance failure other can quickly take place.

Federation 


Database instances are created based on the functionality. Customer related data is stored in separate customer schema, product realted data is in product schema so and so forth. This allows individual databases to scale individually as per the need. If you have complex queries and join between such entities are frequent this setup may make it more complex.

Sharding 


Data is distributed in nodes and each node maintains subset of data partitioned based on some criteria. For example, data of user name starting with A-N is in one node and M-Z is in another. This set up kind of struggle with "Celebrity Problem" where one of the node gets more traffic and other one sit ideal.


SQL vs. NoSQL

SQL and NoSQL databases has their own places, based on the need of the system we can choose either of them. SQL databases are relation databases, NoSQL databases can be of key-value store, document store, wide column store or graph database. Depending on the requirement we can use SQL and NoSQL or even combination of both if needed. Following are fundamental differences between these two which we should be aware.

SQL               NoSQL
Structured Query Language               Not Only SQL

Relationa Database
               
               Non-relational Database

Schema is Predefined and Strict
               Schema is dynamic and flexible

Vertically scalable
               Horizontally scalable

MySQL, Postgres, Oracle etc.
               Redis,MongoDB,Cassandra etc.

Better for structured data where
complex queries with
join and aggregation is required.
               Better for large scale unstructured
               or semi-structured data
ACID Compliance               BASE (Basically Available,
              Soft state, Eventual consistency)


Tuesday, February 25, 2025

System Design Series: Resilience

Introduction

Resilience can be defined as an ability of the system to recover from the failures, disruptions or any kind of event which impacts the proper functioning. 

In real world, any system is likely going to fail once a while due to various known or unknown reasons. We need to see what we can do to make our system get over with such situation, how we can avoid any harm to the system and how we can get back to the business quickly.

Approach

Resilience can be achieved using different approaches listed below,

Fault Tolerance
System continues to work even if any software or hardware fails (fully / partially). 
e.g. In load balancing if one server fails, another server is created or load is distributed in remaining instnaces.

Redundancy
Redundancy or duplication ensures backup and also helps in recovering from failure quickly.
e.g. Create another instance using database replication and use it if original instance fails.

Monitoring
Continuos tracking and monitoring helps in early detection of problems.
e.g. Continuously track the system to ensure expected health and take automatic actions or alert if given health criteria is not met.

Disaster Recovery
Restore the system back after any disaster.
e.g. Regular back ups to avoid data loss or keep it to minimum.

Self Healing
System itself can automatically correct it if any kind of issues.

e.g. Automatic scaling of AWS, if one instance fails automatically create new instance and transfer trafic of unhealthy one to new instance.






Tuesday, February 18, 2025

System Design Series: Availability and Consistency

Introduction

We wish to build a system which is always working and giving accurate responses. However, in world of distributed systems it's not achievable, we have to trade-off based on the need of the user and business. 

For banking system you must prioritise data consistency over availability, it's okay if it's not available for few minutes but it's not at all acceptable if it gives inaccurate results. In contrary, tiktok must be highly available otherwise people may loose interest but it's okay if user don't see newly uploaded video immediately. 

We will understand why we have to trade-off and why can't we have both using the CAP theorem.

CAP Theorem

CAP stands for,
  • Consistency : Each read receives latest write.
  • Availability  : Each requests receives non-error response, not necessarily latest data.
  • Partition Tolerance : Continue to work even if communication failure between nodes.
CAP Theorem states that we can achieve only two of these in distributed systems. Based on our requirement we need to trade-off one of these to achieve the desired results.

We can have any of these system according to the CAP Theorem,
  • CP : Consistent and Partition Tolerant.
  • AP : Available and Partition Tolerant.
  • AC : Available and Consistent (Not Possible in Distributed Environment.)
In distributed environment, it's technically impossible to be available all the time and return latest data on all reads, because if network partition (communication failure) happens, system has two choices, either it can fail (or return error) or return stale data which ultimately breaks the cosistency law.
There are different patterns for consistency and availability, which are listed below.

Consistency Patterns

  • Weak Consistency: Write may or may not be seen by reads. (e.g. Video Call)
  • Eventual Consistency: Write will be soon visible to reads.  (e.g. Email)
  • Strong Consistency: Write is immediately visible. (e.g. File System)

Availability Patterns

  • Replication: Replicate data in additional component using Master-Master or Master-Slave setup.
  • Fail-over: Stand by instance to take over if original instance fails using Active-Active or Active Passive set up.

Availability of system is defined using percentage. For example, system is 99.9% available which is said to be availability of three 9's. If system is 90% available, which roughly mean that in a year it won't be available for ~ 36.5 days, in month it won't be available for ~3 days, in a day it won't be available for ~2.4 Hours.


Tuesday, February 11, 2025

System Design Series: Scalability

Introduction

Scalability means system's capability of handling more work. 

Consider an example of a website where there is one server handling 1000 users. Due to some reason more and more users are opening the website and suddenly the number of users increased to 50,000. However, the system is not capable of handling more than 50,000 users. In this case if load further increases, the system will fail to serve the users, it may crash.

We can solve this problem primarily in two ways. We can either scale the system Vertically or Horizontally

Vertical Scaling

When we increase the capacity of the server to handle more work is considered as vertical scaling. 

Vertical Scaling











Pros

  • Easy to maintain as there will be fewer components in the system.

Cons

  • Capacity increase comes with additional cost which increases rapidly for large scale systems.
  • There is an upper limit till which only you can scale.
  • Single Point of Failure. 

Horizontal Scaling

When we increase the number of servers to handle increasing work, is considered as horizontal scaling. By increasing the number of servers we can (evenly) distribute the load amongst these servers to handle more work using the load balancer. 

Horizontal Scaling












Pros

  • Can solve single point of failure problem.
  • Highly scalable and available.
  • Comparatively cheaper as few small servers are cheaper compared to one high end server.

Cons

  • Added complexity in deployment and maintenance.


↑ Back to Top