Akash's Blog

_

Tuesday, March 11, 2025

System Design Series: Caching

 Introduction


Idea of cahing is to store frequently used data in high speed temporary storage to reduce load on database and improve overall performance.

Caching helps to reduce work of underlying database drasticaly, when we write and read something from database it is being read from the disk. To avoid frequent read from disk which are slower compared to reading it from primary memory. 

Thus caching helps to load frequently used data in memory and use it when requested instead of looking into disk, makes things way faster and the overall latency and throughput impoves.

Latency is time taken to finish task.
Throughput is number of tasks finished in given time.

If you take 1 minute to read a page. Your latency is 1 minute and throughput is 60 PPH (Pages per hour).



Aspects of Caching

Caching can be done in many layers; client side, server side or at dedicated (separate) cache layer. 

There are options available for cahing in following layers,

  • Database
  • Web server
  • Applciation server
  • Browser (or Client Side)
  • Distributed Caching
  • CDN (Content Delivery Network)

We need to consider different aspects while choosing the right caching mechanism. For example, quantity of data to be loaded, which data to put, refresh frequency of the data, expiry of the unused data (eviction), our cache should not face cold start or celebrity problem (if distributed) etc.

Following are different strategies which we can use to update (refresh) the data stored in cache. Image clarifies what is the placement of cache and closely looking at arrow gives idea of data flow.

  • Cache Aside
  • Write Through
  • Write Behind
  • Refresh Ahead




















Also, cache should be evicted if not used for certain period of time for optimal storage use. One of the following eviction Strategies can be used.
  • LRU (Least Recently Used)
  • LFU (Least Frequently Used)
  • FIFO (First In First Out)

Sunday, March 9, 2025

I passed AWS Certified Data Engineer – Associate Exam! 🎉


Intorduction

I recently passed the AWS Certified Data Engineer – Associate exam and wanted to share my experience, hoping it might help others in the future.  

Although this is an associate-level exam, it thoroughly tests your understanding of data engineering concepts and the relevant AWS services through realistic scenarios.


Preparation

I used Udemy course, AWS Certified Data Engineer Associate 2025 - Hands On! by Stephan & Frank Kane which helped me to understand the concepts in detail. It includes hands on sessions in multiple services, which you can follow along. For some services like Redshift, Glue, EMR, Kinesis I had to practically explore things on my own.

However, going through the course wasn't enough, thus I decided to go with some mock exams.


It is more about understanding the nature and application of the services. Multiple services can be used for same solution but questions checks what is best in terms of cost, operation overhead, time and efficiency. 

For revision I used the slides from the same Udemy course.

Exam Day

The day of the exam started off frustrating. Due to a slow internet connection, I was unable to check in, and I feared I might have to reschedule and repay for the exam. Even after my internet speed improved, I was still not allowed to check in, leaving me completely unsure of what was going on.


I tried calling Pearson support but had no luck. I attempted to reschedule, but initially, I didn’t see that option. After 10–15 minutes, it finally appeared, and I was able to reschedule. Phew...  

The exam was quite challenging, packed with tricky and confusing scenarios. I struggled to stay focused on the lengthy questions. The exam consisted of 65 questions with a total duration of 130 minutes—just 2 minutes per question. (I later realized that non-native English speakers get an additional 30 minutes, but it must be claimed before scheduling the exam.)  

Some questions were straightforward, and I felt confident answering them, but the majority required a solid understanding of the relevant AWS services.

Conclusion

The exam evaluates your clarity on AWS services and their use cases. It took me approximately 3.5 weeks to prepare, including completing the course and taking mock tests. That said, my prior experience with AWS, along with having previously passed the AWS Certified Solutions Architect – Associate exam, significantly helped streamline my preparation and reduce the required study time.

Suggestion

If you're planning to take this exam, prioritize understanding how services apply in terms of time, cost, and operational overhead, as these were the key themes for me. Rather than just memorizing facts, focus on grasping real-world use cases.

References

Tuesday, March 4, 2025

System Design Series: Database

Introduction

Databases are essential in system, we have to choose the type and setup for the database as per our requirement. Majorly databases are characterised in two categories SQL (Structured Query Language) and NoSQL (Not Only SQL). We will going to understand the difference between these two and options to make database highly available and scalable.


Failover Strategies

Failover strategies can be applied as per the requirement to overcome data loss and interruptions. 

Cold Standby


Take periodic backup of database and use the backup to restore database in case of failure. This may lead to significant data loss and downtime depending on the frequency of back up and time required to restore the database.



Warm Standby


Do replication of database and switch to replica in case actual databse fails. In this strategy delta between last replication and current databse state will be lost if database fails. Less downtime compared to cold standby.

Hot Standby


Application server (or client of database) makes changes in both, actual and stand by instance, so that in case actual database server fails, server can simply switch to the stand by instace quickly. No data loss but minor downtime during the database switching.








Scaling Strategies 

Master-Slave Replication 


In this setup Master databse can allow read and write, however slave can only allow read. Master replicates to data to slave on write. The idea is to keep slave as upto date as possible with master. Also, majority of the systems are read heavy, in that case such reads can be done through slave and overall load from master can be reduced.

Master-Master Replication 


Both database instance are master, client can read and write in any of the instances and ideally the load is distributed. Replication takes places between masters which is bit tricky compared to master-slave as both instances is allowing write operation. However, in case of one instance failure other can quickly take place.

Federation 


Database instances are created based on the functionality. Customer related data is stored in separate customer schema, product realted data is in product schema so and so forth. This allows individual databases to scale individually as per the need. If you have complex queries and join between such entities are frequent this setup may make it more complex.

Sharding 


Data is distributed in nodes and each node maintains subset of data partitioned based on some criteria. For example, data of user name starting with A-N is in one node and M-Z is in another. This set up kind of struggle with "Celebrity Problem" where one of the node gets more traffic and other one sit ideal.


SQL vs. NoSQL

SQL and NoSQL databases has their own places, based on the need of the system we can choose either of them. SQL databases are relation databases, NoSQL databases can be of key-value store, document store, wide column store or graph database. Depending on the requirement we can use SQL and NoSQL or even combination of both if needed. Following are fundamental differences between these two which we should be aware.

SQL               NoSQL
Structured Query Language               Not Only SQL

Relationa Database
               
               Non-relational Database

Schema is Predefined and Strict
               Schema is dynamic and flexible

Vertically scalable
               Horizontally scalable

MySQL, Postgres, Oracle etc.
               Redis,MongoDB,Cassandra etc.

Better for structured data where
complex queries with
join and aggregation is required.
               Better for large scale unstructured
               or semi-structured data
ACID Compliance               BASE (Basically Available,
              Soft state, Eventual consistency)


Tuesday, February 25, 2025

System Design Series: Resilience

Introduction

Resilience can be defined as an ability of the system to recover from the failures, disruptions or any kind of event which impacts the proper functioning. 

In real world, any system is likely going to fail once a while due to various known or unknown reasons. We need to see what we can do to make our system get over with such situation, how we can avoid any harm to the system and how we can get back to the business quickly.

Approach

Resilience can be achieved using different approaches listed below,

Fault Tolerance
System continues to work even if any software or hardware fails (fully / partially). 
e.g. In load balancing if one server fails, another server is created or load is distributed in remaining instnaces.

Redundancy
Redundancy or duplication ensures backup and also helps in recovering from failure quickly.
e.g. Create another instance using database replication and use it if original instance fails.

Monitoring
Continuos tracking and monitoring helps in early detection of problems.
e.g. Continuously track the system to ensure expected health and take automatic actions or alert if given health criteria is not met.

Disaster Recovery
Restore the system back after any disaster.
e.g. Regular back ups to avoid data loss or keep it to minimum.

Self Healing
System itself can automatically correct it if any kind of issues.

e.g. Automatic scaling of AWS, if one instance fails automatically create new instance and transfer trafic of unhealthy one to new instance.






Tuesday, February 18, 2025

System Design Series: Availability and Consistency

Introduction

We wish to build a system which is always working and giving accurate responses. However, in world of distributed systems it's not achievable, we have to trade-off based on the need of the user and business. 

For banking system you must prioritise data consistency over availability, it's okay if it's not available for few minutes but it's not at all acceptable if it gives inaccurate results. In contrary, tiktok must be highly available otherwise people may loose interest but it's okay if user don't see newly uploaded video immediately. 

We will understand why we have to trade-off and why can't we have both using the CAP theorem.

CAP Theorem

CAP stands for,
  • Consistency : Each read receives latest write.
  • Availability  : Each requests receives non-error response, not necessarily latest data.
  • Partition Tolerance : Continue to work even if communication failure between nodes.
CAP Theorem states that we can achieve only two of these in distributed systems. Based on our requirement we need to trade-off one of these to achieve the desired results.

We can have any of these system according to the CAP Theorem,
  • CP : Consistent and Partition Tolerant.
  • AP : Available and Partition Tolerant.
  • AC : Available and Consistent (Not Possible in Distributed Environment.)
In distributed environment, it's technically impossible to be available all the time and return latest data on all reads, because if network partition (communication failure) happens, system has two choices, either it can fail (or return error) or return stale data which ultimately breaks the cosistency law.
There are different patterns for consistency and availability, which are listed below.

Consistency Patterns

  • Weak Consistency: Write may or may not be seen by reads. (e.g. Video Call)
  • Eventual Consistency: Write will be soon visible to reads.  (e.g. Email)
  • Strong Consistency: Write is immediately visible. (e.g. File System)

Availability Patterns

  • Replication: Replicate data in additional component using Master-Master or Master-Slave setup.
  • Fail-over: Stand by instance to take over if original instance fails using Active-Active or Active Passive set up.

Availability of system is defined using percentage. For example, system is 99.9% available which is said to be availability of three 9's. If system is 90% available, which roughly mean that in a year it won't be available for ~ 36.5 days, in month it won't be available for ~3 days, in a day it won't be available for ~2.4 Hours.


↑ Back to Top