Akash's Blog

_

Tuesday, March 25, 2025

System Design Series: Asynchronism

Introduction

To achieve high performance system need to execute multiple tasks in parallel. It overall increases the throughput and makes system highly responsive. Not all operations should be asynchronous but there are many opportunities where asynchronism can be a better choice. For example, after sending email you can immediately send another without waiting first one to be delivered this is asynchrorous operation.

Message Queue

Message queue is used to achieve asynchronous communication between the components of the system. Producer sends the message to the 'Queue' and at the end of the queue Consumer receives it and processes as per the capacity.

We can consider it as a buffer where sent messages are kept until receiver processes them. Messages are processed in FIFO (First in first out) order by default but can also be configured to behave differently for example, priority based processing.

Components

  • Producer (Sends the message)
  • Queue (Stores the message)
  • Consumer (Receives the message)

Architectures

  • Point to Point (One producer one consumer)
  • Pub/Sub (One producer (publisher) multiple consumers (subscribers))

Examples

  • Kafka
  • SQS
  • RabbitMQ




















Streaming

Streaming is a process of continuous transfer of bytes over the network in real time. Streaming is generally setup with message broker and stream processing component. Audio, Video, Data and Event can be streamed. 






Examples
  • Producer: Application Logs, IoT Sensors, Financial Transactions
  • Message Broker: Apacke Kafka, RabbitQ, AWS Kinesis
  • Stream Processing: Apache Kafka Stream, AWS Kinesis, Spark Streaming
  • Consumer: Database, Data Lakes

Tuesday, March 11, 2025

System Design Series: Caching

 Introduction


Idea of cahing is to store frequently used data in high speed temporary storage to reduce load on database and improve overall performance.

Caching helps to reduce work of underlying database drasticaly, when we write and read something from database it is being read from the disk. To avoid frequent read from disk which are slower compared to reading it from primary memory. 

Thus caching helps to load frequently used data in memory and use it when requested instead of looking into disk, makes things way faster and the overall latency and throughput impoves.

Latency is time taken to finish task.
Throughput is number of tasks finished in given time.

If you take 1 minute to read a page. Your latency is 1 minute and throughput is 60 PPH (Pages per hour).



Aspects of Caching

Caching can be done in many layers; client side, server side or at dedicated (separate) cache layer. 

There are options available for cahing in following layers,

  • Database
  • Web server
  • Applciation server
  • Browser (or Client Side)
  • Distributed Caching
  • CDN (Content Delivery Network)

We need to consider different aspects while choosing the right caching mechanism. For example, quantity of data to be loaded, which data to put, refresh frequency of the data, expiry of the unused data (eviction), our cache should not face cold start or celebrity problem (if distributed) etc.

Following are different strategies which we can use to update (refresh) the data stored in cache. Image clarifies what is the placement of cache and closely looking at arrow gives idea of data flow.

  • Cache Aside
  • Write Through
  • Write Behind
  • Refresh Ahead




















Also, cache should be evicted if not used for certain period of time for optimal storage use. One of the following eviction Strategies can be used.
  • LRU (Least Recently Used)
  • LFU (Least Frequently Used)
  • FIFO (First In First Out)

Sunday, March 9, 2025

I passed AWS Certified Data Engineer – Associate Exam! 🎉


Intorduction

I recently passed the AWS Certified Data Engineer – Associate exam and wanted to share my experience, hoping it might help others in the future.  

Although this is an associate-level exam, it thoroughly tests your understanding of data engineering concepts and the relevant AWS services through realistic scenarios.


Preparation

I used Udemy course, AWS Certified Data Engineer Associate 2025 - Hands On! by Stephan & Frank Kane which helped me to understand the concepts in detail. It includes hands on sessions in multiple services, which you can follow along. For some services like Redshift, Glue, EMR, Kinesis I had to practically explore things on my own.

However, going through the course wasn't enough, thus I decided to go with some mock exams.


It is more about understanding the nature and application of the services. Multiple services can be used for same solution but questions checks what is best in terms of cost, operation overhead, time and efficiency. 

For revision I used the slides from the same Udemy course.

Exam Day

The day of the exam started off frustrating. Due to a slow internet connection, I was unable to check in, and I feared I might have to reschedule and repay for the exam. Even after my internet speed improved, I was still not allowed to check in, leaving me completely unsure of what was going on.


I tried calling Pearson support but had no luck. I attempted to reschedule, but initially, I didn’t see that option. After 10–15 minutes, it finally appeared, and I was able to reschedule. Phew...  

The exam was quite challenging, packed with tricky and confusing scenarios. I struggled to stay focused on the lengthy questions. The exam consisted of 65 questions with a total duration of 130 minutes—just 2 minutes per question. (I later realized that non-native English speakers get an additional 30 minutes, but it must be claimed before scheduling the exam.)  

Some questions were straightforward, and I felt confident answering them, but the majority required a solid understanding of the relevant AWS services.

Conclusion

The exam evaluates your clarity on AWS services and their use cases. It took me approximately 3.5 weeks to prepare, including completing the course and taking mock tests. That said, my prior experience with AWS, along with having previously passed the AWS Certified Solutions Architect – Associate exam, significantly helped streamline my preparation and reduce the required study time.

Suggestion

If you're planning to take this exam, prioritize understanding how services apply in terms of time, cost, and operational overhead, as these were the key themes for me. Rather than just memorizing facts, focus on grasping real-world use cases.

References

Tuesday, March 4, 2025

System Design Series: Database

Introduction

Databases are essential in system, we have to choose the type and setup for the database as per our requirement. Majorly databases are characterised in two categories SQL (Structured Query Language) and NoSQL (Not Only SQL). We will going to understand the difference between these two and options to make database highly available and scalable.


Failover Strategies

Failover strategies can be applied as per the requirement to overcome data loss and interruptions. 

Cold Standby


Take periodic backup of database and use the backup to restore database in case of failure. This may lead to significant data loss and downtime depending on the frequency of back up and time required to restore the database.



Warm Standby


Do replication of database and switch to replica in case actual databse fails. In this strategy delta between last replication and current databse state will be lost if database fails. Less downtime compared to cold standby.

Hot Standby


Application server (or client of database) makes changes in both, actual and stand by instance, so that in case actual database server fails, server can simply switch to the stand by instace quickly. No data loss but minor downtime during the database switching.








Scaling Strategies 

Master-Slave Replication 


In this setup Master databse can allow read and write, however slave can only allow read. Master replicates to data to slave on write. The idea is to keep slave as upto date as possible with master. Also, majority of the systems are read heavy, in that case such reads can be done through slave and overall load from master can be reduced.

Master-Master Replication 


Both database instance are master, client can read and write in any of the instances and ideally the load is distributed. Replication takes places between masters which is bit tricky compared to master-slave as both instances is allowing write operation. However, in case of one instance failure other can quickly take place.

Federation 


Database instances are created based on the functionality. Customer related data is stored in separate customer schema, product realted data is in product schema so and so forth. This allows individual databases to scale individually as per the need. If you have complex queries and join between such entities are frequent this setup may make it more complex.

Sharding 


Data is distributed in nodes and each node maintains subset of data partitioned based on some criteria. For example, data of user name starting with A-N is in one node and M-Z is in another. This set up kind of struggle with "Celebrity Problem" where one of the node gets more traffic and other one sit ideal.


SQL vs. NoSQL

SQL and NoSQL databases has their own places, based on the need of the system we can choose either of them. SQL databases are relation databases, NoSQL databases can be of key-value store, document store, wide column store or graph database. Depending on the requirement we can use SQL and NoSQL or even combination of both if needed. Following are fundamental differences between these two which we should be aware.

SQL               NoSQL
Structured Query Language               Not Only SQL

Relationa Database
               
               Non-relational Database

Schema is Predefined and Strict
               Schema is dynamic and flexible

Vertically scalable
               Horizontally scalable

MySQL, Postgres, Oracle etc.
               Redis,MongoDB,Cassandra etc.

Better for structured data where
complex queries with
join and aggregation is required.
               Better for large scale unstructured
               or semi-structured data
ACID Compliance               BASE (Basically Available,
              Soft state, Eventual consistency)


↑ Back to Top