1. Backend Scaling and Performance Engineering Part-1
Scaling and performance, two very widely used buzzwords and at the same time equally important concepts when we are talking about systems, backends and infrastructure and this is also a topic which expands into a lot of territories and a lot of domains. uh scaling and performance has a very different definition when we are talking about browser systems or front ends and it has
a very different meaning when we are discussing about infrastructure. uh but in this video of course we are going to cope it down to the concepts and the things that are relevant in the context of backend engineering and we will start from the very beginning from the primary question of what even we mean when we talk about performance and we will build up from there layer by layer and we'll develop an intuition for how systems
behave for how systems behave under load where bottlenecks actually hide and how to think clearly when everything seems to be falling apart when your system is actually under load and I'm hoping by the end of this video you will not just learn about a few techniques about a few scaling techniques you will understand how to think about scaling and performance in a way does not matter whether we are talking out backends or any other systems otherwise but your
knowledge should be applicable universally in any system that you will encounter in your future. So with that let's start our discussion about performance performance. So let's ask a question first. What do we actually mean when we say a system is fast? And by talking in the context of our modern systems, modern websites or modern web apps, we can consider a system to be
fast. When let's say we have a website or our web app and a user comes to our web app and they do some kind of interactivity. Let's say they click a button on our web app and the moment they click a button a series of events start executing the browser that they are interacting with let's say in this case it is a Chrome browser the browser sends a request across the internet to our server the server which caters to this particular front end or this particular web app and our server
receives that request let's say is our server And this is somewhere here, right? This request comes to our server and our server receives this particular request. It processes it and maybe if needed, it most likely interacts with our database, right? To either store some information or fetch some information about the user or something that the user needs any kind of interaction with the database. And if needed, it might also call some other
API. Let's say it needs to send an email. So it needs to call the API of a service called recent. Send an email to a user and after doing all this has to eventually send a response. Let's scroll down a little bit. After doing all this has to send a response and typically it's a JSON response and that response again travels back over the internet for the internet and back to your browser. Then your browser takes that JSON response and by doing some kind of JSON
parsing or any kind of JavaScript processing it shows it on the screen. Let's say it renders a couple of cards. Now there is some amount of time passed between the user clicking the button and the cards finally showing up on the screen. this whole interaction and the total time that it took for the user to click the button to finally the rendering of elements on the screen. This whole thing that we can call as latency, right? There's an first concept
that we are going to introduce and latency is one of the most fundamental concept when we are talking about performance and when we are talking about performance metrics metric in the sense how to measure performance there should be some kind of measurable unit where we can define performance mathematically or in some kind of numbers instead of just talking about figurative stuff like fast and all. So latency is an important concept to measure our performance and latency is
actually the the the the phenomenon that users feel in a sense when someone says that your application is slow. They are talking about latency just so you know and it does not matter whether they know the word latency or not but they are talking about latency when someone says your app is fast or app is slow. But the next thing that we should understand is latency is not a single number. It is not as straightforward as saying something like 500 milliseconds. It is not as simple as that. The latency it
varies from request to request. One request might complete in let's say 50 seconds or no not 50 seconds 50 milliseconds. Another request might take let's say something around 200 millconds. And why is there a difference between the two? Because when we are talking about real world, real world in the sense when requests traveling through internet, nothing is as predictable as they are when you are talking on whiteboards where everything is measurable, where everything is
predictable. It's never as simple as that. Maybe for this particular request, it was able to hit some kind of cache either some kind of CDN. CDN we'll talk about what are CDNs or some kind of in-memory cache like radius right it was able to get a hit which means the result of this particular request was already present in a storage in some kind of persistent storage which is much faster than trying to fetch the result from a traditional storage like database and a
storage which is much farther from the user's location and this one had to travel all the way back to our server and it had to do a database query. That's why it took around 200 millconds. Or maybe this is not the case. The cache is not the case. This request maybe the reason was this request, the one which took 50 mconds, it was able to get to our server when our server was idle. When our server was not really doing anything and when this request hit our server, it was already concurrently processing like 50 other requests at the
same time. That's why this request took 200 millconds, right? It can be any of these reasons since we are talking about real world traffic. So if we try to do something, let's say if we try to take the average latency of all these requests, let's say in this example, the first one was 50 and the second one was 200 and the average would be something like 125 milliseconds. If you think about it, does this particular number, the average latency of the requests which is going to our server, does this
number say anything meaningful to you? Now, averages are a very useful number, a very useful statistics, when we are talking about understanding systems in other context. But when we are talking about performance in particular, averages can be very misleading and they are in my opinion pretty much useless when we are talking about performance because the variations are very important in this particular context. So average is not that useful in expressing
the variations of the request latencies. So let's consider a scenario. Let's say you measured over a thousand requests and the average latency that you got out of them was around 100 milliseconds. Okay. But in this particular number in 100 milliseconds that the average latency number that you are able to calculate there is a very misleading outlier here because 99% of the request they were able to complete under 50
mconds while the other 1% it took around 5 seconds. And this 1% this percentage of users they describe they represent real users in your system that are having terrible user experience in your web app. If you take the same math and scale it up what happens? Let's say your back end was able to serve 1 million requests per day. Per day. Out of those 1 million requests, 10,000
10,000 requests, they had to wait for 5 seconds. Which means that after clicking the button or after landing on your screen, the user had to wait 5 seconds or they had to see the loader for 5 seconds before any kind of data showed up on their screen. Now, if you take this math and calculate the average, the average will tell you that your system is performing very good. you have a few hundred milliseconds of latency but you will never understand the frustration. You'll never get to realize your systems
limits when it comes to the 1% of users who are having a very big a very alarming amount of latency in their systems. And this is the exact reason when we are talking about performance instead of talking about averages what we actually talk about is percentiles. And this is something that I have mentioned in my previous videos which is before understanding the mechanics the hidden mechanics of a system before understanding how a particular concept
works how to build that and everything you have to understand the lingo of it or the jarens of it so that you understand different opinions in that ecosystem different opinions that are spread across a spectrum. uh that is the same reason since we are talking about performance you have to understand all the lingo all the jarens that are used when people are talking about measuring performance when they are talking about scaling and that's why percentile is a very important terminology here so now
talking about percentile the 50th percentile or also known as P50 when people say the term P50 what they mean is if your P50 latency is something like 400 milliseconds. It means that 50% of your users experience 400 milliseconds of latency. Same way the 90th percentile also called as P90. It means that 10% let's say if your P90 is around 900 millconds. It says that 10% of your
users experience 900 milliseconds of latency. The same way another important term is P99. Let's say if your P99 is 2 seconds, it means that 1% of your users experience the latency of 2 seconds. In other words also when we say P99 latency, it means 99% of your users get a latency which is less than 2 seconds. Which means it is for them the system is faster as compared to the other 1% of the users. But of course as compared to that definition it is much more intuitive to think in terms of if we are
talking about P99 latency it means that if you minus that number from 100 which is 1%. So 1% of your users experience 2 seconds of latency. Same way for P90 if you minus 90 from 100 which is 10. So 10% of your users experience 900 millconds of latency. These three numbers are very important. Every time you are discussing latency, every time you are discussing performance, optimization, scaling, all these things, these numbers will keep coming up P50, P90 and P99. So if we represent this as
a graph something like this, right? It's not a very good drawing, but let's say this is your P9, this is somewhere around P90, this is somewhere around P50 and so on. So this experience the maximum amount of latency this a little bit less and this a little bit le and other users experience a very less amount of latency. So in practice mostly backend engineers they focus
heavily on P99 and P95 latencies and the reason for that is not just caring about unhappy users. That is of course the case. But the requests that end up here, the requests that end up having the maximum amount of latency, that's where our most complex business logic is or that's where our most complex database queries are and that's where our most complex service synchronization logic
is. For example, calling external services, calling services, sending emails, waiting for web hook calls, etc., etc. So all these complex operations are most probably happening in the workflows of these latency charts is P99 and P95. And because of that the users experiencing the P99 or P95 latencies, they mostly represent our most valuable customers. The customers that are paying us a lot and the customers whose workflows should matter us more. For example, a user who is
making a payment, a user who is making a purchase on your platform, that will generate more complex database queries that will generate more complex business logic as compared to a user who just browsing through your platform. That is why caring about P99 and P95 latency is important. Okay. Now, that's all about latency that you should be aware of. We still not have learned anything new, learned any new scaling technique or anything but eventually you'll be more
and more comfortable with the language of performance with the all the technical jargon of performance and scaling. Another metric that we often talk about when we talk about scaling is throughput. Now latency tells us how long an individual request takes from origin to getting the response. But throughput tells us how many requests our system can handle in a given time period. And this is typically measured
with something like requests per second or requests per minute, right? The throughput of our system. That's what we call it. And the reason this is an important metric and the reason that we are talking about this is for your system you can have a very impressive amount of latency. For example, when handling 10 requests per second, your latency might be very less, something like 150 mconds or something. But the moment the request numbers per second go
up, something like,000 requests per second, your latency suddenly increases to 2 seconds from 150 milliseconds to 2 seconds. Right? And that is the reason understanding throughput with latency is also as important. It helps you answer very practical very real world questions. questions like can our system let's say your system is a e-commerce platform or you handle some kind of learning management system or you sell courses whatever system that you have and you want to answer the question that
can our system handle the traffic the traffic search or the traffic spike of black Friday which happens very certain right one big day just your traffic spikes to hundreds thousands and millions of requests So understanding throughput with latency is very important so that you can answer questions like can our system handle the black Friday rush. Similarly if our systems can survive the load if we suddenly started a email campaign we
started an email campaign or if we got featured in a podcast or we got featured in a very popular website or a blog. How many concurrent users can we support? How many requests per second can we support before we'll start needing more resources? All the answers of all these questions can be answered only when you understand throughput with latency. So one thing is obviously clear latency and
throughput are connected to each other but not in a way that makes intuitive sense to us because as throughput increases as the amount of throughput increases the amount of latency also starts increasing. It starts happening very slowly at first but after a point it starts increasing dramatically like this right and this particular relationship between throughput and latency. This is a very interesting concept in performance engineering and
of course we'll talk about this more in the next sections of this video. Now let us jump into more intuitive concepts about performance and scaling. So let's talk about a ice cream shop. Right? There is an ice cream shop and there is someone at the shop wearing a cap and he's giving out ice creams. Right? So it is a Sunday evening and you go to this particular ice cream shop and when you walk into the shop the shop is completely empty and you go to the
counter, you ask for an ice cream, you pay the bill and you instantly get an ice cream. this ice cream gets transferred to you and since you got the ice cream you are very happy very happy. Anyway, the point of this example was you made a request and since there was no one at the shop, your request was responded instantly. So this is the situation of low utilization with low latency. Low utilization because the shop was empty. The worker, the owner, whoever was giving out the ice cream,
they were not really working very hard. So low utilization of resources and also low latency because you got your request served instantly. So this situation was very low latency. Now let's imagine it is Tuesday lunchtime and you go into the same shop. There is the worker hat giving out ice creams. But this time you are not alone here. This time there is a huge rush. Since it is a weekday, everyone wants to have an ice cream.
There is a huge rush and you are somewhere in the line. This one in the blue with a grumpy face. This is you. By the way, you are standing here and you're waiting for your request. So in this particular case, even if this worker, this ice cream shop worker, even if he works exactly at the same speed, preparing each ice cream in the same 2 minutes as before. For each ice cream, it takes around 2 minutes to prepare. It takes a long time for you because you are waiting for others to get their ice
creams first. Now the concept that I want to that I want you to pay attention to is the workers execution rate. The workers work rate has not reduced at all. Previously they used to take 2 minutes to make a single ice cream and right now they are also working at the same pace at the same rate. But for you the perceived weight time the actual wait time has increased significantly because you have to wait for the others to get their ice creams. Now talking about actual relevant stuff talking
about backend engineering your backend systems your servers they act in the exact same way when the utilization of the resources of CPU memory whatever when the utilization is low the request comes and the response is sent back almost immediately. The CPU grabs each request, processes it, and returns the response instantly. But as more number of requests come, each request waits for
the request that is ahead of it to complete until it gets its chance. And they all start forming a queue- like structure. So the higher the utilization, the higher the utilization, the longer the queue and also the longer the wait time, the longer the each request waits. Okay. Now talking about all of this, the metric that I wanted to come to was this one which is utilization. Utilization which in formal definition
in formal technical definition is the percentage of your system the percentage of your systems capacity that is currently at use that is currently running actively. So at 0% utilization you'll say that your system is idle. It is not doing anything. It is not making use of any of its resources. And at 100% of utilization, your system is maxed out of course and it cannot do anything else and it is at the brink of collapse. Now an interesting relationship that I
wanted to show here is the relationship between utilization and latency. Let's say this is the utilization in our x-axis and this is latency. Now the counterintuitive relationship here is as our utilization increases. This is the relationship that we intuitively expect from our system. The more utilization that we have, the more latency will experience a linear growth. But what happens at 100% utilization of our
system? It does not stay linear anymore. The more we are growing towards 100% utilization starting from let's say 1 to 100 this is how the curve starts look like it starts growing exponentially instead of growing like this it starts growing like this right this is a very counterintuitive relationship between latency and utilization and also very important to wrap our head around the concept of performance and scaling. This is something that you should keep in mind. We'll come back to this or we'll
connect our concepts to this eventually. So again taking about a very irrelevant example or analogy here to make it more digestible. Let's say we have our highway right? Let's say this is our highway national highway. When we say that our highway is at 50% capacity. So let's say these are the cars. Let's I'll make it as blue. These are our cars. Okay. We can say that this is at 50% capacity. There is enough space for each
car to overtake other cars and you know go on its way. So around at 50% capacity the traffic flows pretty smoothly in our highway. The cars they can maintain speed and everyone reaches their destination in time in a very predictable time. Now at 80% capacity let's say we add more cars here or more cars more. Now what happens? you will start noticing some slowdowns, right? Because the overtaking is not as straightforward, at least not without compromising your safety. Again, we
don't safety here. But the problem is you cannot change your lanes as often now because there are way too many cars here and you have to think twice before you change your lanes and you have to strategize how to go from here to here and all these different aspects that come into play. Now let's say we raise it to 90%. We add more cars here. Now what happens? This traffic now becomes very unpredictable. Sometimes it will flow when the cars are perfectly aligned with all the gaps. So each one can make
perfect turns and they can through. But sometimes when we have very impatient people on the road, the traffic will jam, right? Because everyone wants to pass through. No one wants to give way to anyone. Now it has become very predictable depending on the mood of the people depending on a very unpredictable out of control phenomenon of nature. Our traffic can either pass through or it can jam. Every time someone breaks, every time someone changes lane, every time someone starts merging, you'll see
a ripple effect when everyone behind that car will start honking and they cannot move anymore. Anyway, at 100% capacity, let's say you have cars at every spot now. There is no space to move anymore. Now, at this point, no one can move. Your whole pathway is blocked. No one can move on our highway anymore. So, that is a very irrelevant but very intuitive understanding of utilization and latency that we can think about to
make sense of our systems. Of course, when we are talking about these abstract analogies of highways, restaurants and cafes, these very simplified examples, it all makes sense. But the moment you start putting all those concepts into the context of computing, it starts becoming a little unintuitive. Anyway, you can do a onetoone mapping in this case for the relationship between utilization and latency. Now the whole point of this analogy, the whole point of the graph showing utilization and the latency and the exponential growth
relationship between them. The point of all of this is the realization that you have to come to at some point in your back end engineering career is you cannot run your systems at 100% utilization and also expect them to perform well or at least perform in any kind of capacity. you need some amount of headroom, some amount of capacity that is able to absorb the traffic or in this case that is able to absorb the
request traffic. So usually in the wild or in real world production systems you'll see that most systems run at 60 to 80% utilization and reserving the 20% for a buffer in case there are any amount of traffic spikes. Another reason to keep this buffer or keep this headroom is because the property of traffic is not very linear. It is not like a metronome. Every second you'll
get this number of fixed amount of requests. That never happens. The varying nature of traffic is they come in bursts. Right? This is also a term you'll hear a lot when you're talking about performance and scaling. Traffic always comes in bursts. Which means that at 1 minute a lot of requests will start showing up in your server and another minute there are zero requests. Then after half an hour you'll get a thousand other requests. So this is how the real world traffic works at scale. So even if
your average traffic that is well below the capacity which is like 50% or even 40%, these bursts when they happen they can instantly do the spike and you can easily go above 100%. And if you don't have that additional buffer that additional headroom for all these traffic spikes your system will obviously crash. So that is the reason uh understanding utilization understanding the exponential relationship of latency with utilization
and the realization that you should never and you can never run your systems at 100% utilization. There should always be some amount of headroom. There should always be some amount of buffer for your systems resources. Right? That is the whole point of all this discussion at least the first part of our discussion. We have not even started with our scaling and performance talks. I'm just intuitively building the mental models so that you can connect the dots on your own instead of just mugging up all these vertical scaling, horizontal scaling, all these concepts that people just
throw at you whenever they are talking about system design and performance. Okay. Next, when you say your system is slow, it means that something specific in your system is causing the slowness. Right? That's what we mean when you say our system is slow. and finding that particular something whichever is causing this slowness that is called as identifying the bottleneck. Bottleneck. Let me a couple of circles here. This is another important terminology another jargon that you have to remember or at
least should be able to identify when it shows up again. All right. Of course this sounds obvious but in practice when we are working with backend engineering we often skip this particular step which is finding that specific part of our system which is causing the delay which is causing the slowness also called as bottleneck and we start just jumping into solutions. We start jumping into all the standard by the book practices which are supposed to be implemented whenever we encounter slowness in our
system. Whenever we encounter a system lagging, we do things like uh we just go ahead and add caching because we just know from all the different resources that every time you encounter any amount of latency, every time you encounter some kind of bottleneck, just go ahead and add caching. That is the solution to pretty much all kind of latency related problems that we have always heard and always learned. Or if our database seems to be the problem, we what we do, we let's say we are running Postgress version 16, let's upgrade it to
Postgress version 18. Let's just upgrade stuff. Let's just upgrade our packages. Or if our server is seeming like it is the bottleneck, then let's add more servers. Right? In other words, we also call it as horizontal scaling. And sometimes we do get lucky. Sometimes all this stuff, all these buy the book solutions, they often help. But more often we spend days or weeks implementing solution to a problem that we do not actually have. While the actual bottleneck the actual specific
part of our system that completely remains ignored and even after implementing our solution our system still remains slow because the reason the system was slow there still out there and still causing the slowness to put this in perspective let's say we have an API which is / products and it is something let's say a patch call which is we are calling this API with a product ID path parameter and there's a patch call to update a specific product
with some amount of information from our front end. So let's say we have this API and we note that this particular API is appearing to be a little slow and the first thing that comes to our mind and in fact my mind is most likely the database is the problem right because well the database is always the problem and what we do we go ahead and add caching let's say we just add radius to
our stack and we add a caching function in front of this API to speed up of was now that I think about it caching an update function does not make any sense. So let's say this was not a patch call. This was a get call, right? We are fetching the details of a product. And to make this API appear a little faster, we go ahead and implement caching in our service and add it in front of this API. And we spend one week implementing caching a system and adding it in front
of our database. And we go ahead and deploy this change to our production servers. And we observe and what we observe the API is still slow. After doing all this, after spending a week on finding the solution and implementing this solution, our API still appears to be slow. So this time we think that okay, let's get a little deep into the problem. Instead of just assuming our database was the problem, let's add actual logs in our code. So we go ahead and add timing measurements in our code.
When does the request arrive in our system? When do we actually call the database query? When do we call the caching? etc etc etc. We add very granular timing functions across our codebase at least for this function. And what do we discover? We discover that this particular database query it only takes around 10 milliseconds. And this newly added radius cache. This takes 5 millconds. But what we additionally discover is a logging function. a
logging function in our particular API in our particular service function. The function that we are measuring the timing of this logging function writes to a remote logging service something like elastic search or whatever logging service that we use which is not really a problem. We often have to make use of a remote logging service. But the problem was this was happening synchronously growously which means that instead of doing it concurrently or parallelly whatever you want to call in
the next video of this playlist we are going to very extensively cover concurrency and parallelism. So I'll not get too deep into the technicalities of the difference between the two. But we can just on a very high level on a very oversimplified level assume that instead of doing it in a parallel way. What we do we block this request. We make this API call or whatever request or whatever transmission format that we have to log this in our remote logging service. Then we resume our service. Then we resume our API. And this whole thing takes
around 500 milliseconds. And in the end what we realized the database was never a problem because without actually measuring what is causing the problem we just went by the books and implemented caching because every time we are having problems with any kind of latency. We have learned this that just go ahead and add caching. That's why we were not able to catch the real issue the catch the actual culprit which was causing the latency which was our logging service.
And in fact it was the synchronous nature of our logging service which was causing the actual latency. In the same way there are a lot of cases when all these unintuitive components of our whole API workflow which is causing the actual latency. But mostly we just blame our database. We just blame our indexing strategy. We just blame our caching strategy. whatever uh the very obvious ones instead of actually measuring each part each component of our whole workflow and finding out the root cause that is the reason often when we
encounter some kind of latency in our API we'll start investigating first into our database query even though the actual culprit might be something like a JSON serialization step or an XML serialization step or maybe the whole workflow of the API is fast the code is fast but the latency Laty is caused because the response payload is huge and the network transmission here is actually causing the latency. So the lesson in this part of the discussion is
without measuring you should never guess which component which part of your workflow is actually causing the latency which is actually adding your milliseconds number. So every time you are talking about performance, never guess, always measure, right? Measurement is very important whenever we are talking about performance and scaling and optimization. And since we are talking about measuring stuff, another important concept that comes here is profiling. You might have already heard this which falls under a
little advanced category. But this is something that you should also know of. This is also uh something that you have access to and that should be in your arsenal whenever you are trying to debug performance related stuff. So profiling to explain it, it is the practice of measuring where your application where your system actually spends its time. So these profilers they attach to your running application and during runtime during actual processing actual request
serving and what they do they start recording a sample of that request response data and monitoring exactly what happens at every corner of your system. Things like which functions are executing when they started executing when they gave a response and recording all these different numbers from end to end. Now the output of a profiler can be very overwhelming because there are thousands of functions and each accounting for some fraction of total
time etc etc. So we have a tool called flame graph a flame graph which makes this whole output of the profiler a little more digestible. So flame graphs what they do they show you the call stack over time. Call stack is basically the function execution. So all the variables that went to it the output etc etc. So functions that take more time they appear a little wider in the graph so that you can pay a little more
attention to those functions and functions that are called by other functions they appear stacked on top. Right? Another thing that you can take a look at. So if you take a visual look without reading into the output of the profiler, you can make a guess about where your application spends most time on. But the first time you take a look at your profiler output, you'll be surprised because the bottleneck in most real world applications, the bottleneck
is never where you expect. You will assume that your bottleneck is your complex business logic, right? The CPUbound task is often the business logic. But the actual time that your application is spending on is serializing the JSON. While you're expecting your application to be spending most time on database queries, the actual culprit was you calling an external API call in a loop, right? All these things start showing up when you start measuring your application end to
end. But the profiler is in my experience not very good at IO bound task. IOB bound tasks are basically all these things which is database queries, file handling, serialization or making external API calls. All these different kinds of tasks are called IObound tasks which deals with input output and waiting for the results of input output. But the profilers are actually more useful at measuring CPUbound tasks where your application is actually using CPU
cycles where it is actually performing some kind of computation. But most of the time our backend applications uh when we are talking about performance problems are IO bound. It is always some kind of input output problem whether it is at the database level or external service or or cues whatever mostly those are the culprits whenever we are talking about performance stuff unless you deal with things like image processing or machine learning related calculations and all where there are metrics manipulations and etc etc mostly in a
typical SAS applications it's always the IO bound task that cause performance related problems and CPU profilers they are not very good at catching all these IO bound task or at least measuring them and that is the reason we have another tool at our disposal which is called distributed tracing and this falls under the category of observability. Now tracing what it means is it let's say a request came into your system. So following that particular request as it flows through your system and it records
at what time it entered this particular component at what time it started the database query at what time it started executing this particular function at what time it left the function what time it started making this particular external API call all these things it starts measuring. So after you have spent setting up distributed tracing in your API in your application, you will see things like your particular request which which was this one products/ product ID let's say in this case five.
This spent only two milliseconds in your actual business code in your API logic but it spent around 800 milliseconds at your database query level. And now that you have this particular data, you can focus on your database query right. But having these numbers at what level at what component and at what element we are spending the most time on to find out this we have to implement a distributed tracing in our application.
This is a very useful tool for measuring different kinds of latencies in our API very quickly. where it's due. Databases are typically uh the bottlenecks in most of the backend applications at least when we are trying to optimize our performance. Databases are one of the first things that we take a look at and there are good reasons for this because databases typically they do the hard work. They store the data whatever data that we have they store it on disk
durably to persist that data so that every time we boot our system that data is available to us and same for insertion updation deletion and everything by also making sure there are no crashes and there are no failures. They also give us the guarantee of consistency by handling things like concurrent reads, concurrent writes, locking a lot of advanced concepts and a lot of advanced uh research go into databases every day and they also execute complex queries across millions
and billions of rows and all of this does take some time. So we cannot really blame databases directly for all our performance problems. There are some common uh lowhanging fruits when it comes to the typical performance related problem that we face related to our databases. The first one which is a very notorious problem which is n +1 query problem. So in order to understand this, let's say you are a front-end engineer
and you are on the client side working on a react app and you are working on a blog application and in order to show a list of 20 blogs around 20 blogs in the homepage you have to make a API call but for each of those posts or each of those blogs you also need to show the author's name here and the author's name is not available in the uh in the list API where we are getting the list of all
these blogs details in there we do not get the details for the author of a particular post of a particular blog for that what you do you make the first API call where you fetch the whole list and then in a loop for each of those blog posts let's say items here you make another API to fetch the author. So for 20 blogs you make 20 API calls here 20
author API calls and total how many API calls you did which was 21 API calls for showing this single page. Now this pattern of fetching data, this pattern of accessing data is called n +1 query because you do the first query the first query to get n items and then you do n queries to fetch details about those n it single piece of detail. Now imagine for 20 you had to make around 21 API calls. For 100 similarly you'll have to
make 10 and1 API calls. For 1,000 you'll have to make, 101 API calls. On the y-axis if you see the number of API calls that you need to make and on the x-axis you see the number of items that you need to show then it grows linearly with the number of items. So what is the problem here? Now of course there is an obvious problem the the number of items that you need to show and the number of network calls that we need to make should not be related in a linear relationship. So any engineer can u feel
that there is something wrong here. We should not be making thousand requests for showing thousand items. But even if we ignore the obvious problem, the second problem is each database query has some amount of overhead because there is a network involved which is used to transmit the request and response across starting from your application server to your database server back and forth. And for this request transmission before this happens there is also a TCP connection layer or
if you use connection pooling then u will not have to do that. But assuming uh you are getting a new connection for this then there is also a TCP connection um setup layer. There is the TCP connection setup overhead. Even after the request the query reaches your database the database also has to parse the query. it has to plan its execution. It has to actually execute the query and then it has to return the results back to your application. So even if let's
say one query only takes 5 milliseconds which is extremely fast but for showing a,000 items and you're making thousand requests which means a,000 queries which will end up taking 5,000 millconds or 5 seconds. So for 5 seconds your users users who are using your blog application they are seeing a loader for 5 seconds straight is not a very good user experience at least by the standards of 2025. Now what is the solution here? The solution is pretty obvious. We don't uh even have to go
into the mechanics of it which is fetching data in bulk. Right? Instead of making thousand API calls for thousand items, we should be able to fetch all those thousand items in a single request. So in our case instead of making uh a thousand or a 100 API calls a 100 different requests for fetching the author details you collect all the author ids in a single array or a single list whatever data structure that you use in your programming language and you fetch all of those in a single request
in a single query. So now it will be one query to fetch all the posts and another query to fetch all the authors of those posts. Two queries in total. It does not matter whether you have to show 20 blog post or you have to show 100 blog post or you have to show thousand blog post. You'll always need to make only two API calls. If you don't see it in the perspective of front end, if you see in the perspective of your server, so far we've been talking in the context of
your front end or your react app making an API call to your server to fetch the post to fetch the author information etc. Even if you forget about this part because server does not really store the data in any way. The server uses the database to store its data. So every time you are fetching a list of posts, the server is actually fetching those list of posts from the database. In same way, every time you are fetching the author's information, the server is fetching it from the database. The N +1 query problem is actually at our server
level, not at our front end level. So traditionally one of the problems with using OMS was people used to write uh code like this. So in the so let's say we have a uh function here fetch posts and in this function what people uh used to do was db dot select posts right where user id is etc etc and fetch a list of all the posts right this is an array of posts then in a for loop for post in posts fetch or select author
from users where user id equals to author ID something like this. So in a for loop they used to run and fetch the information of all the others. Now this was traditionally called the n +1 query problem because when we are talking in the context of database your server your back end is actually the front end for your database and the n plus1 query problem traditionally happened at our server level not at our front end level. the front end uh example was just to uh make it more intuitive. But this is where the actual problem was. But in a
lot of modern OMS, for example, if you take a look at Django, you can use things like select related for fetching all the data related to your foreign key relationships or you can also do prefetch related for doing things like many relationship. Same way if you are using Ruby in Rails then it has something like includes and if you're using a TypeScript type OM then it has things like left join and select right
in pretty much every OM does not matter you are using Prisma drizzle all of these warms have primitives to bulk fetch or to fetch data using joins right if you are using SQL queries raw SQL queries of course the solution is to use join joints, left joints or inner joints or whatever. But if you're purely expressing your data fetching logic in terms of your OM then there are primitives available in your OM's uh level to fetch all the related data using foreign keys. And the solution was
this. Instead of fetching whatever data that you need in a loop using your OM, you should instruct your OM to fetch or prefetch all the data that you need in bulk before you run the query, right? Which is much more efficient which only runs around two queries or a single SQL query instead of doing a,000 queries or thousand items. Now the problem of N +1 query was very apparent using OMS is because OMS give you this this way of
accessing data using your programming languages types and constructs. So something like this which is like posts equals to await db dot select dot posts where etc etc. Now after running this statement you have an array of posts. Now by intuition you'll do something like for post in post await fetch author information etc. Now it looks like normal code to you. It looks like uh your typescript code but since you are
using an OM and the OM has its own way of fetching data which in turn uses raw SQL queries behind the scenes but since it looks very familiar a programming language constructs we don't think about how exactly the OM executes all these statements behind the scenes that's why in modern OM there is an option to u print the actual SQL queries that it is running behind the scenes. So that while development you can debug whether this is the SQL query pattern that you
actually want to execute or do you want to make any modification or do you want to write your own raw SQLs. So anyway the problem of n +1 query is something to be aware of and something uh to look out for whenever you are using OMS and databases. Always try to use joins and your OM's bulk fetch constructs to avoid the N plus problem. Now moving on the second and the most relevant concept when it comes to database and performance is indexes. Indexes every
single engineer knows this. The most common source of database performance problems involves indexes or to be more specific the lack of indexes the absence of indexes. So uh people who are familiar with indexes there is nothing much to explain there. You have to understand your uh data access patterns through your application observability framework and then accordingly you have to add indexes to different different tables so that the uh access is a little faster. But uh just to give a brief
about what exactly are indexes whenever we talk about databases and indexes and the explanation for it. This is a very common uh example that people come up with which is the example of a library. So imagine a library which has around a million books in different different shells, different different sections, different different brooms, different different floors. But there is no catalog. There is no way to track down which kind of books lives in which floor, which section and which shelf. So if a new customer comes or a new excited
reader comes to your library and they ask for a book by a specific author, right? Someone comes to your library and says they want to explore all the books of the author John Green. Now since you as a librarian do not maintain any kind of catalog whatsoever, you only have one choice. Walk through the entire library examining every single book and uh collecting all of that and giving the user a box which contains all the books of geometry. And of course there is a
problem with that one. It takes a long time probably uh 3 days until you're done. And the second is if the user comes the second time asking for the same request asking for all the books of John Green or a second user comes asking for the same request again your 3-day adventure starts you start collecting all the books of John Green again since uh after the user reads them or after the user is done with all the books you put it back in random order in the shelves again. Anyway, so now that you
have some kind of a very irrelevant intuition there, what we are trying to explain with this library example is the the activity of going through your whole library, examining each book to make sure that this is the book that you want. This whole process is called a sequential scan also called as full table scan which means exactly what it sounds like going through each row of a table to find out if this is the row that you want if this is the row that
you want and collecting them one by one and returning the results to the user. Of course since we are talking about computing and databases it does not take 3 days to get the results. It usually takes somewhere between 1,000 milliseconds or 2 seconds if it is a very large table. But still from the standards of our modern uh performance, this is still slow. So what what is the solution here? Going back to our library example, if we as librarians, we
maintain a catalog. A catalog which is organized by the author's name. Let's say there are all the authors in the alphabet A, B, C, N, and with the J. We have John Green and with that we also store the locations of all the books of John Green and where are they exactly stored in our library and once we have that information we directly go to all the places all the shelves all the sections in our library and fetch all the books and return it to our user which takes around 2 to 3 minutes
depending on how big your library is. So mapping it back to our database context this is what an index is. It is typically a data structure and usually it is a B tree index. There are other types of indexes called hash index and all but for this particular use case that we talked about we use the database uses a B tree index a B tree data structure and what is the property of this particular data structure. It maintains a sorted copy a sorted copy of all the values of a column. Let's say uh
B indexed our posts table with the value of author id. So starting from here like 1 2 3 4 and all. It will maintain this list and the pointer to all the books. Since author ID is a foreign key there. So when you search by a particular column let's say we want all the books of author three. Then using this pointer relationship, we can instantly get access to all the books of that particular author. Instead of your database having to scan each and every single row, it knows exactly which rows
it needs to fetch before. So of course as compared to the full table scan or the sequential scan, this is a much faster operation. So the operation depending on how many rows that you had in the first place. For a million rows, it takes around 4 seconds to find all the books for that author. Then with an index in place, it will take somewhere like 40 milliseconds or 100 millcond under 100 millconds, which is a huge
performance improvement. But with index, there is another important fact that you have to keep in mind, which is indexes are not free. They are not a one-stop solution to all your database performance problems because indexes have their own cost. The cost is cost. First, each index it takes some amount of space. Of course, if it wants to store this sorted list with all the pointers and all the information to fetch all the rows etc etc. Then it has
to store that information somewhere. And depending on how large your table is, how many rows that your table has the indexes the size of the indexed data will also grow. So that is the cost factor. The second one is the more problematic one is that every index in our indexed data, our indexed collection here, it also has to be updated every time you do an insert or update or a
delete row operation on your primary table. the table of whose column you have indexed. Every time you do an insert, update or delete operation on that table, your indexed data structure also has to be updated. Since it has to be in sync with all these operations so that when you run a select query, it can appropriately fetch the data without being out of sync with the current state of your table. So this is the second and more problematic factor of using indexes. So if to improve your
performance what you did for every column of your table you went ahead and created an index for that and every time you do an insert update or delete operation on that table all the indexed data structure has to be updated and you'll see huge latency numbers all your right operations especially they will slow down significantly because every time they do a right operation they have to update all these index data structures one by one. So uh to find the solution for this as I had already explained in the database video instead of defining the indexes of course uh at
the start of your migration when you're creating new tables you can go ahead and add one or two indexes for fields that you know that are going to be part of the query because by default the ID field the primary key of a table is already indexed by the database at least that is the case for posgress so you do not have to explicitly index the ID field. You search by ID then it is going to use the index to find that particular row. But if it is any other column, let's say we are talking about our books table, then the ID column of the books table is already indexed. But you have
to manually index the author ID field, the authority column. And since we know that the author ID is such a useful column to search books by, we can go ahead and add an index for author ID while we are defining our table only uh before we even write our APIs because it is such a common use case searching all the books of an author. But some of the nonobvious use cases where we are not really sure whether a particular column
is going to be part of many queries or is going to be part of many joins because in joins also uh indexes are used. So if we are not very confident, we are not very sure about that then we should wait. We should actually uh monitor our app. We should take a look at our logs at our traces and after we know for sure after we have the data that these particular API endpoints are being hit most frequently and using distributed tracing we find out that this particular database query is having
this amount of latency because of the absence of uh an index on the ISBN number field that is the reason we have to index the ISBN number field ISBN number column. So this is how in a real production system we decide which columns exactly we should index unless they are very obvious like the authority columns because indexes have a huge overhead you should be very cautious and not be very generous when you are defining different indexes on your
table. Two other things that also come here since we are talking about indexes is uh composite index and also covering index. It is important to understand the difference between these two whenever we are talking about database indexes. So I'll just mention it in brief. Composite index basically uh means that it covers multiple columns column one column two. So for example, if you frequently query
by both user ID and the column created at created at then a composite index on a structure like this user ID, comma created at this will help you optimize that particular query better than creating indexes on these two separately. Of course the order of creating this index also matters. The user ID comes first then comes the created at. Because when you create a composite index like this with the order of user ID and created at, it anyway helps you with a query where both user
ID and created at is used. But even if both are not used, if even if you are only querying by user ID, this index still applies because this is the order that you have applied it. But if you only query by created at this index will not be applied. So the ordering of defining or creating this index matters. So that's about composite index. Now what is covering index? Let's say we have a table called uh departments departments and our uh internal sales team frequently adds departments or
updates department deletes department etc depending on the client requirements. Now the department's table has let's say a 100 different columns but the only column that we need is the name of the department which we show uh in our front end in our client app in our web app and also the ID of the department. Since we only need these two columns what we can do we can use a covering index to index the name column. And what happens when you use the covering index is whenever you make a query which only involves the name of
the department the database can serve all the data that we need from the index does not have to go back to the actual table contents and fetch the data. All the data will be available in the index only and we can serve that from there. Same way you can add multiple columns but of course storing more data in the index makes the size of the index even bigger. You have to carefully analyze if this is worth it or not. But the question that you can ask in the end is how do you know how do you actually know uh which field to index at your database
level. Let's say using distributed tracing you found out that this particular query select dash from dash join this these tables and with the group I having whatever your complex query this is this complex query. How do you exactly know uh which column in all this that you have to index? This is very simple. A lot of databases offer this very useful tool which is called analyze or explain analyze. Whenever you are running your query, if you add explain analyze before the start of this query, what happens? The database shows
you how exactly the database was able to fetch all the data from uh which table it started, what table it joined with, what index it exactly used for fetching that particular piece of data and you'll find all the internal details of how the whole query plan was executed from start to end. And you can just take a look at that query plan and you can find out uh what are the columns for which we are doing a sequential scan or a full table
scan. there are no indexes being used and uh you can guess that if the rows of this table are in thousands or millions then this might be the cause of latency and you can add an index here and even after adding an index you can run the same excellent analyze query the second time and you can check whether our database is able to use this particular index or not if it is able to use that particular index. So instead of showing full table scan or instead of showing sequential scan it will show you index
scan and your query should naturally feel much faster. So by using this measuring techniques using distributed tracing to find out the exact database query and then using explain analyze to find the exact columns you can find out the key of your performance problems in your database. That's all about indexes when we talk about databases. Next overhead that comes is the cost of connections. One of the realizations
that you'll have once you start scaling your back end is the realization that your database connections they are not as cheap as you thought because when we are working in local they are so cheap that we never even have to think about database connections or their relations to uh performance overhead. But when you start scaling you'll see that database connections they can get very expensive at scale. So in order to understand that
let's talk a little bit about database connections. Let's say we have our server here. So our server and this is our database. What exactly happens when your application or your backend connects to your database? How does this whole connection work? To explain it simply, it establishes what we call as a TCP connection. Now TCP connection itself is a uh three-way handshake process. You send a SIM, there is an act, there is a SIN or whatever which goes into the network engineering domain. But just so you know, we
establish what we call as a TCP connection and we uh use three-way handshake for that TCP connection. After it has established this TCP connection, it performs any kind of authentication if there are any necessary then it negotiates the encryption algorithms like private key, public key and all that infrastructure. Then it sets up a session state for the data access. Then the database actually allocates some
amount of memory for this connection. Some megabytes or MBs for starting out. So this is all the things that happen each time your server establishes a new connection with your database. All these things happen one by one and of course all these things take time and resources. So the point that we are trying to make here is if your application if your back end establishes a new connection for every single database operation and instantly closes
it once it is done with that particular single operation then this whole cost of setting up a database connection you'll keep paying it repeatedly again and again and here you are going to pay with the cost of latency latency of setting up a connection every time you get a new HTTP request, every time you have to make a new database query. The second thing is the databases limit because the database also has a limit for number of connections that you can establish with
it. A typical Postgress database has a limit on let's say a fewund connections 500 or 600 depending on how large of a memory how large of a CPU that you have configured for the database. But talking in general you have let's say around 500 or 400 to 500 connections of limit. So if your application has a traffic spike or if it's a busy day some kind of sale happening then your back end can easily exhaust all the database connections since it is going to uh deal with all these different HTTP requests
concurrently dealing with 500 or thousands or u 10 thousands of requests in a second or minute it will easily exhaust your databases connection limit and your database will crash. So in order to prevent this particular problem of exhausting the connection limit and the first one is the paying the cost factor every time we want to make a new database query creating a new database connection. To avoid those two issues what people have come up with is the concept of connection pooling. A very important concept when we are talking about performance and scaling. Now
connection pooling solves both of these problems. the problem of having to establish a new connection with our database every time we want to uh make a new database and the second thing is the the danger the potential of exhausting our database connection limit and how does it help us. So instead of establishing a new connection with our database every time we want to make a new query what we do we interact with a middleman which we call as the database pool. This is our server. This is our
database. And instead of uh directly interacting with our database, we interact with the pool. And what does this pool do? These connection pools, they in turn send our request to the database in a more optimized because what a pool does, it maintains a set of connections. Instead of repeatedly opening a new connection and closing it, it maintains a set of open connections, set of idle connections or whatever we call in the terminology of pools. Basically a set of connections it has
readym made for us. So when our server when our back end needs to interact with our database what we do our server asks for a connection asks to borrow a single connection from our pool and it takes that connection and using that connection it makes the database query which is the result after it is done it gives the connection back to the pool. So after that instead of closing the connection after we are done with our query the pool keeps the connection open depending on your configuration uh it depends uh how long the pool will keep
the connection open but it will keep the connection open for some time that you can assume so that the next time we want to make a new database query it will reuse the same connection instead of creating a new connection to our database. By using this pool architecture we got rid of those two problems. Even when we are talking about uh database pooling, there are two types of pooling patterns. One is using internal what is say and the second one is external pooling. Internal pooling
just means uh usually these days all the modern database drivers. They provide pooling at your applications level. they the database driver itself maintains a pool of connections and you interact with that pool to uh make database queries and give it back and all. It is called internal because it is specific to your server and since in our modern uh web application structure we don't have a single server when we want to scale we add multiple servers. So for
each of these servers since we are using internal pooling they will have a pool of connections. Now the problem with this is the way we configure our uh scaling the scaling of our server. This is called horizontal scaling. When you add more instances of our server when our traffic spikes, right? So the one scenario one problematic scenario here is let's say our traffic spikes at this point and uh our Kubernetes autoscaling infrastructure what it did it added two
more instances. So each server maintains a pool of let's say uh 150 connections at a time and since we added three instances that is 150 150 150 which is around 450 connection atoms that you can assume or the 450 is the maximum amount of connection that we will be able to push to. Now the problem is our database can only handle 300 connections and now the moment the traffic spikes again and
we push to let's say 400 connections between all these three instances of our back end of our server our database is going to crash because we exhausted the pool limit and that is the reason we have external poolers. So for Postgress a very popular external pooler is called PG bouncer. Take a look at PG bouncer. G bouncer a very lightweight connection pooler for posgress. It is also open source. You can take a look at this. But the whole idea of PG bouncer is let's
say we explore the same scenario again. We have a traffic spike and our Kubernetes autoscaler added two more instances of servers. But this time instead of each server having its own internal pool of connections, what happens? We configure an external PG bouncer pooler which maintains a pool of let's say 250 or 300 connections and all the instances of our server they interact with this pooler. Since this is an external pooler, all of them have access to this using some kind of
networking. And this in turn deals with our database. And now we will not be exhausting our database limit because since this single pool interacts with all the instances of our server, there will never be a scenario where we will exhaust our connection pool without realizing because a spike in our traffic. So that is the difference between internal pooling and external pooling. When we are scaling, when we go to production, we usually prefer configuring an external pooler, something like PG bouncer as compared to
internal pooling, especially if we are aware of our uh traffic spike patterns and with that we have covered some of the important performance related concerns when we are talking about databases. The first one was N plus1 query problem. The second one was the indexing strategy, what to do, what not to do. And the third one was the overhead of creating, establishing and destroying connections again and again and the importance of using pooling for performance. So by this point you have
already optimized your queries, you have added all the appropriate indexes and you have configured your connection pool. But still your database is the bottleneck. Still you are seeing latency at your database level. Now at this point the obvious next step should be caching. Of course we have discussed a lot about caching in the caching video in this playlist. So we are not going to go too deep into the mechanics of
caching. But we'll still have to discuss caching since we are talking about performance and scaling. We cannot do that without mentioning caching since that is such an integral part of our performance or our systems performance. The idea of caching is beautifully simple which is store the results of any expensive operation. Let's say in our case the expensive operation is the database layer. So whatever expensive operation whatever complex queries that we are running here we take the result
of this and we store it in radius or any kind of caching solution that you use. So the next time a request comes instead of going through all these complex queries we do a detour here and we check if our caching layer already has the result of this particular complex query or not. If it has then we directly return that response instead of going through this path which let's say takes around 500 milliseconds or let's say 800 millconds. But if we return the same
response from our caching layer it takes around 50 mconds. So with just one move we were able to reduce the perceived latency from 800 mconds to 50 mconds. That's the part about caching that all of us all the engineers know this the very obvious use case of caching. But while we are discussing about caching, one of the most common disadvantages of using caching or one of the most
problematic situation while using caching is cache invalidation. There is a famous code around here which is there are only two difficult problems in programming. One is naming things, naming your variables, naming your functions, naming your classes. And the second one is cache invalidation. And that is not an exaggeration. If you have personally implemented multi-layer caching in your system, cache invalidation is a very difficult problem because of the fact that it touches upon so many layers starting from your servers to CDNs, your client's browsers
and whatever comes between that things like reverse proxy and all these things when the actual data for example in our case when the result of this particular complex query which we were running through SQL on top of our database. This data, this underlying data actually changes. We cannot serve the same old cached data anymore. We have to invalidate that cache somehow and keeping this cache data in sync with our
original data which lives in our database. That is the difficult problem that we are talking about. So in order to deal with cache invalidation, we have primarily two techniques that we regularly use. The first one is timebased expiration and the second one is event based invalidation. First one is time based invalidation and second one is event based invalidation. Time based invalidation is pretty straightforward. We decide beforehand that we want to keep this cache valid
for this amount of time. Let's say for 5 minutes or 10 minutes. After this amount of time, this cache should expire and the next time a new request comes, we should consult with the original source of data instead of taking the cash data. That's what we mean by time based invalidation. But of course the problem here with time based invalidation is figuring out the optimal amount of time for our expiration for our automatic expiration which depends on our users
access patterns the frequency with which our API is hit and that again differs from endpoint to endpoint and from service to service. So we have to take into account a lot of these things before deciding on the optimal amount of time for automatic cache invalidation. The second technique that we have is event based invalidation. This one is much more simpler. This basically means that every time we make a change to our
original data, for example, let's say we were caching the result of a user's profile. So every time an update operation happens on top of this user's profile every time the user updates their name their email or any kind of preference in that same endpoint in that same handler code after updating the user's data in our database we go ahead and we invalidate this cache or we just delete this cache. So the next time we
hit the get endpoint for getting our user profile, we'll check if there is a cache entry for this user's profile. And since we have already deleted this cache entry since there was an update operation on top of user's profile, we do not find a cached entry. And because of that, we consult our original source which is a database and we fetch the user's details from the database and return it in the response. And this way we do not have to guess. We do not have to predict the optimal amount of time
for automatic invalidation and we know exactly at what point we need to invalidate the cache. But another small problem with this is at your at code level at every instance where you are updating your user's profile data you have to remember to also invalidate your cache since there is no automatic expiry. If you forget even one instance where you are not invalidating the cache then there is a risk of showing stale data to your user. So both of these
methods have their pros and cons. That's about the problem of cache invalidation. Now talking about caching where does your cache actually live? What is the location of your cache? Now here there are two options. The first one is called local caching and the second one is distributed caching. Local caching basically means that instead of storing your cache in any external service, you create a data structure, something like a dictionary or a map and you cache
whatever values that you want to cache inside that map and you keep that in your own server's memory. But the problem with that is if you have 10 instances of your server, each server will hold a map which will maintain its own local cache. Now there are 10 different server instances and for each server there is a copy of it own local cache. Now you have the problem of cache inconsistency between all these 10 servers. When we say distributed caching, what do we mean by using an
external service something like radius or memcached whatever external service that you typically use? There is also the modern option which is called Valky. The advantage of using a distributed caching service or an external caching services. Now even if you have 10 instances of your server, all of them are communicating with the same caching server, right? So you don't have the problem of cache inconsistency anymore. But the con here is getting the value of
cache now requires a network round. The distance between this and this can cause some amount of latency here. When it is in memory caching, it is almost instant. We are talking about 2 to 3 milliseconds of latency depending on how large your payload is and what kind of serialization that you use. But when there is a network roundtrip involved, even if it is an internal network, we are still talking about 50 mconds of latency at least. So what many systems use, they use what we call as tiered
caching, a small local cache for the hottest data. And by hottest data we mean the data that is of course this will differ depending on what kind of caching algorithm that you're using whether you are taking into account the last used value or the most frequently used value as a but the hottest data is in the sense the data that is in demand the most that is kept in the local cache and as an upstream server to that we have our distributed cache something
like radius. So when some function, some request needs some kind of data, it first goes to the local cache, checks if the local cache has it or not, and then it goes to the global cache or the external cache. And depending on the request configuration, depending on our algorithms configuration, after it fetches the cached item from our global cache, it can decide whether to store that value in our local cache or not depending on how hot that data is.
Another important thing when talking about caching is the caching patterns. Primarily we use three kinds of patterns. Cash aside, right through then right behind. Caching pattern basically means that at what point do we actually go ahead and cache the data? How do we decide which pattern that we should use? And to answer that question over the time people have come up with different different techniques and depending on what kind of access pattern that your
application has, what kind of access pattern that your users have and what kind of access pattern that that particular endpoint has, you will end up using either of these three or more caching patterns. Now the first one which is the cash aside pattern. This is the mostly used pattern. This is the pattern that we implement even if we don't know that this is the name of that pattern which is the cash aside pattern which basically means that let's say we have our database here there is a complex query that need to run for our
endpoint which is / users/ user ids in this case two. So the request comes first it checks our caching layer let's say this is our caching layer. First, it checks if the result of this user's profile is inside our cache or not. If it is found inside the cache, then it is returned as a response immediately. If it does not find it in the cache, then it consults our database. It runs this complex query, finds the result, saves it in the cache, and then finally
returns the response. Similarly, when some kind of update operation happens on this user, our application code explicitly goes ahead and deletes that particular cache entry or invalidates that particular cache entry so that the next time a new request comes for the get endpoint of our users, it does not find a cache entry so that it goes to the database and finds fresh data so that we never end up serving stale data to our users. This whole pattern is
called the cache aside pattern or the lazy loading pattern is the mostly used pattern in backend engineering and also the most intuitive pattern. Same when we say write to it means that instead of invalidating or deleting the entry in our cache every time a user's information is updated in our database what happens every time a write operation is completed on our database what we do instead of just updating the database and invalidating our cache this pattern updates both the cache and the database at the same time. So instead of
invalidating it just updates the cache entry and that's why we call it write through cache. The advantage of this is obviously there is never a cache miss. We are always going to find a cache entry for our users profile details. But the disadvantage for this is our write operation takes a few milliseconds more because we need to write to our database and to our caching layer at the same time before we can return a successful response. And finally we have write
behind. This takes a clever approach since in the earlier approach there was the problem of latency because we had to write both to our database layer and our caching layer before we returned a successful response. What happens in the write behind pattern is whenever there is an update operation in this case let's say we are updating the name of the user. So instead of updating both the database and our caching layer, what we do? We update our caching layer. Since updating any kind of operation on our caching layer is extremely fast, we
finish this operation and return a successful response and asynchronously update the value in our database. Now with this pattern, we can mitigate the risk of latency of updating both the database and the caching layer. But the problem with this is if for some reason our database operation fails. Now our caching layer and our database layer are in inconsistent state. So those are a few caching patterns that you should be aware of while thinking about scaling and performance. One last topic that we
should cover under the caching section is the number which we call as cash hit rate. This particular metric basically says that what percentage of your requests are successfully able to get the information that they want from our caching layer and how many of them are not able to get that information and they have to consult the original source. If you say you have 90% cache it rate, it means that 90% of your requests
are able to successfully get a response from the caching layer and only 10% of the requests have to get to the original source which is our database since they did not find the response that they were looking for from the caching layer. Of course, 90% is a very good cach rate for a system. If you have very low cash rate, if you have cash rate of something like 20%. It means that even though you have implemented a caching layer, but your caching algorithm or your caching
pattern, there is an issue somewhere that you need to find and fix. So what are the factors that actually affect our cached ratio? Some of the important ones are first is your TTL, the automatic expiry property that we set on our cash entries. If you have a very large TTL then that will in turn affect your cash at leio because since a lot of items are living in your cash for a long time there is a good chance that they will get a hit sometime in the future. But
with a large TTL there is also the risk of stailness. If you set a very large TTL let's say something like 7 days then that completely defeats the purpose of setting a TTL in the first place. Now you have to actively fight the problem of stailness. The second factor for your cache it rate is the cache size. This matters because the larger your memory is the larger your caches size is the more amount of data that it can hold and the more amount of hit that it can take right it is pretty obvious. Third is
your data access patterns. If you don't understand which endpoints of your application of your back end are frequently accessed and at what time these endpoints are frequently accessed basically if you don't understand your users behavior then your caching algorithm your caching patterns your caching strategy is not going to work and you will not see a very good cash hit ratio something like 80% or 70%. Since you don't understand your users and the cash hit ratio directly depends
on how deeply you understand the behavior of your platform at least the behavior of your users or their access patterns. So that is another factor that affects your cachid ratio and that's all about caching. Now moving on, let's talk a little bit about scaling directly instead of all the factors that affect scaling. Let's talk directly about scaling a little bit. Since this is a video about scaling and performance literally we have to talk about the the
obvious parts the standard textbook answers of scaling which is vertical scaling and horizontal scaling. Now to reach there the primary question that leads us there is your application runs on top of servers and as your traffic grows as your user base grows those servers and the resources that these servers have they start becoming insufficient and over the time you'll need more and more and more capacity. So
how do you actually get it? To answer this particular question, how do you get more and more capacity? That's where we have come up with these two broad categories of scaling. So there are two fundamental approaches of scaling. The first one is vertical scaling. Second one is horizontal scaling. Now vertical scaling, this is also called as scaling up. So you can imagine it as scaling up in the sense increasing the height of
your server or making your server more and more powerful. So that is one intuitive way of imagining vertical scaling which basically means that you take your old server and in order to make that a little more powerful you replace that with a bigger server with a little more powerful server literally. And how do you do that? using a couple of obvious hardware upgrades. First, CPU. You can go ahead and add more and more cores 2 4 8 etc. Second, memory, which means your random access memory
and your primary memory, whatever you call it, the fastest kind of memory. You go ahead and add more and more memory to it. So, initially, if it was 2 GB, you make it 4 GB, then 8 GB, then 16, then 28, 32, whatever your server's needs are. Third, you can also scale it up by storage. If previously you had 30 GB of storage, you make it 300 GB, then 1 DB and so on. Fourth is the network card. Of course, in storage, you can also go ahead with SSDs, NVM SSDs, etc., which
offer a much faster access time with modern technologies. And lastly, the network card, the modern network cards which offer 10 Gbps and all which will offer you more capability to handle internet traffic. Now the beauty of vertical scaling or just adding more and more power to your server is its simplicity and the intuitiveness. Your architecture, your systems architecture, your your backends architecture, it does not change at all. You don't have to
think about statelessness and all this different code level stuff. Just go ahead and upgrade your machine. As simple as that. Your code does not change. And a server with twice as many CPUs as compared to before can handle roughly twice as many requests. Right? That's just simple math. Same for the memory. A server with twice as much RAM can cache in local twice as much data. Right? everything you can just go ahead and multiply with two. Your capacity is literally multiplied by two depending on
how you are upgrading. If you are if you had 16 GB of RAM and you made it 32GB, now we can say your caching capacity has been increased the multiple of two. And for this simplicity and the intuitiveness for a lot of applications, vertical scaling makes a lot of sense. And even if you take a look at it economically, a server with twice as much power typically costs less than two servers of half of its size. And on top
of that, you also avoid the operational overhead of managing multiple machines with things like security, backup and maintenance, configuring load balancers and handling distributed state. There is a lot of complexity when we are going to talk about horizontal scaling or adding more instances to your server. Now the question is if vertical scaling is so simple and it is so intuitive why is not every company doing this? Why is not every single engineer is doing it? And the answer to that is vertical scaling
has hard limits. What kind of limits? limits because no matter how much money that you spend there is always going to be a ceiling on how powerful a machine can be and that ceiling is at the hardware level because cloud providers even if you take a look at GCP or digital ocean whatever that you're using they do offer very large instances depending on your use case but at some
point you are going to reach the biggest machine that they offer with let's say 32 cores or whatever the biggest machine of Aless is and what do you do after that if you need more capacity if you're getting more user traffic if you're getting more requirements to scale even after that what do you do you cannot scale vertically any further there is a hard limit and you have already hit that limit the second thing the first thing is the hard limit the second thing is the single point of failure which means
that if one very powerful server with 32 cores with 96 GB of RAM you have a very powerful a beast of a machine but if your server crashes for some reason maybe it is a a dependency related issue or some kind of issue servers crash every day for a lot of reasons so let's say this vertical server that you have which is very powerful in itself but it crashes for some reason and for that whole window let's say your application crashed for 2 hours when you are not
available. For those 2 hours, your application, your platform is completely unavailable. Your service is completely unavailable. Of course, you can mitigate this particular risk of unavailability with things like standby servers or automatic failovers. But still the fact remains that one single server does not matter how powerful, it creates this risk of single point of failure. And the last thing is no geoic
distribution which means that even though your server has 32 cores 96 GB of RAM but let's say your server resides in USA and 40% of your user base is in India what happens then 40% of your user base they experience increased amount of latency as compared to the 60% of your user base. Even though this is one of the major chunks of your user base, since you have a single server which is a vertical server, even though a very powerful server, you cannot do anything to decrease this latency that your users
regularly experience. Because of all these things, even though vertical scaling makes a lot of sense, vertical scaling is very intuitive and it is very simple. does not require any kind of programming complexity, core level complexity or management maintenance level complexity. There are major disadvantages here. That's why the majority of the tech community or companies have moved towards the other solution of scaling which is horizontal scaling which takes the completely
different approach. Instead of taking one machine and adding more and more and more powerful to that machine, making it scale vertically, what horizontal scaling does, instead of instead of adding more and more powerful servers by replacing the existing servers, we just add more instances, more counts of that same server horizontally, which means all these servers work together to cater to our growing user traffic. Instead of one powerful machine serving all your
user base, you have multiple mediumsized medium powerful machine serving your whole user base. There are also some very clear theoretical advantages here. And I say theoretical for a reason. We'll come back to this. One of the theoretical advantage, the mathematical advantage is if one server can take,000 requests per second, then if you have five servers, then you have the capability to take 5,000 requests per second. one of the clear wins, one of the clear advantages of horizontal scaling. Same way if suddenly you have a
big campaign coming, then the only thing that you have to do is add more number of servers or more number of instances. If you need more, just keep adding, keep adding, keep adding. You can just keep scaling horizontally as much as possible. You never ever hit a hard limit because there is no limit on how many servers that you can configure to work together to serve your user base or your user traffic. The other advantage is the redundancy which means that if
one of your server fails then it has no effect on your user traffic. Your services do not stop because whatever user traffic that you have at that moment they get equally distributed among all the other service all the other instances that are available. Even though one of your instance go down there are other instances to take its place to serve to your traffic. Third thing which was one of the major disadvantage of vertical scaling was again geographic distribution. Since you
are just adding more and more servers to your stack horizontally, what you can do, you can add all these instances from different different geographic locations and they can together form a network and using some technology something like a load balancer which we'll talk about in a little bit which will take a request and depending on the user's location it will pass that request to the nearest server. And because of this setup, you can have minimum amount of latency in serving your users. That is another
clear win for horizontal scaling. Now coming to the disadvantages. Of course, there cannot be only pros and pros of horizontal scaling. If that was the case, then we wouldn't have even mentioned the existence of vertical scaling, right? Everyone would have just gone and implemented horizontal scaling. Now the disadvantages of horizontal scaling lies in its complexity. The same complexity that vertical scaling completely avoids. One of the primary question the one of the first questions
that you have the moment you start adding one more instance of your server or two more instant or 10 more instance is how do you distribute the requests among all your servers. What you need? You need some kind of a load balancer. Now that is introduction of a completely new component, a completely new element in your stack, in your network stack, in your infrastructure stack. And the addition of a new element means that addition of more complexity in your infrastructure. So by using a load
balancer, you can distribute the traffic among all your servers. Now that you have a load balancer, you also need to choose what kind of load balancing algorithm that you want to go with, right? Another decision that you have to make, another choice that you have to make. Then the next question is how do you keep all your servers in sync? For example, if the user updates, the user goes ahead and updates its name in server one, then how does the server two
know that there was an update for the user in server one? Next question is what happens when the network the network that keeps all these servers connected that enables all these servers to communicate with each other. What happens when this network fails? How do these servers communicate with each other? And in turn, what if they start making conflicting decisions? Since this communication medium failed, they will start making decisions which might be redundant with the other server or it might be conflicting with the other
server. or what happens when you have let's say three server instances and one of the server instances dies. How do you know that this particular server is not available anymore and how do you pick that particular incident and how do you decide at what point we need to route the request to these two servers and stop routing the request to this server and how do you decide at what point we need to start routing the request back to this server the moment it starts coming online. That is another question
that you might ask. Of course, all these questions that I mentioned here, all of them have answers. We have already figured out the answers to all these questions. But all these answers are complex and they involve some amount of tradeoff because we are talking about distributed computing here. We are talking about distributed systems and distributed systems they do not really get rid of problems for us. they just transform one set of problem to a completely different set of problems by making a certain amount of trade-off and
sometimes these new set of problems are more favorable to us as compared to the original set of problems and that's why original set of problems and sometimes the original set of problems are more solvable as compared to the new set of problems and using this comparison of which set of problems that you want to deal with which set of questions that you want to answer.