Task queues and background jobs
In this video, we are going to talk about background jobs or background tasks. If you are a back-end developer, understanding background jobs is very important so that you can build scalable and responsive applications or responsive backends. So what exactly is a background task and why should you really care? So a background task is basically any piece of code that runs
outside of the request response life cycle, right? This is our request, this is our response and this is our client. This is our server, right? any piece of code or any logic, any workflow which runs outside of this whole interaction, this whole client server interaction or the request response life cycle that we call as background job or a background task. Which means that whatever that you
are doing here outside of the request response life cycle, this does not need to happen immediately, right? This is not a missionritical task that needs to be responded immediately after calling it. This is not synchronous and that is the reason we can offload it to a separate process and finish however we have programmed it to finish. Right? However we have programmed it to respond to whatever client or to whatever
separate process. Why exactly do we need them? Right? So let's take an example. Let's say you have an user and you have a platform some kind of SAS platform. The user comes to your platform and they sign up. They sign up to your platform which means they type their email, they type their name, they type their password etc. and they make a request and that request is made by your front end whichever UI is represented to the
user and your front end makes a request or an API call to your server to your back end and your server does whatever processing whatever validation etc it needs to do for the email for the password for the length of the password for the complexity of the password etc and after that is successful. What it needs to do? It needs to send an email to that user to this user. So that if you have ever signed up to different
platforms, you must have noticed this workflow that whenever you sign up to their platform, you give them your email, you give them a password and they send a verification email to your provided email so that they can verify that whatever email that you have provided that actually belongs to you and you're not providing any random email that you don't actually own, right? That is one of the ways to verify that you own the email. What they do, they send a verification email which
contains either a link which you can click on and that redirects you to the front end and it asks you to provide a new password etc etc or it can also provide you a code right a sixdigit code or 8digit code that you need to type into the front-end interface right some kind of one-time password so this is how the workflow looks like and when you sign up to a new platform. Now the point to focus here is
the user comes to your platform they type their email they type their name the username whatever and they click on sign up and your front end makes a request an API call to your back end and your back end does whatever the initial processing it needs to do and then it sends an email to the user. Now this is the task that we want to focus here the sending the email to the user and this is one of the examples of a workflow or a task that we usually offload to
background processes and the reason being when you want to send an email to user what actually happens is you are subscribed or you are using some third-party application right? Some third party platform to send your email, some email provider, right? It can be a provider, some modern providers like resend, bravo. There are a lot of email providers that we use and how it
actually works is you construct whatever the content of your email is. Uh let's say you have an HTML template that you want to send to the user and you fill the template with whatever code or whatever link that you want to send to actually verify the email and you take that HTML template and you provide what is the email address that you want to send to. This is the email address of the user and you provide the subject you provide
which email that you want to send from etc etc right whatever the default parameters is for sending an email through an email provider to an SMTP provider and you make an API call to a different service your back end makes an another API call to the server of whatever email provider that you're using recent mail gun etc etc Now that server processes your email. It checks whatever it needs to check whether you own this email, whether you have the
correct API key, etc., etc. And if everything is successful, it sends the email. However, that server has configured to send emails and it sends the response back to your server that whether the email sending was successful or not or whether it was a failure or not, etc., etc. Now the reason we want to offload this to a background task offloading this particular interaction you your back end calling another back end or another server for sending the email is because this server may not be
as responsive as our own server right because we don't have control over other servers and for some reason for some kind of traffic spike or some kind of service downtime etc. There can be a lot of reasons. So for some reason let's say this service is down and assuming we were not doing this whole email sending workflow in a background job. We were doing it synchronously. What happens? Your server server does all the
processing. It makes the database calls. It stores the initial data of the user in the database and it creates a verification code etc etc and it calls this API and this API call fails. So there can be multiple parts here. One is if you are not properly handling the error of this API failing and because of that what happens this API call this whole the user signing up this API call
this also fails if you don't have proper error handling for this API right the email sending it this API and that gives a very bad user experience if your signup API fails itself now the second thing is let's say you have proper error handling in place and if you do have that then only this interaction fails right the signup API itself works properly but this interaction fails and
even in that case the signup is successful and in the front end you are showing that we have sent you a verification email and you can use that to verify your account right but Since this never actually succeeded and you told the user that we have already sent the email. So that again gives a very bad user experience. Now what they have to do they have to come again to the platform and they have to if you have the functionality. they can click on
resend email and again if the recend email resend email now that is a different API call and by then if the service is up then all good you can resend the email and they can verify the account and if the service is still down then again this part of the API is failed and you show the user that we have sent you an email again right and etc etc you can imagine how that
experience goes So that's how it works in a synchronous workflow when you do all the processing in the same process that your API is working in in the same request response life cycle. Now imagining let's say let's go to a different diagram we have a user they come to your front-end platform they type their name type their email and your front end platform makes that request to your server right now
the server does the all the initial processing then what it does at last after doing all the processing etc generating the verification code constructing the email instead of sending the email or instead of calling the API of the email provider to send the email in the same method in the same function in the same request in the same request response life cycle. What it does, it takes all the content, all the information that we need to send the email and it packages that into some
format. Let's say a JSON format like it serializes it into a JSON format. All the information that we need to send the email and it injects it into some kind of Q, right? Some kind of Q. We have a lot of tasks here or maybe the queue is empty at this point. does not matter but it pushes this task into a queue and what it does after it pushes the task into the queue the API call does not happen at that point it just says that
this is a new task this is a new function call that we need to trigger or that we need to execute at some point at somewhere we don't know that yet right and it pushes that into the cube And after that it returns and by return we mean returns some kind of success status code either 20 0 or 201 depending on whatever semantics that you're using and because of this workflow what happens the user signs up
and your front end makes the API call and they immediately see that screen that we have sent a verification email to your provided email. So you can use the link in that email to verify your account and that takes care of creating the task part. Coming to a different side of this execution. Let's say here after we created the task and pushed it into the queue. Now what happens on this side of the task Q is there are consumers or depending on which
framework or which language that you're using there are consumers or there are workers etc. What they do they take out this particular task from this queue and depending on the configuration and there can be a lot of different types of configuration and depending on what kind of scale that you're executing if you have different different types of nodes different nodes for producers different nodes for consumers but we are talking in a very high level overview to understand how the workflow exactly works what they do they take out this
task from the queue some kind of consumer. And when I say consumer, what we mean is a program that runs in a different process from our main process from our main backend application. Right? this different process it picks up this particular task and since we had serialized the all the information in the task into some kind of format let's say JSON format here it des serializes
into whatever the native language format is let's say if it's Python it des serializes into a dictionary if it's JavaScript it des serializes into an object if it's go it des serializes into a strct right you get the idea des serializ it and it has now a native format for all that data that it needs to perform that kind of task. Now there can be different different configs here. Let's say you have pushed the email into the email sending queue, right? There
can be a different queue for sending notifications, for sending inapp notifications or for sending push notifications to mobiles. Right? Now different types of consumers pick up different tasks from different cues right those all can be configurable depending on what library or what framework that you're using. Now since it is a email sending queue our email consumer or email worker picked up this task and it was in JSON format it des
serializes serialized into a native format and now it has all the information that it needs to perform the task. Now what happens it has all the information and beforehand what we do in that consumer side we configure from which queue we want to pick up the task from and we also configure what kind of data that we are expecting to receive from the task right and that falls under the d serialization layer of the
consumer then what happens we also register a method or a handler which is the code that actually runs when we provide it the data that it needs to use for running, right? Which is basically the same workflow or the same function that we're running synchronously when we did not have a background job, right? B which basically means here at this point what we do we take all the information
that we need for sending the email which is the HTML template the sender email the receiver email the subject etc and the user ID the user information for adding the user first name for adding the user's email etc etc now it has all the data it needs to send the email now what it happens It calls the API for sending the email from different different email providers depending on which email provider that you are using.
It can be recent, it can be mail gun, it can be bravo. Right now this consumer this finally calls the API which sends the user the email. Now usually what happens this all this processing it happens between a few seconds or even milliseconds if there are not a lot of tasks in the queue. So latency is not a problem when you are trying to send an email right because the expiry of let's
say verification emails are usually 15 minutes or 20 minutes and since our email will be delivered in milliseconds or max to max under 5 to 10 seconds if you have a lot of traffic and you have a lot of tasks in your queue right what happens this consumer causes pay and this service sends the email and the user receives the email and they can now verify their account. And in a different scenario, what happens since we assumed
in the previous example that if this service was down and our email sending this API call fails, right? this API call fails. What happens now is since this API failed, this consumer process, this consumer function that the handler that we registered with the consumer, this also fails, which means our task, this particular task has failed. At this point, if we were processing synchronously, our the request response
life cycle breaks and we return some kind of code. Since it's an server error, we return a 500 internal server error. Now that it's happening in the background, what happens if a task fails and depending on what framework that you're using? If you're using Python, you have something like celery. If you're using NodeJS, you have something like Bullmq. In Go, you have something like async Q. Right? Depending on what library or what framework that you're
using for all this background task processing, background task execution, when a task fails, it is again injected into the queue for retrying. And for retries, we have different different algorithms. One of the popular algorithm is exponential back off which basically means after a task fails, it will try after let's say 1 minute, it'll retry that task. If it again fails, it will try after 2 minutes. It will increase
the time. It will increase the whole time every time the task fails. And of course, we have a maximum amount of retries that we have configured beforehand. Let's say five times. So after it fails the second time, we'll try after 4 minutes. After it fails the third time, we'll try after 8 minutes. And usually if you're using an email provider, a big email provider like recent or mail gun, the services don't go down or your APIs don't fail for like 4 minutes, 8 minutes straight. Usually
the downtime will be in seconds or milliseconds. So for those cases, it might fail let's say once or max twice, but the third time it'll most likely succeed. the service, the external service will be back up again and your email sending will be successful and the user will receive the email. And since we offloaded this whole email sending workflow to a background task even if the external service was down, we were able to send the email successfully with
all this retrying mechanism etc. And that is another advantage of using background task processing because of this very convenient functionalities like retrying mechanisms and failure detection etc etc. So now we have a very general idea about how background task processing works and why do we use them? What are the advantages? One of the advantage is the faster processing time, the responsiveness of our backend
application so that we don't block our actual API call because of some external service or because of some heavy processing. And second is we have retrying mechanisms for tasks which are prone to failing. Now to summarize background tasks allow you to offload time consuming and non-critical operations and because of that your backend application your actual backend API calls they are responsive they prevent timeouts because of dependency on external services and it
significantly improves your overall user experience. Now what are some major examples or what are some major kind of tasks that we usually in a typical SAS applications offload to background. One which we already saw is sending emails. Sending emails since it depends on an external service we offloaded to a background task which we already discussed the whole workflow of. The second is processing images or videos.
So if you have a workflow where the user uploads an image and what you have to do is you have to take that image and you have to resize it or you have to resize it into different different format so that the image can be optimized for delivery depending on the network conditions of the user or the device of the user. Since in mobile smartphones we usually need smaller size images and in desktop applications in big size 2 XL XL
etc. we need big images for all these kinds of tasks processing images or processing videos can be offloaded to background. Third we have generating generating reports. So if you are running some kind of enterprise application, enterprise SAS application, let's say take an example of a project management application. So you have to send weekly reports or daily reports or monthly reports depending on your configuration of the user. represent reports which are basically PDF files,
emails of all the stats, all the process task, all the completed task, all the pending task, all the pendings task in a sprint etc etc. In a typical project management application, you have to construct a email with all the content directly in the HTML format or in a PDF format. You have to generate that report in the PDF format and you have to send the email daily at midnight at 12:00 weekly on Sunday at 12:00 etc. Right? For these kinds of tasks also we offload
them to backgrounds using chron jobs. Right? And if you're using a framework or a library like salary bulmq they also have features for scheduled task so that you can configure a particular date a particular time and those task will be executed again and again on those particular intervals. Those features are provided by different different libraries or again sending push notifications. Push notification basically means all the notification that you receive in your smartphone
right on your notification bar from different apps. If you have apps like Swiggy, Zumato, you receive these notifications to order food or your delivery guy has reached a restaurant etc etc. Right? These are called push notifications that your user receive in the smartphone directly in the notification panel instead of inside the app. And usually how it works is you have a back end. Let's say the Zumato or Swiggy, they have the back end. Whenever you install the app in your phone, let's
say you install the Swiggy app in your phone, your device is registered under this push notification service. And usually it is provided by your operating system. So Google has their own push notification service and Apple has their own push notification service etc etc. Now what happens your back end has to store this devices some kind of code in their database and whenever they want to send a particular notification to a
particular device. They have to make a service call to Google or Apple depending on which operating system their user is using. And these external services like Google and Apple, they actually send the notification to you. Not the back end cannot send the notification directly to your phone. It is done by your operating system, right? Operating system service. The back end actually stores the code of your device in the database and it uses that to send
you a notification. So since this workflow also involves making an API call making a service call to a different service external service we also offload this to a background task. Now these are some of the major examples of when do we actually use background tasks in a typical backend application. Now let's get a little technical and understand what exactly is a task Q and
how do they actually work. Now to go by definition, a task Q is just a system for managing and distributing background jobs or tasks. And this is actually the mechanism, the behind thescenes technology that enables you to offload task to reliably hand off the work that you want to do in the background to a separate process. Right? This is the core let's say engine that works behind
the scenes to enable this whole workflow. Now the core idea is you have a producer which is your application code. It can be NodeJS code or Python code or Golang code right depending on what language you're using what framework that you're using your application code which creates the task which creates the task and pushes that into the queue the queue this is the
task Q this is called the producer the responsibility of the producer the work of the producer is it will create the task And that creation process will have all the information the consumer the worker will need to process or execute the task. So the producer will take all that information and it will serialize it into a JSON format or any other kind of serializable any other kind of serializable format that your framework or the your library uses. It serializes
that data it creates the task and it pushes that into the cube. That is all the work of a producer. Then on the other side you have the consumer. You have the consumer which basically runs in a different process. It can be in the same backend application code or it can be a separate codebase also right? It runs in a different process and what it does it picks out those tasks and it actually runs them. it executes those tasks depending on what is the execution
code or what is the handler that you have registered for a particular task. This is how the whole interaction works on a very high level. So this is like imagine a to-do list for your back end. Your application code adds the tasks into the to-do list and the workers or the consumers they pick them off one by one and they execute them. Right? a to-do list for your backend application. So technically speaking on a high level the producer this part it creates a task
and the task might contain information like the user ID or the user profile information like the name of the user the email of the user that we'll need for processing that task. it'll have all the information and if it is an image processing task it will have the image or if it is a report sending task it will have all the data that it needs to send the report etc etc right and the producer this process of pushing the task into the queue it is called nqing right if you are familiar with data structures and algorithms the process of
pushing a new item into the queue it is called and queuing the task which means adding a new item to the task. Now this part this is called a broker. The queue is also called a broker. Now this is responsible for storing these tasks until a worker is ready to process them which happens on the this side other side. So this is like a temporary holding area until consumers or workers can pick up these tasks and start
processing. And this part the consumer or the worker this runs in a separate process or a separate thread. The technicalities can be different depending on what language what framework and what library that you're using but the core fundamentals remain the same. This consumer, it runs in a separate process. And the way it picks these tasks is it constantly monitors the queue for new items, for new items being pushed into the queue. And when it
sees a new task, when it finds a new task, it dqes it. The process of taking out an item from a queue, it's called DQing. And it dqes that task and it starts executing it. So that is the highlevel technical overview of how a task Q actually works. Now going a little more deep. When a task is NQed, a task this is called NQ and is called DQ.
This is also called the broker. Right? So when a task is NQed, it is serializable. It is serialized into a JSON format. Right? This is an important part of ENQing a task. Now this underlying system this Q so far we've been talking in a very abstract way like task Q etc. Usually these are technologies. So speaking of examples, this can be technologies like Rabbit MQ or Radius,
PubSub, right? Radius has a very useful module called publisher and subscriber module which are usually used for implementing task use, right? Or it can be SGS which is a managed queuing service from AWS. Right? If you are considering scaling your task management or task processing system to different nodes which are spread across the whole world then Amazon SQS is a good solution
which is managed service provided by AWS which are deployed in multiple regions so that you have a very scalable and a very responsible task processing system in place. So this queue this broker which actually stores the task right this is usually managed by an underlying technology like rabbit MQ or radius pubsub etc. So what happens is when the worker or when the consumer it completes the task it sends an acknowledgement an
acknowledgement back to the queue. This tells a Q that the task was successfully processed and it can be removed. But if it does not send an acknowledgement, the Q can decide depending on different parameters that whether the task was unresponsive in that case it will try a different retrying mechanism. If the task failed in that case it will try a different retrying mechanism. Right? All these frameworks like celery, bulmq, asyncq, all these very popular
frameworks for task processing. They have all the features in place to manage all these different kinds of edge cases. In this whole interaction, we also have a concept called visibility timeout which basically means that this is the period this whole workflow. This visibility timeout is the period when a task is considered in progress, right? by a consumer or by a worker. And if the worker or the consumer it does not acknowledge the task, it does not send
the acknowledgement signal back to the queue within that timeout, right? Whatever timeout we have configured depending on a configuration. If it does not the worker, the consumer does not send an acknowledgement signal within the timeout probably because it crashed or probably because the external service hung up. It can be a lot of reasons. What the Q will do is the Q will make the task. It will take the task and it will make it available to other consumers or other workers so that the
task does not get lost in the whole interaction. And the reason being since the way our task processing systems work is we the publisher pushes a task into the queue and consumers take out task from the queue. they process it and the cues remove the task from the queue. Right? And since the consumer already took out a particular task from the queue and it was not able to acknowledge the success or the failure of the task.
We don't want the task to get lost in the whole interaction. That's why we have this feature called visibility timeout. uh because of that the Q ensures that our task don't get lost and when a consumer or when a worker is not able to send a particular acknowledgement signal it will make that particular task available to other consumers or other workers so that they can start processing. So someone has to send an acknowledgement signal back to the queue to mark a particular task either as success or failure. Now let's
talk about different kinds of tasks that we usually encounter in a backend application. So the first example is pretty simple that we have already seen one of task which means uh something happens in your in your request request response life cycle and you want to perform a single task which is like a trigger depending on a particular scenario we want to execute a particular function in background. So sending an email is a good example of one of task.
A user registered. You want to send them a verification email. The verification was successful. You want to send them an welcome email. And someone wants to reset their password. You want to send them an email so that using that link they can reset their password. Let's say it's a social media application and someone texted you and you want to send them send the user a notification. So that's again a one-off task, right? So these are examples of one-off task which you will encounter more frequently as compared to other task. But other tasks also have their place. Then we have
recurring tasks. Recurring tasks are basically tasks which have to be executed periodically in specific intervals. And examples of that can be sending a particular report, sending daily reports or sending monthly reports or sending annual reports to your users for whatever the stats in that quarter, in that year or in that month or in that day etc. Recurring task also can contain
cleanup or maintenance jobs. So let's say you have an authentication system in place and for authenticating users you use stateful technologies and for implementing stateful authentication you have to store all the sessions of the user in your database in the sessions table and what happens as the user logs in logs out logs in logs out for a long time they have a lot of abandoned or a lot of orphan sessions which are not really active but they're still in your database since you never actually
deleted them. So you have a task or a recurring task which executes let's say in every 3 months or in every 2 months or in every month end of the month or starting of the month what it does it goes through all the orphan sessions all the sessions which are not active and it's been a long time. So you actually go and delete them from your database. so that you have some free storage. The orphan sessions
are not taking unused space in your database. So those are some examples of recurring tasks. Then we also have chain tasks. Chain task basically have parent child relationship between different different kinds of task. So let's say you have an LMS platform. LMS means learning management system. So some of the popular LMS platforms are let's say Udemy etc. Right now let's imagine you
have an LMS platform of your own and what happens the user the instructor whoever is creating courses in that platform they can come into the platform and they can create a new course and for creating a new course what do they have to do? What they have to do is they have to upload videos. they have to upload videos into the platform. Right? Now a typical workflow of uploading videos into an LMS platform might look like the
user comes they click on upload video and they select the video that they want to upload and they click enter and the video starts uploading and the technically how it works is the front end takes that video and it immediately the back end immediately sends an acknowledgement so that the request is not blocked but the uploading process is in progress. address right it can be uploaded to S3 using a pre-signed URL
right the implementation can be different after the video is uploaded it number of tasks can be triggered the first task can be the video has to be taken and it has to be processed so that the video is available in different different resolutions depending on the network condition of the user or the device of the user etc right the video has to be encoded to different formats to cater to a wide range of network conditions and a wide range of devices. So that is your first task and after
that video is processed and second task has to be triggered which depends on the first task which is the video has to be encoded first. Now the second task is the thumbnails for that video has to be generated right after the video is encoded to a particular format so that that video can be served efficiently from different different CDNs uh content delivery networks or whatever platforms that you're using. After the video is encoded then we have to trigger a
different task so that the thumbnails for that video has to be generated. After the thumbnail has been generated, now we have to trigger another task which is those thumbnail images they have to be processed so that we have different different resolutions of the same thumbnail images for different network conditions and for different devices. Now this thumbnail image processing task depends on thumbnail generation task. But at the same time after the video is encoded at the same
time we started the thumbnail generation task we can parallelly start another task which is transcription generation right the audio transcription generation to show to the user when the video is playing the basically the subtitles. Now this task the audio transcription generation and this task the thumbnail generation since they are not dependent on each other they can be started parallelly but they both depend on the
task that the video has to be encoded first. So this kind of interaction this kind of task chaining this is what we mean when we say chain task or tasks which have a parent child relationship right or some kind of hierarchy. A particular task can only be triggered once the parent task is triggered and successfully completed first. Right? Those are examples of chain task. Then
we have batch tasks. So one example of batch task from a real world example is let's say you have a SAS and you have a button where the user can click on delete account right a lot of SAS platforms offer this feature where you can delete your account now what happens when the user actually deletes that account you don't actually the front end cannot just make an API call and the back end cannot just go through all the data of the user in in the database and
they cannot just delete it in the same request, right? Since if it's a big platform, the user might have a lot of data, right? And they might have data in a database which are spread across different different shards or different different regions etc. Right? There can be a lot of different types of conditions and lot of different types of data of a user that we have in the platform and all of those cannot be deleted in a single request response
life cycle. Even if it can be deleted, it might take let's say more than 40 or 50 seconds or let's say worst case more than a minute, right? And we cannot stall or we cannot block an API call for that long. And for that reason what happens whenever the delete account API call is made what we do we immediately respond back with a successful uh response like 200 and we say that your
account deletion task is in progress and the user is logged out from the platform and they are given some kind of grace period that you have 3 days or you have 7 days to log back into the account and cancel the deletion process or the account will be permanently deleted after the end of 3 days. That can be one use case or the second use case is let's say you don't have a grace period. You're not a big platform like AWS. In that case, what you do the moment the
delete account API call is made the front end immediately receives the successful response that the back end sends back and it logs the user out. And for that user the account is deleted permanently from our platform. But what exactly happens is the back end takes whenever the back end receives the delete account API call. What it does? It creates a task which is delete account and it pushes that task into the queue and it immediately responds back a
successful response like 200 and the front end does it job. And for the back end what happens the consumer which is responsible for actually performing the task of deleting an account it picks that task from the queue and it starts the deletion process which might be going through all the resources of the user. Let's say if it's a project management application going through all the projects where the user is the owner. So removing all the entities of
the user from that project and then deleting the profile of the user, deleting the assets of the user like logos, cover images and then at last actually deleting the user account, right? Then sending some kind of email to the user that your account has been deleted etc etc. So all of those things happen in background which means our API call is not blocked and that is an example of batch tasks right because it involves calling a lot of task from a
single task. So delete account task itself can call a lot of tasks for deleting the entities of the user deleting the assets of the user right those can be batch task triggering a lot of task from a single task. Another example of a batch task can be let's say at 12:00 or midnight or at the end of the week you have to send report to a particular user. Now since you are a big platform you have a lot of users. So at
the same time you trigger a lot of tasks let's say thousands of tasks which all do the same thing generating a report and sending to the user but since it happens for a lot of user that also we can consider as a batch tasks right sending thousands of reports to thousands of users at a particular interval right those are some of the different types of background tasks that we usually deal with now these are some of the design considerations s that you
have to account for when working with task use or working with background task especially when we are talking about scale when you're talking about thousands of users when it's a big platform you're managing the back end of a big platform and it's concerned about the background task processing first thing is it potency this basically means that whatever task that you're creating that you are executing you have to design them in a way that they can be
safely executed multiple times without causing any side effects. Right? In case of let's say failure of the task. Let's say for example in our delete account task you have in your in our delete account task in a typical delete account task what you do is we go through all the resources all the entities of the user uh in the database and we start removing them or we start setting them null etc etc depending on our requirements. Now the what we mean by
Adamency in that context is you have to do all of them in a particular transaction right a particular transaction so that what happens let's say you start deleting stuff from the database and down the road something fails right it can be something related to database or it can be an external service call right calling an API getting some kind of data of the user etc right it can be anything at this
point. If something fails here, if you are executing all of this in a single transaction, what you can do it? You can do a custom roll back, right? A manual roll back. So what happens since this task has failed when the next time the the task Q retries this task, we start from scratch, right? We start from 0% completion. And this is what we mean by item potent. You have to design your tasks in a way that they don't cause any side effects in case something fails
down the line and you have to retry that task from scratch. Same way error handling you have to make sure that you have very robust and very extensive error handling in place especially in tasks management systems because everything is happening in a different process right so you have to make sure you don't miss anything you don't miss any bottlenecks you don't miss any edge cases etc etc right so you have to implement robust error handling to catch
and to log errors also to redry failed tasks. Right? So this is very important in a task management system. Then you have to set up proper monitoring. And what we mean by monitoring is you have to track the status of how many tasks that you currently have in your queue. What are the number of successful tasks? What are the number of failed tasks? And what is the major reason of failed task? It is. Is it some kind of external service or is it some kind of internal
error etc etc right at any point you have to have the complete view of what is the status of your whole task management system so you can use technology like Prometheus right using graphana and tools like that so every time a new task is created you can in insert a new metrics we'll cover things like metrics instrumentation elastic elk stag etc etc all these things related to logs traces monitoring in a different
video but we use this technique called metrics where every time something happens some kind of trigger we create a new metrics and we insert it into some kind of stack in this case we have graphana to monitor what are the different metrics currently that our back back end application have and we have ways to persist those metrics in disk different replication techniques etc etc. Right? So for your task processing system you have to have all these monitoring techniques in place so that at all times you can monitor what
is the status of your system. Same way you have to design your system in a way that it scales so that down the lane if you require more processing power right let's say you your user base spiked and you have to cater to more users now double the amount of users now so what you have to do is you have to add more number of consumers right so you have to design it in a way that down the lane you can scale your consumers horizontally right you can add more
nodes to your consumer customer so that the processing remains as responsive as ever. Same way if you require ordering you have to make sure that whatever library, whatever framework, whatever task you that you are using, they support ordered delivery. Right? Tasks have to be executed in a particular order. You have to make sure that your particular library or framework supports that. Same way we also have rate limiting. So if your task interact with external services, you have to implement
proper rate limiting to prevent overloading those services, right? Because those services also charge you for number of API calls and they also have their own API rate limiting or technique etc etc right so you have to take this also into consideration while designing your tasks. So these are some of the design considerations you have to keep in mind when you're working with a task management or background task management system. Now talking about some of the best practices that I can recommend from
my own experience of dealing with these task background task management systems is the first thing is keep your task small and focused which basically means a single task should only be concerned about a single processing unit. Right? You should not do a lot of things in a single task. So you have to design your task in a way that you have to divide the responsibilities between different tasks so that if one task fails it does
not affect other tasks. If the one task one processing is dependent on a different task then what you can do is you can go for chain task right. You can have a parent child relationship so that a task can be scaled in that way so that even if the child fails the parent does not fail right the child failing is a different task. So the retrying mechanism can handle that with exponential backoff or whatever. But if you do a lot of things in a single task, what happens is even if the topmost
processing is successful but somewhere down the lane something fails then the whole thing has to be repeated all over again right and if you're trying to add a lot of stuff into a single task then that wastes a lot of processing power of your CPU of your consumer of your server right so if you keep your task focused and small it's easier to scale it's easier to avoid bugs and it's easier to monitor them, right? And it's also
easier for your queue to implement different mechanisms like retries, etc. Same way, always try to avoid longunning tasks. If you have a task that takes a long time to complete, break it down to smaller and more manageable chunks. Right? As I said, this is something similar to the above point. If you have a task which is doing a lot of stuff then of course it's going to take a lot of time which is a signal to you that it's about time that you divide this
particular task into different task either different tasks which are concurrently processed or different tasks which have a parent child relationship depending on your requirements. Same way using proper error handling and logging. This is as I said one of the most important things when we are dealing with background task processing. You have to have proper error handling in place so that you give a chance to your queue for retrying and you have all the information you need to
debug whatever happened in your task which made it fail or whether it is an external service whether it is an internal error or everything. So you need all these error handling techniques or logging techniques so that it's easier for you to debug things. It's easier for you to monitor things to find out where the problem is and it's also easier for the queue to perform various retrying mechanisms. And at last you have to constantly monitor Q length and
worker health which basically means you have to have proper alerting systems in place so that if the Q length exceeds a particular limit you have to come up with different techniques to make it more scalable or if some of the workers are going down for some reason you have to debug them what exactly is the error that is causing them to go down etc etc. You have to have proper alerting mechanism so that at all times you are aware that your system your whole task background task management systems is running smoothly. Now to recap, background tasks are essential for
building scalable for building reliable and responsive backend applications and they allow you to offload time consuming and non-critical operations to improve the user experience and to prevent timeouts and to enable redrawing mechanisms for dependency on external services and for heavy processing tasks That's pretty much all about task use that you need to know as a back end engineer.