JT Software Abstractions: Serverless and (Micro)Batch Processing: AWS Lambda and the S3 data lake

At Warren Rogers we are actively working to modernize our data processing pipelines by leveraging AWS managed services. Although many processes that were once executed in batch are being done in our new real time complex event processing pipeline using Amazon Kinesis, it is still practical (and more affordable) for some processes to run in batch at regular intervals (once a day or month for example). In the legacy system, jobs against file-based data can run for hours, basically using up the valuable resources of a single large server. We needed an efficient and practical way to borrow compute power for a period of time. We soon realized that serverless functions with AWS Lambda are the perfect fit for many of these jobs for a few reasons.

Firstly, our data lake, like in many other companies is stored in Amazon S3. This turned out to be a great decision. Not only is S3 scalable and affordable, but Amazon recently released an incredibly useful optimization tool called S3 Select. Using simple SQL queries, we can pull the exact fields from the exact rows we need for our batch processes. Running S3 Select over large amounts of data (effectively using server-side filtering) can take a relatively short amount of time, which is important when running AWS Lambda functions.

Secondly, running a serverless function against S3 is perfect as Lambdas can run using hundreds of concurrent executions while S3 was built to handle massive scale. Running Lambda unfettered against something like a traditional database would be sure to take it down. In general, one needs to be very careful to take note of the resources a highly scalable and concurrent service like Lambda calls upon. In the best case your Lambdas are running at full speed, with all the allowed concurrency, against a highly scalable backend like S3 or DynamoDB. Otherwise there are ways to reduce the concurrency and hamstring Lambda if you have to (see reserved concurrency).

Thirdly, our data and its processing are naturally segmented by site. This means we can cut up processing into thousands of jobs in a very natural way. Running concurrent Lambdas against S3 has some of the same constraints you might think of when it comes to thread safety as S3 doesn’t make strong consistency guarantees. You don’t want to share data (in particular generated artifacts) across different instances of the job. We have naturally “single-threaded” data that fits nicely with this pattern.

As to the mechanism itself that runs these jobs, we discovered that we could use one Lambda function, triggered by CloudWatch at regular intervals, to grab all the sites (representing discrete jobs) from S3 and place each site as an event in an SNS topic. For each event, we could trigger a Lambda to run that job. I have noticed up to 200 invocations running concurrently using this model.

To use this pattern, there are a few constraints. First, you must be able to break up your job into discrete chunks. Secondly, the job must use no more than 3 GB of memory and run no longer than five minutes. These limits are imposed by AWS Lambda. At Warren Rogers, this fits many of our use cases.

We have found that this pattern uses less resources, costs less and is faster than doing this with say a Hadoop cluster, not to mention all of the headaches of setting up and maintaining virtual machines. Our jobs are less expensive, faster and easier to develop and update using AWS Lambda.

2 comments:

codingdolphinDecember 7, 2020 at 3:10 AM
Really helpful down to the ground, happy to read such a useful post. I got a lot of information through it and I will surely keep it in my mind. Keep sharing. If you are looking for some useful data and information regarding Docker Health Check then visit Coding Dolphin.
awaisJanuary 4, 2021 at 10:44 PM
Positive site, where did u come up with the information on this posting?I have read a few of the articles on your website now, and I really like your style. Thanks a million and please keep up the effective work. 3d printing services near me

Monday, August 6, 2018

Serverless and (Micro)Batch Processing: AWS Lambda and the S3 data lake

2 comments: