top of page

ML-powered clustering of 1000s of images serverlessly in AWS with milliseconds latency.

Celebrate App is a closed space photo sharing mobile application to which users upload millions of photos each month, at an average of more than 60 images per second during the peak hours. At this scale, effective clustering of images became a very important Machine Learning use case for our downstream tasks, one of them being the detection of similar images from a dynamic groupset.


Celebrate App is a cloud-powered app and all our infrastructure lives in AWS. We took the serverless route to deploy the ML solution, which has significant cost and scalability implications. It was also critical to make this solution low latency, which means that we have to cluster 1,000–2,000 images in milliseconds.


The problem that we wanted to solve is to find relative similarity among images. Finding relative similarity with a score between 0 and 1 will help us find those clusters of images that have low, medium, and high similarity. Cosine similarity is a good metric to find such a pairwise similarity among images. To do that, we first have to find image representation, sometimes also called embeddings, which are one dimensional vectors of 3D (RGB) images. We obtain these representations by applying a machine learning model based on artificial neural networks. Our final solution calculates image representations and saves it in a persistent storage — AWS EFS. This way we not only eliminated re-calculating representations but also dramatically reduced latency and cost. We only compute cosine similarity on demand based on available embeddings where the user provides thresholds to filter specific images.


At a high level, our solution can be grouped into three parts:

  1. App users upload an image to S3 and S3 Notification adds the new image key to SQS. An AWS Lambda polls this SQS queue and invokes a SageMaker Endpoint, which outputs the image representations (embeddings).

  2. The output from SageMaker is sent to another Lambda which saves it to EFS.

  3. A Lambda which is invoked via API gateway calculates similar image clusters in real time and returns the result to the end user.

Below, you can see a simplified version of our architecture:


The details of each part of our solution are described below:


Part 1-Data Ingress: Images live in S3, given the sheer amount of newly uploaded images and existing images, invoking our computation pipeline against all these images is impractical. Thus, image keys are put in SQS (via API Gateway Rest Endpoint) which is later polled by Lambda. Lambda downloads images, preprocesses them, calls the ML component Part 2 and sends its output to Part 3.


Part2-Machine Learning: Our Machine Learning pipeline is mainly an Encoder Neural Network that transforms a 3D (RGB Image) array into a one-dimensional vector array. We have experimented with different Neural Network architectures such as VGG, Resnet, and even Transformer-based Encoders. In our experience these pre-trained Neural Networks already worked very well but in the future we are also planning on training a CNN Auto-Encoder on our own data instead of relying on a pre-trained Imagenet Encoder. After some internal experiments, we took the best-performing model and deployed it serverlessly via Sagemaker Serverless Endpoint in our region.


Part3-Clustering: There are three main AWS resources that we use here. A Lambda function writes to the EFS file system. EFS holds one image folder per user. We have one file per user where all ML embeddings are written to. Even though this has significantly increased read/write performance, we had to take care of file locking to prevent concurrent writes. We group images by amount of similarity using the Cosine-Similarity Function from scikit-learn.


Our Learning And Optimization:


1.There are several things that we have done to decrease latency of our Lambda (Lambda which calculates cosine similarity) invocation:


1.1 Having EFS as a file system for Lambda greatly improved read/write operations of the persistent data (image representations) . EFS is also enabled in the high-performance mode (MaxIO), which further enhances the read/write performance of this Lambda. We choose EFS as persistent storage and not S3 because they are fundamentally different and EFS has this advantage that S3 does not as it stated in AWS docs: “The Lambda service mounts EFS file systems when the execution environment is prepared. This adds minimal latency when the function is invoked for the first time, often within hundreds of milliseconds. When the execution environment is already warm from previous invocations, the EFS mount is already available.”


1.2 Cold start in Lambda is the major factor that causes second(s) latency in import heavy code such as ours. Cold start caused by the time needed to prepare the execution environment. Eliminating cold start of the Lambda using Provisioned Concurrency behind Auto Scaling Group (ASG) has significantly decreased latency to milliseconds. We have also moved static code such as heavy Machine Learning imports on the top of Lambda handler. These changes brought 10x improvement in latency.


1.3 We have one file per user and all the images per user are written to one file. This eliminates reading the representation in the loop sequentially from the file system. For users having more than 1000 images this change has decreased latency by a factor of 5.


1.4 For any matrix computation in a numpy array, we have eliminated loops wherever possible and strongly relied on AVX2, a vectorization extension to the Intel x86 instruction set that can perform single instruction multiple data (SIMD). Again, we have seen significant latency improvement when we had to process large amounts of image representations in memory.


1.5 According to some public blogs there is correlation between Lambda memory and execution time of the function. We have tried to experiment with different memory sizes and benchmarked latency. The improvement was incremental.


2. We run our SageMaker Inference pipeline which generates image representations (econders) serverlessly behind ASG. Even though this should have run robustly, we have noticed some throttling errors with the help of AWS X-Ray. We think this throttling error was happening because of the time needed for ASG to scale SageMaker Inference Endpoints. Adding the delivery delay between messages, so that SQS ‘waits’ for milliseconds before calling Lambda (which also calls Sagemaker) eliminated this issue.

To conclude, in this blog post we have described the general architecture of our solution to the problem of clustering 1000s of images with minimal latency, and dived deep into tackling issues such as persistence storage and cold start of Lambda functions. In the next blog post, we are planning to dive deep into our Machine Learning pipeline and how we managed ML package dependencies in Lambdas.


Relevant Links:


Lambda — EFS: https://aws.amazon.com/blogs/compute/using-amazon-efs-for-aws-lambda-in-your-serverless-applications/

Lambda — AVX2 — SIMD: https://docs.aws.amazon.com/lambda/latest/dg/runtimes-avx2.html

Python — fcntl: https://www.oreilly.com/library/view/python-standard-library/0596000960/ch12s02.html

Sagemaker Serverless Inference: https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html

Computer Vision Neural Network Architectures: https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5

Lambda Memory and Duration: https://medium.com/geekculture/pick-the-right-memory-size-for-your-aws-lambda-functions-682394aa4b21

1 Ansicht0 Kommentare

Aktuelle Beiträge

Alle ansehen
bottom of page