21.8 C
London
Friday, September 20, 2024

Accelerate Generative AI Inference with Amazon SageMaker’s Fast Auto Scaling

Introduction

Amazon SageMaker inference is a cloud-based service that enables developers to deploy machine learning (ML) models at scale. With the increasing demand for real-time AI applications, SageMaker inference provides industry-leading capabilities to address inference challenges. In this article, we will explore how SageMaker inference can help you reduce the time it takes for your generative AI models to scale automatically.

Faster Auto Scaling Metrics

Today, we are excited to announce a new capability in Amazon SageMaker inference that can help you reduce the time it takes for your generative AI models to scale automatically. You can now use sub-minute metrics and significantly reduce overall scaling latency for generative AI models. With this enhancement, you can improve the responsiveness of your generative AI applications as demand fluctuates.

The rise of foundation models (FMs) and large language models (LLMs) has brought new challenges to generative AI inference deployment. These advanced models often take seconds to process, while sometimes handling only a limited number of concurrent requests. This creates a critical need for rapid detection and auto scaling to maintain business continuity. Organizations implementing generative AI seek comprehensive solutions that address multiple concerns: reducing infrastructure costs, minimizing latency, and maximizing throughput to meet the demands of these sophisticated models. However, they prefer to focus on solving business problems rather than doing the undifferentiated heavy lifting to build complex inference platforms from the ground up.

SageMaker provides industry-leading capabilities to address these inference challenges. It offers endpoints for generative AI inference that reduce FM deployment costs by 50% on average and latency by 20% on average by optimizing the use of accelerators. The SageMaker inference optimization toolkit, a fully managed model optimization feature in SageMaker, can deliver up to two times higher throughput while reducing costs by approximately 50% for generative AI performance on SageMaker. Besides optimization, SageMaker inference also provides streaming support for LLMs, enabling you to stream tokens in real time rather than waiting for the entire response. This allows for lower perceived latency and more responsive generative AI experiences, which are crucial for use cases like conversational AI assistants. Lastly, SageMaker inference provides the ability to deploy a single model or multiple models using SageMaker inference components on the same endpoint using advanced routing strategies to effectively load balance to the underlying instances backing an endpoint.

Components of Auto Scaling

The following figure illustrates a typical scenario of how a SageMaker real-time inference endpoint scales out to handle an increase in concurrent requests. This demonstrates the automated and responsive nature of scaling in SageMaker. In this example, we walk through the key steps that occur when the inference traffic to a SageMaker real-time endpoint starts to increase and concurrency to the model deployed on every instance goes up. We show how the system monitors the traffic, invokes an auto scaling action, provisions new instances, and ultimately load balances the requests across the scaled-out resources. Understanding this scaling process is crucial for making sure your generative AI models can handle fluctuations in demand and provide a seamless experience for your customers.

Conclusion

In this post, we detailed how the ConcurrentRequestsPerModel and ConcurrentRequestsPerCopy metrics work, explained why you should use them, and walked you through the process of implementing them for your workloads. We encourage you to try out these new metrics and evaluate whether they improve your FM and LLM workloads on SageMaker endpoints. You can find the notebooks on GitHub.

Frequently Asked Questions

Question 1: What is Amazon SageMaker inference?

Amazon SageMaker inference is a cloud-based service that enables developers to deploy machine learning (ML) models at scale. It provides industry-leading capabilities to address inference challenges, including automatic scaling, model optimization, and streaming support.

Question 2: What are the benefits of using Amazon SageMaker inference for generative AI models?

The benefits of using Amazon SageMaker inference for generative AI models include reduced infrastructure costs, minimized latency, and maximized throughput. It also provides automatic scaling, model optimization, and streaming support, making it an ideal choice for real-time AI applications.

Question 3: How do the new metrics ConcurrentRequestsPerModel and ConcurrentRequestsPerCopy work?

The new metrics ConcurrentRequestsPerModel and ConcurrentRequestsPerCopy track the actual concurrency or the number of simultaneous requests being handled by the containers (in-flight requests), including the requests queued inside the containers. This provides a more direct and accurate representation of the load on the system.

Question 4: How can I implement the new metrics in my workloads?

You can implement the new metrics by creating a scalable target and defining a scaling policy using the ConcurrentRequestsPerModel and ConcurrentRequestsPerCopy metrics. This will enable you to scale your generative AI models automatically based on the actual concurrency levels.

Question 5: Can I use Amazon SageMaker inference with other AWS services?

Yes, Amazon SageMaker inference can be used with other AWS services, such as Amazon Rekognition, Amazon Comprehend, and Amazon Translate. It provides a scalable and secure way to deploy machine learning models at scale, and can be integrated with other AWS services to create a comprehensive AI solution.

Conclusion (75 words)

In conclusion, Amazon SageMaker inference provides a scalable and secure way to deploy machine learning models at scale. The new metrics ConcurrentRequestsPerModel and ConcurrentRequestsPerCopy provide a more direct and accurate representation of the load on the system, enabling automatic scaling and optimization of generative AI models. With its industry-leading capabilities and scalability, Amazon SageMaker inference is an ideal choice for real-time AI applications.

Latest news
Related news