September 16, 2024
Baking in Performance - Dynamic Batching with Batched
Adding dynamic batching to our inference API back in the day was a real pain. We struggled with complex implementations and performance bottlenecks that were annyoing and seemed unnecessary. We figured that many developers likely face similar challenges when trying to optimize their machine learning pipelines. That's why we're happy to open-source Batched, so other developers' lives can be a bit easier when it comes to implementing dynamic batching.
Read on to learn more about dynamic batching and how Batched can help you optimize your ML workflows or access the repo here.
TLDR:
Batched is a lightweight library that seamlessly adds dynamic batching to any transformer model or function, automatically grouping multiple requests for efficient processing.
Why Dynamic Batching?
GPUs excel at handling massive parallel computations using SIMD (Single Instruction, Multiple Data), which allows one instruction to be applied to large sets of data simultaneously. This parallelism is what makes GPUs so powerful for tasks like deep learning, where operations are performed over large tensors.
However, in production environments, especially in online inference scenarios, requests usually come in one by one and are processed sequentially rather than in parallel. Each request may only use a fraction of the GPU's capability, which leads to massive underutilization of the GPU.
This is where dynamic batching comes into play. Instead of handling each request individually, we gather multiple incoming requests over a short time window and process them all at once as a batch. With this approach, we're able to use the GPU's ability to perform parallel computations more effectively, improving overall throughput and resource utilization.
The Toaster Analogy
Imagine a baker who toasts bread for customers. The toaster has four slots, but when a customer orders toast, the baker immediately fills a single slot and waits for the toast to be ready. If another customer comes in while the toaster is still running, the baker has to wait until the first batch is done before being able to toast a new piece of bread for the new customer. This is sequential processing, where resources are not fully utilized.
Now, imagine dynamic batching: the baker waits a short time after an order comes in to see if any other orders are arriving. By grouping multiple orders together, the baker fills all four slots of the toaster and toasts multiple pieces of bread simultaneously. While this approach might slightly delay the start of toasting for some customers, it significantly increases the overall throughput. More customers get their toast faster, and the toaster is used efficiently.
Similarly, in machine learning systems, dynamic batching might introduce a small latency (typically milliseconds) for individual requests, but it dramatically improves the GPU's efficiency and the system's overall performance. By processing multiple inputs at once, the GPU's computational resources are better utilized, allowing the system to handle a much larger volume of data in less time.
The Challenge
Implementing dynamic batching isn't always straightforward. It can involve complex synchronization mechanisms, queue management, and careful handling of different input shapes and types. Many developers find themselves writing boilerplate code or fighting with available frameworks that may not fit seamlessly into their existing infrastructure.
Batched
While there are frameworks like Triton that offer dynamic batching, we found that none provide this functionality without locking you into their entire ecosystem. As strong believers in developer experience and composability, we’ve open-sourced Batched to allow you to easily integrate dynamic batching into your projects without adding unnecessary complexity.
Batched is designed to be lightweight and easy to use. It wraps around your existing functions or models, automatically handling the batching of incoming requests. Therefore, it allows you to add dynamic batching to your codebase with minimal changes and without rearchitecting your entire system.
Installation
You can install Batched via pip:
Add Dynamic Batching to Your Model
Let's look at how you can add dynamic batching to your projects using the batched
library. Suppose you're using an embedding model with sentence_transformers
:
That's it! Your model's encode
method now supports dynamic batching. When you call model.encode
, Batched will automatically collect incoming requests and process them in batches:
Using Batched in an API
Dynamic batching is mostly used in inference APIs, where multiple clients might be sending requests simultaneously. By batching these requests, you can significantly improve throughput and reduce per-request latency.
Here's how you could use Batched in an API using FastAPI:
How Batched Works Under the Hood
Batched intercepts calls to the wrapped function and places them into a queue. It waits for a short, configurable amount of time to collect additional requests. Once the time elapses or a maximum batch size is reached, it processes all collected requests together. If used for the forward pass of a model, it will also handle padding the inputs to the correct size.
This makes sure that requests are batched without introducing significant latency. The default settings are optimized for common use cases, but you can adjust parameters like maximum wait time or batch size to better suit your needs.
Benchmarks and Performance Gains
While specific performance gains can vary depending on your use case and hardware, our benchmarks have shown promising results.
We set up a test environment where we simulated multiple clients sending requests to an inference API. Without dynamic batching, the GPU was underutilized, and the system struggled to handle a high volume of concurrent requests.
After integrating Batched, we observed:
- Up to 10x improvement in throughput: The system could process many more requests per second.
- Reduced per-request latency: Despite the slight delay introduced by batching, the overall time from request to response decreased due to more efficient processing.
- Better GPU utilization: The GPU's computational capacity was utilized more efficiently.
It's important to note that the exact improvements you'll see may differ based on your specific setup, model complexity, and request patterns. We strongly recommend benchmarking dynamic batching in your own workflows to get a clear picture of the benefits for your use case.
Feedback
As always, we greatly welcome any feedback! Whether you're implementing dynamic batching or exploring other optimization techniques, we'd love to hear about your experiences and challenges.
If you're interested in collaborating or have ideas on how to improve Batched, please reach out to us on our Discord or contribute to the GitHub repository.
Happy batching! 🍞