Baking in Performance - Dynamic Batching with Batched - Blog

Adding dynamic batching to our inference API back in the day was a real pain. We struggled with complex implementations and performance bottlenecks that were annyoing and seemed unnecessary. We figured that many developers likely face similar challenges when trying to optimize their machine learning pipelines. That's why we're happy to open-source Batched, so other developers' lives can be a bit easier when it comes to implementing dynamic batching.

Read on to learn more about dynamic batching and how Batched can help you optimize your ML workflows or access the repo here.

TLDR:
Batched is a lightweight library that seamlessly adds dynamic batching to any transformer model or function, automatically grouping multiple requests for efficient processing.

Why Dynamic Batching?

GPUs excel at handling massive parallel computations using SIMD (Single Instruction, Multiple Data), which allows one instruction to be applied to large sets of data simultaneously. This parallelism is what makes GPUs so powerful for tasks like deep learning, where operations are performed over large tensors.

However, in production environments, especially in online inference scenarios, requests usually come in one by one and are processed sequentially rather than in parallel. Each request may only use a fraction of the GPU's capability, which leads to massive underutilization of the GPU.

This is where dynamic batching comes into play. Instead of handling each request individually, we gather multiple incoming requests over a short time window and process them all at once as a batch. With this approach, we're able to use the GPU's ability to perform parallel computations more effectively, improving overall throughput and resource utilization.

The Toaster Analogy

Imagine a baker who toasts bread for customers. The toaster has four slots, but when a customer orders toast, the baker immediately fills a single slot and waits for the toast to be ready. If another customer comes in while the toaster is still running, the baker has to wait until the first batch is done before being able to toast a new piece of bread for the new customer. This is sequential processing, where resources are not fully utilized.

Now, imagine dynamic batching: the baker waits a short time after an order comes in to see if any other orders are arriving. By grouping multiple orders together, the baker fills all four slots of the toaster and toasts multiple pieces of bread simultaneously. While this approach might slightly delay the start of toasting for some customers, it significantly increases the overall throughput. More customers get their toast faster, and the toaster is used efficiently.

Similarly, in machine learning systems, dynamic batching might introduce a small latency (typically milliseconds) for individual requests, but it dramatically improves the GPU's efficiency and the system's overall performance. By processing multiple inputs at once, the GPU's computational resources are better utilized, allowing the system to handle a much larger volume of data in less time.

The Challenge

Implementing dynamic batching isn't always straightforward. It can involve complex synchronization mechanisms, queue management, and careful handling of different input shapes and types. Many developers find themselves writing boilerplate code or fighting with available frameworks that may not fit seamlessly into their existing infrastructure.

Batched

While there are frameworks like Triton that offer dynamic batching, we found that none provide this functionality without locking you into their entire ecosystem. As strong believers in developer experience and composability, we’ve open-sourced Batched to allow you to easily integrate dynamic batching into your projects without adding unnecessary complexity.

Batched is designed to be lightweight and easy to use. It wraps around your existing functions or models, automatically handling the batching of incoming requests. Therefore, it allows you to add dynamic batching to your codebase with minimal changes and without rearchitecting your entire system.

Installation

You can install Batched via pip:

pip install batched

Add Dynamic Batching to Your Model

Let's look at how you can add dynamic batching to your projects using the batched library. Suppose you're using an embedding model with sentence_transformers:

from sentence_transformers import SentenceTransformer
import batched
 
# Load your model as usual
model = SentenceTransformer('mixedbread-ai/mxbai-embed-large-v1')
 
# Wrap the encode method with batched.dynamically
model.encode = batched.dynamically(model.encode)

That's it! Your model's encode method now supports dynamic batching. When you call model.encode, Batched will automatically collect incoming requests and process them in batches:

embeddings = model.encode(["Hello, world!"])  # Now dynamically batched.

Using Batched in an API

Dynamic batching is mostly used in inference APIs, where multiple clients might be sending requests simultaneously. By batching these requests, you can significantly improve throughput and reduce per-request latency.

Here's how you could use Batched in an API using FastAPI:

from fastapi import FastAPI
from fastapi.responses import ORJSONResponse
from sentence_transformers import SentenceTransformer
from pydantic import BaseModel
import uvicorn
import batched
 
app = FastAPI()
 
model = SentenceTransformer('mixedbread-ai/mxbai-embed-large-v1')
model.encode = batched.dynamically(model.encode)
 
class EmbeddingsRequest(BaseModel):
    input: str | list[str]
 
@app.post("/embeddings")
def embeddings(request: EmbeddingsRequest):
    embeddings = model.encode(request.input)
    return ORJSONResponse({"embeddings": embeddings})
 
if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

How Batched Works Under the Hood

Batched intercepts calls to the wrapped function and places them into a queue. It waits for a short, configurable amount of time to collect additional requests. Once the time elapses or a maximum batch size is reached, it processes all collected requests together. If used for the forward pass of a model, it will also handle padding the inputs to the correct size.

This makes sure that requests are batched without introducing significant latency. The default settings are optimized for common use cases, but you can adjust parameters like maximum wait time or batch size to better suit your needs.

Benchmarks and Performance Gains

While specific performance gains can vary depending on your use case and hardware, our benchmarks have shown promising results.

We set up a test environment where we simulated multiple clients sending requests to an inference API. Without dynamic batching, the GPU was underutilized, and the system struggled to handle a high volume of concurrent requests.

After integrating Batched, we observed:

Up to 10x improvement in throughput: The system could process many more requests per second.
Reduced per-request latency: Despite the slight delay introduced by batching, the overall time from request to response decreased due to more efficient processing.
Better GPU utilization: The GPU's computational capacity was utilized more efficiently.

Batched Benchmarks

It's important to note that the exact improvements you'll see may differ based on your specific setup, model complexity, and request patterns. We strongly recommend benchmarking dynamic batching in your own workflows to get a clear picture of the benefits for your use case.

Feedback

As always, we greatly welcome any feedback! Whether you're implementing dynamic batching or exploring other optimization techniques, we'd love to hear about your experiences and challenges.

If you're interested in collaborating or have ideas on how to improve Batched, please reach out to us on our Discord or contribute to the GitHub repository.

Happy batching! 🍞

Baking in Performance - Dynamic Batching with Batched

Reading Time

Publish Date

Authors