September 16, 2024

Baking in Performance - Dynamic Batching with Batched

All posts
Baking in Performance - Dynamic Batching with Batched

Reading Time

3 min read

Publish Date

September 16, 2024

Authors

  • Julius Lipp

    Julius Lipp

  • Aamir Shakir

    Aamir Shakir

Adding dynamic batching to our inference API back in the day was a real pain. We struggled with complex implementations and performance bottlenecks that were annyoing and seemed unnecessary. We figured that many developers likely face similar challenges when trying to optimize their machine learning pipelines. That's why we're happy to open-source Batched, so other developers' lives can be a bit easier when it comes to implementing dynamic batching.

Read on to learn more about dynamic batching and how Batched can help you optimize your ML workflows or access the repo .

Why Dynamic Batching?

GPUs excel at handling massive parallel computations using SIMD (Single Instruction, Multiple Data), which allows one instruction to be applied to large sets of data simultaneously. However, in production, requests usually come in one by one, processed sequentially rather than in parallel. This is where dynamic batching helps. Instead of handling each request individually, we gather multiple requests and process them all at once.

Think of a baker toasting bread for customers. The toaster has four slots. When a customer orders toast, the baker immediately fills a single slot and waits for the toast to be ready. If another customer comes in while the toaster is still running, the baker has to wait until the first batch is done before being able to toast a new piece of bread for a new customer. This is sequential processing. Now imagine dynamic batching: the baker waits a short time after an order comes in to see if another order is arriving, grouping multiple orders together so that more customers can be served at once and the toaster can be utilized efficiently.

While this approach might slightly delay the start of toasting for some customers, it significantly increases the overall throughput. Similarly, in machine learning systems, dynamic batching might introduce a small latency (typically milliseconds) for individual requests, but it dramatically improves the GPU's efficiency and the system's overall performance, allowing it to handle a much larger volume of data in less time.

Batched

While there are frameworks (e.g ) that offer dynamic batching, we found that none provide this functionality without locking you into their entire ecosystem. As strong believers in developer experience and composability, we’ve open-sourced Batched to allow you to easily integrate dynamic batching into your projects without adding unnecessary complexity.

Installation

pip install batched

Add Dynamic Batching to Your Model

Let's look at how you can add dynamic batching to your projects using the batched library:

from sentence_transformers import SentenceTransformer
import batched
 
model = SentenceTransformer('mixedbread-ai/mxbai-embed-large-v1')
model.encode = batched.dynamically(model.encode)
 
model.encode(["Hello, world!"]) # Now dynamically batched.

So how can I use it in an API?

Dynamic batching is mostly used in inference APIs, where you want to process multiple inputs at once to improve throughput and efficiency. Here's a simple example of how you could use it with :

from fastapi import FastAPI
from fastapi.responses import ORJSONResponse
from sentence_transformers import SentenceTransformer
from pydantic import BaseModel
import uvicorn
import batched
 
app = FastAPI()
 
model = SentenceTransformer('mixedbread-ai/mxbai-embed-large-v1')
model.encode = batched.dynamically(model.encode)
 
class EmbeddingsRequest(BaseModel):
    input: str | list[str]
 
@app.post("/embeddings")
def embeddings(request: EmbeddingsRequest):
    return ORJSONResponse({"embeddings": model.encode(request.input)})
 
if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Benchmarks and Performance Gains

While specific performance gains can vary depending on your use case and hardware, our benchmarks have shown impressive results. We observed performance improvements of up to 10x in throughput. This significant boost in efficiency comes from better utilization of GPU resources and reduced overhead in processing multiple inputs. The exact improvements you'll see may differ based on your specific setup, but the potential for substantial gains is clear. We strongly recommend benchmarking dynamic batching in your own workflows to quantify the benefits for your particular use case.

Batched Benchmarks

Feedback

As always, we greatly welcome any feedback! Whether you're implementing dynamic batching or exploring other optimization techniques, we'd love to hear about your experiences and challenges.

Please share your thoughts, questions, and insights on our .

Happy batching! 🍞