April 12, 2024
64 bytes per embedding, yee-haw ðŸ¤
We are happy to introduce a novel embeddings compression method: Binary MRL. This method will make vector search much more scalable and enable a range of new embeddings-based applications that weren't economically feasible before our release. Learn how the parameter influence the search results.
Read on to learn more about our approach and to check out our benchmarks. If you want to skip right to the model instead, you can access it here:
- mxbai-embed-large-v1: Our recently released flagship embedding model supports binary MRL as it is. How cool is that?!
- Wikipedia Demo: You can experience the speed and performance of our model in the demo (using binary quantization).
Why Embeddings?
Embeddings are one of the most versatile tools in natural language processing, supporting a wide variety of settings and use cases. In essence, embeddings are numerical representations of more complex objects like text, images, audio, etc. Specifically, the objects are represented as n-dimensional vectors.
After transforming objects using an embedding model, you can determine their inherent semantic similarity by calculating the similarity of the respective embeddings. Essentially, you determine how strongly related two objects are by measuring how close their embeddings are to each other in the n-dimensional vector space. This is crucial for many use cases: it serves as the backbone for recommendation systems, retrieval, one-shot or few-shot learning, outlier detection, similarity search, paraphrase detection, clustering, classification, and much more.
Using embeddings is particularly important for Retrieval-Augmented Generation (RAG). The idea behind the concept of RAG is to be able to have an LLM access custom documents that you provide (like analyst reports in your company) and improve its output based on that information. Transforming the documents into embeddings (as well as the query given to the model) allows the LLM to retrieve the most relevant information from your data and utilize it to produce the most relevant output for the user.
Embeddings May Struggle to Scale
However, embeddings may be challenging to use at scale because of their memory usage, which leads to expensive solutions and high latencies. Currently, many state-of-the-art models produce embeddings with 1024 dimensions, each of which is encoded in float32
, i.e., they require 4 bytes per dimension. To perform retrieval over 250 million vectors, you would therefore need around 1TB of memory! With costs estimated at $3.8 per GB/month, using x2gd
instances on AWS, this would incur monthly cloud storage costs of more than $3,500.
Matryoshka Representation Learning & Vector Quantization to the Rescue
To solve the scaling issues of embeddings, two approaches have lately been gaining particular traction: Matryoshka Representation Learning (MRL) and Vector Quantization. Let's first take a look at both concepts.
MRL tries to make embeddings ready to scale by reducing the number of output dimensions of an embedding model without sacrificing a lot of accuracy. This can be achieved by storing more important information in earlier dimensions of the embedding, so that the less important later dimensions can be truncated, saving for example on storage cost and improving processing speed in downstream tasks. In essence, the loss function during model training needs to be calibrated in a way that not only accounts for the standard model performance on, say, 1024 output dimensions, but that tracks the performance using the first 512, 256, 128,... dimensions. Training the model to minimize this loss function will lead it to frontload the most important identifying information within its output vectors.
On the other hand, vector quantization represents a very different approach to the problem. Here, instead of changing the number of output dimensions, the size of every dimension is reduced. Typically, each dimension of the embedding is stored as a float32 value, which requires 4 bytes (32 bits) of storage space. Especially when considering vectors with 1024 dimensions, potentially millions or billions of them, the benefits of reducing this size become obvious. A large gain in memory and disk space efficiency as well as retrieval speed under retention of 95% and more of performance can be realized by storing the embedding dimensions as binary values instead.
This is achieved by simply transforming the float32-values to 1 if they are greater than 0 and to 0 if they are not. In order for this process not to result in greater loss of performance, a rescoring step can be performed when using the model for retrieval tasks. In this approach, first both query and documents are represented as binary embeddings and the most relevant search results are retrieved with them, which are then also reranked in relation to a float32-embedding of the query.
Taking It One Step Further with Binary MRL
Recognizing the potential of both of these approaches, we already published some of our research findings on the subject. On MRL, we published our innovative and novel 2D-Matryoshka model; on binary quantization, we co-authored a post on the hugging face blog, introducing curious members of the community to the subject.
Now, we aim to take things one step further by combining both approaches. We want to demonstrate that it is feasible to truncate embedding dimensions and reduce the size of each dimension simultaneously, while still retaining most of the original model performance using our very own embedding model.
The following table demonstrates that our model is able to retain over 90% of performance while reducing its output dimensions from 1024 to 512 and also reducing the size of each dimension by a factor of 32. In effect, we create a 64x efficiency gain. Naturally, this decrease in memory usage also leads to a proportional - i.e., enormous - decrease in infrastructure cost when processes are run via cloud computing or a vector database specifically!
We evaluated the model performance on the MTEB retrieval benchmark, which includes the 13 publicly available BEIR datasets. The tables show NDCG@10 scores, relative performance retention, and vector size in bytes of our model with float32 values and binary quantization combined with different output dimensions:
1024 Dim. | 512 Dim. | 256 Dim. | 128 Dim. | 64 Dim. | |
---|---|---|---|---|---|
NDCG@10 | |||||
float32 | 54.39 | 51.79 | 46.78 | 36.63 | 18.63 |
binary | 52.46 | 49.37 | 43.25 | 32.80 | 17.61 |
Performance Retention | |||||
float32 | 100.00% | 95.22% | 86.01% | 67.34% | 34.25% |
binary | 96.46% | 90.76% | 79.52% | 60.31% | 32.38% |
Vector Size [byte] | |||||
float32 | 4,096 | 2,048 | 1,024 | 512 | 256 |
binary | 128 | 64 | 32 | 16 | 8 |
As shown, Mixedbread's embedding model performs more than 90% as well using 64-byte vectors as it does using 4,096-byte vectors. In our view, these 512-dimensional binary embeddings also represent the sweet spot for the trade-off between performance and storage capacity.
We can also take a look at the following graph visualizing the relation between performance and output dimensions for both float32
and binary embeddings:
As we can see, the curves for float32
and binary embeddings exhibit strong similarities. It's our view that the trade-off between size and performance is optimal in the less steep left part of the curve. Due to resource constraints, we did not evaluate the performance retention for int8
-quantization, but we would expect that curve to be extremely similar to and inbetween the other two curves.
What does all of this mean in practice?
The following text takes 64 bytes (ASCII) to store: "Bread the warm and yeasty comfort that feeds both body and soul."
Alternatively, you could now store a vector embedding a complex object like text or an image at extremely high quality. Which would you consider more useful?
The Economic Consequences of the Release
Saving space on embedding sizes is not merely a cosmetic exercise to excite a small group of experts on the subject - it makes using neural search with vector databases significantly cheaper. This can have wide-ranging consequences: we believe it will enable new and exciting embeddings-based applications that previously weren't economically feasible!
In the following table, we compiled an overview of required storage space and therefore monthly cost of performing retrieval over 100m, 250m, and 1b vectors, respectively. Again, we assumed costs of $3.8 per GB/month, using x2gd
instances on AWS:
Data type | Dim. | 100M embeddings | 250M embeddings | 1B embeddings |
---|---|---|---|---|
float32 | 1024 | 381.47GB $1,449.58 / mo | 953.67GB $3,623.96 / mo | 3.81TB $14,495.85 / mo |
float32 | 512 | 190.73GB $724.79 / mo | 476.84GB $1,811.98 / mo | 1.91TB $7,247.92 / mo |
float32 | 256 | 95.37GB $362.40 / mo | 238.42GB $905.99 / mo | 953.67GB $3,623.96 / mo |
float32 | 128 | 47.68GB $181.20 / mo | 119.21GB $453.00 / mo | 476.84GB $1,811.98 / mo |
binary | 1024 | 11.92GB $45.30 / mo | 29.80GB $113.25 / mo | 119.21GB $453.00 / mo |
binary | 512 | 5.96GB $22.65 / mo | 14.90GB $56.62 / mo | 59.60GB $226.50 / mo |
binary | 256 | 2.98GB $11.32 / mo | 7.45GB $28.31 / mo | 29.80GB $113.25 / mo |
binary | 128 | 1.49GB $5.66 / mo | 3.73GB $14.16 / mo | 14.90GB $56.62 / mo |
Using It in Action
We offer binary MRL through our API and it is also supported through Sentence Transformers. Here an example how you can use it:
We put also a demo online where you can search through the English wikipedia using binary embeddings and which helps you to understand the influence of the parameters on the results.
Practical Considerations
On a practical level, you will need a vector database that supports this new concept to take full advantage of the benefits it can provide. We understand that many providers will be hesitant in offering this service, as it directly cuts into their profits if the number of users and the embeddings they perform retrieval on stay constant - even though we believe that making vector search more economically available to users will lead to an increase in demand for expanded old as well as completely new applications that will offset this effect for the providers. Already, there are providers that recognize the potential of our findings and want to support their customers in using it in innovative and productive ways. Vespa has been particularly vocal about their excitement to support the wonderful things the community will be able to do with binary MRL.