July 18, 2024
Open Source Gets DE-licious: Mixedbread x deepset German/English Embeddings
We are happy to introduce the much requested result of a collaboration between deepset (the creators of Haystack) and Mixedbread on a project close to home: an open-source German/English embedding model. Our model sets a new performance standard among its open-source peers. Also, it supports binary quantization and Matryoshka representation learning (MRL) to benefit from order-of-magnitude reductions in storage and infrastructure costs.
Read on to learn more about our approach and to check out our benchmarks. If you want to skip right to the model instead, you can access it here:
- mixedbread-ai/deepset-mxbai-embed-de-large-v1: The new and powerful open-source German embedding model made by Mixedbread together with deepset.
Why use embeddings?
Embeddings are among the most adaptable tools in natural language processing, applicable to a diverse range of settings and use cases. Specifically, embeddings are numerical representations of complex objects like text, images, and audio, depicted as n-dimensional vectors.
By transforming objects with an embedding model, you can assess their inherent semantic similarity by calculating the similarity of their respective embeddings. This process involves measuring how closely related two objects are based on the proximity of their embeddings in the n-dimensional vector space. This technique is essential for numerous applications, forming the foundation for recommendation systems, retrieval, one-shot or few-shot learning, outlier detection, similarity search, paraphrase detection, clustering, classification, and more.
Embeddings also play a pivotal role in Retrieval-Augmented Generation (RAG). RAG aims to enable a large language model (LLM) to access custom documents you provide, such as company analyst reports, and enhance its output based on this information. By converting documents and queries into embeddings, the LLM can retrieve the closest information to your data and use it to generate the most relevant output for the user.
Embedding models and their language bias
Currently available embedding models are typically geared towards the English language due to the sheer volume of English-language digital content, the availability of extensive English datasets for training, and the focus of major research institutions and tech companies on English-speaking markets. This bias results in significant limitations for non-English applications, often leading to inaccuracies and misrepresentations of other languages. For the German language, as for many others, this has long meant a lack of robust, effective tools for natural language processing tasks. Our new German language embedding model aims to address this problem, providing a more accurate and culturally relevant tool for the German-speaking community, which also represents one of the world's largest economic areas.
A new winner in German open-source embeddings
Our model is mainly focused on retrieval tasks and is built on the proverbial shoulders of giants; the giant being the multilingual-e5-large model by Wang et al. in this case. Our model was initialized from multilingual-e5-large, fine-tuned on 30+ million pairs of high quality German data, and optimized for compression. As a result, it can perform embedding at large scale, with low cost and high performance in terms of speed and quality. Our model is trained using AnglE loss. During training a mixture of full fine-tuning and LoRA was used to provide better generalization.
We made a significant effort to avoid any overlap of the training and test data. The model is benchmarked on a mix of private and public benchmarks, which we created in collaboration with some of deepset's clients. You can find an overview of the benchmarks in this spreadsheet.
Model | Avg. Performance (NDCG@10) | Binary Support | MRL Support |
---|---|---|---|
deepset-mxbai-embed-de-large-v1 | 51.7 | ✅ | ✅ |
multilingual-e5-large | 50.5 | ❌ | ❌ |
jina-embeddings-v2-base-de | 50.0 | ✅ | ❌ |
Commercial Models | |||
Cohere Multilingual v3 | 52.4 | ✅ | - |
As the table shows, on the NDCG@10 metric, which compares the list of retrieval results against an ideally ordered list, our model sets a new standard for open-source German embedding models and even gets close to fully commercial, closed-source alternatives. While the improvement on the benchmark might not appear particularly significant at first glance, our case study demonstrates that it makes a big difference in real-world applications.
Case Study: A legal data client
For benchmarking with a focus on real-world applicability, we compared the performance of our model against a model that was specifically fine-tuned on the German legal domain. In opposition to that model's fine-tuning on domain-specific data, our model did not see any of this data before the benchmarking.
Model | Avg. Performance (MAP@10) |
---|---|
deepset-mxbai-embed-de-large-v1 | 90.25 |
voyage-law-2 | 84.80 |
As we can see, our general German embedding model outperforms domain-specific alternatives in areas they were specifically trained for. We view this result as a very promising signal for the future usefulness of our model across new domains in the German-speaking world.
Save 97%+ of infrastructure costs with binary MRL
Today, embeddings are still known to struggle with storage and processing speed when used for large-scale tasks. To address the scaling issues of embeddings, two promising approaches have emerged: Matryoshka representation learning (MRL) and binary quantization. MRL reduces the number of output dimensions in an embedding model without significant accuracy loss by prioritizing important information in the initial dimensions, allowing for truncation of less important later dimensions. This approach optimizes storage and processing speed by calibrating the model's loss function to prioritize performance across varying output dimensions. Conversely, binary quantization reduces the size of each dimension by converting float32 values to binary, significantly enhancing memory and disk space efficiency while retaining high performance.
By combining these methods, the aptly named Binary MRL approach demonstrates that it is possible to truncate dimensions and reduce size simultaneously, achieving substantial efficiency gains without major performance sacrifices. This method has been known to allow Mixedbread’s embedding model for which it was originally developed to retain 90% performance with a 64x efficiency gain, dramatically lowering infrastructure costs in cloud computing and vector databases. During benchmarking with binary quantization, we could confirm that our German embedding model keeps 91.8% of performance while increasing efficiency by a factor of 32 due to the quantization. Using the MRL approach and tracking performance over different dimensionalities yields the following result:
As the graph shows, a reduction of vector size by 25% still leaves 97.5% of model performance. In our view, a particularly interesting trade-off emerges at 512 dimensions, where over 93% of model performance remain while cutting embedding sizes in half and enjoying the associated benefits regarding infrastructure needs and cost.
Using it in action
Give us feedback
We hope you enjoy the new, strongest open-source German embedding model as much as we do and welcome any feedback to improve and refine our models' user-friendliness or capabilities. Please let us know if you’re yearning for any new features, want to tell us about an interesting use case, or have encountered any issues. We value your feedback!
Please share your feedback and thoughts through the Mixedbread discord or the Haystack discord. We are here to help and also always happy to chat about the exciting field of machine learning!
Citation
Thank you
To NVIDIA for providing us with cutting-edge computational resources. All of our training and evaluation was done on a NVIDIA DGX with 8xA100 GPUs, which they generously sponsored. This kind of support is invaluable to us, and we're truly grateful to the NVIDIA team for helping to make this project possible.