deepset-mxbai-embed-de-large-v1

Discover deepset-mxbai-embed-de-large-v1, a powerful German/English embedding model developed through collaboration between deepset and Mixedbread. This state-of-the-art open-source model offers superior performance, supports binary quantization and Matryoshka representation learning, and enables significant cost reductions in real-world applications.

API Reference

Embeddings

Model Reference

Blog Post

German/English Embeddings with all the goodies

Model Description

deepset-mxbai-embed-de-large-v1 is a powerful German/English embedding model developed through collaboration between deepset and Mixedbread. It sets a new performance standard among open-source embedding models, outperforming domain-specific alternatives in real-world applications.

The model was initialized from the multilingual-e5-large model and fine-tuned on over 30 million pairs of high-quality German data using the AnglE loss function. This extensive training enables the model to adapt to a wide range of topics and domains, making it suitable for various real-world applications and Retrieval-Augmented Generation (RAG) use cases.

deepset-mxbai-embed-de-large-v1 supports both binary quantization and Matryoshka representation learning (MRL). This allows for significant reductions in storage and infrastructure costs, with the potential for 97%+ cost savings through binary MRL.

The model achieves top performance on various benchmarks, including private and public datasets created in collaboration with deepset's clients. It demonstrates strong performance across diverse tasks, showcasing its versatility and robustness.

Layers	Embedding Dimension	Recommended Sequence Length	Language
24	1024	512	German/English

Using a Prompt

For retrieval tasks, the query should be preceded by the prompt: query:. For passages, use the prompt: passage:. For other tasks, the text can be used as-is without any additional prompts.

The prompt parameter is available via our /embeddings endpoint, SDKs and some third-party integrations, to automatically prepend the prompt to the texts for you. By default, we calculate the embeddings using the provided text directly.

Suitable Scoring Methods

Cosine Similarity: Ideal for measuring the similarity between text vectors, commonly used in tasks like semantic textual similarity and information retrieval.
Euclidean Distance: Useful for measuring dissimilarity between embeddings, especially effective in clustering and outlier detection.
Dot Product: Appropriate when embeddings are normalized; used in tasks where alignment of vector orientation is critical.

Limitations

Language: deepset-mxbai-embed-de-large-v1 is primarily designed for German and English languages.
Sequence Length: The suggested maximum sequence length is 512 tokens. Longer sequences may be truncated, leading to a loss of information.

Examples

Calculate Sentence Similarities

The following code illustrates how to compute similarities between sentences using the cosine similarity score function:

from mixedbread_ai.client import MixedbreadAi
from sentence_transformers.util import cos_sim
 
mxbai = MixedbreadAI(api_key="YOUR_API_KEY")
 
query = 'query: Warum sollte man biologisches Brot kaufen?'
docs = [
    query,
    "passage: In unserer Bäckerei bieten wir auch glutenfreies Brot an, das für Menschen mit Zöliakie geeignet ist.",
    "passage: Kuchen und Gebäck sind ebenfalls Teil unseres Angebots, wobei wir auf höchste Qualität und Frische achten.",
    "passage: Wir haben auch eine Auswahl an herzhaften Snacks und Sandwiches, die perfekt für die Mittagspause sind."
    "passage: Biologisches Brot wird aus natürlichen Zutaten hergestellt und enthält keine künstlichen Zusatzstoffe. Es ist gesünder und umweltfreundlicher.",
    "passage: Unsere Bäckerei bietet eine Vielzahl von Brotsorten an, darunter auch biologisches Brot. Es schmeckt besser und ist frei von chemischen Konservierungsstoffen.",
    "passage: Kunden bevorzugen zunehmend biologisches Brot, da es nicht nur gut für die Gesundheit ist, sondern auch einen positiven Beitrag zur Umwelt leistet."
]
 
result = mxbai.embeddings(
    model="mixedbread-ai/deepset-mxbai-embed-de-large-v1",
    input=docs
)
 
embeddings = [item.embedding for item in result.data]
 
# Calculate cosine similarity
similarities = cos_sim(embeddings[0], embeddings[1:])
print(similarities)

This example demonstrates how to use the model for calculating similarities between a query and multiple passages, which is a common task in information retrieval and semantic search applications.

deepset-mxbai-embed-de-large-v1

Model Description

Using a Prompt

Suitable Scoring Methods

Limitations

Examples

Calculate Sentence Similarities

On this page