mxbai-colbert-large-v1
A state-of-the-art ColBERT model for reranking and retrieval tasks. This model combines efficient vector search with nuanced token-level matching, making it ideal for advanced information retrieval applications.
API Reference
Currently not available via API
Model Reference
mxbai-colbert-large-v1
Blog Post
ColBERTus Maximus - Introducing mxbai-colbert-large-v1
Model description
mxbai-colbert-large-v1 is a state-of-the-art ColBERT (Contextualized Late Interaction BERT) model for reranking and retrieval tasks. It is based on the mxbai-embed-large-v1 model and achieves state-of-the-art performance on 13 publicly available BEIR benchmarks.
ColBERT combines the benefits of vector search and cross-encoders. Queries and documents are encoded separately, but instead of creating a single embedding for the entire document, ColBERT generates contextualized embeddings for each token in the document. During search, the token-level query embeddings are compared with the token-level embeddings of the documents using the lightweight scoring function MaxSim. This allows ColBERT to capture nuanced matching signals while being computationally efficient.
mxbai-colbert-large-v1 is initialized from the mxbai-embed-large-v1 model, which was trained on over 700 million samples from various domains. The ColBERT model was then fine-tuned on around 96 million samples to adapt it to the late interaction mechanism. This extensive training enables the model to be used for a wide range of tasks and domains.
On the BEIR benchmark, mxbai-colbert-large-v1 outperforms other ColBERT models on average and directly in most tasks. Its exceptionally high reranking score even surpasses typical scores for cross-encoder based reranker models on the benchmark, despite the advantages of the ColBERT architecture regarding resource efficiency. The model also demonstrates state-of-the-art retrieval performance when compared to other currently available ColBERT models.
Layers | Embedding Dimension | Recommended Sequence Length | Language |
---|---|---|---|
24 | 1024 | 512 | English |
Suitable Scoring Methods
- MaxSim: The lightweight scoring function used in ColBERT to compare token-level query embeddings with token-level document embeddings.
Limitations
- Language: mxbai-colbert-large-v1 is trained on English text and is specifically designed for the English language.
- Sequence Length: Any text longer than 512 tokens will be truncated.
Examples
We recommend using RAGatouille for utilizing our ColBERT model.
The result looks like this: