August 23, 2024
Getting Better with Baguetter - New Retrieval Testing Framework
We are excited to introduce Baguetter, our new open-source retrieval testing framework! Baguetter combines sparse, dense, and hybrid retrieval into a single, easy-to-use interface. It enables fast benchmarking, implementation, and testing of new search methods. Read on to learn more about our approach and to check out some usage examples. If you want to skip right to the framework instead, head over to the Baguetter GitHub repository.
Why Baguetter?
While traditional search remains the foundation of most information retrieval systems, semantic search is gaining more and more traction. We're committed to advancing retrieval systems, but we identified a gap in our toolkit: we lacked a testing framework that could handle both traditional (sparse) and semantic (dense) search methodologies.
Existing solutions often focus on a single method, present challenges in extensibility, or come with large codebases. We encountered these limitations firsthand during our work on BMX and while experimenting with various fusion algorithms for hybrid search. To address these challenges, we created Baguetter. It offers:
- Easy extensibility
- Support for lexical, dense, and hybrid search, as well as embeddings quantization and reranking
- Pure Python implementation with an easy-to-use API
- Good performance
Building on Solid Foundations
Baguetter is a fork of retriv, an open-source project created by Elias Bassani. We've adapted retriv to enhance its flexibility and added efficient keyword search algorithms implementation, mainly BM25S and BMX. For dense search capabilities, we leverage USearch and Faiss. This combination allows us to explore a wide range of possibilities, from modifying BM25 tokenization to testing embedding model performance with quantization.
Using It
Getting started with Baguetter is straightforward:
Baguetter's strength lies in its unified interface, allowing easy evaluation of different search methods:
Evaluation
Understanding Evaluation Metrics
Baguetter uses ranx to evaluate different retrieval metrics. These metrics help assess various aspects of a search engine's effectiveness and can guide optimization efforts. The key metrics include:
-
Normalized Discounted Cumulative Gain (nDCG@k): A metric that takes into account both the position and the relevance grade of retrieved documents. It's particularly useful for tasks with graded relevance judgments and for tasks where the order of results matters.
-
Precision@k: Measures the proportion of relevant documents in the top k results. This is valuable for ensuring that a high percentage of top results are relevant.
-
Mean Reciprocal Rank (MRR): Focuses on the position of the first relevant document in the ranked list. This metric is useful for tasks where finding a single correct answer quickly is important.
-
Recall@k: Calculates the proportion of all relevant documents retrieved in the top k results. This metric is critical for tasks where finding as many relevant documents as possible is important.
When interpreting these results, consider the following:
- Align the metrics with your use case. Different applications may benefit from using different metrics.
- Be aware of trade-offs between metrics. Improvements in one metric may come at the cost of deteriorations in another. For example, tuning for high precision could potentially reduce recall.
- Understanding these trade-offs is crucial when optimizing your system for real-world performance.
Setting Up Evaluation Datasets
Text information retrieval datasets typically consist of three key components:
- A collection of documents with associated ids (corpus)
- A set of queries with associated ids (queries)
- Query-Document relevance judgments (qrels) (ground truth)
While you would typically build a dataset yourself to judge the performance of your search engine, in this case we're using existing datasets from the MTEB (Massive Text Embedding Benchmark).
Baguetter provides a wrapper for loading datasets from the Hugging Face Hub:
Running Evaluations
Once you have your datasets ready, you can evaluate different retrieval methods. In the following example, we're comparing BMX and BM25 sparse retrieval but you can use any indices that extend the BaseIndex
interface.
This will run the evaluation and save the results to the specified directory for you to analyze.
Feedback
We welcome your feedback and thoughts through our discord community. We're here to assist and discuss topics related to information retrieval.