August 23, 2024

Getting Better with Baguetter - New Retrieval Testing Framework

All posts
Getting Better with Baguetter - New Retrieval Testing Framework

Reading Time

4 min read

Publish Date

August 23, 2024

Authors

  • Julius Lipp

    Julius Lipp

  • Aamir Shakir

    Aamir Shakir

We are excited to introduce Baguetter, our new open-source retrieval testing framework! Baguetter combines sparse, dense, and hybrid retrieval into a single, easy-to-use interface. It enables fast benchmarking, implementation, and testing of new search methods. Read on to learn more about our approach and to check out some usage examples. If you want to skip right to the framework instead, head over to the .

Why Baguetter?

While traditional search remains the foundation of most information retrieval systems, semantic search is gaining more and more traction. We're committed to advancing retrieval systems, but we identified a gap in our toolkit: we lacked a testing framework that could handle both traditional (sparse) and semantic (dense) search methodologies.

Existing solutions often focus on a single method, present challenges in extensibility, or come with large codebases. We encountered these limitations firsthand during our work on and while experimenting with various fusion algorithms for hybrid search. To address these challenges, we created Baguetter. It offers:

  1. Easy extensibility
  2. Support for lexical, dense, and hybrid search, as well as embeddings quantization and reranking
  3. Pure Python implementation with an easy-to-use API
  4. Good performance

Building on Solid Foundations

Baguetter is a fork of , an open-source project created by Elias Bassani. We've adapted retriv to enhance its flexibility and added efficient keyword search algorithms implementation, mainly and . For dense search capabilities, we leverage and . This combination allows us to explore a wide range of possibilities, from modifying BM25 tokenization to testing embedding model performance with quantization.

Using It

Getting started with Baguetter is straightforward:

pip install baguetter

Baguetter's strength lies in its unified interface, allowing easy evaluation of different search methods:

from baguetter.indices import BMXSparseIndex
 
# Create a simple sparse index
index = BMXSparseIndex()
 
# Sample documents
doc_ids = ["1", "2", "3", "4"]
docs = [
    "We all love baguette and cheese",
    "Baguette is a great bread",
    "Cheese is a great source of protein",
    "Baguette is a great source of carbs",
]
 
# Add documents to the index
index.add_many(doc_ids, docs)
 
# Perform a search
results = index.search("baguette and cheese")
print(results) # SearchResults(keys=['1', '3', '2', '4'], scores=array([1.888962  , 0.73008496, 0.60060966, 0.5797755 ], dtype=float32), normalized=False)

Evaluation

Understanding Evaluation Metrics

Baguetter uses to evaluate different retrieval metrics. These metrics help assess various aspects of a search engine's effectiveness and can guide optimization efforts. The key metrics include:

  1. Normalized Discounted Cumulative Gain (nDCG@k): A metric that takes into account both the position and the relevance grade of retrieved documents. It's particularly useful for tasks with graded relevance judgments and for tasks where the order of results matters.

  2. Precision@k: Measures the proportion of relevant documents in the top k results. This is valuable for ensuring that a high percentage of top results are relevant.

  3. Mean Reciprocal Rank (MRR): Focuses on the position of the first relevant document in the ranked list. This metric is useful for tasks where finding a single correct answer quickly is important.

  4. Recall@k: Calculates the proportion of all relevant documents retrieved in the top k results. This metric is critical for tasks where finding as many relevant documents as possible is important.

When interpreting these results, consider the following:

  • Align the metrics with your use case. Different applications may benefit from using different metrics.
  • Be aware of trade-offs between metrics. Improvements in one metric may come at the cost of deteriorations in another. For example, tuning for high precision could potentially reduce recall.
  • Understanding these trade-offs is crucial when optimizing your system for real-world performance.

Setting Up Evaluation Datasets

Text information retrieval datasets typically consist of three key components:

  1. A collection of documents with associated ids (corpus)
  2. A set of queries with associated ids (queries)
  3. Query-Document relevance judgments (qrels) (ground truth)

While you would typically build a dataset yourself to judge the performance of your search engine, in this case we're using existing datasets from the .

Baguetter provides a wrapper for loading datasets from the Hugging Face Hub:

from baguetter.evaluation import HFDataset
 
datasets = [
    HFDataset("mteb/scifact"),
    HFDataset("mteb/scidocs")
]

Running Evaluations

Once you have your datasets ready, you can evaluate different retrieval methods. In the following example, we're comparing BMX and BM25 sparse retrieval but you can use any indices that extend the BaseIndex interface.

from baguetter.evaluation import evaluate_retrievers
from baguetter.indices import BMXSparseIndex, BM25SparseIndex
 
# Create indices
bmx_factory = lambda: BMXSparseIndex()
bm25_factory = lambda: BM25SparseIndex()
 
# Evaluate
result = evaluate_retrievers(
    datasets=datasets,
    retriever_factories={
        "bmx": bmx_factory,
        "bm25": bm25_factoryter
    },
    metrics=["ndcg@1", "ndcg@5", "ndcg@10", "precision@1", "precision@5", "precision@10", "mrr@1", "mrr@5", "mrr@10"],
)
result.save("eval_results")
#Report for mteb/scifact (rounded):
#---------------------------------------------------------------
#    Model      NDCG@1    NDCG@5    NDCG@10    P@1    P@5    P@10    MRR@1    MRR@5    MRR@10
#---  -------  --------  --------  ---------  -----  -----  ------  -------  -------  --------
#a    bmx          0.55     0.663      0.694   0.55  0.161   0.093     0.55    0.643     0.655
#b    bm25         0.54     0.66       0.687   0.54  0.161   0.091     0.54    0.638     0.649
#....

This will run the evaluation and save the results to the specified directory for you to analyze.

Feedback

We welcome your feedback and thoughts through our . We're here to assist and discuss topics related to information retrieval.