BM𝒳: A Freshly Baked Take on BM25

We are proud to announce that researchers from Mixedbread and the Hong Kong Polytechnic University have developed a new lexical search algorithm, BMX, that outperforms the current standard BM25 across the board and is easy to use via Mixedbread's open-source Baguetter library.

Read on to learn about BMX, how it works, our benchmarks, and how to use it in practice. If you want to jump right in, you can look into the paper and the library right here:

BMX Paper: Our research paper covering the approach in more academic detail.
Baguetter: Our reference implementation of BMX and an easy to hack testing framework for information retrieval.

TLDR:
Our new BMX search algorithm iterates on the long-standing industry standard and can be accessed via our fully open-source Baguetter library.

What is BMX?

Most text search engines are powered by lexical (keyword) search algorithms, the most prominent of which is BM25. It powers search engines for web search, e-commerce, legal search engines, and many more applications. A key strength of BM25 is that it performs really well in an out of distribution setting, meaning that it generalizes well on data it has not seen before. However, the keyword search approach comes with its own limitations:

BM25 does not consider the similarity between the query and any given document, which could enable a more accurate assessment of this document's relevance to the query.
Lexical search algorithms lack semantic understanding and are therefore unable to deal with linguistic nuances like synonyms and homonyms. This limitation is a key factor in the underperformance of lexical search compared to domain-specific text embedding-based semantic search.

To tackle these challenges, we propose BMX, a lexical search algorithm that incorporates both similarity and semantics. It comes with these key innovations:

Entropy-weighted similarity: We adjust the similarity scores between query tokens and related documents based on the entropy of each token. This approach reduces the influence of common tokens, allowing less frequent but more meaningful tokens to have a greater impact on the similarity score.
Weighted query augmentation (WQA): To incorporate semantics into lexical search, we developed a method that processes both the original query and its augmented versions simultaneously within BMX. This approach removes the need for multiple retrievals and reranking steps, leading to increased efficiency.

How does BMX work?

Now, we will explore the core scoring algorithm in BMX as well as weighted query augmentation (WQA). This will be a math-heavy section, but please bear with us!

Performance and Benchmarking

To validate our ideas and assess the performance of BMX, we evaluated it extensively on different benchmarks including BEIR, BRIGHT, and some multilingual benchmarks. You can find a full overview of our benchmarking in the BMX paper. Our results show that BMX outperforms BM25 across the board and can offer a significant improvement in retrieval quality.

BEIR

BEIR is a benchmark focused on out-of-domain information retrieval. It covers datasets for different domains, document or query lengths, and tasks. After evaluation on the 15 available benchmarks, we compare BMX to the implementations of BM25 in Baguetter and the bm25s library:

Model	AA	SD	SF	NFC	TCV	TCH	FQA	HQA	MM	FVR	NQ	DBP	QRA	CF	CQA	Avg.
BM25S, k1=1.2, b=0.75
Robertson	49.2	15.5	68.3	31.9	59.0	33.8	25.4	58.5	22.6	50.3	29.2	30.3	80.4	13.7	29.9	39.87
ATIRE	48.7	15.6	68.1	31.8	61.0	33.2	25.3	58.5	22.6	50.3	29.1	30.3	80.5	13.7	30.1	39.92
BM25+	48.7	15.6	68.1	31.8	61.0	33.2	25.3	58.5	22.6	50.3	29.1	30.3	80.5	13.7	30.1	39.92
BM25L	49.6	15.8	68.7	32.2	62.9	33.0	25.0	55.9	21.4	46.6	28.1	29.4	80.3	13.5	29.8	39.48
BM25	48.7	15.6	68.0	31.8	61.0	33.2	25.3	58.5	22.6	50.3	29.1	30.3	80.5	13.7	30.1	39.91
Baguetter, k1=1.2, b=0.75
Robertson	49.55	15.69	68.71	32.85	64.36	30.95	25.27	59.13	23.19	50.93	28.55	31.69	78.73	13.54	30.73	40.26
ATIRE	49.23	15.66	68.73	32.71	65.94	31.15	25.30	59.13	23.23	50.98	28.59	31.56	78.72	13.55	31.00	40.37
BM25+	49.23	15.66	68.73	32.71	65.94	31.15	25.30	59.13	23.23	50.98	28.59	31.56	78.72	13.55	31.00	40.37
BM25L	50.32	15.88	69.37	33.00	67.05	31.52	24.79	56.35	22.01	47.37	27.35	30.87	78.21	13.29	30.87	39.88
BM25	49.32	15.65	68.67	32.68	65.78	31.11	25.26	59.09	23.24	50.98	28.85	31.56	78.73	13.55	30.99	40.36
BMX	50.46	15.91	69.42	32.84	68.1	34.34	25.39	61.73	24.21	55.75	29.84	32.2	78.56	13.32	30.77	41.52

As the table shows, BMX generally outperforms all BM25 variants, achieving the best results on 11 out of 15 datasets. In our view, this underscores the importance of incorporating query-document similarity in our search algorithm.

Other than that, we also observe that embedding-based models consistently outperform lexical models, which can be attributed to their powerful semantic understanding. However, due to their significant advantage regarding resource requirements, lexical search algorithms still remain much more widely used.

BRIGHT

BRIGHT was introduced as the first text retrieval benchmark requiring intensive reasoning to retrieve the relevant documents. The benchmark includes 1,385 real-world queries across diverse domains such as StackExchange, LeetCode, and math competitions. These queries are paired with, for example, relevant web pages linked in StackExchange answers and theorems tagged in math Olympiad questions, all of which necessitate deliberate reasoning to identify their connections to the respective queries. For this evaluation, we also took a look at the performance of BMX with WQA.

Model	BIO	ES	ECO	PSY	ROB	SO	SL	LC	PONY	AOPS	TQ	TT	Avg.
Embedding Models
E5-Mistral-7B	18.80	26.00	15.50	15.80	16.40	9.80	18.50	28.70	4.80	7.10	23.90	25.10	17.50
Proprietary Models
OpenAI	23.70	26.30	20.00	27.50	12.90	12.50	20.30	23.60	2.50	8.50	22.20	10.80	17.57
Cohere	19.00	27.50	20.20	21.80	16.20	16.50	17.70	26.80	1.80	6.50	15.10	7.10	16.35
Voyage	23.60	25.10	19.80	24.80	11.20	15.00	15.60	30.60	1.50	7.40	26.10	11.10	17.65
Lexical Models
BM25 ( $k1=0.9$ , $b=0.4$ )	19.25	29.67	16.89	15.77	14.19	15.22	15.30	25.05	4.62	8.28	6.57	2.34	14.43
BM25 ( $k1=0.9$ , $b=0.4$ ) + WQA	22.9	33.86	17.2	24.33	14.54	20.21	16.01	26.61	6.03	8.41	10.50	3.29	16.99
BMX ( $\alpha=0.05$ )	26.4	33.55	13.29	16.06	14.68	13.80	15.47	25.35	10.91	9.18	8.17	2.38	15.77
BMX ( $\alpha=0.05$ ) + WQA	31.57	40.24	17.40	23.53	16.67	18.32	15.05	28.34	10.82	9.22	9.98	2.74	18.66

The table demonstrates that BMX with weighted query augmentation outperforms all baselines, including embedding models and other lexical models. We attribute this to BMX's ability to generalize effectively across various domains, while embedding models often struggle with out-of-domain data. The results highlight that the WQA mechanism successfully enhances BMX’s semantic understanding, enabling it to handle realistic retrieval scenarios more effectively than alternative solutions. Also, BMX achieves this without the need for costly training on massive high-quality datasets, as would be required for embedding models.

Multilingual Benchmark

To verify the performance of BMX on multilingual data, we also performed benchmarking on datasets for Chinese, Japanese, Korean, German, and French retrieval tasks.

Model	MMarcoRetrieval (Chinese)	JaQuAD (Japanese)	StrategyQA (Korean)	GermanDPR (German)	FQuAD (French)	Avg.
BM25	47.41	54.26	33.89	52.27	91.80	55.93
BMX	47.94	54.63	35.72	53.58	91.92	56.76

The table presents the evaluation results, demonstrating that BMX consistently outperforms BM25 across these multilingual datasets. This suggests the increased effectiveness of BMX in handling diverse multilingual information retrieval tasks.

Try it out

Installation

Before you start, make sure to install the Baguetter library:

pip install baguetter

Usage

The following examples show how to use BMX in practice:

from baguetter.indices import BMXSparseIndex
 
# Initialize BM𝒳 index
bmx = BMXSparseIndex()
 
# Add bakery items to the index
docs = [
    "Freshly crusty baked sourdough bread with a crispy crust",
    "Flaky croissants made with French butter",
    "Chocolate chip cookies with chunks of dark chocolate",
    "Cinnamon rolls with cream cheese frosting",
    "Artisanal baguettes with a soft interior and crusty exterior"
]
keys = list(range(len(docs)))
 
bmx.add_many(keys=keys, values=docs)
 
# Search for bread
query = "crusty bread"
results = bmx.search(query, top_k=2)
 
print(results)
# SearchResults(keys=[0, 4], scores=array([2.5519667 , 0.97304875], dtype=float32), normalized=False)

Why does this matter?

Improving the quality of lexical search without a significant increase in complexity or computational resource requirements could lead to a range of beneficial outcomes. Using BMX could improve the user experience of any application that relies on search algorithms. We also expect performance benefits for natural language processing applications that include our algorithm as part of their pipeline.

We're excited to hear about the community's opinions on BMX and where this algorithm could be used in the future. If you want to give us feedback or discuss anything search- or machine learning-related, please come join our Discord community.

Cite As:

@article{li2024bmx,
      title={BMX: Entropy-weighted Similarity and Semantic-enhanced Lexical Search},
      author={Xianming Li and Julius Lipp and Aamir Shakir and Rui Huang and Jing Li},
      year={2024},
      eprint={2408.06643},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2408.06643},
}

Reading Time

Publish Date

Authors

What is BMX?

How does BMX work?

Performance and Benchmarking

BEIR

BRIGHT

Multilingual Benchmark

Try it out

Installation

Usage

Why does this matter?

Reading Time

Publish Date

Authors

BM𝒳: A Freshly Baked Take on BM25

Reading Time

Publish Date

Authors

Read more about BMX

Reading Time

Publish Date

Authors