top of page
Search
Writer's pictureArthur Câmara

Fine-tuning an LLM for State-of-the-Art retrieval: Zeta Alpha's top-10 submission to the MTEB benchmark

We are excited to introduce Zeta-Alpha-E5-Mistral, our first open model, as a showcase of how to fine-tune LLMs to produce state-of-the-art embeddings, and we are proud that at the moment of submission (5 September 2024) our model landed in the top-10 of this globally competitive benchmark.

MTEB Leaderboard - Retrieval tasks - 12 September 2024.

While extremely large open-source LLMs make the most headlines (like LLama's 3.1 405B parameter model), their smaller siblings, with less than 10B parameters, are quickly becoming one of the most popular and powerful ways to use LLMs for applications.


One of the problems researchers and users are tackling with these LLMs is how to create high-quality embeddings that can be used for tasks such as semantic retrieval, making large collections of documents searchable, especially within RAG pipelines.


The most common benchmark for such models, the MTEB leaderboard, measures performance on tasks such as clustering, classification, and, most importantly, retrieval. It shows that pre-trained LLMs can be fine-tuned to produce high-quality embeddings, with many of the best-performing models being based on 7B-parameter LLMs. 

We hope that openly sharing our data and recipes is helpful for others working in a similar direction.


Model selection

Looking at the MTEB benchmark, it is clear that 7B models can produce high-quality embeddings, with Mistral-based models being the most popular choice. We did not want to train a model completely from scratch for this release, so we decided to further fine-tune e5-mistral-7b-instruct, one of the most successful and widely used open embedding models, which is, in turn, based on Mistral-7B-v0.1, in order to improve its standing on MTEB.


While fine-tuning an already strong model is a good starting point, it also limits some of our choices. For instance, trying to fix the inconsistencies in the instructions used by the model is tricky, as the model already "knows" how to perform retrieval under these instructions (one example of such "inconsistencies" is the instructions for STS datasets, which end with a period, unlike others).


Another limitation is that some newer models, such as NV-Embed-V2, show that tweaks to the attention mechanism can significantly increase performance, and inserting architectural changes or completely different loss functions in an existing model would require a full retraining process.


Training Dataset

One of the most critical parts of the process was deciding which datasets to use when training the model. While the original E5-Mistral was trained, at least in the first stage, with mostly synthetic data, we iterated on our training data mixture, using only "real" data. In the end, we settled on the following datasets for our training set:


Dataset

# of samples

Source

Arguana

4,065

FEVER

50,000

FIQA

14,166

HotPotQA

85,000

MsMarco (passage)

200,000

NFCorpus

4,000

NQ

100,231

SciFact

919

NLI

20,000

SQuad

87,417

StackExchange

100,000

TriviaQA

20,000

SciRep

43,000

arXiv-s2s

34,929

arXiv-p2p

34,929

BiorXiv-s2s

4,070

BiorXiv-p2p

4,070

medRxiv-s2s

1,160

medRxiv-p2p

1,160

AmazonCounterfactual

4,018

AmazonReview

20,000

Banking77

9,926

Emotion

15,989

MTOPIntent

9,942

ToxicConversations

39,999

TweentSentiment

27,481

IMDb

14,999

STS12

1,850

STS22

414

STSBenchmark

2,777

When sampling from non-retrieval datasets, we used a stratified sampling strategy so the ratio of samples between the classes remained consistent with the rest of the training set.


The NV-Retriever training mix inspired our selection of training data (with some minor changes to filtering and selection). Of note is that we did not use samples from BioASQ, PAQ, and GOOAQ. Instead, we included samples from the SciRepEval collection from their search subset. We removed any queries and documents that may overlap with MTEB's SciDocs test collection to avoid contamination.


Hard negatives

One of the most important steps when building a training set is how you sample your hard negatives. Again, we took inspiration from the work of the NV-Retriever team and used a similar TopK-PercPos strategy for sampling negatives for each query in the retrieval datasets. Instead of naively selecting the Top-K non-positive documents retrieved by a retriever as hard negatives, we only select documents with a score of at least 95% of the score of the true positive document. This sampling strategy avoids the problems of too-hard negatives and false negatives.


Due to the expensive annotation step, most of the retrieval training datasets existing today have only a single positive document for each query. However, in reality, most queries can be answered by more than a single document. As an example, take the TREC-Covid dataset. In that case, some queries can have over 100 documents marked as "relevant". Therefore, avoiding false negatives in the training set is critical.


To sample these negative documents, we relied on an existing powerful embedding model, Snowflake's Arctic-embed-m-v1.5. While a larger model could yield even better hard negatives, Snowflake's model's medium size allowed us to gather the hard negatives without a large GPU budget, which would have been better used to train our model.


Evaluation Datasets

A key consideration when training any Machine Learning model is how fast we can iterate over hyperparameters and models. However, the size of the MTEB collection's datasets, particularly the corpus of some of the retrieval datasets, can make this process impractical. Evaluating the full MTEB collection (or even only on BEIR) can take days, especially with a large model like Zeta-Alpha-E5-Mistral. Hence, we have created a smaller version for each BEIR dataset, known as the NanoBEIR collection, to address this. As part of this release, we are making the NanoBEIR datasets available on the HuggingFace Hub.


The NanoBEIR collection consists, for each dataset, of 50 queries, randomly sampled from the full collection and up to 200 negative documents per query. To sample these negative documents, we used both BM25, as implemented by Pyserini, and another embedding model, Alibaba's gte-large-en-v1.5. This is similar to the approach used by the Snowflake's Arctic team, in what they called an internal "Lite BEIR dataset". We make our version publicly available, in hopes of it being a useful resource for reproducibility and faster experimentation. As an example, we show in the table below the results of the base model (E5-Mistral) and Zeta-Alpha-E5-Mistral, on the NanoBEIR datasets:


Dataset

E5-Mistral

Zeta-Alpha-E5-Mistral

NanoArguAna

59.9

65.8 (+5.8)

NanoClimateFever

42.5

42.3 (-0.2)

NanoDBPedia

71.8

72.8 (+1.0)

NanoFEVER

94.9

96.2 (+1.35)

NanoFiQA

60.3

61.0 (+0.7)

NanoHotPotQA

85.6

89.9 (+4.3)

NanoMSMarco

66.1

70.1 (+4.0)

NanoNFCorpus

33.0

39.4 (+6.4)

NanoNQ

75.4

83.1 (+7.7)

NanoQuora

94.1

95.8 (+1.75)

NanoSCIDOCS

35.4

41.3 (+5.92)

NanoSciFact

78.0

79.8 (+1.8)

NanoTouche-2020

52.5

54.0 (+1.5)

Average

65.3

68.6 (+3.3)

Training recipe


We trained Zeta-Alpha-E5-Mistral on 4xA100(80G) GPUs. The training process took about 80 hours. To increase the effective batch size, we used GradCache. In our experiments, we found that in-batch negatives did not seem to be helpful when continuing training from an existing checkpoint. We also experimented with following the alternating tasks for each batch, as proposed by the SFR-Embedding-Mistral team. However, we did not notice any significant changes in the results, with an added complexity in the training script. We used the traditional InfoNCE loss with temperature scaling. Finally, we also used an early stopping technique, which stopped training after no improvements on the evaluation set after ten evaluation steps. In the end, we trained Zeta-Alpha-E5-Mistral with the following hyperparameters:

Parameter

Value

GradCache chunk size

8

Effective batch size

1024

Epochs

1

Maximum query length

192

Maximum document length

512

Loss Temperature

0.2

Negatives per query

5

Lora R

8

Lora Alpha

32

Lora Dropout

0.1

We trained the model using BF16 with TF32 support activated in PyTorch to speed up training. We also trained our model using SPDA attention. We saw minor speed gains using Flash-Attention and SPDA, but SPDA was more stable in a few small-scale experiments. Another thing we tried implementing to speed up training was using PyTorch's compiler. However, given the nature of text data, where the input shapes always change (i.e., between query and document), avoiding the PyTorch compiler to re-compile the graph frequently was not trivial. It should be possible to work around it with Dynamic Shapes, but we leave this investigation for future work.


Next steps

For now, we hope that Zeta-Alpha-E5-Mistral and NanoBEIR can be useful, and we look forward to releasing more high-quality embedding models soon.

317 views0 comments

Comments


bottom of page