[논문 Review] 20. NV-Retriever: Improving text embedding models with effective hard-negative mining

좋은 Contrastive Learning을 위한 Hard Negative를 잘 찾아보자

NV-Retriever: Improving text embedding models with effective hard-negative mining

Text embedding models have been popular for information retrieval applications such as semantic search and Question-Answering systems based on Retrieval-Augmented Generation (RAG). Those models are typically Transformer models that are fine-tuned with cont

arxiv.org

Abstract

MTEB benchmark에서 현재 (12.18) 1위를 차지하고 있는 모델

https://huggingface.co/nvidia/NV-Embed-v2

nvidia/NV-Embed-v2 · Hugging Face

Introduction We present NV-Embed-v2, a generalist embedding model that ranks No. 1 on the Massive Text Embedding Benchmark (MTEB benchmark)(as of Aug 30, 2024) with a score of 72.31 across 56 text embedding tasks. It also holds the No. 1 in the retrieval s

huggingface.co

contrastive learning dataset 구성의 개선 방안에 대해서 논의하는 논문이므로, text embedding model 학습에서의 contrastive learning에 대해 다룬 예전 논문 SimCSE를 미리 읽어보는 것을 추천한다.
https://ll2ll.tistory.com/75

현재 대부분의 텍스트 임베딩 모델이 contrastive learning을 바탕으로 fine-tuning을 진행하는데, 많은 논문에서 새로운 임베딩 모델 아키텍처와 훈련 방법을 소개했지만 negative passage mining에 대해서는 제대로 탐구되거나 설명하지 않았다.

임베딩 모델을 contrastive learning 방식으로 훈련하기 위해서는 고품질의 hard negative passage를 선택하는 것이 매우 중요하면서도 어려운 과제 중 하나이다.

따라서 이번 논문에서는 효과적으로 false negative를 걸러내기 위해 positive relevance score를 이용하는 positive-aware 방법론을 제안한다.

1. Introduction

임베딩 모델은 Contrastive Learning (대조학습) 으로 학습되어,

Positive : 쿼리의 임베딩 - 해당 쿼리에 대한 답변과 관련된 구절 -> 유사성 최대화
Negative : 쿼리의 임베딩 - 쿼리와 관련이 없는 구절 -> 유사성 최소화

하는 방식으로 학습이 된다.

그리고 Hard negative라는 것은,

쿼리와 어느 정도 유사성이 있지만 답변과 관련이 없는 구절을 찾아내어 negative로 넣어주는 것.

즉 모델에게 그냥 in-batch, random negative 보다 더 어려운 과제를 주어 학습 성능을 높이는 것이다.

하지만 설명 그대로 쿼리와 어느 정도는 유사성이 있으면서도 정답에 해당하지 않는 구절을 찾아야하기 때문에 찾는 것 자체가 어렵다.

물론 Hand-crafted로 찾거나 만들면 당연히 성능은 좋겠지만 그것도 다 돈이라서

이걸 최대한 품질 좋게 자동화해보자! 하고 구상하는 것이 Hard negative mining이다.

2. Background

2.1 Text embedding models

Text embedding model은 가변 길이의 텍스트를 다운스트림 태스크에 사용할 수 있도록 고정 차원 벡터로 표현해주는 모델이다.

embedding sentence
- Sentence-BERT (SBERT) : siamese (query, positive), triplet (query, positive, negative) 이용해서 BERT network 수정
Contrastive Learning
- SimCLR & SimCSE : contrastive learning과 text embedding으로의 적용
- E5 : MS에서 개발한 텍스트 임베딩 모델
  - 대규모 weakly-supervised dataset CCpairs 구축
    - 명시적인 라벨이 없는 데이터 또는 제한된 라벨 정보를 활용하여 모델을 학습시키는 방법
    - 직접적인 라벨 대신, 텍스트 쌍(pair), 클릭 로그, 또는 웹 구조와 같은 암시적인 신호를 사용하여 학습
    - e.g. 질문 - 답변, 제목 - 본문
  - consistency-based filter 사용하여 데이터 품질 높임
  - InfoNCE Loss + in-batch negative

2.2 Hard-negative mining for fine-tuning embedding models

Contrastive Learning에는 {query, positive passage, negative passage(s)} triplet이 필요하다.

negative passage를 선택하기 위한 여러 방법이 있는데

사람이 직접 레이블을 지정
corpus에서 랜덤하게 선택
(in-batch negative) 배치의 다른 쿼리의 positive passage를 negative로 사용

모델 학습할 때 forward-pass에서 이미 해당 passage에 대한 임베딩이 생성되었기 때문에 효율적이지만, 같은 배치 내에서 negative를 뽑아 사용하기 때문에 충분한 negative 다양성을 위해 배치 크기가 커야 한다.

또한 쿼리에 대해서 negative passage가 무작위하기 때문에, 해당 방법을 통해서만 제대로 contrastive learning을 수행하기에는 어려움이 있다.

예를 들어 {query : "고양이의 수명은?", negative : "넷플릭스의 설립연도는 1997년이다"} 와 같은 쌍으로 구성되면 학습에 별로 도움이 안된다는 소리다.

그렇게 때문에 Hard negative를 제공하는 것이 중요하다.

DPR : BM25로 1-2개의 hard negative + in-batch negative 사용
ANCE : 훈련 중 ANN index를 비동적으로 업데이트하고 쿼리해 corpus embedding에서 지속적으로 hard negative mining을 시도한다. (비용 많이 듦)
RocketQA : Hard negative를 naive하게 찾아낼 경우 false negative가 많아진다는 사실을 보였다.
- MS-MARCO set에서 실험한 결과, 쿼리와 가장 유사한 구절의 약 70%가 실제로는 positive로 분류되어야 한다는 것을 발견했다.
- '해당 논문에서는 쿼리와의 유사도 점수 높은 것을 필터링해서 잠재적인 false-negative를 없애는 방법을 사용했다.
일부 연구에서는 Embedding model에서 찾아낸 hard negative를 cross-encoder ranking model이나 powerful decoder model을 통해 개선
최근 MTEB에서 상위권을 차지하고 있는 모델들은 fine tuning에 hard negative mining을 활용했지만, 사용할 모델과 방법론을 자세히 탐색하거나 설명하지는 않는다. 하지만 참고할 만한 몇몇 예외 케이스도 있다.
- snowflake-artic-embed-l : hard negative mining에 대해 다양한 최대 점수 임계값을 실험하는 연구 진행
- SFR-embedding-mistral : 세 가지 하드 네거티브 샘플링, top-k candidate 에 대한 ablation study

3. Investigation on hard-negative mining for fine-tuning text embedding models

3.1 Hard-negative mining methods

Naive top-k : positive를 제외하고 query와 가장 유사한 상위 k개의 후보를 선택하는 것
- 그러나 앞에서 언급했듯 False negative가 잡힐 확률이 높음.

Top-K shifted by N : 쿼리와의 유사도 내림차순으로 정렬한 후, rank N개 이후의 top-k를 선택하는 것
- e.g. Top-10 shifted by 5 : 첫 5개를 버리고 6 - 15번째 passage를 선택
- negative와의 유사도 점수를 고려하지 않음 -> 중요한 negative를 버리거나 False negative를 유지할 수도
Top-k abs : 절대 임계값보다 유사도 점수가 높은 negative 제외
- 임계값에 예민함
- positive와의 유사도 점수가 고려되지 않고 그냥 절대값으로 자름

이러한 한계점을 극복하기 위해, 다음과 같은 2가지 positive-aware mining method를 제안한다.

Top-k MarginPos : negative score의 Maximum threshold를 다음 식으로 정한다.

Top-k PercPos : negative score의 Maximum threshold를 다음 식으로 정한다.

위의 Mining method로 False negative를 제거한 다음 다수의 후보를 추출해서 Hard negative set을 구성한다.

일반적으로 top-k개를 선택하지만, top-k개에서 샘플링하여 다양성을 추가하는 방법도 시도해보았다.

Sampled Top-k : top-k에서 n개의 샘플 추출
Top-1 + Sampled Top-k : Top-1 hard negative 고른 다음 n-1개 샘플 추출

3.2 Selected embedding models

Hard negative mining을 위한 Teacher model로 다음과 같은 텍스트 임베딩 모델을 사용했다.

e5-large-unsupervised (334M): E5 model pre-trained on unsupervised data with CL
e5-large-v2 (334M): E5 model fine-tuned on top of e5-large-unsupervised with supervised data
snowflake-arctic-embed-l (334M): E5와 같이 지도 & 비지도 두 차례에 걸쳐 훈련된 모델. 데이터와 훈련 방식에서 개선이 있었음
e5-mistral-7b-instruct (7.1B) : embedding 생성을 위해 CL로 학습된 Decoder only Mistral model
NV-embed-v1 (7.8B) : A Mistral-based embedding model with some modifications including bi-directional and latent attention

3.3 Training and Evaluation

3.3.1 Training

NQ, Stack Exchange, SQUAD dataset을 혼합한 287k samples 사용

3.3.2 Evaluation

MTEB 전체로 진행하면 많은 계산과 시간이 필요하므로, BEIR benchmark에서 Q&A RAG system에 적합한 NQ, HotpotQA, FiQA-2018을 대상으로 평가 진행
검색 정확도는 NDCG@10을 사용

3.4 Ablation Study Results

3.4.1 Different teacher embedding models for mining

모든 Query에 대해 4개 방식으로 Hard negative를 생성

각 임베딩 모델(예: BM25, NV-Embed-V1 등)을 사용하여 쿼리와 가장 유사한 문서들을 하드 네거티브 후보군으로 추출
TopK-PercPos 방식(아래에서 설명)을 통해 false negative를 제거
최종적으로 필터링된 후보군 중에서 상위 4개 문서를 하드 네거티브로 선택

해당 훈련셋으로 baseline model (E5-large-unsupervised)를 훈련

BM25, random은 오히려 baseline보다 성능이 나빠짐
e5-large-v2, snowflake(E5 기반, 334M) 베이스라인보다 더 높은 성능 보임
NV-embed-v1, e5-mistral-7b (Mistral 기반, 7B) 은 가장 높은 정확도

3.4.2 Ensembling hard-negatives from different embedding model

4가지 Teacher Model (e5-large-v2, snowflake, NV-embed-v1, e5-mistral-7b) 에서 Mining한 상위 4개의 Hard negative 유사도를 조사하였다.

NQ, SQUAD, StackExchange 데이터셋에서 jaccard similarity를 측정한 결과 모두 30% 미만인걸로 나타났다.

Jaccard Similarity = 집합 간의 유사도 지표 = ∣A∩B∣ / |A∪B∣

따라서, Hard negative 품질을 개선하기 위하여 앙상블을 시도하였다.

각 앙상블 방식은 1개의 (query, positive) 예제 쌍에 대하여 4개의 hard-negatives를 반환한다.

Cross-sample ensembling : teacher model을 샘플링하여 예제에 대한 모든 네거티브 확보
Intra-sample ensembling : 예제마다 각 Teacher model에서 Mining된 top-1 negative를 선택

Cross-sample ensemble은 베이스라인보다 성능이 떨어짐
Intra-sample
- 뽑힌 top-1 Hard negative가 중복될 수 있으므로 두 가지 방법을 시도 (중복 제거, 제거 X)
- 중복된 Hard negative를 그대로 두는 편이 품질 더 높았음
  - 여러 모델이 Hard negative로 뽑았으므로 실제로도 중요한 negative일 가능성 -> 중복 제거 X 시 CE Loss 에서 중요도 높아지기 때문

3.4.3 Comparing metohds for mining hard-negatives

base model : E5-Large-Unsupervised
teacher model : e5-mistral-7b-instruct
TopK-Abs, TopK-MarginPos, TopK-PercPos의 경우 margin 구성 범위는 [0, 1] 이며 0.05 단위로 증가

각 Method에서 Config에 다른 최적값

Positive-aware hard negative mining 방식인 TopK-marginpos, percpos가 성능이 가장 우수했음
상위 k개를 negative로 선택하는 것이 나은지, 더 넓은 범위의 후보에서 샘플링하는 것이 나은지를 실험 (마지막 블록)
- TopK-PercPos (95%)
- top-k VS top-1 + sampling(n-1) 비교
- 거의 성능차이 없었음

해당 방법이 실제로 False Negative를 제거하는 데 효과가 있었는지 시각화하였다.

(a) Positive, Negative의 점수 분포를 약간 분리하는 데 도움이 됨
(b) Negative가 Positive보다 높은 점수를 보이는 것을 방지
(c) 지나치게 positive 와 유사한 Negative를 제외하여 CE loss가 과도하게 커지는 것을 제한함 -> 학습 데이터의 노이즈가 줄어들었음을 의미

Mining 비교 실험이 e5-large-unsupervsied (334M) 와 다른 크기인 모델에도 일반화될 수 있는지 확인하기 위해 Mistral-7B-v0.1을 base model로 놓고 동일한 실험 진행
- 단 Mistral 크기가 훨씬 크기 때문에 메모리 이슈로 예제당 Hard Negative 1개만 사용

이전 e5-large-unsupervsied보다 성능이 훨씬 좋아짐
Top-k Abs가 Naive보다도 떨어지는 성능을 보임
top1 보다 sampling이 더 좋은 성능 보임

4. NV-Retriever

4.1 Model Architecture

Mistral 7B를 기본 모델로 사용
causal LM을 embedding model로 훈련하기 위해 bi-directional attention, mean pooling 사용

4.2 Train set and instruction prefixes

MTEB에는 retrieval, reranking, classifcation, clustering과 같은 다양한 작업이 포함되어 있으므로 전반적 성능을 향상하기 위해 다양한 훈련셋이 필요함

Hard negative mining
- Teacher model로는 가장 좋은 성능을 보인 E5-mistral-7B을 사용
- TopK-PercPos(95%)
- 각 훈련셋에 대한 특정 prefix 설계
  - query에만 추가되고 passage에는 추가되지 않으므로 재인덱싱 불필요
  - {task_definition}: {query} 형식 사용

4.3 Training (부록 D)

Step 1. Retrieval supervised data (in-batch negative + Hard-negative data) 사용

Step 2. retrieval data와 다른 종류의 태스크( e.g. 분류, 회귀) 데이터셋을 혼합

Huggingface trainer + PEFT 사용
모델 하이퍼파라미터

학습 하이퍼파라미터

4.4 Results

현재 (24.12.22) NV-embed-v1 + Hard negative 방법론 적용한 NV-embed-v2가 1등

저작자표시 (새창열림)

'NLP > 논문리뷰' 카테고리의 다른 글

[논문 Review] 22. Evaluating Human-LM Interaction (1)	2025.03.06
[논문 Review] 21. Deepseek LLM (0)	2025.02.13
[논문 Review] 19. Mixed Precision Training (1)	2024.07.25
[논문 Review] 18. Recommendation as Language Processing (RLP) : A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5) (1)	2024.07.11
[논문 Review] 17. RAG (Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks) (0)	2024.06.27

그냥이것저것

[논문 Review] 20. NV-Retriever: Improving text embedding models with effective hard-negative mining

Abstract

1. Introduction