Why Your RAG Almost Works (And How to Fix It)
tl;dr
Most RAG systems underperform because the embedding layer treats queries and documents as equivalent text, which they are not. Switching to an asymmetric embedding model that encodes each differently produces meaningfully better retrieval. This guide explains why, with a practical example and a concrete first step. Estimated read time: 8 minutes.
There is a specific kind of frustration that comes with a RAG system that almost works. The answers are in the right neighborhood. The citations are plausible. But something is off: the retrieved passages are adjacent to the right answer rather than the right answer itself. Users notice before the metrics do.
The instinct is to reach for the usual suspects: a different model, a tighter prompt, smaller chunk sizes, more context. These are reasonable guesses. They are usually wrong. The problem is almost always earlier in the pipeline, at the step where text gets converted into numbers and compared against other numbers. That step is the embedding layer, and most setups get it subtly, consequentially wrong.
This article was inspired by a production agent rebuild where switching one component, the embedding model, produced a shift in retrieval quality that was noticeable before it was measurable. Here is the LinkedIn post that prompted this deeper explanation:
Is This Actually Your Problem?
Before changing anything, confirm that retrieval is actually the failure point rather than generation. These are different problems with different fixes.
Signs the embedding layer is underperforming:
- Answers cite the right document but the wrong section within it
- Queries about specific topics retrieve passages that mention the topic but do not answer the specific question
- Adding more chunks to the context window improves answers (this suggests the right chunk exists but is not ranking first)
- Semantic rewording of a query produces noticeably different results for what should be the same question
A simple diagnostic test:
Take five user queries that produced unsatisfactory answers. Manually search your document corpus for the passage that actually contains the right answer. Then check what your embedding model returned as the top-3 results for each query. If the correct passage is consistently in positions 4-10 rather than positions 1-3, the problem is retrieval, not generation. Your embedding model is finding the right neighborhood but not the right address.
If that test confirmed your suspicion, the next three sections explain why it happens and how asymmetric embeddings fix it. If it pointed somewhere else (the prompt, the generation model, the chunking strategy), this article is not your fix.
What Embeddings Actually Are
Read this if the diagnostic above pointed at your embedding layer and you want to understand why. Skip ahead to How Asymmetric Embeddings Handle the Mismatch if you already know what embeddings, vectors, and cosine similarity are.
An embedding is what your RAG system uses to decide which document chunks are "about" a given question. Technically, it is a list of numbers (called a vector), often 768 or 1536 of them, that the model produces to represent the meaning of a piece of text. The useful property is this: texts with similar meanings produce vectors that sit close together in this numerical space, and texts with different meanings produce vectors that sit far apart.
"Closeness" in this context is usually measured with cosine similarity, a calculation that treats each vector as a direction in space and measures the angle between two directions. An angle of zero means identical direction (very similar meaning). An angle of 90 degrees means no relationship.
RAG (Retrieval-Augmented Generation) is an architecture where a language model answers questions by first searching a knowledge base. Instead of relying only on what the model learned during training, the system retrieves relevant passages from a document library and feeds them into the model as context. The retrieval step depends entirely on embeddings: the user's question gets embedded, every document chunk gets embedded, and the system returns whichever chunks are numerically closest to the question.
The quality of everything downstream, the answer, the citations, the user's trust, depends on whether that retrieval step returns the right chunks.
Why Symmetric Embeddings Are the Wrong Default for Search
This section is for anyone whose RAG runs on a general-purpose embedding model (OpenAI's text-embedding-3, Cohere's default, most open-source options) and wonders why retrieval underperforms.
Most embedding models are trained symmetrically. They learn to produce similar vectors for texts that are semantically equivalent: two sentences that say the same thing in different words, two paragraphs with the same argument. This is useful for tasks like duplicate detection, clustering, and semantic textual similarity benchmarks.
Search is a different task entirely.
When you search for something, the query and the document that answers it do not look similar as text. They serve different purposes:
| Text type | Typical characteristics |
|---|---|
| Query | Short, conversational, intent-driven, often a question |
| Document | Dense, formal, information-packed, often declarative |
A user asks: "what causes a PDF to render slowly in the browser?" A document that answers this contains paragraphs about PDF.js rendering pipelines, font parsing overhead, and canvas element limitations. These two texts share topic but not style, structure, or vocabulary density. A symmetric model trained to find equivalence between similar texts will struggle here. It was not designed for this mismatch.
The standard benchmark for this is called BEIR (Benchmarking IR). It tests retrieval models across diverse real-world search tasks and consistently shows that models optimized for symmetric similarity underperform on asymmetric search scenarios, sometimes significantly.
How Asymmetric Embeddings Handle the Mismatch
This is the fix. Read this if the previous section described your setup.
An asymmetric embedding model encodes queries and documents through different learned pathways, or is trained specifically on query-document pairs rather than on equivalent-sentence pairs. The model learns that a short, intent-driven question should produce a vector that is close to the dense, informative passage that actually answers it, even if those two texts would score low on raw semantic similarity.
This distinction matters in practice for two reasons.
First, the training signal is different. Symmetric models learn from pairs of texts with similar meaning. Asymmetric models learn from actual search data: a query that a real person asked, paired with the document passage that actually satisfied them. This trains the model to bridge the style gap rather than measure it.
Second, some asymmetric architectures (like bi-encoders trained with hard negatives) are explicitly penalized for returning passages that are topically adjacent but not actually useful. This is exactly the failure mode that makes RAG feel "almost right." The model learns that "close but wrong" is a retrieval failure, not a partial success.
Practically, asymmetric models come in two forms: models with separate query and document encoders (you call a different function for each), and models with a unified encoder trained on asymmetric data (you pass a prefix like "query:" or "passage:" to signal context). Both produce meaningfully better retrieval than a symmetric model used for search.
A Production Example: What Better Retrieval Feels Like
During a rebuild of a production agent, the system was using a general-purpose symmetric embedding model. Retrieval was functional. Answers were usually in the right area. But there was a consistent pattern of "close but wrong" results: passages that were about the right topic but not the right level of specificity, or from the right document but the wrong section.
After switching to Jina's asymmetric embedding model (specifically jina-embeddings-v3, which handles the query/document distinction through task prefixes), the shift was apparent before any formal evaluation. Queries that previously returned three plausible-but-wrong passages began returning the directly relevant passage in the top result. The "almost right" character of the answers changed. They became more precisely sourced.
This is not a Jina product recommendation. Other asymmetric models, including Cohere's embed-v3, Voyage AI's retrieval models, and several open-source options from the MTEB leaderboard, produce similar improvements. The point is that the model type matters more than the specific vendor.
The improvement was not about chunk size, overlap, or prompt engineering. The retrieval pipeline, the prompts, and the chunking strategy were unchanged. Only the embedding model changed. That single swap produced the most significant quality improvement of the entire rebuild.
How to Choose an Asymmetric Embedding Model
Once you know the model type matters, the question is which specific model to pick. The MTEB leaderboard (Hugging Face maintains a current version) under the "retrieval" task category is the most reliable public benchmark. Look for models that score well on BEIR specifically. The "Retrieval Average" column is the right metric for RAG use cases; the overall average includes clustering and classification tasks that have no bearing on your problem.
Also check whether the model requires you to signal query versus document context. Models that use prefixes ("query:", "passage:") or separate encoders give you explicit control over the asymmetric encoding. Models that do not distinguish between the two are symmetric by default, regardless of what the marketing materials say.
Common Questions
Does chunk size still matter if I switch to asymmetric embeddings?
Yes, but it becomes less critical as a first-order problem. Asymmetric embeddings improve the model's ability to find the right passage; chunk size determines how much context surrounds it. A good embedding model can still retrieve an unhelpfully large chunk. Start with 512-768 tokens as a baseline and adjust based on the average length of a complete answer in your corpus.
Can I use asymmetric embeddings with any vector database?
Asymmetric embeddings produce standard vectors and are compatible with all major vector databases (Pinecone, Weaviate, Qdrant, pgvector, etc.). The only requirement is that you consistently use the query encoder for query embedding and the document encoder for document embedding at index time. If you index documents with the query encoder by mistake, retrieval will degrade.
What about reranking? Is that the same as asymmetric embeddings?
Reranking is a separate, complementary technique. A reranker takes the top-N results from initial retrieval and re-scores them using a cross-encoder that compares query and document together rather than separately. Reranking is more accurate than bi-encoder retrieval but too slow to run across an entire corpus. The typical pattern is: asymmetric embeddings for initial retrieval (fast, good), reranker on top-20 results (slower, better). You want both if quality is the priority.
My RAG system uses OpenAI's embeddings. Are those symmetric?
OpenAI's text-embedding-3 models are general-purpose and not specifically optimized for asymmetric retrieval. They perform reasonably but consistently score below dedicated retrieval models on BEIR benchmarks. If you are using them and experiencing the "almost right" failure mode, switching to a retrieval-specific model is worth testing before making any other changes.
How do I handle multilingual documents with asymmetric embeddings?
Several asymmetric models support multilingual retrieval, including jina-embeddings-v3 and Cohere's multilingual embed models. For multilingual corpora, check the MTEB leaderboard's multilingual retrieval category rather than the English-only retrieval scores. Embedding queries and documents in their native languages typically outperforms translating everything to English first.
Key Takeaways
- RAG systems that feel "almost right" usually have a retrieval problem, not a generation problem. Confirm this with a manual top-3 audit before changing prompts or models.
- Standard embedding models are trained for semantic similarity between equivalent texts. Search requires a different property: bridging the style gap between a short conversational query and a dense informative document.
- Asymmetric embedding models are trained on actual query-document pairs and explicitly learn to return relevant passages for intent-driven queries, even when the surface text looks nothing alike.
- The MTEB leaderboard "Retrieval Average" column on Hugging Face is the most reliable public benchmark for comparing embedding models on this specific task.
- Reranking is complementary to asymmetric embeddings, not a replacement. Use both if quality matters.
Your concrete first step: Open the MTEB leaderboard at huggingface.co/spaces/mteb/leaderboard, filter by task type "Retrieval," sort by the BEIR average column, and identify the top three models that fit your size and latency budget. Pick the smallest one that outscores your current model and run the five-query diagnostic test described above. If retrieval rank improves, you have found the source of your "almost right" problem.
This article was inspired by content originally written by Mario Ottmann. The long-form version was drafted with the assistance of Claude Code AI and subsequently reviewed and edited by the author for clarity and style.