Evaluating vector-based “Traditional” RAG Systems: Why Knowledge Base Quality Matters

26/03/2026

6 MIN READ /

Author

Petros Stavroulakis

When organizations deploy vector-based “Traditional” Retrieval-Augmented Generation (RAG) systems, performance issues are often attributed to model choice or hyper-parameter tuning.

However, in practice, our experience in testing GenAI systems show that many failures originate elsewhere: in the structure and quality of the underlying knowledge base.

During a recent RAG system quality assessment, we observed that a client’s system was underperforming not because of the LLM or retrieval configuration, but due to duplicated and overlapping documents within the knowledge base.

This highlights an important but often overlooked reality:

RAG systems do not fail only because of models—they fail because of how knowledge is structured, retrieved, and used.

In this article, we explore how document selection and knowledge base design directly impact system performance, and how these factors can be evaluated more effectively as part of a broader GenAI testing and assessment approach.

Retrieval-Augmented Generation (RAG) systems used by companies usually rely on an internal curated knowledge base to generate accurate and relevant answers. A common quality dimension when building these systems is: how many and which documents should be added to the knowledge base, so that the RAG system has optimized performance? While it might seem intuitive to keep the database as small as possible, research and practice show that the real goal is not minimizing document count per se but maximizing useful knowledge coverage while avoiding excessive redundancy.

Why Minimal Document Sets Can Be Risky

To understand how knowledge base design affects system behavior, we first need to look at how document quantity and diversity influence retrieval performance.

From a system evaluation perspective, the key question becomes: what does the ideal amount of information per topic look like?

You would think that using a single instance/document per piece of information would be ideal, correct? Well, that might be too low actually.

RAG systems retrieve relevant information from a document corpus using a vector search before generating responses. If the corpus is too small and unvaried (a single piece of information per topic), a classical vector search system may fail to retrieve relevant information even if it conceptually exists in the knowledge base.

There are two main reasons for this:

First, coverage limitations. With fewer documents per topic, the system may lack information about edge cases, related topics, or variations of a concept.

Second, retrieval sensitivity. Many RAG systems rely on embedding similarity for retrieval. These methods are approximate: if a user asks a question using phrasing that differs from the document’s wording, the retriever might miss the correct chunk.

Because of this, RAG databases benefit from semantic diversity. The same core fact expressed in multiple ways increases the probability that the retrieval mechanism will match the user’s query.

From a testing standpoint, this means that retrieval performance should not be evaluated in isolation from knowledge base design.

Content Overlap: How Much Is Too Much?

Okay, so we start adding a healthy amount of information per topic, in various forms and with different semantics but then, when is it too much? Overlap in a RAG database refers to different documents containing similar or related knowledge. This is not about duplicating text but about ensuring that key information appears in multiple contexts.

A useful rule-of-thumb guideline is:

Critical facts should appear in two or three sources.
Secondary information can appear once.

For example, a company’s refund policy might appear in an FAQ page, a product guide, and a terms of service document. Each may phrase the information slightly differently, improving the likelihood that a semantic search system retrieves it.

However, excessive duplication should be avoided because it increases storage costs and retrieval time, and may bias retrieval results toward repeated content which makes answers poorer in information compared to the information available in our knowledge base.

This balance between coverage and redundancy is critical, and often becomes visible only when evaluating system outputs under real usage conditions.

Existing Tools for Document Selection

To manage this balance, several techniques are commonly used to assess whether new documents add value to a RAG system.

1. Embedding Similarity

One of the most widely used approaches is comparing document embeddings against the existing corpus. Tools such as FAISS, Pinecone, Weaviate, and Milvus allow developers to detect duplicates or near-duplicates by measuring semantic similarity.

If a document is extremely similar to existing content, it may be redundant.

2. Document Quality Filtering

Before documents enter a knowledge base, pipelines often check quality factors for metrics such as completeness, language, formatting, and noise. Tools like Apache Tika help extract text content, while frameworks such as LangChain and LlamaIndex provide ingestion pipelines that include filtering and preprocessing.

3. Topic Coverage Analysis

Another approach evaluates whether a document introduces new topics or concepts into the corpus. Topic modeling and clustering methods can reveal gaps in the knowledge base.

Libraries such as BERTopic, scikit-learn, and Gensim are commonly used to analyze topic coverage.

4. Retrieval Evaluation

Some teams simulate queries to see whether a new document improves system performance. Evaluation frameworks such as Ragas, DeepEval, and TruLens help measure retrieval quality and answer accuracy.

While these approaches provide useful signals, they often focus on similarity rather than actual knowledge contribution.

Graph-Based Approaches: A New Direction

To address these limitations, more advanced approaches have emerged, focusing not just on textual similarity but on the structure of knowledge itself. Researchers have begun using graph representations of knowledge bases to manage RAG corpora. Instead of treating documents as isolated text chunks, these approaches extract entities and relationships and store them in a knowledge graph.

When the RAG system leverages the Knowledge Graph of the knowledge base, it is often referred to as a GraphRAG system.

In a graph representation:

Nodes represent entities or concepts.
Edges represent relationships between them.

When a new document is evaluated, the system checks whether it introduces new nodes (entities) or new edges (relationships) in the knowledge base graph. If it adds little new knowledge, it may be redundant.

Graph-based selection provides several advantages:

It detects conceptual redundancy more reliably than embeddings.
It identifies missing knowledge areas.
It improves multi-hop reasoning by connecting related concepts across documents.

This approach shifts the key question from “Is this document similar to existing ones?” to “Does this document add new knowledge to the system?”

Examples

At code4thought we use this state-of-the-art method through a GraphRAG tool, which can automatically identify if a new document provides extra value to your knowledge base and makes your RAG more performant, or not. You can therefore optimize the size of your knowledge base or ask whether adding a particular document (or removing an existing one) would be beneficial to your RAG system.

To illustrate, a very simple and short example:

Assume we have an extremely small RAG knowledge base, which contains the following information regarding the topic of how aspirin reduces inflammation:

Aspirin inhibits cyclooxygenase (COX) enzymes.
Cyclooxygenase enzymes produce prostaglandins.
Prostaglandins cause inflammation and pain.

And we present this RAG system with the following three new short documents:

Document text

Aspirin inhibits cyclooxygenase (COX) enzymes. Cyclooxygenase enzymes produce prostaglandins. Prostaglandins cause inflammation and pain. Aspirin treats inflammation. Aspirin inhibits prostaglandins.

Ground truth

Has all the detail in our knowledge base but also provides more information.

Aspirin inhibits cyclooxygenase (COX) enzymes. Cyclooxygenase enzymes produce prostaglandins. Prostaglandins cause inflammation and pain.

Provides exactly the same information than what currently exists in our knowledge base.

Indian tech stocks have experienced a sharp and unusual downturn in recent weeks, driven by concerns that artificial intelligence could disrupt the traditional outsourcing model underpinning the country’s back-office sector.

Irrelevant document, nothing to do with the topic.

First document:

We see that, for our first document, all pieces of information which it contains already exist in our knowledge base, but we also get more information on the subject because it also tells us that:

Aspirin treats inflammation
Aspirin inhibits prostaglandins

Which we did not have before.

Second document:

The second document is an exact copy of the knowledge which exists in our knowledge base. Therefore, it is redundant as it does not provide us with any new information at all.

Third document:

The third document is completely irrelevant to the topic and has to do with Indian tech stocks.

When using classical embeddings to determine if the three documents should be added or not to the knowledge base, the first two cases would probably look very similar in terms of vector score (text similarity) to your already existing database, because the exact same wording was used in both. They would therefore be rejected as ‘similar’ or ‘too close’. The vector similarity would fail to find the new knowledge hidden in the first document because it focuses on similarity. With regards to the third document, it would indeed see that its vector value is very different to the current database and would be able to reject it successfully.

This is where GraphRAG is different.

It is able to go beyond how similarly you express an idea and focus on the information it provides you and the relationships within that info and would therefore keep the first document (add it to your database as it has more information), it would fail the second as providing redundant information and it would also reject the third as irrelevant to the knowledge base.

From a system testing perspective, this distinction is critical.

The Key Takeaway

Effective RAG systems are not built by simply increasing or reducing document volume. They require a structured understanding of how knowledge is represented, retrieved, and used within the system.

Traditional approaches such as embedding similarity and topic modeling provide useful signals, but they do not fully capture whether new information improves system performance. Graph-based approaches move closer to this goal by focusing on how knowledge is connected and whether it meaningfully extends the system’s understanding.

From an AI quality perspective, this reinforces a broader principle:

GenAI systems must be evaluated as end-to-end systems—not focusing evaluation solely on the AI models used.

Knowledge based design, retrieval logic, and data structure all directly influence performance, reliability, and trustworthiness. They cannot hence be overlooked, otherwise, it is like focusing on the tree and missing the forest.

At code4thought, this is exactly how we approach GenAI testing and assessment—by identifying weaknesses across the full system pipeline and translating them into actionable improvements.

SOFTWARE QUALITY

TRUSTWORTHY AI

SOFTWARE QUALITY

TRUSTWORTHY AI

Evaluating vector-based “Traditional” RAG Systems: Why Knowledge Base Quality Matters

Why Minimal Document Sets Can Be Risky

Content Overlap: How Much Is Too Much?

Existing Tools for Document Selection

Graph-Based Approaches: A New Direction

Examples

The Key Takeaway

SOFTWARE QUALITY

TRUSTWORTHY AI

CODE4THOUGHT

follow us