RAG (retrieval augmented generation) improves upon basic LLM use by providing an external knowledge store that acts as the model’s memory, reducing hallucination and enabling the citation of knowledge claims. Data is passed through an embeddings model that converts all text to a vector in high-dimensional space, where it can be compared with other embedded data.
In RAG apps, embedding data is usually stored in vector databases, designed specifically for the needs of similarity search. Vector databases provide an easy way to index and query texts. Queries are performed by comparing vectors in the database and finding ones that are similar to the query’s embedding. A common choice of distance metric is cosine similarity.
In RAG, document chunking should be performed carefully to find the optimal chunk size and chunks should be overlapped to preserve their context. Prompts should be engineered to show examples of effective citation and should instruct the model to announce when the source material doesn’t supply an answer to the question.