Beginner Guide to RAG - the architecture behind LLM Applications

kim ngan nguyen
Sep 11, 2023
5 min read

Updated: Sep 15, 2023

In a series of blog posts, we have explored the incredible potential of Large Language Models (LLMs) to revolutionize business applications, enabling meaningful impacts across various industries. If you haven't already, be sure to check out our previous posts:

Build a Medical Chatbot with GPT4: https://bit.ly/45fgw4M
Data Analysis Made Easy with Data Assistant: https://bit.ly/3YByewx
Create Autonomous HR Assistant Chatbot: https://bit.ly/3DUBSZa

What truly simplifies the development of these LLM-powered applications is a framework called Retrieval Augmented Generation (RAC) - an emerging architecture that garnered significant attention lately. In this article, we will delve into the high-level architecture of RAG. Note that this blog post is just a gentle introduction into this architecture and will not provide you with deep technical details.

Why do we need RAG?

Imagine asking an LLM, whether it's GPT, Llama, Falcon, or any other, "Who is Elon Musk?" In most cases, you'll receive an accurate response because this information is readily available on the internet. However, if you inquire about a lesser-known entity, such as yourself or your organization's private data, the LLM often struggles. Here's why:

Limited Training Data: LLMs are only aware of concepts within their training data, making them ill-equipped to answer questions about private or obscure data.
Hallucinated Answers: To fill knowledge gaps, LLMs might fabricate responses based on assumptions, leading to inaccurate "hallucinated" answers.
Difficulty in Learning New Concepts: Teaching LLMs new concepts can be an arduous task.

Retrieval-augmented generation (RAG) serves as the solution to these challenges. It's an AI framework designed to improve the quality of LLM-generated responses by anchoring the model to external knowledge sources. This process involves retrieving relevant external data and incorporating it into the LLM's generative process. In simpler terms, RAG provides LLMs with additional context, potentially including your private data, to guide them in generating more accurate responses. Here's how the process unfolds when a user submits a query:

User inputs a query.
The application search for relevant information in its knowledge base (ussually a vector database).
The applications retrieved all relevant information and send them to the LLM.
LLM uses its innate language understanding + your private data + prompts provides by the application's owner + User's Query to generate response.

This retrieval of external information addresses the previously mentioned challenges, empowering LLMs to answer questions related to private data effectively. Now, let's explore how RAG applications are constructed and their key modules.

Key Modules of RAG

Source: Retrieval | ️🔗 Langchain. (n.d.). https://python.langchain.com/docs/modules/data_connection/

Load Your Private Data

The first step in building a RAG application involves loading your unique, private data into the application's "knowledge base." This data, which may originate from various sources like PDF files, Excel sheets, code repositories, or public websites, enriches LLMs with information they haven't encountered during their training.

Semantic Search

When users input their queries, RAG applications will retrieve relevant information in its knowledge base. But how the application knows which information is relevant to the user's question? This involves the process of Semantic Search - an advanced information retrieval technique that aims to improve the accuracy and relevance of search engine results. Semantic search goes beyond traditional keyword-based search methods by considering the meaning and intent behind a user’s query. It leverages text embeddings, which are multi-dimensional numerical representations of “meaning” generated by language models. These embeddings capture the semantic relationships between words and phrases, allowing for more nuanced and context-aware document retrieval.

Semantic search involves several transformation steps in order to best prepare the documents for retrieval:

1. Document Transform: One of the primary transformation steps is splitting (or chunking) a large document into smaller chunks. After we acquire a list of smaller chunks of text, we can now feed these chunks into some embedding models.

2. Text Embeddings: Embeddings capture the semantic meaning of the text, allowing you to quickly and efficiently find other pieces of text that are similar. After embedding, your information will look like this in the vector space.

Source: Science, O.-. O. D. (2022, November 23). A gentle introduction to vector Search - ODSC - Open Data Science - Medium.

So imagine if you search for sea animals, you will retrieve all the dots of chicken, wolf, dog, and not apple, or Google.

3. Vector Stores: With the rise of embeddings, there has emerged a need for databases to support efficient storage and searching of these embeddings. These includes FAISS, Pinecone, Chroma, etc. These database stores multi-dimensional numerical representations of your data (all the small dots you see in above picture) and allow us to retrieve these 'dots' quick and easy. This vector database is now our application's knowledge base.

Source: What is a Vector Database? (n.d.). Pinecone. https://www.pinecone.io/learn/vector-database/

4. Retrievers: When an user input a questions, here is the process inside RAG:

The application issues a query.
RAG use the same embedding model, which is used in Step 2 to create the 'knowledge base', to create embeddings for the query
RAG use those embeddings to query the database for similar vector embeddings. In other words, RAG retrieves all information with high similarity with the question in its knowledge base.

After the retrieval step, we will feed all retrieved information to LLMs, equipping LLMs with new knowledge that is outside of its training data and empower it to answer new questions on your private data.

RAG Challenges

There are also some considerations we need to keep in mind with RAG architecture:

Time Sensitivity: Imagine your organization data changes over time, making the previous data no longer relevant, how do you ensure that the application only retrieves latest data? (The answer is you can combine semantic similarity with a time decay.)
Strong Lexical Matching: This is actually a challenge that I ran into when building the Medical Chatbot. Imagine you search for "Changes to be expected in 8th week of pregnancy", but getting results from "Changes to be expected in 10th week of pregnancy" or "Changes to be expected in 4th th week of pregnancy". This is because the level of similarity between these "dots" are too high, resulting in cross-referenced documents and thus poor outcomes.
Amibiguous or Complex Queries: Customer queries aren’t always this straightforward. They can be ambiguously worded, complex, or require knowledge the model either doesn’t have or can’t easily parse. These are the conditions in which LLMs are prone to making things up. This is the reasons why some questions generate desirable responses, while some are not. So how to ensure that the applications make most out of user's inputs to generate a good response?

There are multiple retrival strategies to resolve these obstacles, you can read more at: https://python.langchain.com/docs/modules/data_connection/retrievers/

Conclusion

In conclusion, Retrieval Augmented Generation (RAG) empowers LLMs to harness external knowledge sources, enhancing their ability to provide accurate responses and tackle private data inquiries. By understanding the key modules of RAG, businesses can build applications that offer a more refined, context-aware, and insightful user experience. RAG is currently the best-known tool for grounding LLMs on the latest, verifiable information, and lowering the costs of having to constantly retrain and update them. But RAG is imperfect, and many interesting challenges remain in getting RAG done right.

Reference

Superwise. (2023, August 9). Emerging architectures for LLM applications [Video]. YouTube. https://www.youtube.com/watch?v=Pft04KLw5Lk

Martineau, K. (2023). What is retrieval-augmented generation? IBM Research Blog. https://research.ibm.com/blog/retrieval-augmented-generation-RAG

Retrievers | ️🔗 Langchain. (n.d.). https://python.langchain.com/docs/modules/data_connection/retrievers/

Ahmed, Z. (2023, July 13). Semantic Search in the context of LLMs - Zul Ahmed - medium. Medium. https://medium.com/@zahmed333/semantic-search-in-the-context-of-llms-7961308cd6ad

What is a Vector Database? (n.d.). Pinecone. https://www.pinecone.io/learn/vector-database/

HeyitsKim