While working on various small projects involving LLM Applications, I encountered numerous issues related to the quality of the model outputs. I conducted experiments with several language models, including GPT-3.5, GPT-4, and Llama2, and discovered that none of them were exempt from inconsistency and hallucination. Thus, I decided to address this matter in a blog post. In this blog, I will include those challenges that I have personally faced when running LLM applications, and thus suggest some solutions which are available to tackle those challenges.
Image Source: What causes LLMs to hallucinate | SabrePC Blog. (n.d.). https://www.sabrepc.com/blog/Deep-Learning-and-AI/what-causes-large-language-models-to-hallucinate
Problem 1: Inconsistency
Large language models often display inconsistency in their performance. On certain occasions, they successfully provide accurate answers to questions, while in other instances, they simply repeat random information from their training data. If there are moments when these models seem uncertain or lack a coherent understanding of the content, it's because they indeed lack true comprehension. These models excel at recognizing statistical relationships among words but lack genuine semantic understanding.
I specifically encountered this problem when I do this project [ [ChatGPT] Mastering the Art of Product Documentation: an AI-Powered Feature Analysis Prompt! - https://kimngannguyen1912.wixsite.com/my-site/post/mastering-the-art-of-product-documentation-an-ai-powered-feature-analysis-prompt ] - in which I use a predefined prompt to generate Business Requirement Documents from a very simple feature description. Even with the same feature description, ChatGPT yields different outputs each time. On some occasions, it produces very good results, but it's no guarantee that it will produce the same level of quality with a different input variable. So how do we overcome this challenge?
Solution 1: Adjust the temperature
One of the easiest solution is to set the model temperature equal to 0 in the setting. Temperature is a hyperparameter that controls the randomness of the model's generation. When temperature is set to 0, it effectively reduces the randomness to zero, making the model always choose the most probable word or sequence of words according to its training data. The LLM will consistently generate the same output for a given input, as it always chooses the most probable next word. This eliminates the randomness that can lead to inconsistent responses.
However, this method will not ensure that when we slightly alter the input prompt, the model will generate consistent results. So how do we replicate good answers when altering input variable? Let's look at solution 2.
Solution 2: In-context Learning for LLMs
Few-shot in-context learning is a powerful approach that exploits the capabilities of LLMs to solve a task given a small set of input-output examples at test time. To accomplish this task, the few good question - answer examples are combined into a prompt - with this, we simply instruct the LLM how to produce a good output by teaching them what are examples of good output. Let's look at the example prompt below:
From the given system architecture, create mermaid code to visualize the architect. Here is an example of good mermaid code for this task:
flowchart LR
subgraph Azure
s[fa:fa-code Server]
db[(fa:fa-table Database)]
end
subgraph Netlify
c[fa:fa-user Client]
end
subgraph Netlify
end
subgraph Azure
direction LR
end
c -- 1. HTTP GET --> s
s -- 2. SQL Query --> db
db -. 3. Result Set .-> s
s -. 4. JSON .-> c
Because I gave GPT the example of good output, GPT will always follow the same template in the example even though I change the input variable. This helps to replicate the same good and consitent results across multiple different queries.
Solution 3: Multi Query Retriever
LLMs in generall are very sensitive to the input prompt. Some questions can generate very good responses, while many others don't. Meanwhile, customer queries aren’t always straightforward. They can be ambiguously worded, complex, or require knowledge the model either doesn’t have or can’t easily parse. These are the conditions in which LLMs are prone to making things up. So, how do we make the most of out user's input queries?
While Prompt engineering / tuning (as suggested in Solution 2) is sometimes done to manually address these problems, but can be tedious. Another solution we can consider is Multi Query Retriever. The MultiQueryRetriever automates the process of prompt tuning by using an LLM to generate multiple queries from different perspectives for a given user input query. For each query, it retrieves a set of relevant documents and takes the unique union across all queries to get a larger set of potentially relevant documents. By generating multiple perspectives on the same question, the MultiQueryRetriever might be able to overcome some of the limitations of the distance-based retrieval and get a richer set of results. (Langchain)
Source: MultiQueryRetriever | ️🔗 Langchain. (n.d.). https://python.langchain.com/docs/modules/data_connection/retrievers/MultiQueryRetriever
Problem 2: Hallucination
A hallucination refers to a situation where the model generates text that is not based on factual or accurate information, often producing fictional or misleading content. Addressing hallucinations is a crucial challenge in improving the reliability and trustworthiness of AI-generated content. Imaging building a medical chatbot that give hallucinated content, dangerous >.< isn't it?
To solve the problem of Hallucination, let's first understand why LLMs hallucinate? In short, LLMs are only aware of concepts within their training data, making them ill-equipped to answer questions about private or updated data. To fill knowledge gaps, LLMs might fabricate responses based on assumptions, leading to inaccurate "hallucinated" answers. I have used two solutions to resolve this issues:
Solution 1: Prompt Engineering
The most simple solution to Hallucination problem is to add an instruction to the prompt, for example:
Answer the question as truthfully as possible, and if you're unsure of the answer, say "Sorry, I don't know"
However, we would want to improve the interaction with the LLM, by having the LLM provide an accurate and correct answer to a question. This can be achieved by helping the model through providing additional context within the prompt. In an enterprise context, this can be done effectively with RAG.
Source: Greyling, C. (2023, January 10). Preventing LLM Hallucination With Contextual Prompt Engineering — An Example From OpenAI. Medium. https://cobusgreyling.medium.com/preventing-llm-hallucination-with-contextual-prompt-engineering-an-example-from-openai-7e7d58736162
Solution 2: Retrieval Augmented Generation (RAC)
RAG is an AI framework designed to improve the quality of LLM-generated responses by anchoring the model to external knowledge sources. This process involves retrieving relevant external data and incorporating it into the LLM's generative process. In simpler terms, RAG provides LLMs with additional context, potentially including your private data, to guide them in generating more accurate responses. With this solution, we provide additional context for the LLMs to cover its knowledge gap and thus, making it less hallucinated. I have dedicated the whole blog post for RAG in https://kimngannguyen1912.wixsite.com/my-site/post/beginner-guide-to-rag-the-architecture-behind-llm-applications Please make sure to check it out.
Solution 3: FLARE
Forward-Looking Active REtrieval augmented generation (FLARE), is a generic retrieval-augmented generation method which iteratively uses a prediction of the upcoming sentence to anticipate future content, which is then utilized as a query to retrieve relevant documents to regenerate the sentence if it contains low-confidence tokens.
Here is the process behind FLARE:
Input question: Ex: Generate a summary about Joe Bidden.
Look up an initial set of relevant documents (illustrated with the black line in the above image)
Use this initial set and query to predict the upcoming sentence: The first sensence it predicts is: "Joe Biden (born November 20, 1942) ..."
Check for the confidence level of the predicted tokens.
If the confidence level is high then continue predicting the next sentence: For the first sentence, the confidence level is high so it will move on to next sentence.
Else use the predicted sentence as a query to look up more relevant documents: In the second sentence, there are 2 tokens with low confidence level: "the University of Pennsylvania" and "a law of degree". Thus, FLARE will automatically generate a question like "What uni did Joe Bidden attend?" or "What degree did Joe Biden earn?". After self-ask such question, FLARE can autocorrect the previous generated reponse to "He graduated from the University of Delahware in 1965 with Bachelor of Arts ..." . Then, it uses the correct sentence to predicts the next sentence. This process iteratively continues to generate the full response.
FLARE was tested along with baselines comprehensively over 4 long-form knowledge-intensive generation tasks/datasets and achieves superior or competitive performance on all tasks, demonstrating the effectiveness of our method.
Source: Jiang, Z. (2023, May 11). Active Retrieval Augmented generation. arXiv.org. https://arxiv.org/abs/2305.06983
Conclusion
Addressing inconsistency and hallucination in LLM applications is crucial for ensuring the reliability and trustworthiness of AI-generated content. By employing strategies such as adjusting the temperature, in-context learning, multi-query retrieval, prompt engineering, retrieval augmented generation (RAG), and FLARE, we can significantly enhance the reliability and quality of LLM outputs. These solutions empower developers and researchers to harness the full potential of LLMs while minimizing inaccuracies and ensuring trustworthy AI-generated content. As the field continues to evolve, tackling these challenges remains crucial for advancing the capabilities of LLM applications.
Reference
Jiang, Z. (2023, May 11). Active Retrieval Augmented generation. arXiv.org. https://arxiv.org/abs/2305.06983
Mallick, B. (2023, June 15). Enhancing Large Language Models with Retrieval Augmented Generation. Medium. https://tech.timesinternet.in/enhancing-large-language-models-with-retrieval-augmented-generation-e2625a50bd1d
Greyling, C. (2023, January 10). Preventing LLM Hallucination With Contextual Prompt Engineering — An Example From OpenAI. Medium. https://cobusgreyling.medium.com/preventing-llm-hallucination-with-contextual-prompt-engineering-an-example-from-openai-7e7d58736162
MLOps.community. (2023, June 29). Building LLM applications for production // Chip Huyen // LLMs in Prod Conference [Video]. YouTube. https://www.youtube.com/watch?v=spamOhG7BOA
Comments