top of page

[Llama2 + Langchain + Pinecone] Unlock the Knowledge Within: Chat with Your Books, PDF files

Updated: Aug 13, 2023

In this article, we will explore the exciting world of natural language processing and build an advanced chatbot capable of answering questions from PDF files. Our chatbot's intelligence will be driven by the combined forces of three powerful technologies: Langchain, Llama 2, and Pinecone. We'll walk you through each step, from installing the required packages to utilizing the state-of-the-art Llama 2 model to extract and comprehend information from PDFs. Let's dive in and unleash the potential of these cutting-edge tools!


An overview of Chatbot Architecture

This architecture is taken from this video: https://youtu.be/ckb4DnHLBrU I highly recommend you all check out his channel on Youtube to support the author.

The summary of this architecture is as followed:

  1. Extract data/content from the pdf file

  2. Split the pdf into smaller text chunks

  3. Create embeddings for each text chunks in to vector space

  4. Build semantic index for the text chunks

  5. Create knowledge base from the text chunks

  6. When user asks a questions, the bot will:

  • Create embeddings for the user's question

  • Use semantic search to identify relevant texts to the question

  • Feed the question + relevant texts to LLM (Llama 2) to get the final answer

In this project, we utilise three main technologies: Langchain, Pinecone, and Llama 2. All of them are open source and available for free.


Langchain

Langchain is an open-source framework that makes it easy to build applications using large language models (LLMs). It provides a variety of features for processing text, interacting with LLMs, and chaining together different components to create complex applications. Langchain is a powerful tool for developers who want to build applications that can understand and respond to natural language.

Here are some of the key features of Langchain:

  • Support for a variety of LLMs: Langchain can be used with a variety of LLMs, including OpenAI GPT-3, Google BERT, and Hugging Face's Transformers library.

  • Easy-to-use API: Langchain provides a simple and easy-to-use API that makes it easy to process text, interact with LLMs, and chain together different components.

  • Flexible architecture: Langchain's flexible architecture makes it easy to build complex applications that can understand and respond to natural language.

Learn more about Langchain at: https://python.langchain.com


Vector Database and Pinecone

A vector database is a type of database that stores data as high-dimensional vectors. These vectors represent the features or attributes of the data, and can be used to perform similarity search. Pinecone is a vector database that is specifically designed for natural language processing (NLP) applications. It provides a variety of features for indexing and querying text data, and can be used to build powerful NLP applications.

Here are some of the key features of Pinecone:

  • High-performance indexing: Pinecone uses a variety of techniques to index text data, including term frequency-inverse document frequency (TF-IDF) and word embedding. This allows Pinecone to perform similarity search on text data very efficiently.

  • Scalability: Pinecone is designed to be scalable, and can be used to index and query large datasets.

  • Ease of use: Pinecone provides a simple and easy-to-use API that makes it easy to index and query text data.

Learn more about Pinecone at: https://docs.pinecone.io/docs/overview


Llama 2

Llama 2 is a family of state-of-the-art open-access large language models released by Meta on July 2023. It includes models with parameters ranging from 7 billion to 70 billion. These models are specifically engineered to excel in various language processing tasks.

Building upon its predecessor, LLaMA, LLaMA 2 introduces several improvements. The pretraining corpus size has been increased by 40%, allowing the model to learn from a broader and more diverse range of publicly available data. Furthermore, Llama 2 has doubled its context length, enabling the model to consider a larger context while generating responses, resulting in enhanced output quality and accuracy.

Llama 2-Chat is a version of Llama 2 that has been fine-tuned for dialogue-related applications. Through the fine-tuning process, the model has been optimized to deliver superior performance, ensuring it generates more contextually relevant responses during conversations.

Llama 2 was pretrained using openly accessible online data sources. For the fine-tuned version, Llama 2-Chat, leveraged publicly available instruction datasets and used more than 1 million human annotations.

The best part? Llama 2 is available for free, both for research and commercial use.


Learn more about Llama 2 at: https://ai.meta.com/llama


Step by step guide to build the chatbot

0. Install all neccessary packages and libraries

!pip install langchain
!pip install pypdf
!pip install unstructured
!pip install sentence_transformers
!pip install pinecone-client
!pip install llama-cpp-python
!pip install huggingface_hub

from langchain.document_loaders import PyPDFLoader, OnlinePDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Pinecone
from sentence_transformers import SentenceTransformer
from langchain.chains.question_answering import load_qa_chain
import pinecone
import os
from langchain.llms import LlamaCpp
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from huggingface_hub import hf_hub_download
from langchain.chains.question_answering import load_qa_chain
from langchain.prompts import PromptTemplate

1. Extracting Data from PDF using PyPDFLoader

The first step in our chatbot's architecture is extracting content from PDF files. For this purpose, we employ PyPDFLoader, a Python library that facilitates the extraction of text from PDF documents. PyPDFLoader parses the PDF file, extracting the textual content, and prepares it for further processing.

loader = PyPDFLoader("path to your pdf file")
data = loader.load()

In this project, I use the below PDF as my data source:


2. Splitting PDF into Smaller Text Chunks using Recursive Character Text Splitter

Handling lengthy documents can be challenging, especially for complex PDF files. To overcome this, we employ the Recursive Character Text Splitter, a powerful algorithm that breaks down the extracted text into smaller and more manageable chunks. This step ensures that our chatbot can process and analyze the content efficiently.

text_splitter=RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
docs=text_splitter.split_documents(data)

3. Creating Embeddings for Text Chunks using HuggingFaceEmbeddings

Once we have our text chunks, the next crucial step is to convert them into meaningful numerical representations called embeddings. In this architecture, we use the HuggingFaceEmbeddings with the model "sentence-transformers/all-MiniLM-L6-v2." HuggingFaceEmbeddings is a part of the Sentence Transformers library, which specializes in transforming textual data into vector space embeddings. These embeddings capture the semantic meaning of the text, allowing the chatbot to understand the context and nuances of the content.

embeddings=HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')

4. Storing Embedding Vectors at Pinecone

Pinecone comes into play as the cloud-native vector database and similarity search service. After generating the embeddings for each text chunk, we store these vectors in Pinecone's high-performance database. Pinecone's efficient indexing and retrieval mechanisms enable fast and accurate similarity searches, making it ideal for building our knowledge base.


First, you need to have an account in https://app.pinecone.io/

Next, create a new index in Pinecone: We specify the metric type as "cosine" and dimension as 384 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 384-dimension vectors.

Next, we set up the enviroment with Hugging Face Access Token and Pinecone API Key:

os.environ["HUGGINGFACEHUB_API_TOKEN"] = "hf_..."
PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY', 'your pinecone key')
PINECONE_API_ENV = os.environ.get('PINECONE_API_ENV', 'us-west1-gcp-free')  

Finally, we initialize pinecone:

# initialize pinecone
pinecone.init(
    api_key=PINECONE_API_KEY,  # find at app.pinecone.io
    environment=PINECONE_API_ENV  # next to api key in console
)
index_name = "langchainllama2" # put in the name of your pinecone index here
docsearch=Pinecone.from_texts([t.page_content for t in docs], embeddings, index_name=index_name)

5. Building Semantic Index and Utilizing Pinecone Similarity Search

With our embeddings stored in Pinecone, we can now build a semantic index for our text chunks. The semantic index groups similar text chunks together, forming a knowledge base for our chatbot. When a user asks a question, we first create embeddings for their question. We then employ Pinecone's similarity search to identify relevant text chunks that may contain potential answers.

query="What is Financial Inclusion?"
docs=docsearch.similarity_search(query)

6. Feeding Relevant Texts and User's Question to LLM (Llama 2) Model

With relevant text chunks at hand, we take the user's question and combine it with these texts. We then pass this question-context combination to LLM (Llama 2), an advanced NLP model renowned for its contextual understanding capabilities. Llama 2 processes the information and delivers a final answer based on its comprehensive understanding of the context and the user's question.


The model I use in this project is TheBloke/Llama-2-13B-chat-GGML. Llama-2-Chat is optimized for dialogue use cases. Llama-2-Chat models outperform open-source chat models on most benchmarks tested, and in human evaluations for helpfulness and safety, are on par with some popular closed-source models like ChatGPT and PaLM.

!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose

callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

model_name_or_path = "TheBloke/Llama-2-13B-chat-GGML"
model_basename = "llama-2-13b-chat.ggmlv3.q5_1.bin" # the model is in bin format

model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)

n_gpu_layers = 40  # Change this value based on your model and your GPU VRAM pool.
n_batch = 256  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

# Loading model,
llm = LlamaCpp(
    model_path=model_path,
    max_tokens=256,
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    callback_manager=callback_manager,
    n_ctx=1024,
    verbose=False,
)

Now we can use Langchain to chain all components together and generate response. Note that aside from the Llama 2 Chat Prompt, you can also custom your prompt to generate better answer. The model prompt template is as below:


Prompt template: Llama-2-Chat

SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
USER: {prompt}
ASSISTANT:

You can add your USER prompt as below:

prompt_template = """Use the following pieces of context to answer the question at the end. Give the output in 500 words.

{context}

Question: {question}
Answer in English:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

Now let the chain run and gives your the answer:

chain=load_qa_chain(llm, chain_type="stuff", prompt = PROMPT)
chain.run(input_documents=docs, question=query)
query="What is Financial Inclusion?"
docs=docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

Results

Let's look at some answers generated by the chatbot.


Question: What is Financial Inclusion?

Answer: Financial inclusion refers to the access of individuals and businesses to useful and affordable financial products and services, which can help achieve goals such as eliminating poverty, reducing hunger, and narrowing the gender and income inequality gap. It is considered a critical cornerstone of nation-building by China, which has announced a national strategy to develop financial inclusion by 2020. This includes making finance virtually accessible and affordable to everyone through the power of technology.


Question: How does QR merchant growth impact financial inclusion in China?

Answer: QR Merchant Growth Plan has significantly impacted financial inclusion in China by contributing to the country's economy, increasing credit visibility, and playing a part in the construction of digital cities. By the end of 2018, Ant Financial had served more than 12.27 million micro-enterprises, which collectively made up one-sixth of all micro-enterprises in China. The QR Merchant Growth Plan has provided financial services such as credit, loans, insurance, and wealth management to these micro-enterprises, thereby promoting financial inclusion in China.


Question: Is there a role reversal between Ant Financial and Chinese Comercial Banks?

Answer: Yes, there has been a role reversal between Ant Financial and Chinese commercial banks. In the past, Chinese banks provided limited services to small businesses, leaving a large portion of the market unserved. However, with the rise of Ant Financial and other fintech companies, the traditional banking model has been disrupted, and now Chinese commercial banks are competing for a small pool of good-credit small and micro businesses, while Ant Financial serves the majority of risky but in-need financing. This is a significant shift in the financial landscape of China, as fintech companies have taken the lead in serving the underbanked segment of the population.


Question: Why does Ant Financial select financial inclusion as its strategy?

1st Answer: I don't know the answer to this question based on the provided context. The text does not explicitly state why Ant Financial selects financial inclusion as its strategy. However, it mentions that the company focuses on developing cutting-edge technologies to enable inclusive finance and improves efficiency and lowers costs at approximately 200 financial institutions. It also attributes its success to its dedication to financial inclusion through digital technology and positioning itself as a techfin company. Additionally, it incorporates the vision of bringing small and beautiful changes to the world into its culture. However, without explicit information on why Ant Financial selects financial inclusion as its strategy, I cannot provide a definitive answer to this question.


2nd Answer: Based on the given context, it appears that Ant Financial selects financial inclusion as its strategy because of the company's mission to make finance virtually accessible and affordable to everyone. The company attributes its success to its dedication to financial inclusion through the use of digital technology, positioning itself as a techfin company. By implementing the Quick Response (QR) Merchant Growth Plan and providing technological digitization services to commercial banks, Ant Financial aims to bring small and beautiful changes to the world by promoting balanced development between urban and rural areas, and financial inclusion.


As you can see from the examples above, the chat bot provided pretty good answers to the questions of different difficulty levels. It's interesting that sometimes the model will say that it is not sure about the answer, but it's good that it still gives us relevant information from the document. Although the answers are different on each query call, the answers definitely follow strictly the content of the document provided.


Conclusion

The architecture of our advanced PDF chatbot involves a carefully orchestrated flow of technologies and processes. Starting with data extraction and text chunking, we progress through embedding generation, semantic indexing, and similarity search. The chatbot's ability to utilize Llama 2 for contextual understanding enables it to provide users with relevant and informative answers from PDF documents. As chatbots continue to evolve, this architecture serves as a powerful example of leveraging state-of-the-art technologies to build intelligent and efficient information retrieval systems.


References

AssemblyAI. (2023, May 6). Vector Databases simply explained! (Embeddings & Indexes) [Video]. YouTube. https://www.youtube.com/watch?v=dN0lsF2cvm4

Muhammad Moin. (2023, July 25). LangChain: Chat with Books and PDF Files with Llama 2 and Pinecone (Free LLMs & Embeddings) [Video]. YouTube. https://www.youtube.com/watch?v=ckb4DnHLBrU

QA over in-memory documents | 🦜️🔗 Langchain. (n.d.). https://python.langchain.com/docs/use_cases/question_answering/how_to/question_answering

Llama-CPP | ️🔗 Langchain. (n.d.). https://python.langchain.com/docs/integrations/llms/llamacpp

TheBloke/Llama-2-13B-chat-GGML · Hugging face. (n.d.). https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML

MuhammadMoinFaisal. (n.d.). LargeLanguageModelsProjects/QA Book PDF LangChain Llama 2/Final_Llama_CPP_Ask_Question_from_book_PDF_Llama.ipynb at main · MuhammadMoinFaisal/LargeLanguageModelsProjects. GitHub. https://github.com/MuhammadMoinFaisal/LargeLanguageModelsProjects/blob/main/QA%20Book%20PDF%20LangChain%20Llama%202/Final_Llama_CPP_Ask_Question_from_book_PDF_Llama.ipynb




Comments


bottom of page