dev-resources.site
for different kinds of informations.
Chat with your PDF: Build a PDF Analyst with LlamaIndex and AgentLabs
In this tutorial, we'll learn how to use some basic features of LlamaIndex to create your PDF Document Analyst.
We'll use the AgentLabs interface to interact with our analysts, uploading documents and asking questions about them.
The tools we'll use
LlamaIndex is a simple, flexible data framework for connecting custom data sources to large language models.
It makes it easy to build Llm backend applications.
AgentLabs will allow us to get a frontend in no time using either Python or TypeScript in our backend (here we'll use Python).
What we are building
Getting started
As usual, let's install all the dependencies we'll need.
If you're using pip:
pip install pypdf langchain llama-index agentlabs-sdk
If you're using poetry:
poetry add pypdf langchain llama-index agentlabs-sdk
And now import them all:
from langchain.llms import OpenAI
from llama_index import SimpleDirectoryReader, ServiceContext, VectorStoreIndex
from llama_index.node_parser import SimpleNodeParser
from llama_index import set_global_service_context
from llama_index.response.pprint_utils import pprint_response
from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.query_engine import SubQuestionQueryEngine
from agentlabs.chat import IncomingChatMessage, MessageAttachment
from agentlabs.project import Project
from agentlabs.chat import MessageFormat
import asyncio
import os
Preparing our model
Before to get started, we need to instantiate the model we'll use along this tutorial for handling the users requests and compute our document's embedding (we'll talk more about this later).
Here, we'll use OpenAI's text-davinci-003 model. We pass it a max_tokens value of -1 so it's considered as unlimited.
llm = OpenAI(temperature=0, model_name="text-davinci-003", max_tokens=-1)
Now we construct a ServiceContext, passing it our llm as an argument so that every time the framework needs to call our model, it'll use our llm's instance.
service_context = ServiceContext.from_defaults(llm=llm)
set_global_service_context(service_context=service_context)
What's next?
Okay, so now our users will be able to upload some files.
In order to give our application the ability to retrieve information about these large files, we will need to transform them a bit and store them in a dedicated storage.
Note: If you're not familiar with embeddings, here's an article that explains embeddings in detail.
Long story short, performing semantic research over a normal database is not something we are capable of. To allow our model to retrieve some data by semantic proximity, we'll proceed in two steps:
- we'll ask our model to transform the data in a mathematical representation (this process is called embeddings)
- we'll store this representation in a database that is capable of retrieving the spatial similarity of two embeddings.
Files handling and indexing
Let's assume we know where the files are stored in the filesystem and we have the absolute paths of the location of every file.
First, we'll initialize a variable that will contain our vector storage index; it will be null at the beginning, you will understand why very soon.
vs_index = None
Now, we'll use the SimpleDirectoryReader
and its load_data()
method to transform every PDF file into a plain text document using pypdf
under the hood.
def load_and_index_files(paths):
docs = SimpleDirectoryReader(input_files=paths).load_data()
In that same function, we'll now use a VectorStoreIndex.from_documents()
to create an in-memory vector store index containing our embeddings.
Under the hood, this method will use our LLM to compute all vector embeddings for us.
Let's update our function:
def load_and_index_files(paths):
docs = SimpleDirectoryReader(input_files=paths).load_data()
vs_index = VectorStoreIndex.from_documents(docs)
Final change: since our users will be able to upload multiple documents, we want to re-index and update our vector store index every time a user uploads a new document.
We can achieve this by using the SimpleNodeParser and inserting nodes directly into our index.
Here's our final function:
def load_and_index_files(paths):
docs = SimpleDirectoryReader(input_files=paths).load_data()
global vs_index
if vs_index is None:
vs_index = VectorStoreIndex.from_documents(docs)
else:
parser = SimpleNodeParser.from_defaults(chunk_size=1024, chunk_overlap=20)
new_nodes = parser.get_nodes_from_documents(docs)
vs_index.insert_nodes(new_nodes)
Querying
Now, we know how to handle our files and create our indices. We can create what we'll need to query those indices.
To do so, we'll configure a QueryEngine
as follows.
engine = vs_index.as_query_engine(similarity_top_k=3)
Now we have our querying engine, we can use it to send some queries:
# or use await if you're in an async function
response = asyncio.run("your query about your document")
But obviously, to get it working, we now need to wrap everything up and to setup the UI for our users.
No worries, this is probably the most straightforward part.
Setting up the UI
We'll start by setting up the user interface with AgentLabs.
It's fairly easy to do:
- sign-in to https://agentlabs.dev
- create a project
- create an agent and name it ChatGPT
- create a secret key for this agent
Init the AgentLabs project
Now, we will init AgentLabs with the info they provide to us in our dashboard.
from agentlabs.agent import Agent
from agentlabs.chat import IncomingChatMessage, MessageFormat
from agentlabs.project import Project
import os
alabs = Project(
project_id="df3e3beb-49c4-4bd7-9193-e7755e4e1578",
agentlabs_url="https://llamaindex-analyst.app.agentlabs.dev",
secret=os.environ['AGENTLABS_SECRET'],
)
agent = alabs.agent(id="5fb3e7af-5cb3-4095-bca1-47db49774730")
alabs.connect()
alabs.wait()
Here, we add our secret in an environment variable for safety reasons. All the above variables can be found in your AgentLabs console.
Handling user uploads and messages
We'll use the on_chat_message
method provided by AgentLabs to handle every message (including files) sent by the user.
We'll define a handler with a simple logic :
if the message contains one or more attachment, then we'll download them and we'll use the
load_and_index_files
function we previously created.if the message contains no attachment but we did not indexed any file yet, we send a kindly message to our user inviting them to upload some files
otherwise, we run the query over our engine and we return the result to the user.
Here's our handler code:
def handle_message(msg: IncomingChatMessage):
if len(msg.attachments) > 0:
agent.typewrite(
conversation_id=msg.conversation_id,
text="Ok, I am indexing your files"
)
st = agent.create_stream(conversation_id=msg.conversation_id, format=MessageFormat.MARKDOWN)
paths = download_attachments(msg.attachments)
load_and_index_files(paths)
st.typewrite("All files have been indexed. You can ask me questions now.")
st.end()
return
if vs_index is None:
return agent.typewrite(
conversation_id=msg.conversation_id,
text="No files have been indexed yet. Please upload some files."
)
engine = vs_index.as_query_engine(similarity_top_k=3)
response = asyncio.run(engine.aquery(msg.text))
agent.typewrite(
conversation_id=msg.conversation_id,
text=response.response,
)
alabs.on_chat_message(handle_message)
You probably noticed that AgentLabs provides some practical built-in methods to interact with our users in realtime such as agent.typewrite()
and agent.create_stream()
.
You can find more information about these methods in the official documentation.
Et voilΓ !
Congrats, your project is ready!
You can also retrieve the entire source code here.
Here's again how it looks:
Conclusion
In this tutorial, we only saw some basic querying and storing mechanisms available with LlamaIndex.
However, it gives you an idea about how you can get started an easily prototype powerful llms-apps with AgentLabs and LlamaIndex.
If you liked this tutorial, feel free to leave a comment below and to smash the like buttons :)
Featured ones: