dev-resources.site
for different kinds of informations.
Talk with your PDF documents in SharePoint
A dreadful Teams/Slack message popped up! “Hey, could you help to find out [information] is in which documents?” You opened up the SharePoint folder, only to find out that you have no idea which documents this information belongs to.
Fear not! In this article, we will be building a RAG application to search through the mountain of PDF documents in your SharePoint.
RAG app: https://finance-chatbot-vincent-cheng.streamlit.app/
Tech Stack
- Database: ChromaDB
- LLM and model: OpenAI’s gpt-4o-mini, Google’s Gemini 1.5 Flash-8B
- Text embeddings: OpenAI’s text-embedding-3-large, Google’s embedding-001
- FrontEnd: Streamlit
- Cloud: Streamlit community cloud
- Tools: LangChain
- Storage: Microsoft SharePoint
Architecture Overview
Github: https://github.com/cyshen11/finance-chatbot/tree/main
Index
For index, we are converting the PDF documents into vector embeddings and store in a vector database.
Given that your documents are in the SharePoint, we can load the documents using LangChain SharePointLoader. Before using the SharePointLoader, we need to obtain a few parameters O365_CLIENT_ID
, O365_CLIENT_SECRET
, O365_TOKEN
, DOCUMENT_LIBRARY_ID
and FOLDER_ID
. You can follow this guide on how to obtain these parameters. For the O365_TOKEN, convert the content in o365_token.txt into TOML format. Copy the output and paste into your Streamlit secrets in this format.
[O365_TOKEN]
token_type = ...
scope = ...
expires_in = ...
...
In the Python code, read this secrets, convert into JSON, write the JSON into this directory Path.home() / ".credentials"
. Then, you can initialize the SharePointLoader with the token and load the documents.
directory_path = Path.home() / ".credentials"
# Check if dir exist
if not os.path.exists(directory_path):
os.makedirs(directory_path)
# Write O365 token into text file
with open(directory_path / "o365_token.txt", 'w') as f:
json.dump(O365_TOKEN, f)
# Initialize document loader
loader = SharePointLoader(
document_library_id=document_library_id,
auth_with_token=True,
folder_id=folder_id
)
Load the documents using the SharePointLoader. Before initializing the vector database, obtain the API keys for the LLM model that you are going to use. Initialize vector database (ChromaDB) and specify the collection name, embeddings based on user selected model. Provide the directory to the persist_directory
parameter to save the vector database on-disk. Add the loaded documents into the vector database with generated ids.
Retrieval
When we submit the question at the app, the RAG will convert the question into embeddings, perform vector search to return top K documents (n-nearest neighbors) based on vector similarity.
Generation
The RAG then passes the documents as context
and user question
to the LLM for generating a response
. We will also retrieve the source
, page
from the documents and de-duplicate them. Finally, the response
, source
and page
are passed back to the front-end.
Result
Tada! We found the documents!
Featured ones: