Use LangChain, OpenAI GPT, & Deep Lake to Chat with CSVs, PDFs, JSONs, GitHub Repos, URLs, & More
We’ve previously explored chatting with PDFs or understanding GitHub repos with LangChain. Many apps are popping up here and there inspired by those use cases, but DataChad, created by our community member Gustav von Zitzewitz, takes it several steps further, and works both locally or in the cloud, and allows chatting with multiple data sources of various types (PDFs, Excel sheets, etc.) at the same time.
DataChad is an open-source project that allows users to ask questions about any data source by leveraging embeddings, Deep Lake as a vector database, large language models like GPT-3.5-turbo or GPT-4, and LangChain. The data source can be anything from a local file like a pdf or CSV to a website url, a GitHub repository, or even the path to a directory, scanned recursively if the app is deployed locally. The app now supports Local Mode, where all data is processed only locally and no API calls are made. This is made possible by leveraging pre-trained open source LLM models like GPT4all, and creating Deep Lake-powered embedding storage on the local disk vs in the Deep Lake cloud.
The app works by uploading any file or entering any way or URL (or pointing to the location of your files using the Local Mode). Subsequently, the app detects and loads the data source into text documents, embeds the text documents using OpenAI embeddings, then stores them embeddings as a vector dataset to Activeloop’s Deep Lake Cloud. A Langchain is established, comprising an LLM model and the embedding database index as a retriever. This chain serves as the context for answering user queries over any data they upload.
Why Do You Need a Chat With Any Data App?
DataChad is designed to serve as an indispensable tool for individuals who require swift and precise data querying from any source.
Whether you’re seeking a comprehensive understanding of a complete project or looking for swift answers from a single data source without manually sifting through the material (say, a Wikipedia article, codebase, or an academic paper you’re cramming), with DataChad, users can ask natural language questions and get relevant answers in seconds without writing complex SQL queries or using other data querying tools.
Finally, the app can be hosted and used from anywhere, like in the demo, or deployed locally to enable querying local directories. This makes it essential to be able to run this types of solutions locally, without the need to send companies like OpenAI your data (in that case, you’d need to use an open-source large language model).
Editorial Note on OpenAI Embeddings
Costs can become a factor for extensive OpenAI API usage. To have full transparency and control of this critical factor, DataChad will display the app’s usage of tokens and total costs in $. To get a feeling about the scale, prompts using the maximum number of tokens of 4069 still lay well below a single cent of total usage costs.
How DataChad Works: Architectural Blueprint for AI-powered Chat with Data App
OpenAI Embeddings
DataChad uses OpenAI Embeddings to convert text documents into vectors that can be indexed and searched efficiently. OpenAI’s embeddings are instrumental in evaluating the semantic similarity between two or more text fragments or the relevance of extensive documents to a concise query. They are extensively vital for tasks like search or classification. OpenAI embeddings employ the cosine similarity method to calculate the similarity between documents and a question.
Vector Database
DataChad uses Deep Lake - the vector database for all AI data, to store the embeddings generated from the text documents. Vector databases are designed to store and search vectors efficiently and are optimized for large-scale datasets. Deep Lake stands out from various vector databases in its multi-modality (i.e., ability to support multiple data types and store embedding metadata). It is highly relevant if you’re looking to build an all-in-one chat with a data app like DataChad.
Large Language Models (LLMs)
DataChad uses large language models like GPT-3.5 Turbo to generate responses to user questions. LLMs are powerful models trained on massive amounts of text data that can generate natural language responses to a wide range of questions.
LangChain
DataChad uses LangChain to combine the embeddings and LLMs into a single retrieval chain that can be used to answer user questions. LangChain is a powerful technique for integrating natural language processing tools into a single pipeline. Read this ultimate LangChain guide if you want to understand the power of LangChain.
Streamlit
DataChad is implemented as a Streamlit app, a way to build demo apps in Python quickly. It takes the pain away from caring about how to implement a UI and how to host the app properly and lets you focus on the backend work.
Factors to Consider as You Build a LangChain & Large Language Model-based app (k. arg, chunks, etc.)
The DataChad project is built upon the fusion of two critical natural language processing (NLP) technologies, leveraging the attention mechanism of Language Model-based Models (LLMs) like GPT-4 through the OpenAI API and employing vector similarity for efficient embedding comparison when querying the vector database. This combination allows for robust analysis and retrieval of information from textual data. Let’s delve into the details, focusing on the querying parameters of the vector database within DataChad.
The Attention Mechanism of LLMs
DataChad taps into the attention mechanism offered by LLMs, such as GPT-3, using the OpenAI API. This attention mechanism enables the model to weigh the importance of different words or tokens in a text sequence, capturing contextual relationships and semantic nuances. By leveraging LLMs, DataChad benefits from their ability to generate rich and accurate representations of textual data.
Vector Similarity for Embedding Comparison
When querying the vector database, DataChad employs vector similarity to compare document embeddings. This technique measures the geometric similarity between embeddings, allowing for the efficient retrieval of similar documents. Vector similarity provides a simple yet effective method for identifying related content in large-scale datasets.
Parameters for Querying the Vector Database and the LLM
DataChad’s querying process involves several important parameters that influence the retrieval and analysis of document embeddings. What are those parameters?
chunk_size
chunk_size in LangChain-based apps determines the size at which the text is divided into smaller chunks before being embedded. This parameter ensures the efficient processing of large documents and controls the granularity of the resulting embeddings. The DataChad default is 1000.
fetch_k
fetch_k in LangChain-based apps specifies the number of documents to pull from the vector database. This parameter determines the scope of the search and influences the relevance of the retrieved documents. The DataChad default is 20.
k
The k in LangChain-based apps represents the most similar embeddings selected to build the context for the LLM prompt in the langchain. This parameter affects the contextual understanding and response generation of the LLM when querying the OpenAI API. The DataChad default is 10.
max_tokens
The max_tokens parameter limits the documents returned from the vector store based on tokens before building the context to query the LLM. This parameter ensures that DataChad does not run into the LLM’s prompt limit (4069 for gpt-3.5-turbo). The DataChad default is 3375.
temperature
LLM temperature controls the randomness of the LLM output. A temperature of 0 means the response is deterministic: it always returns the same completion (making it significantly less prone to hallucination). A temperature of greater than zero results in increasing variation in the completion. The DataChad default is 0.7.
By carefully tuning these parameters, DataChad optimizes the trade-off between computational efficiency and the quality of results obtained from both the vector database and LLM-based querying. By ticking the Advanced Options checkbox in the app, experienced users can further modify these parameters.
How to Solve Most Common Issues When Building With LangChain
The previous section discussed the importance of selecting appropriate parameters for querying the vector database and the language model within the DataChad project. However, despite the default values having been carefully chosen and tested, it is not uncommon to encounter challenges or the desire for further improvement in the overall query experience. In this section, we will address some common issues you may face as you build your app and provide suggested solutions that can help overcome these challenges.
Issue 1: Running into errors related to the prompt length
Solution: Decrease one or many of k, chunk_size, and max_tokens.
Issue 2: The answers contain hallucinations or do not match the true data content
Solution: Decrease the temperature. Set it to 0 for the most conservative answers that are unlikely to deviate from the sources.
Issue 3: The answers are not relevant enough
Solution: Increase chunk_size, or if this leads to running into issue 1, then increase k and fetch_k while decreasing chunk_size
Practical Guide: Building an All-In-One Chat with Anything App
The code is split in three parts. First, we build out the Streamlit app defined in app.py. The second part, utils.py, contains all processing functionality and API calls. Final part is constants.py, where all project-specific paths, names, and descriptions are defined.
app.py
1import streamlit as st
2from streamlit_chat import message
3
4from constants import (
5 ACTIVELOOP_HELP,
6 APP_NAME,
7 AUTHENTICATION_HELP,
8 CHUNK_SIZE,
9 DEFAULT_DATA_SOURCE,
10 ENABLE_ADVANCED_OPTIONS,
11 FETCH_K,
12 MAX_TOKENS,
13 OPENAI_HELP,
14 PAGE_ICON,
15 REPO_URL,
16 TEMPERATURE,
17 USAGE_HELP,
18 K,
19)
20from utils import (
21 advanced_options_form,
22 authenticate,
23 delete_uploaded_file,
24 generate_response,
25 logger,
26 save_uploaded_file,
27 update_chain,
28)
29
30# Page options and header
31st.set_option("client.showErrorDetails", True)
32st.set_page_config(
33 page_title=APP_NAME, page_icon=PAGE_ICON, initial_sidebar_state="expanded"
34)
35st.markdown(
36 f"<h1 style='text-align: center;'>{APP_NAME} {PAGE_ICON} <br> I know all about your data!</h1>",
37 unsafe_allow_html=True,
38)
39
40# Initialise session state variables
41# Chat and Data Source
42if "past" not in st.session_state:
43 st.session_state["past"] = []
44if "usage" not in st.session_state:
45 st.session_state["usage"] = {}
46if "chat_history" not in st.session_state:
47 st.session_state["chat_history"] = []
48if "generated" not in st.session_state:
49 st.session_state["generated"] = []
50if "data_source" not in st.session_state:
51 st.session_state["data_source"] = DEFAULT_DATA_SOURCE
52if "uploaded_file" not in st.session_state:
53 st.session_state["uploaded_file"] = None
54# Authentication and Credentials
55if "auth_ok" not in st.session_state:
56 st.session_state["auth_ok"] = False
57if "openai_api_key" not in st.session_state:
58 st.session_state["openai_api_key"] = None
59if "activeloop_token" not in st.session_state:
60 st.session_state["activeloop_token"] = None
61if "activeloop_org_name" not in st.session_state:
62 st.session_state["activeloop_org_name"] = None
63# Advanced Options
64if "k" not in st.session_state:
65 st.session_state["k"] = K
66if "fetch_k" not in st.session_state:
67 st.session_state["fetch_k"] = FETCH_K
68if "chunk_size" not in st.session_state:
69 st.session_state["chunk_size"] = CHUNK_SIZE
70if "temperature" not in st.session_state:
71 st.session_state["temperature"] = TEMPERATURE
72if "max_tokens" not in st.session_state:
73 st.session_state["max_tokens"] = MAX_TOKENS
74
75# Sidebar with Authentication
76# Only start App if authentication is OK
77with st.sidebar:
78 st.title("Authentication", help=AUTHENTICATION_HELP)
79 with st.form("authentication"):
80 openai_api_key = st.text_input(
81 "OpenAI API Key",
82 type="password",
83 help=OPENAI_HELP,
84 placeholder="This field is mandatory",
85 )
86 activeloop_token = st.text_input(
87 "ActiveLoop Token",
88 type="password",
89 help=ACTIVELOOP_HELP,
90 placeholder="Optional, using ours if empty",
91 )
92 activeloop_org_name = st.text_input(
93 "ActiveLoop Organisation Name",
94 type="password",
95 help=ACTIVELOOP_HELP,
96 placeholder="Optional, using ours if empty",
97 )
98 submitted = st.form_submit_button("Submit")
99 if submitted:
100 authenticate(openai_api_key, activeloop_token, activeloop_org_name)
101
102 st.info(f"Learn how it works [here]({REPO_URL})")
103 if not st.session_state["auth_ok"]:
104 st.stop()
105
106 # Clear button to reset all chat communication
107 clear_button = st.button("Clear Conversation", key="clear")
108
109 # Advanced Options
110 if ENABLE_ADVANCED_OPTIONS:
111 advanced_options_form()
112
113# the chain can only be initialized after authentication is OK
114if "chain" not in st.session_state:
115 update_chain()
116
117if clear_button:
118 # resets all chat history related caches
119 st.session_state["past"] = []
120 st.session_state["generated"] = []
121 st.session_state["chat_history"] = []
122
123# file upload and data source inputs
124uploaded_file = st.file_uploader("Upload a file")
125data_source = st.text_input(
126 "Enter any data source",
127 placeholder="Any path or url pointing to a file or directory of files",
128)
129
130# generate new chain for new data source / uploaded file
131# make sure to do this only once per input / on change
132if data_source and data_source != st.session_state["data_source"]:
133 logger.info(f"Data source provided: '{data_source}'")
134 st.session_state["data_source"] = data_source
135 update_chain()
136
137if uploaded_file and uploaded_file != st.session_state["uploaded_file"]:
138 logger.info(f"Uploaded file: '{uploaded_file.name}'")
139 st.session_state["uploaded_file"] = uploaded_file
140 data_source = save_uploaded_file(uploaded_file)
141 st.session_state["data_source"] = data_source
142 update_chain()
143 delete_uploaded_file(uploaded_file)
144
145# container for chat history
146response_container = st.container()
147# container for text box
148container = st.container()
149
150# As streamlit reruns the whole script on each change
151# it is necessary to repopulate the chat containers
152with container:
153 with st.form(key="prompt_input", clear_on_submit=True):
154 user_input = st.text_area("You:", key="input", height=100)
155 submit_button = st.form_submit_button(label="Send")
156
157 if submit_button and user_input:
158 output = generate_response(user_input)
159 st.session_state["past"].append(user_input)
160 st.session_state["generated"].append(output)
161
162if st.session_state["generated"]:
163 with response_container:
164 for i in range(len(st.session_state["generated"])):
165 message(st.session_state["past"][i], is_user=True, key=str(i) + "_user")
166 message(st.session_state["generated"][i], key=str(i))
167
168# Usage sidebar with total used tokens and costs
169# We put this at the end to be able to show usage starting with the first response
170with st.sidebar:
171 if st.session_state["usage"]:
172 st.divider()
173 st.title("Usage", help=USAGE_HELP)
174 col1, col2 = st.columns(2)
175 col1.metric("Total Tokens", st.session_state["usage"]["total_tokens"])
176 col2.metric("Total Costs in $", st.session_state["usage"]["total_cost"])
177
178
utils.py
1import logging
2import os
3import re
4import shutil
5import sys
6from typing import List
7
8import deeplake
9import openai
10import streamlit as st
11from dotenv import load_dotenv
12from langchain.callbacks import OpenAICallbackHandler, get_openai_callback
13from langchain.chains import ConversationalRetrievalChain
14from langchain.chat_models import ChatOpenAI
15from langchain.document_loaders import (
16 CSVLoader,
17 DirectoryLoader,
18 GitLoader,
19 NotebookLoader,
20 OnlinePDFLoader,
21 PythonLoader,
22 TextLoader,
23 UnstructuredFileLoader,
24 UnstructuredHTMLLoader,
25 UnstructuredPDFLoader,
26 UnstructuredWordDocumentLoader,
27 WebBaseLoader,
28)
29from langchain.embeddings.openai import OpenAIEmbeddings
30from langchain.schema import Document
31from langchain.text_splitter import RecursiveCharacterTextSplitter
32from langchain.vectorstores import DeepLake, VectorStore
33from streamlit.runtime.uploaded_file_manager import UploadedFile
34
35from constants import (
36 APP_NAME,
37 CHUNK_SIZE,
38 DATA_PATH,
39 FETCH_K,
40 MAX_TOKENS,
41 MODEL,
42 PAGE_ICON,
43 REPO_URL,
44 TEMPERATURE,
45 K,
46)
47
48# loads environment variables
49load_dotenv()
50
51logger = logging.getLogger(APP_NAME)
52
53def configure_logger(debug: int = 0) -> None:
54 # boilerplate code to enable logging in the streamlit app console
55 log_level = logging.DEBUG if debug == 1 else logging.INFO
56 logger.setLevel(log_level)
57
58 stream_handler = logging.StreamHandler(stream=sys.stdout)
59 stream_handler.setLevel(log_level)
60
61 formatter = logging.Formatter("%(message)s")
62
63 stream_handler.setFormatter(formatter)
64
65 logger.addHandler(stream_handler)
66 logger.propagate = False
67
68configure_logger(0)
69
70def authenticate(
71 openai_api_key: str, activeloop_token: str, activeloop_org_name: str
72) -> None:
73 # Validate all credentials are set and correct
74 # Check for env variables to enable local dev and deployments with shared credentials
75 openai_api_key = (
76 openai_api_key
77 or os.environ.get("OPENAI_API_KEY")
78 or st.secrets.get("OPENAI_API_KEY")
79 )
80 activeloop_token = (
81 activeloop_token
82 or os.environ.get("ACTIVELOOP_TOKEN")
83 or st.secrets.get("ACTIVELOOP_TOKEN")
84 )
85 activeloop_org_name = (
86 activeloop_org_name
87 or os.environ.get("ACTIVELOOP_ORG_NAME")
88 or st.secrets.get("ACTIVELOOP_ORG_NAME")
89 )
90 if not (openai_api_key and activeloop_token and activeloop_org_name):
91 st.session_state["auth_ok"] = False
92 st.error("Credentials neither set nor stored", icon=PAGE_ICON)
93 return
94 try:
95 # Try to access openai and deeplake
96 with st.spinner("Authentifying..."):
97 openai.api_key = openai_api_key
98 openai.Model.list()
99 deeplake.exists(
100 f"hub://{activeloop_org_name}/DataChad-Authentication-Check",
101 token=activeloop_token,
102 )
103 except Exception as e:
104 logger.error(f"Authentication failed with {e}")
105 st.session_state["auth_ok"] = False
106 st.error("Authentication failed", icon=PAGE_ICON)
107 return
108 # store credentials in the session state
109 st.session_state["auth_ok"] = True
110 st.session_state["openai_api_key"] = openai_api_key
111 st.session_state["activeloop_token"] = activeloop_token
112 st.session_state["activeloop_org_name"] = activeloop_org_name
113 logger.info("Authentification successful!")
114
115def advanced_options_form() -> None:
116 # Input Form that takes advanced options and rebuilds chain with them
117 advanced_options = st.checkbox(
118 "Advanced Options", help="Caution! This may break things!"
119 )
120 if advanced_options:
121 with st.form("advanced_options"):
122 temperature = st.slider(
123 "temperature",
124 min_value=0.0,
125 max_value=1.0,
126 value=TEMPERATURE,
127 help="Controls the randomness of the language model output",
128 )
129 col1, col2 = st.columns(2)
130 fetch_k = col1.number_input(
131 "k_fetch",
132 min_value=1,
133 max_value=1000,
134 value=FETCH_K,
135 help="The number of documents to pull from the vector database",
136 )
137 k = col2.number_input(
138 "k",
139 min_value=1,
140 max_value=100,
141 value=K,
142 help="The number of most similar documents to build the context from",
143 )
144 chunk_size = col1.number_input(
145 "chunk_size",
146 min_value=1,
147 max_value=100000,
148 value=CHUNK_SIZE,
149 help=(
150 "The size at which the text is divided into smaller chunks "
151 "before being embedded.\n\nChanging this parameter makes re-embedding "
152 "and re-uploading the data to the database necessary "
153 ),
154 )
155 max_tokens = col2.number_input(
156 "max_tokens",
157 min_value=1,
158 max_value=4069,
159 value=MAX_TOKENS,
160 help="Limits the documents returned from database based on number of tokens",
161 )
162 applied = st.form_submit_button("Apply")
163 if applied:
164 st.session_state["k"] = k
165 st.session_state["fetch_k"] = fetch_k
166 st.session_state["chunk_size"] = chunk_size
167 st.session_state["temperature"] = temperature
168 st.session_state["max_tokens"] = max_tokens
169 update_chain()
170
171def save_uploaded_file(uploaded_file: UploadedFile) -> str:
172 # streamlit uploaded files need to be stored locally
173 # before embedded and uploaded to the hub
174 if not os.path.exists(DATA_PATH):
175 os.makedirs(DATA_PATH)
176 file_path = str(DATA_PATH / uploaded_file.name)
177 uploaded_file.seek(0)
178 file_bytes = uploaded_file.read()
179 file = open(file_path, "wb")
180 file.write(file_bytes)
181 file.close()
182 logger.info(f"Saved: {file_path}")
183 return file_path
184
185def delete_uploaded_file(uploaded_file: UploadedFile) -> None:
186 # cleanup locally stored files
187 file_path = DATA_PATH / uploaded_file.name
188 if os.path.exists(DATA_PATH):
189 os.remove(file_path)
190 logger.info(f"Removed: {file_path}")
191
192def handle_load_error(e: str = None) -> None:
193 e = e or "No Loader found for your data source. Consider contributing: {REPO_URL}!"
194 error_msg = f"Failed to load {st.session_state['data_source']} with Error:\n{e}"
195 st.error(error_msg, icon=PAGE_ICON)
196 logger.info(error_msg)
197 st.stop()
198
199def load_git(data_source: str, chunk_size: int = CHUNK_SIZE) -> List[Document]:
200 # We need to try both common main branches
201 # Thank you github for the "master" to "main" switch
202 repo_name = data_source.split("/")[-1].split(".")[0]
203 repo_path = str(DATA_PATH / repo_name)
204 text_splitter = RecursiveCharacterTextSplitter(
205 chunk_size=chunk_size, chunk_overlap=0
206 )
207 branches = ["main", "master"]
208 for branch in branches:
209 if os.path.exists(repo_path):
210 data_source = None
211 try:
212 docs = GitLoader(repo_path, data_source, branch).load_and_split(
213 text_splitter
214 )
215 break
216 except Exception as e:
217 logger.info(f"Error loading git: {e}")
218 if os.path.exists(repo_path):
219 # cleanup repo afterwards
220 shutil.rmtree(repo_path)
221 try:
222 return docs
223 except Exception as e:
224 handle_load_error()
225
226def load_any_data_source(
227 data_source: str, chunk_size: int = CHUNK_SIZE
228) -> List[Document]:
229 # Ugly thing that decides how to load data
230 # It aint much, but it's honest work
231 is_text = data_source.endswith(".txt")
232 is_web = data_source.startswith("http")
233 is_pdf = data_source.endswith(".pdf")
234 is_csv = data_source.endswith("csv")
235 is_html = data_source.endswith(".html")
236 is_git = data_source.endswith(".git")
237 is_notebook = data_source.endswith(".ipynb")
238 is_doc = data_source.endswith(".doc")
239 is_py = data_source.endswith(".py")
240 is_dir = os.path.isdir(data_source)
241 is_file = os.path.isfile(data_source)
242
243 loader = None
244 if is_dir:
245 loader = DirectoryLoader(data_source, recursive=True, silent_errors=True)
246 elif is_git:
247 return load_git(data_source, chunk_size)
248 elif is_web:
249 if is_pdf:
250 loader = OnlinePDFLoader(data_source)
251 else:
252 loader = WebBaseLoader(data_source)
253 elif is_file:
254 if is_text:
255 loader = TextLoader(data_source)
256 elif is_notebook:
257 loader = NotebookLoader(data_source)
258 elif is_pdf:
259 loader = UnstructuredPDFLoader(data_source)
260 elif is_html:
261 loader = UnstructuredHTMLLoader(data_source)
262 elif is_doc:
263 loader = UnstructuredWordDocumentLoader(data_source)
264 elif is_csv:
265 loader = CSVLoader(data_source, encoding="utf-8")
266 elif is_py:
267 loader = PythonLoader(data_source)
268 else:
269 loader = UnstructuredFileLoader(data_source)
270 try:
271 # Chunk size is a major trade-off parameter to control result accuracy over computaion
272 text_splitter = RecursiveCharacterTextSplitter(
273 chunk_size=chunk_size, chunk_overlap=0
274 )
275 docs = loader.load_and_split(text_splitter)
276 logger.info(f"Loaded: {len(docs)} document chucks")
277 return docs
278 except Exception as e:
279 handle_load_error(e if loader else None)
280
281def clean_data_source_string(data_source_string: str) -> str:
282 # replace all non-word characters with dashes
283 # to get a string that can be used to create a new dataset
284 dashed_string = re.sub(r"\W+", "-", data_source_string)
285 cleaned_string = re.sub(r"--+", "- ", dashed_string).strip("-")
286 return cleaned_string
287
288def setup_vector_store(data_source: str, chunk_size: int = CHUNK_SIZE) -> VectorStore:
289 # either load existing vector store or upload a new one to the hub
290 embeddings = OpenAIEmbeddings(
291 disallowed_special=(), openai_api_key=st.session_state["openai_api_key"]
292 )
293 data_source_name = clean_data_source_string(data_source)
294 dataset_path = f"hub://{st.session_state['activeloop_org_name']}/{data_source_name}-{chunk_size}"
295 if deeplake.exists(dataset_path, token=st.session_state["activeloop_token"]):
296 with st.spinner("Loading vector store..."):
297 logger.info(f"Dataset '{dataset_path}' exists -> loading")
298 vector_store = DeepLake(
299 dataset_path=dataset_path,
300 read_only=True,
301 embedding_function=embeddings,
302 token=st.session_state["activeloop_token"],
303 )
304 else:
305 with st.spinner("Reading, embedding and uploading data to hub..."):
306 logger.info(f"Dataset '{dataset_path}' does not exist -> uploading")
307 docs = load_any_data_source(data_source, chunk_size)
308 vector_store = DeepLake.from_documents(
309 docs,
310 embeddings,
311 dataset_path=dataset_path,
312 token=st.session_state["activeloop_token"],
313 )
314 return vector_store
315
316def build_chain(
317 data_source: str,
318 k: int = K,
319 fetch_k: int = FETCH_K,
320 chunk_size: int = CHUNK_SIZE,
321 temperature: float = TEMPERATURE,
322 max_tokens: int = MAX_TOKENS,
323) -> ConversationalRetrievalChain:
324 # create the langchain that will be called to generate responses
325 vector_store = setup_vector_store(data_source, chunk_size)
326 retriever = vector_store.as_retriever()
327 # Search params "fetch_k" and "k" define how many documents are pulled from the hub
328 # and selected after the document matching to build the context
329 # that is fed to the model together with your prompt
330 search_kwargs = {
331 "maximal_marginal_relevance": True,
332 "distance_metric": "cos",
333 "fetch_k": fetch_k,
334 "k": k,
335 }
336 retriever.search_kwargs.update(search_kwargs)
337 model = ChatOpenAI(
338 model_name=MODEL,
339 temperature=temperature,
340 openai_api_key=st.session_state["openai_api_key"],
341 )
342 chain = ConversationalRetrievalChain.from_llm(
343 model,
344 retriever=retriever,
345 chain_type="stuff",
346 verbose=True,
347 # we limit the maximum number of used tokens
348 # to prevent running into the models token limit of 4096
349 max_tokens_limit=max_tokens,
350 )
351 logger.info(f"Data source '{data_source}' is ready to go!")
352 return chain
353
354def update_chain() -> None:
355 # Build chain with parameters from session state and store it back
356 # Also delete chat history to not confuse the bot with old context
357 try:
358 st.session_state["chain"] = build_chain(
359 data_source=st.session_state["data_source"],
360 k=st.session_state["k"],
361 fetch_k=st.session_state["fetch_k"],
362 chunk_size=st.session_state["chunk_size"],
363 temperature=st.session_state["temperature"],
364 max_tokens=st.session_state["max_tokens"],
365 )
366 st.session_state["chat_history"] = []
367 except Exception as e:
368 msg = f"Failed to build chain for data source {st.session_state['data_source']} with error: {e}"
369 logger.error(msg)
370 st.error(msg, icon=PAGE_ICON)
371
372def update_usage(cb: OpenAICallbackHandler) -> None:
373 # Accumulate API call usage via callbacks
374 logger.info(f"Usage: {cb}")
375 callback_properties = [
376 "total_tokens",
377 "prompt_tokens",
378 "completion_tokens",
379 "total_cost",
380 ]
381 for prop in callback_properties:
382 value = getattr(cb, prop, 0)
383 st.session_state["usage"].setdefault(prop, 0)
384 st.session_state["usage"][prop] += value
385
386def generate_response(prompt: str) -> str:
387 # call the chain to generate responses and add them to the chat history
388 with st.spinner("Generating response"), get_openai_callback() as cb:
389 response = st.session_state["chain"](
390 {"question": prompt, "chat_history": st.session_state["chat_history"]}
391 )
392 update_usage(cb)
393 logger.info(f"Response: '{response}'")
394 st.session_state["chat_history"].append((prompt, response["answer"]))
395 return response["answer"]
396
397
constants.py
1from pathlib import Path
2
3APP_NAME = "DataChad"
4MODEL = "gpt-3.5-turbo"
5PAGE_ICON = "🤖"
6
7K = 10
8FETCH_K = 20
9CHUNK_SIZE = 1000
10TEMPERATURE = 0.7
11MAX_TOKENS = 3357
12ENABLE_ADVANCED_OPTIONS = True
13
14DATA_PATH = Path.cwd() / "data"
15DEFAULT_DATA_SOURCE = "[email protected]:gustavz/DataChad.git"
16
17REPO_URL = "https://github.com/gustavz/DataChad"
18
19AUTHENTICATION_HELP = f"""
20Your credentials are only stored in your session state.\n
21The keys are neither exposed nor made visible or stored permanently in any way.\n
22Feel free to check out [the code base]({REPO_URL}) to validate how things work.
23"""
24
25USAGE_HELP = f"""
26These are the accumulated OpenAI API usage metrics.\n
27The app uses '{MODEL}' for chat and 'text-embedding-ada-002' for embeddings.\n
28Learn more about OpenAI's pricing [here](https://openai.com/pricing#language-models)
29"""
30
31OPENAI_HELP = """
32You can sign-up for OpenAI's API [here](https://openai.com/blog/openai-api).\n
33Once you are logged in, you find the API keys [here](https://platform.openai.com/account/api-keys)
34"""
35
36ACTIVELOOP_HELP = """
37You can create an Activeloop account (including 200GB of free database storage) [here](https://www.activeloop.ai/).\n
38Once you are logged in, you find the API token [here](https://app.activeloop.ai/profile/gustavz/apitoken).\n
39The organisation name is your username, or you can create new organisations [here](https://app.activeloop.ai/organization/new/create)
40"""
41
42
Concluding Remarks: Build your Chat with Data Tool, or Use DataChad
DataChad elevates conversing with CSVs, PDFs, JSONs, GitHub repositories, local paths or web URLs to a completely new level. If you’ve read this far, consider giving DataChad a try.
By harnessing the power of embeddings, Deep Lake’s vector database for all AI data, large language models (LLMs), and LangChain, DataChad enables users to query any data source easily. DataChad seamlessly transforms any data into text documents, embeds them using OpenAI embeddings, and stores the embeddings as a vector dataset in Activeloop’s Deep Lake Cloud. And creates a LangChain, which serves as the context for generating precise responses to user queries. Whether the task at hand is understanding a complex project or seeking quick answers from a single data source, DataChad allows users to pose natural language questions and receive relevant answers in seconds.
DataChad - Chat with Any Data FAQs
How can I deploy a ChatGPT for my Data fully locally?
If your data is sensitive and you would like to keep it local, you can still use DataChad to chat with your local data locally. Just select the Local Mode in the settings.
Can I deploy chat with data fully on-premise?
Yes, if your enterprise data needs to be fully secure and you’re looking to self-host a “ChatGPT” for your data without giving third-party access, you can deploy DataChad in Local Mode with serverless Deep Lake vector database. With the help of open-source models like GPT4all, you can run the embeddings computation fully locally without the need to send your data to providers like Anthropic or OpenAI.
Can I have ChatGPT to chat with multiple files at the same time?
Yes, DataChad supports chatting with many files at the same time. You can chat with PDFs, text documents, Word documents or CSV files all at the same time.