Merge pull request mayooear#71 from mayooear/feat/add-directory-loader

Add directory loader to load multiple pdf files
spacepirate0001 · Mar 28, 2023 · ef4046d · ef4046d
2 parents b4c88e1 + 90381f0
commit ef4046d
Show file tree

Hide file tree

Showing 4 changed files with 23 additions and 29 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
-# GPT-4 & LangChain - Create a ChatGPT Chatbot for Your PDF Docs
+# GPT-4 & LangChain - Create a ChatGPT Chatbot for Your PDF Files
 
-Use the new GPT-4 api to build a chatGPT chatbot for Large PDF docs (56 pages used in this example).
+Use the new GPT-4 api to build a chatGPT chatbot for multiple Large PDF files.
 
 Tech stack used includes LangChain, Pinecone, Typescript, Openai, and Next.js. LangChain is a framework that makes it easier to build scalable AI/LLM apps and chatbots. Pinecone is a vectorstore for storing embeddings and your PDF in text to later retrieve similar docs.
 
@@ -48,15 +48,15 @@ PINECONE_INDEX_NAME=
 
 5. In `utils/makechain.ts` chain change the `QA_PROMPT` for your own usecase. Change `modelName` in `new OpenAIChat` to `gpt-3.5-turbo`, if you don't have access to `gpt-4`. Please verify outside this repo that you have access to `gpt-4`, otherwise the application will not work with it.
 
-## Convert your PDF to embeddings
+## Convert your PDF files to embeddings
 
-1. In `docs` folder replace the pdf with your own pdf doc.
+**This repo can load multiple PDF files**
 
-2. In `scripts/ingest-data.ts` replace `filePath` with `docs/{yourdocname}.pdf`
+1. Inside `docs` folder, add your pdf files or folders that contain pdf files.
 
-3. Run the script `pnpm run ingest` to 'ingest' and embed your docs
+2. Run the script `npm run ingest` to 'ingest' and embed your docs. If you run into errors troubleshoot below.
 
-4. Check Pinecone dashboard to verify your namespace and vectors have been added.
+3. Check Pinecone dashboard to verify your namespace and vectors have been added.
 
 ## Run the app
 
@@ -73,16 +73,15 @@ In general, keep an eye out in the `issues` and `discussions` section of this re
 - Check that you've created an `.env` file that contains your valid (and working) API keys, environment and index name.
 - If you change `modelName` in `OpenAIChat` note that the correct name of the alternative model is `gpt-3.5-turbo`
 - Make sure you have access to `gpt-4` if you decide to use. Test your openAI keys outside the repo and make sure it works and that you have enough API credits.
+- Your pdf file is corrupted and cannot be parsed.
 
 **Pinecone errors**
 
 - Make sure your pinecone dashboard `environment` and `index` matches the one in the `pinecone.ts` and `.env` files.
 - Check that you've set the vector dimensions to `1536`.
 - Make sure your pinecone namespace is in lowercase.
 - Pinecone indexes of users on the Starter(free) plan are deleted after 7 days of inactivity. To prevent this, send an API request to Pinecone to reset the counter.
-- Retry with a new Pinecone index.
-
-If you're stuck after trying all these steps, delete `node_modules`, restart your computer, then `pnpm install` again.
+- Retry from scratch with a new Pinecone index and cloned repo.
 
 ## Credit
 

diff --git a/docs/finance/turingfinance.pdf b/docs/finance/turingfinance.pdf
diff --git a/docs/MorseVsFrederick.pdf → docs/law/MorseVsFrederick.pdf b/docs/MorseVsFrederick.pdf → docs/law/MorseVsFrederick.pdf
diff --git a/scripts/ingest-data.ts b/scripts/ingest-data.ts
@@ -4,18 +4,20 @@ import { PineconeStore } from 'langchain/vectorstores';
 import { pinecone } from '@/utils/pinecone-client';
 import { CustomPDFLoader } from '@/utils/customPDFLoader';
 import { PINECONE_INDEX_NAME, PINECONE_NAME_SPACE } from '@/config/pinecone';
+import { DirectoryLoader } from 'langchain/document_loaders';
 
-/* Name of directory to retrieve files from. You can change this as required */
-const filePath = 'docs/MorseVsFrederick.pdf';
+/* Name of directory to retrieve your files from */
+const filePath = 'docs';
 
 export const run = async () => {
   try {
-    /*load raw docs from the pdf file in the directory */
-    const loader = new CustomPDFLoader(filePath);
-    // const loader = new PDFLoader(filePath);
-    const rawDocs = await loader.load();
+    /*load raw docs from the all files in the directory */
+    const directoryLoader = new DirectoryLoader(filePath, {
+      '.pdf': (path) => new CustomPDFLoader(path),
+    });
 
-    console.log(rawDocs);
+    // const loader = new PDFLoader(filePath);
+    const rawDocs = await directoryLoader.load();
 
     /* Split text into chunks */
     const textSplitter = new RecursiveCharacterTextSplitter({
@@ -32,18 +34,11 @@ export const run = async () => {
     const index = pinecone.Index(PINECONE_INDEX_NAME); //change to your own index name
 
     //embed the PDF documents
-
-    /* Pinecone recommends a limit of 100 vectors per upsert request to avoid errors*/
-    const chunkSize = 50;
-    for (let i = 0; i < docs.length; i += chunkSize) {
-      const chunk = docs.slice(i, i + chunkSize);
-      console.log('chunk', i, chunk);
-      await PineconeStore.fromDocuments(chunk, embeddings, {
-        pineconeIndex: index,
-        namespace: PINECONE_NAME_SPACE,
-        textKey: 'text',
-      });
-    }
+    await PineconeStore.fromDocuments(docs, embeddings, {
+      pineconeIndex: index,
+      namespace: PINECONE_NAME_SPACE,
+      textKey: 'text',
+    });
   } catch (error) {
     console.log('error', error);
     throw new Error('Failed to ingest your data');