Skip to content

Commit

Permalink
Merge pull request mayooear#71 from mayooear/feat/add-directory-loader
Browse files Browse the repository at this point in the history
Add directory loader to load multiple pdf files
  • Loading branch information
mayooear authored Mar 28, 2023
2 parents b4c88e1 + 90381f0 commit ef4046d
Show file tree
Hide file tree
Showing 4 changed files with 23 additions and 29 deletions.
19 changes: 9 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# GPT-4 & LangChain - Create a ChatGPT Chatbot for Your PDF Docs
# GPT-4 & LangChain - Create a ChatGPT Chatbot for Your PDF Files

Use the new GPT-4 api to build a chatGPT chatbot for Large PDF docs (56 pages used in this example).
Use the new GPT-4 api to build a chatGPT chatbot for multiple Large PDF files.

Tech stack used includes LangChain, Pinecone, Typescript, Openai, and Next.js. LangChain is a framework that makes it easier to build scalable AI/LLM apps and chatbots. Pinecone is a vectorstore for storing embeddings and your PDF in text to later retrieve similar docs.

Expand Down Expand Up @@ -48,15 +48,15 @@ PINECONE_INDEX_NAME=

5. In `utils/makechain.ts` chain change the `QA_PROMPT` for your own usecase. Change `modelName` in `new OpenAIChat` to `gpt-3.5-turbo`, if you don't have access to `gpt-4`. Please verify outside this repo that you have access to `gpt-4`, otherwise the application will not work with it.

## Convert your PDF to embeddings
## Convert your PDF files to embeddings

1. In `docs` folder replace the pdf with your own pdf doc.
**This repo can load multiple PDF files**

2. In `scripts/ingest-data.ts` replace `filePath` with `docs/{yourdocname}.pdf`
1. Inside `docs` folder, add your pdf files or folders that contain pdf files.

3. Run the script `pnpm run ingest` to 'ingest' and embed your docs
2. Run the script `npm run ingest` to 'ingest' and embed your docs. If you run into errors troubleshoot below.

4. Check Pinecone dashboard to verify your namespace and vectors have been added.
3. Check Pinecone dashboard to verify your namespace and vectors have been added.

## Run the app

Expand All @@ -73,16 +73,15 @@ In general, keep an eye out in the `issues` and `discussions` section of this re
- Check that you've created an `.env` file that contains your valid (and working) API keys, environment and index name.
- If you change `modelName` in `OpenAIChat` note that the correct name of the alternative model is `gpt-3.5-turbo`
- Make sure you have access to `gpt-4` if you decide to use. Test your openAI keys outside the repo and make sure it works and that you have enough API credits.
- Your pdf file is corrupted and cannot be parsed.

**Pinecone errors**

- Make sure your pinecone dashboard `environment` and `index` matches the one in the `pinecone.ts` and `.env` files.
- Check that you've set the vector dimensions to `1536`.
- Make sure your pinecone namespace is in lowercase.
- Pinecone indexes of users on the Starter(free) plan are deleted after 7 days of inactivity. To prevent this, send an API request to Pinecone to reset the counter.
- Retry with a new Pinecone index.

If you're stuck after trying all these steps, delete `node_modules`, restart your computer, then `pnpm install` again.
- Retry from scratch with a new Pinecone index and cloned repo.

## Credit

Expand Down
Binary file added docs/finance/turingfinance.pdf
Binary file not shown.
File renamed without changes.
33 changes: 14 additions & 19 deletions scripts/ingest-data.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,18 +4,20 @@ import { PineconeStore } from 'langchain/vectorstores';
import { pinecone } from '@/utils/pinecone-client';
import { CustomPDFLoader } from '@/utils/customPDFLoader';
import { PINECONE_INDEX_NAME, PINECONE_NAME_SPACE } from '@/config/pinecone';
import { DirectoryLoader } from 'langchain/document_loaders';

/* Name of directory to retrieve files from. You can change this as required */
const filePath = 'docs/MorseVsFrederick.pdf';
/* Name of directory to retrieve your files from */
const filePath = 'docs';

export const run = async () => {
try {
/*load raw docs from the pdf file in the directory */
const loader = new CustomPDFLoader(filePath);
// const loader = new PDFLoader(filePath);
const rawDocs = await loader.load();
/*load raw docs from the all files in the directory */
const directoryLoader = new DirectoryLoader(filePath, {
'.pdf': (path) => new CustomPDFLoader(path),
});

console.log(rawDocs);
// const loader = new PDFLoader(filePath);
const rawDocs = await directoryLoader.load();

/* Split text into chunks */
const textSplitter = new RecursiveCharacterTextSplitter({
Expand All @@ -32,18 +34,11 @@ export const run = async () => {
const index = pinecone.Index(PINECONE_INDEX_NAME); //change to your own index name

//embed the PDF documents

/* Pinecone recommends a limit of 100 vectors per upsert request to avoid errors*/
const chunkSize = 50;
for (let i = 0; i < docs.length; i += chunkSize) {
const chunk = docs.slice(i, i + chunkSize);
console.log('chunk', i, chunk);
await PineconeStore.fromDocuments(chunk, embeddings, {
pineconeIndex: index,
namespace: PINECONE_NAME_SPACE,
textKey: 'text',
});
}
await PineconeStore.fromDocuments(docs, embeddings, {
pineconeIndex: index,
namespace: PINECONE_NAME_SPACE,
textKey: 'text',
});
} catch (error) {
console.log('error', error);
throw new Error('Failed to ingest your data');
Expand Down

0 comments on commit ef4046d

Please sign in to comment.