GitHub - wadie999/Chat-Bot

Introduction

This project provides an API for custom text search functionality for large text documents, integrating with the LangChain library. It aims to provide efficient and relevant text retrieval for specific use cases.

Installation and Setup

Clone the repository
Install the required packages: pip install -r requirements.txt.
Place your text file in the data folder.
Set your OPENAI key by creating a keys.py file in src folder and adding your key as a string variable key = " .."

Usage

The main code is encapsulated in the TextSearch class, which provides methods to load documents, split text, store chunks, and retrieve chunks. A separate function run_query is used to execute a query and print the result.

Running the Script

Set your OpenAI API key: Make sure to set your OpenAI API key in the environment or modify the code accordingly to include it.

Prepare the Transcript: Place your transcript file in the project directory and name it 1_transcript.txt.

Run the Script: You can run the script from the command line:

uvicorn src.main:app --reload

uvicorn is the ASGI server used to run FastAPI applications.

src.main:app tells uvicorn where to find the FastAPI application instance (app) within your project (in the main module under the src directory).

--reload is an optional argument that enables auto-reloading of the server when code changes are detected. This is particularly useful during development, as it allows for changes to take effect without manually restarting the server. vbnet

Theoretical Background

A large language model (LLM) is a type of machine learning model that can perform a variety of natural language processing (NLP) tasks, including generating and classifying text, answering questions in a conversational manner and translating text from one language to another.

BUT They only know what they learned during training. So how do we get them to use private data?

The Answer is to Convert all private data into embeddings stored in a vector database.

Text Search and Retrieval

The project utilizes a state-of-the-art text search and retrieval mechanism. It is based on the following key components:

Tokenization

Tokenization is the process of dividing text into smaller units, called tokens. In our implementation, we use a specific tokenization method to break down the text into chunks.

Embedding

We use OpenAI Embeddings to represent text in a numerical form, allowing for similarity comparisons and retrievals.

Retrieval

Based on the query, the most relevant chunks are retrieved and processed to form the final response.

For a more detailed explanation, refer to QA Langchain Documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
Images		Images
data		data
src		src
static		static
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Installation and Setup

Usage

Running the Script

Theoretical Background

Text Search and Retrieval

Tokenization

Embedding

Retrieval

About

Releases

Packages

Languages

wadie999/Chat-Bot

Folders and files

Latest commit

History

Repository files navigation

Introduction

Installation and Setup

Usage

Running the Script

Theoretical Background

Text Search and Retrieval

Tokenization

Embedding

Retrieval

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages