This repository provides a set of examples for performing document analysis using OpenAI's GPT-3 language model. They are:
- Data Exploration - Explore data used in this repo.
- Get Embeddings - Generate embeddings from documents using GPT-3.
- Visualise Embeddings - Visualise embeddings in 3D plot.
- Classify Documents - Use GPT-3 to classify documents into different categories.
- Summarize Documents - Automatically generate summaries of documents using GPT-3.
- Extract Key Information - Identify key information in documents using GPT-3.
- Extract Key Words - Extract important words from documents using GPT-3.
- Semantic Search - Retrieve relevant documents.
- Retrieve Information based on Context - Answer a query based on given context.
- Unstructured Data to Structured Data - Extract specified entities and put into a table.
This project is a data analysis of the BBC news dataset. The goal of this project is to explore the data, classify documents into categories, summarize documents, extract key information from documents and extract keywords from documents.
The dataset used in this project is the BBC News Archive available from kaggle. It contains 2225 articles from the BBC news website with 5 different categories: business, entertainment, politics, sport and tech. Each article has a category, filename, title and text.
Examples in this repo uses text-davinci-003
.
This project consists of notebooks that perform the following tasks:
- 00-explore-data.ipynb - This notebook explores the data by looking at the distribution of classes, number of words per document, etc.
- 01-get-embeddings.ipynb - This notebook uses pre-trained word embeddings to create vector representations for each document.
- 02-visualise-embeddings.ipynb - This notebooks visualise word embeedings in a 3D plot.
- 03-classify-documents.ipynb - This notebook builds classification models using Random Forest and XGBoost to predict the class of each document.
- 04-summarize-documents.ipynb - This notebook uses GPT-3 to generate summaries for each document.
- 05-extract-key-information.ipynb - This notebook extracts key information from each document such as people, organizations, locations, etc.
- 06-extract-key-words.ipynb - This notebook extracts important keywords from each document.
- 07-semantic-search.ipynb - This notebook performs semantic search to retreive most relevant news from a specific corpus, by comparing the similiarity of the embeddings of the query to that of the text corpus.
- 08-retrieve-information.ipynb This notebook retrieve information based on a given context. This is achieved by contstructing the prompt with context.
- 09-unstructure-data-to-structured-data.ipynb This notebook extract specified entities and arranged them into a table.
Note: This README.md is co-authored with text-davinci-003
.
- OpenAI repo: https://github.com/openai/openai-cookbook/
- Which embedding model to use? https://openai.com/blog/new-and-improved-embedding-model/