StudySetCreator is a Python tool that generates study sets (flashcards) from PDF files using OpenAI's language models. It processes the content of a PDF file, extracts text and images, and uses the OpenAI API to create question-answer pairs suitable for studying or revision purposes. The generated study set is saved as a CSV file, ready to be imported into flashcard applications or used directly.
- PDF Processing: Extracts text and images from PDF files.
- OpenAI Integration: Utilizes OpenAI's GPT models to generate study cards from the extracted content.
- Batch Processing: Supports processing in chunks to handle large PDF files efficiently.
- Resume Capability: Can resume processing from where it left off in case of interruptions.
- Language Support: Generates study sets in the specified language.
- Customization: Allows customization of various parameters like model selection, output file name, chunk size, etc.
- Python: Version 3.7 or higher.
- OpenAI API Key: Required to access OpenAI's language models.
-
Clone the Repository
git clone https://github.com/jaylann/StudySetCreator.git cd StudySetCreator
-
Create a Virtual Environment (optional but recommended)
python3 -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install Dependencies
pip install -r requirements.txt
The application requires an OpenAI API key to function. This key should be stored in a .env
file in the project's root directory.
-
Create the
.env
FileThere is a
.env.template
file provided in the project. Copy this template to create your.env
file:cp .env.template .env
-
Edit the
.env
FileOpen the
.env
file in a text editor and add your OpenAI API key:OPENAI_API_KEY=your-openai-api-key-here
Replace
your-openai-api-key-here
with your actual OpenAI API key.
Run the main.py
script with the required arguments to generate a study set from a PDF file.
python main.py [options]
--model
: OpenAI model to use (default:gpt-4o-mini
).--output
: Output CSV file name (default:study_set.csv
).--input
: Input PDF file to process (required).--in_dir
: Input directory containing PDF files to process.--out_dir
: Output directory to save the study sets.--chunk_size
: Number of pages to process at once (default:10
).--use_batch
: Use OpenAI Batch API for processing.--text_only
: Extract text only, ignore images.--language
: Language for the study set (default:english
).--no_resume
: Whether to resume processing from the last checkpoint. WARNING: If set and a progress file exists, it will be overwritten.
-
Basic Usage
Generate a study set from
document.pdf
using the default settings.python main.py --input document.pdf --output document.csv
-
Specify Output File and Model
Generate a study set from
lecture_notes.pdf
, using thegpt-4o
model, and save the output toflashcards.csv
.python main.py --model gpt-4o --output flashcards.csv --input lecture_notes.pdf
-
Process Only Text Content
Generate a study set ignoring images in the PDF.
python main.py --text_only --input textbook.pdf --output textbook.csv
-
Use Batch Processing
Use OpenAI's Batch API to process the PDF (suitable for large PDFs. Reduces cost by ~50% but may take longer).
python main.py --use_batch --input large_document.pdf --output large_document.csv
-
Specify Language
Generate a study set in Spanish.
python main.py --language spanish --input notas_de_clase.pdf --output notas_de_clase.csv
The system prompt used by the OpenAI API can be customized to change how study cards are generated.
-
Prompt File:
./storage/prompt.txt
Edit this file to modify the prompt. The placeholder
[LANGUAGE]
in the prompt will be replaced with the language specified via the--language
argument.
The JSON schema defines the expected structure of the API responses.
-
Schema File:
./storage/schema.json
Edit this file to change the schema if you need the responses in a different format.
The application uses logging to provide information about its operation.
- Log Output: The application outputs logs to the console. You can modify the logging configuration in
src/utils/logging.py
if you need to change log levels or output formats.
- Resume Processing: If the processing is interrupted, the application can resume from where it left off using the progress saved in
progress.json
. - Progress File: The file
progress.json
is used to keep track of progress. It can be deleted to start processing from the beginning. - Batch Processing Errors: If a batch job fails or is still in progress, an error message will be logged.
All required Python packages are listed in requirements.txt
. Install them using:
pip install -r requirements.txt
- API Usage: Be mindful of your OpenAI API usage and billing.
- Supported Models: Ensure that the model you specify (e.g.,
gpt-4
) is available to your OpenAI account.
Contributions are welcome! Please open an issue or submit a pull request for any bugs or feature requests.
This project is licensed under the MIT License.
Made with ❤️ by Justin Lanfermann