SynthLlama is a project for generating synthetic data using language models. It allows you to upload PDFs, select the desired data format, and specify the amount of data needed. As a result, you can obtain your dataset in either JSON or CSV format.
Large models require substantial data, and collecting it manually is not always feasible. At this point, synthetic data plays a critical role in supplementing training data where it is lacking. Our goal in this project is to address this issue by enhancing models with synthetic data, thus eliminating data scarcity as a limitation.
Clone the repository:
git clone https://github.com/cows-cats/SynthLlama.git
cd SynthLlama
pip install -r requirements.txt
python api.py
streamlit run streamlit1.py