-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
index out of range (between phase 1 and phase 2) #51
Comments
Hmm looks like a lot of your paragraphs are being "judged as unworthy for questions" by the paragraph judgement step -- either it thinks they're all metadata, or the model is messing up a lot for some reason. One solution might be to turn off filtering entirely, using SKIP/FILTER_CHUNKS? Thanks for bringing this up btw, I've added in a recent push a more clear error message:
Let me know if turning off filter chunks solves it. Also, I'd maybe be curious to see some of the intermediate outputs in ./output/judge_paragraph_generations/intermediate_generations/, because if there is factual information in your files then it shouldn't be dropping all of them. |
I am running the most recent release of augmentoolkit in a docker container that is running python 3.11 I turned off filter chunks in the config.yaml in the /original folder: augmentoolkit appears to complete phase 1 and fails with a different "index out of range" error: I do not see a file: ./output/judge_paragraph_generations/intermediate_generations/ |
I am using the latest augmentool.
Phase 1 appears to complete without errors.
I am including the config.yaml and the output that was written to the screen. Are there parameters that are missing?
Intermediate files are present:
ls -l ../outFiles/
total 6684
drwxr-xr-x 4 root root 4096 Sep 18 14:50 judge_paragraph_generations
-rw-r--r-- 1 root root 4776753 Sep 18 15:38 judge_paragraph_generations_DATAGEN_OUTPUT.jsonl
-rw-r--r-- 1 root root 2058739 Sep 18 14:50 pretraining.json
Here is the config.yaml:
API:
API_KEY: xxx
BASE_URL: http:// xxx /
LARGE_LOGICAL_MODEL: llama3.1
LOGICAL_MODEL: llama3.1
HUGGINGFACE:
HUB_PATH: Heralax/test-atk-dataset-do-not-use-3
PRIVATE: False
PUSH_TO_HUB: False
PATH:
DEFAULT_PROMPTS: ./prompts
INPUT: ../../trainingFiles
OUTPUT: ../../outFiles
PROMPTS: ./prompts
PHASE:
PHASE_INDEX: 3
WORK_IN_PHASES: False
SKIP:
ANSWER_RELEVANCY_CHECK: False
FILTER_CHUNKS: False
QUESTION_CHECK: False
SYSTEM:
CHUNK_SIZE: 1900
COMPLETION_MODE: False
CONCURRENCY_LIMIT: 10
CONVERSATION_INSTRUCTIONS: For this conversation, you are generating a chat between
a generalist, generic AI assistant, and a human.
DOUBLE_CHECK_COUNTER: 1
DO_NOT_USE_SYSTEM_PROMPTS: True
FINAL_ASSISTANT_PROMPT_NO_RAG: 'You are a helpful AI assistant.
FINAL_ASSISTANT_PROMPT_RAG: 'You are a helpful AI assistant.
MODE: api
STOP: True
SUBSET_SIZE: 15
USE_FILENAMES: False
USE_SUBSET: False
Here is the output just before the error:
{'paragraph': None, 'metadata': '../../trainingFiles/iqideaskb.WordPress.2024-07-26.xml.md'}
{'paragraph': None, 'metadata': '../../trainingFiles/innovationqkb.WordPress.2024-07-26.xml.md'}
{'paragraph': None, 'metadata': '../../trainingFiles/innovationqkb.WordPress.2024-07-26.xml.md'}
{'paragraph': None, 'metadata': '../../trainingFiles/iqideaskb.WordPress.2024-07-26.xml.md'}
{'paragraph': None, 'metadata': '../../trainingFiles/priorartdatabasekb.glossary.WordPress.2024-07-27.xml.md'}
{'paragraph': None, 'metadata': '../../trainingFiles/innovationqkb.WordPress.2024-07-26.xml.md'}
Converting generations to training data
entering saving mode
Converting ../../outFiles/judge_paragraph_generations/intermediate_generations to a dataset
...Converted successfully (we think)
Traceback (most recent call last):
File "/tmp/augmentoolkit-master/original/processing.py", line 374, in
asyncio.run(main())
File "/usr/local/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/local/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/tmp/augmentoolkit-master/original/processing.py", line 222, in main
print(filtered_worthy_for_questions[0])
IndexError: list index out of range
The text was updated successfully, but these errors were encountered: