index out of range (between phase 1 and phase 2) #51

mjh624 · 2024-09-18T16:34:25Z

I am using the latest augmentool.
Phase 1 appears to complete without errors.
I am including the config.yaml and the output that was written to the screen. Are there parameters that are missing?

Intermediate files are present:
ls -l ../outFiles/
total 6684
drwxr-xr-x 4 root root 4096 Sep 18 14:50 judge_paragraph_generations
-rw-r--r-- 1 root root 4776753 Sep 18 15:38 judge_paragraph_generations_DATAGEN_OUTPUT.jsonl
-rw-r--r-- 1 root root 2058739 Sep 18 14:50 pretraining.json

Here is the config.yaml:
API:
API_KEY: xxx
BASE_URL: http:// xxx /
LARGE_LOGICAL_MODEL: llama3.1
LOGICAL_MODEL: llama3.1
HUGGINGFACE:
HUB_PATH: Heralax/test-atk-dataset-do-not-use-3
PRIVATE: False
PUSH_TO_HUB: False
PATH:
DEFAULT_PROMPTS: ./prompts
INPUT: ../../trainingFiles
OUTPUT: ../../outFiles
PROMPTS: ./prompts
PHASE:
PHASE_INDEX: 3
WORK_IN_PHASES: False
SKIP:
ANSWER_RELEVANCY_CHECK: False
FILTER_CHUNKS: False
QUESTION_CHECK: False
SYSTEM:
CHUNK_SIZE: 1900
COMPLETION_MODE: False
CONCURRENCY_LIMIT: 10
CONVERSATION_INSTRUCTIONS: For this conversation, you are generating a chat between
a generalist, generic AI assistant, and a human.
DOUBLE_CHECK_COUNTER: 1
DO_NOT_USE_SYSTEM_PROMPTS: True
FINAL_ASSISTANT_PROMPT_NO_RAG: 'You are a helpful AI assistant.

FINAL_ASSISTANT_PROMPT_RAG: 'You are a helpful AI assistant.

Context information is below:


----------------------

{data}

'

MODE: api
STOP: True
SUBSET_SIZE: 15
USE_FILENAMES: False
USE_SUBSET: False

Here is the output just before the error:

{'paragraph': None, 'metadata': '../../trainingFiles/iqideaskb.WordPress.2024-07-26.xml.md'}
{'paragraph': None, 'metadata': '../../trainingFiles/innovationqkb.WordPress.2024-07-26.xml.md'}
{'paragraph': None, 'metadata': '../../trainingFiles/innovationqkb.WordPress.2024-07-26.xml.md'}
{'paragraph': None, 'metadata': '../../trainingFiles/iqideaskb.WordPress.2024-07-26.xml.md'}
{'paragraph': None, 'metadata': '../../trainingFiles/priorartdatabasekb.glossary.WordPress.2024-07-27.xml.md'}
{'paragraph': None, 'metadata': '../../trainingFiles/innovationqkb.WordPress.2024-07-26.xml.md'}
Converting generations to training data
entering saving mode
Converting ../../outFiles/judge_paragraph_generations/intermediate_generations to a dataset
...Converted successfully (we think)
Traceback (most recent call last):
File "/tmp/augmentoolkit-master/original/processing.py", line 374, in
asyncio.run(main())
File "/usr/local/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/local/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/tmp/augmentoolkit-master/original/processing.py", line 222, in main
print(filtered_worthy_for_questions[0])
IndexError: list index out of range

The text was updated successfully, but these errors were encountered:

e-p-armstrong · 2024-09-19T22:12:39Z

Hmm looks like a lot of your paragraphs are being "judged as unworthy for questions" by the paragraph judgement step -- either it thinks they're all metadata, or the model is messing up a lot for some reason. One solution might be to turn off filtering entirely, using SKIP/FILTER_CHUNKS?

Thanks for bringing this up btw, I've added in a recent push a more clear error message:

if len(filtered_worthy_for_questions) == 0:
        print("No paragraphs were judged worthy for questions. Either the judgement step thinks everything you added is metadata or has no factual information, or your input path is wrong, or the model is being stupid. Check your input directory path, your model, and your input data. The intermediate outputs at the end of each file in ./output/judge_paragraph_generations/intermediate_generations/ may help you diagnose the problem.")
        sys.exit(1)

Let me know if turning off filter chunks solves it. Also, I'd maybe be curious to see some of the intermediate outputs in ./output/judge_paragraph_generations/intermediate_generations/, because if there is factual information in your files then it shouldn't be dropping all of them.

mjh624 · 2024-09-20T18:37:29Z

I am running the most recent release of augmentoolkit in a docker container that is running python 3.11
root@e6e8132f1ba5:/tmp/augmentoolkit# python --version
Python 3.11.10

I turned off filter chunks in the config.yaml in the /original folder:
SKIP:
ANSWER_RELEVANCY_CHECK: False
FILTER_CHUNKS: True
QUESTION_CHECK: False

augmentoolkit appears to complete phase 1 and fails with a different "index out of range" error:
FAILED TO GENERATE QUESTIONS!
Output written to /tmp/outFiles/question_generation_generations/question_generation_generations/6d4fd93e-3a9d-458b-bde1-5ff938bdd97b.yaml
FAILED TO GENERATE QUESTIONS!
Output written to /tmp/outFiles/question_generation_generations/question_generation_generations/7f40d689-700b-4400-b215-0295b077fd90.yaml
FAILED TO GENERATE QUESTIONS!
Output written to /tmp/outFiles/question_generation_generations/question_generation_generations/4a970d6c-4d1b-4569-9ceb-7f7da41645e0.yaml
COMPLETED PHASE 1
asyncio.run(main())
File "/usr/local/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/tmp/augmentoolkit/original/processing.py", line 264, in main
print(generated_qa_dicts[0])
~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
Augmentoolkit is starting to run! If this is your first time running this it might take a few moments to start due to imports and such.
root@e6e8132f1ba5:/tmp/augmentoolkit#

I do not see a file: ./output/judge_paragraph_generations/intermediate_generations/
root@e6e8132f1ba5:/tmp/augmentoolkit# ls outFiles/
pretraining.json
root@e6e8132f1ba5:/tmp/augmentoolkit# ls ../outFiles/
pretraining.json qatuples_filtered/ question_generation_generations/
root@e6e8132f1ba5:/tmp/augmentoolkit#

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index out of range (between phase 1 and phase 2) #51

index out of range (between phase 1 and phase 2) #51

mjh624 commented Sep 18, 2024

e-p-armstrong commented Sep 19, 2024 •

edited

Loading

mjh624 commented Sep 20, 2024

index out of range (between phase 1 and phase 2) #51

index out of range (between phase 1 and phase 2) #51

Comments

mjh624 commented Sep 18, 2024

e-p-armstrong commented Sep 19, 2024 • edited Loading

mjh624 commented Sep 20, 2024

e-p-armstrong commented Sep 19, 2024 •

edited

Loading