Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index out of range (between phase 1 and phase 2) #51

Open
mjh624 opened this issue Sep 18, 2024 · 2 comments
Open

index out of range (between phase 1 and phase 2) #51

mjh624 opened this issue Sep 18, 2024 · 2 comments

Comments

@mjh624
Copy link

mjh624 commented Sep 18, 2024

I am using the latest augmentool.
Phase 1 appears to complete without errors.
I am including the config.yaml and the output that was written to the screen. Are there parameters that are missing?

Intermediate files are present:
ls -l ../outFiles/
total 6684
drwxr-xr-x 4 root root 4096 Sep 18 14:50 judge_paragraph_generations
-rw-r--r-- 1 root root 4776753 Sep 18 15:38 judge_paragraph_generations_DATAGEN_OUTPUT.jsonl
-rw-r--r-- 1 root root 2058739 Sep 18 14:50 pretraining.json

Here is the config.yaml:
API:
API_KEY: xxx
BASE_URL: http:// xxx /
LARGE_LOGICAL_MODEL: llama3.1
LOGICAL_MODEL: llama3.1
HUGGINGFACE:
HUB_PATH: Heralax/test-atk-dataset-do-not-use-3
PRIVATE: False
PUSH_TO_HUB: False
PATH:
DEFAULT_PROMPTS: ./prompts
INPUT: ../../trainingFiles
OUTPUT: ../../outFiles
PROMPTS: ./prompts
PHASE:
PHASE_INDEX: 3
WORK_IN_PHASES: False
SKIP:
ANSWER_RELEVANCY_CHECK: False
FILTER_CHUNKS: False
QUESTION_CHECK: False
SYSTEM:
CHUNK_SIZE: 1900
COMPLETION_MODE: False
CONCURRENCY_LIMIT: 10
CONVERSATION_INSTRUCTIONS: For this conversation, you are generating a chat between
a generalist, generic AI assistant, and a human.
DOUBLE_CHECK_COUNTER: 1
DO_NOT_USE_SYSTEM_PROMPTS: True
FINAL_ASSISTANT_PROMPT_NO_RAG: 'You are a helpful AI assistant.

'

FINAL_ASSISTANT_PROMPT_RAG: 'You are a helpful AI assistant.

Context information is below:


----------------------

{data}

'

MODE: api
STOP: True
SUBSET_SIZE: 15
USE_FILENAMES: False
USE_SUBSET: False

Here is the output just before the error:

{'paragraph': None, 'metadata': '../../trainingFiles/iqideaskb.WordPress.2024-07-26.xml.md'}
{'paragraph': None, 'metadata': '../../trainingFiles/innovationqkb.WordPress.2024-07-26.xml.md'}
{'paragraph': None, 'metadata': '../../trainingFiles/innovationqkb.WordPress.2024-07-26.xml.md'}
{'paragraph': None, 'metadata': '../../trainingFiles/iqideaskb.WordPress.2024-07-26.xml.md'}
{'paragraph': None, 'metadata': '../../trainingFiles/priorartdatabasekb.glossary.WordPress.2024-07-27.xml.md'}
{'paragraph': None, 'metadata': '../../trainingFiles/innovationqkb.WordPress.2024-07-26.xml.md'}
Converting generations to training data
entering saving mode
Converting ../../outFiles/judge_paragraph_generations/intermediate_generations to a dataset
...Converted successfully (we think)
Traceback (most recent call last):
File "/tmp/augmentoolkit-master/original/processing.py", line 374, in
asyncio.run(main())
File "/usr/local/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/local/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/tmp/augmentoolkit-master/original/processing.py", line 222, in main
print(filtered_worthy_for_questions[0])
IndexError: list index out of range

@e-p-armstrong
Copy link
Owner

e-p-armstrong commented Sep 19, 2024

Hmm looks like a lot of your paragraphs are being "judged as unworthy for questions" by the paragraph judgement step -- either it thinks they're all metadata, or the model is messing up a lot for some reason. One solution might be to turn off filtering entirely, using SKIP/FILTER_CHUNKS?

Thanks for bringing this up btw, I've added in a recent push a more clear error message:

if len(filtered_worthy_for_questions) == 0:
        print("No paragraphs were judged worthy for questions. Either the judgement step thinks everything you added is metadata or has no factual information, or your input path is wrong, or the model is being stupid. Check your input directory path, your model, and your input data. The intermediate outputs at the end of each file in ./output/judge_paragraph_generations/intermediate_generations/ may help you diagnose the problem.")
        sys.exit(1)

Let me know if turning off filter chunks solves it. Also, I'd maybe be curious to see some of the intermediate outputs in ./output/judge_paragraph_generations/intermediate_generations/, because if there is factual information in your files then it shouldn't be dropping all of them.

@mjh624
Copy link
Author

mjh624 commented Sep 20, 2024

I am running the most recent release of augmentoolkit in a docker container that is running python 3.11
root@e6e8132f1ba5:/tmp/augmentoolkit# python --version
Python 3.11.10

I turned off filter chunks in the config.yaml in the /original folder:
SKIP:
ANSWER_RELEVANCY_CHECK: False
FILTER_CHUNKS: True
QUESTION_CHECK: False

augmentoolkit appears to complete phase 1 and fails with a different "index out of range" error:
FAILED TO GENERATE QUESTIONS!
Output written to /tmp/outFiles/question_generation_generations/question_generation_generations/6d4fd93e-3a9d-458b-bde1-5ff938bdd97b.yaml
FAILED TO GENERATE QUESTIONS!
Output written to /tmp/outFiles/question_generation_generations/question_generation_generations/7f40d689-700b-4400-b215-0295b077fd90.yaml
FAILED TO GENERATE QUESTIONS!
Output written to /tmp/outFiles/question_generation_generations/question_generation_generations/4a970d6c-4d1b-4569-9ceb-7f7da41645e0.yaml
COMPLETED PHASE 1
asyncio.run(main())
File "/usr/local/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/tmp/augmentoolkit/original/processing.py", line 264, in main
print(generated_qa_dicts[0])
~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
Augmentoolkit is starting to run! If this is your first time running this it might take a few moments to start due to imports and such.
root@e6e8132f1ba5:/tmp/augmentoolkit#

I do not see a file: ./output/judge_paragraph_generations/intermediate_generations/
root@e6e8132f1ba5:/tmp/augmentoolkit# ls outFiles/
pretraining.json
root@e6e8132f1ba5:/tmp/augmentoolkit# ls ../outFiles/
pretraining.json qatuples_filtered/ question_generation_generations/
root@e6e8132f1ba5:/tmp/augmentoolkit#

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants