You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using the axolotl library for preprocessing data, which is then processed and saved in the .arrow format. This process involves several data sources that are processed and integrated into a unified data structure.
axolotl: 0.4.1
datasets: 2.19.1
pyarrow: 16.1.0
torch: 2.3.0
Issue: After completing the preprocessing, I see a message about successful saving of preprocessed data (Success! Preprocessed data path: /path/to/axolotl/examples/llama-3), but when I attempt to decode the file data-00000-of-00001.arrow, I encounter an error: ArrowInvalid: Not an Arrow file. Checking the data-00000-of-00001.arrow file showed a lack of magic bytes typical for Arrow files.
Expectations: I expected the file to be saved in a valid Arrow format and that it could be successfully loaded and decoded for further use.
Questions for Developers
Can you confirm that the process described in the axolotl documentation should correctly save data in the Arrow format using the specified library versions?
What additional checks or configuration changes should I perform to ensure the correctness of the file format?
Are there any known compatibility issues between datasets, pyarrow, and axolotl that could have led to this error?
What steps would you recommend for diagnosing and solving this problem, given that checking the file did not show the presence of magic bytes typical for Arrow?
Additional Information
I am ready to provide any additional logs, configurations, or information that may help in diagnosing and resolving this issue. I would appreciate any help or recommendations.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I am using the axolotl library for preprocessing data, which is then processed and saved in the .arrow format. This process involves several data sources that are processed and integrated into a unified data structure.
Data Configuration:
yaml
Копировать код
datasets:
path: "/path/to"
type: alpaca_chat.load_qa
data_files: "/path/to/1.jsonl"
path: '/path/to'
data_files: "/path/to/2.jsonl"
type: sharegpt
path: '/path/to'
data_files: "/path/to/2.jsonl"
type: context_qa.load_v2
path: '/path/to'
data_files: "/path/to/3.jsonl"
type: alpaca_w_system.load_open_orca
path: '/path/to'
data_files: "/path/to/4.jsonl"
type: sharegpt.load_role
Library Versions:
axolotl: 0.4.1
datasets: 2.19.1
pyarrow: 16.1.0
torch: 2.3.0
Issue: After completing the preprocessing, I see a message about successful saving of preprocessed data (Success! Preprocessed data path: /path/to/axolotl/examples/llama-3), but when I attempt to decode the file data-00000-of-00001.arrow, I encounter an error: ArrowInvalid: Not an Arrow file. Checking the data-00000-of-00001.arrow file showed a lack of magic bytes typical for Arrow files.
Expectations: I expected the file to be saved in a valid Arrow format and that it could be successfully loaded and decoded for further use.
Questions for Developers
Can you confirm that the process described in the axolotl documentation should correctly save data in the Arrow format using the specified library versions?
What additional checks or configuration changes should I perform to ensure the correctness of the file format?
Are there any known compatibility issues between datasets, pyarrow, and axolotl that could have led to this error?
What steps would you recommend for diagnosing and solving this problem, given that checking the file did not show the presence of magic bytes typical for Arrow?
Additional Information
I am ready to provide any additional logs, configurations, or information that may help in diagnosing and resolving this issue. I would appreciate any help or recommendations.
Beta Was this translation helpful? Give feedback.
All reactions