-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert .parquet to the expected well-prepared data structure of a directory #2
Comments
Thanks for your interest! We host our data on huggingface for both efficiency and safety. And we haven't tried to convert these .parquet files to local disk files. Not quite sure about your purpose but if 1) you only want to finetune diffusion model then no need to do this, just directly use the hosted data 2) for other usages you can try Use this line to download/load processed data (into memory): SceneCraft/scenecraft/finetune/train_controlnet_sd.py Lines 684 to 685 in 6568a2c
where
For the detailed structure of dataset, refer to SceneCraft/scenecraft/data/dataset.py Line 342 in 6568a2c
Please let me know if any further questions. And I will generate a script to do this when available. |
Hello, thank you for your work, about downloading the preprocessed dataset, the next step is to train the scenecraft model? Or do you need to go through other steps, such as training the controlnet module, about the structure of the dataset, I am also puzzled, the .parquet file code does not seem to be able to process, and the required json or jsonl files are not available, how should I deal with it, so that I can carry out the follow-up process after downloading the preprocessed dataset. |
@prologu Hi, please refer to README, when you downloaded the preprocessed data, the next step is to training the controlnet module. You need to download the raw data and deal with the structure of the dataset and the follow-up process only when you are preprocessing the data. While the .parquet data is what we already preprocessed and could be directly used for training without any data process steps. |
When I was training controlnet, there was a problem that the.parquet file could not be processed. Under the data set path, there were no data files that reported errors. Do I need to add the code to process the.parquet file format to achieve the training, or do I need to adjust some parameter positions? |
Please try runing the script of training controlnet and the processed data will be downloaded and cached automatically by huggingface. No need to add any other codes. You could paste logs or errors here to provide more specific infomation. |
1.This is a .parquet file of one of my downloaded preprocessed datasets: /root/.cache/huggingface/hub/datasets--gzzyyxy--layout_diffusion_hypersim/snapshots/808d065cdaef61af0642795ceae1d020592f9c39/train/train-00000-of-00349.parquet |
Thanks for provoiding detailed information. This is the step: First these is no need to manage the path to the downloaded data. I'm not sure about which way you used to download them. But I recommend to delete the /root/.cache/huggingface/hub/datasets--gzzyyxy--layout_diffusion_hypersim/ first. And just use SceneCraft/scenecraft/finetune/train_controlnet_sd.py Lines 684 to 685 in 6568a2c
where you just need to set args.train_data_dir to the name of dataset repo e.g. gzzyyxy/xxxx , and args.dataset is one of hypersim and scannetpp , for cache_dir you could just use the default one (which will be at root/.cache). Once you run this line, huggingface will automatically manage the dataset (for each time you load the data, it will check if it exist at cache dir, if not then download it. However the format of data is what we cannot handle directly and can only be handled by huggingface's datasets package).
For more detailed and advanced infos, please refer to datasets which is used in our codebase. |
Knowning that this is not a traditional technique most of people used to, I will give more detaild instructions later. Please keep tracking. |
Thanks for your help. I will show you the new error after setting with gzzyyxy/xxxx in the train_data_dir. |
The previous issue has been resolved, but a new issue has arisen, I have downloaded the pre-processed dataset, do I still need to download the original dataset, I do not have the expected dataset file format and content, the following issues occur: |
Guys, thanks a lot for your explanation. The parquet file is very confusing. It will be better if authors can move this explanation to readme :) |
Hi guys, sorry for late. Yes we use a new dataloader tool which is not so friendly for new comers. I will provide instructions additionally. @prologu For now, some issues exist in online dataset. Please try the easiest way downloading the raw hypersim dataset from here and process them locally, please refer to readme |
Hi, thank you for your great work!
I want to know how to convert a series of .parquet files to the expected well-prepared data structure of a directory. Is it possible to provide a convert script here?
Thank you very much.
best,
The text was updated successfully, but these errors were encountered: