Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert .parquet to the expected well-prepared data structure of a directory #2

Open
sugar-fly opened this issue Oct 25, 2024 · 12 comments
Assignees

Comments

@sugar-fly
Copy link

Hi, thank you for your great work!

I want to know how to convert a series of .parquet files to the expected well-prepared data structure of a directory. Is it possible to provide a convert script here?
Thank you very much.

best,

@OrangeSodahub
Copy link
Owner

OrangeSodahub commented Oct 25, 2024

Thanks for your interest!

We host our data on huggingface for both efficiency and safety. And we haven't tried to convert these .parquet files to local disk files. Not quite sure about your purpose but if 1) you only want to finetune diffusion model then no need to do this, just directly use the hosted data 2) for other usages you can try

Use this line to download/load processed data (into memory):

# Here the dataset is defined in `scenecraft.dataset:FinetuneDataset`.
raw_dataset = load_dataset(args.train_data_dir, args.dataset, cache_dir=args.cache_dir, split="train")

where args.train_data_dir is the dataset repo name, args.dataset is one of [hypersim, scannetpp]. Then save them to local files. For how to use this loaded dataset, refer to

def make_train_dataset(args, tokenizer, accelerator):

For the detailed structure of dataset, refer to

class FinetuneDataset(datasets.GeneratorBasedBuilder):

Please let me know if any further questions. And I will generate a script to do this when available.

@OrangeSodahub OrangeSodahub self-assigned this Oct 25, 2024
@OrangeSodahub OrangeSodahub added good first issue Good for newcomers and removed good first issue Good for newcomers labels Oct 25, 2024
@prologu
Copy link

prologu commented Nov 12, 2024

Hello, thank you for your work, about downloading the preprocessed dataset, the next step is to train the scenecraft model? Or do you need to go through other steps, such as training the controlnet module, about the structure of the dataset, I am also puzzled, the .parquet file code does not seem to be able to process, and the required json or jsonl files are not available, how should I deal with it, so that I can carry out the follow-up process after downloading the preprocessed dataset.

@OrangeSodahub
Copy link
Owner

@prologu Hi, please refer to README, when you downloaded the preprocessed data, the next step is to training the controlnet module.

You need to download the raw data and deal with the structure of the dataset and the follow-up process only when you are preprocessing the data. While the .parquet data is what we already preprocessed and could be directly used for training without any data process steps.

@prologu
Copy link

prologu commented Nov 13, 2024

When I was training controlnet, there was a problem that the.parquet file could not be processed. Under the data set path, there were no data files that reported errors. Do I need to add the code to process the.parquet file format to achieve the training, or do I need to adjust some parameter positions?

@OrangeSodahub
Copy link
Owner

OrangeSodahub commented Nov 13, 2024

Please try runing the script of training controlnet and the processed data will be downloaded and cached automatically by huggingface. No need to add any other codes.

You could paste logs or errors here to provide more specific infomation.

@prologu
Copy link

prologu commented Nov 13, 2024

1.This is a .parquet file of one of my downloaded preprocessed datasets: /root/.cache/huggingface/hub/datasets--gzzyyxy--layout_diffusion_hypersim/snapshots/808d065cdaef61af0642795ceae1d020592f9c39/train/train-00000-of-00349.parquet
2.This is the path configuration of train_concrolnet_sd.py:
parser.add_argument(
"--train_data_dir",
type=str,
default="/root/.cache/huggingface/hub/datasets--gzzyyxy--layout_diffusion_hypersim/snapshots/808d065cdaef61af0642795ceae1d020592f9c39/train",
help=(
"A folder containing the training data. Folder contents must follow the structure described in"
" https://huggingface.co/docs/datasets/image_dataset#imagefolder. In particular, a metadata.jsonl file"
" must exist to provide the captions for the images. Ignored if dataset_name is specified."
),
3.This is the error that occurred:
Traceback (most recent call last):
File "/root/ljq/SceneCraft/scripts/../scenecraft/finetune/train_controlnet_sd.py", line 1460, in
main(args)
File "/root/ljq/SceneCraft/scripts/../scenecraft/finetune/train_controlnet_sd.py", line 1167, in main
train_dataset = make_train_dataset(args, tokenizer, accelerator)
File "/root/ljq/SceneCraft/scripts/../scenecraft/finetune/train_controlnet_sd.py", line 700, in make_train_dataset
raw_dataset = load_dataset(args.train_data_dir, args.dataset, cache_dir=args.cache_dir, split="train", trust_remote_code=True)
File "/root/anaconda3/envs/scenecraft/lib/python3.9/site-packages/datasets/load.py", line 2132, in load_dataset
builder_instance = load_dataset_builder(
File "/root/anaconda3/envs/scenecraft/lib/python3.9/site-packages/datasets/load.py", line 1853, in load_dataset_builder
dataset_module = dataset_module_factory(
File "/root/anaconda3/envs/scenecraft/lib/python3.9/site-packages/datasets/load.py", line 1582, in dataset_module_factory
return LocalDatasetModuleFactoryWithoutScript(
File "/root/anaconda3/envs/scenecraft/lib/python3.9/site-packages/datasets/load.py", line 834, in get_module
patterns = get_data_patterns(base_path)
File "/root/anaconda3/envs/scenecraft/lib/python3.9/site-packages/datasets/data_files.py", line 503, in get_data_patterns
raise EmptyDatasetError(f"The directory at {base_path} doesn't contain any data files") from None
datasets.data_files.EmptyDatasetError: The directory at /root/.cache/huggingface/hub/datasets--gzzyyxy--layout_diffusion_hypersim/snapshots/808d065cdaef61af0642795ceae1d020592f9c39/train doesn't contain any data files

@OrangeSodahub
Copy link
Owner

OrangeSodahub commented Nov 13, 2024

Thanks for provoiding detailed information. This is the step:

First these is no need to manage the path to the downloaded data. I'm not sure about which way you used to download them. But I recommend to delete the /root/.cache/huggingface/hub/datasets--gzzyyxy--layout_diffusion_hypersim/ first. And just use

# Here the dataset is defined in `scenecraft.dataset:FinetuneDataset`.
raw_dataset = load_dataset(args.train_data_dir, args.dataset, cache_dir=args.cache_dir, split="train")

where you just need to set args.train_data_dir to the name of dataset repo e.g. gzzyyxy/xxxx, and args.dataset is one of hypersim and scannetpp, for cache_dir you could just use the default one (which will be at root/.cache). Once you run this line, huggingface will automatically manage the dataset (for each time you load the data, it will check if it exist at cache dir, if not then download it. However the format of data is what we cannot handle directly and can only be handled by huggingface's datasets package).

For more detailed and advanced infos, please refer to datasets which is used in our codebase.

@OrangeSodahub
Copy link
Owner

Knowning that this is not a traditional technique most of people used to, I will give more detaild instructions later. Please keep tracking.

@prologu
Copy link

prologu commented Nov 13, 2024

Thanks for your help. I will show you the new error after setting with gzzyyxy/xxxx in the train_data_dir.
error:Traceback (most recent call last):
File "/root/ljq/SceneCraft/scripts/../scenecraft/finetune/train_controlnet_sd.py", line 1460, in
main(args)
File "/root/ljq/SceneCraft/scripts/../scenecraft/finetune/train_controlnet_sd.py", line 1167, in main
train_dataset = make_train_dataset(args, tokenizer, accelerator)
File "/root/ljq/SceneCraft/scripts/../scenecraft/finetune/train_controlnet_sd.py", line 700, in make_train_dataset
raw_dataset = load_dataset(args.train_data_dir, args.dataset, cache_dir=args.cache_dir, split="train", trust_remote_code=True)
File "/root/anaconda3/envs/scenecraft/lib/python3.9/site-packages/datasets/load.py", line 2132, in load_dataset
builder_instance = load_dataset_builder(
File "/root/anaconda3/envs/scenecraft/lib/python3.9/site-packages/datasets/load.py", line 1890, in load_dataset_builder
builder_instance: DatasetBuilder = builder_cls(
File "/root/anaconda3/envs/scenecraft/lib/python3.9/site-packages/datasets/builder.py", line 342, in init
self.config, self.config_id = self._create_builder_config(
File "/root/anaconda3/envs/scenecraft/lib/python3.9/site-packages/datasets/builder.py", line 572, in _create_builder_config
raise ValueError(
ValueError: BuilderConfig 'hypersim' not found. Available: ['default']
I do not know why the "hypersim" is becoming the config_name.

@prologu
Copy link

prologu commented Nov 13, 2024

The previous issue has been resolved, but a new issue has arisen, I have downloaded the pre-processed dataset, do I still need to download the original dataset, I do not have the expected dataset file format and content, the following issues occur:
Steps: 0%| | 0/27432 [00:00<?, ?it/s]Skipped current batch data due to [Errno 2] No such file or directory: './data/hypersim/semantic_images/ai_009_009/frame.0091.npz'.
Skipped current batch data due to [Errno 2] No such file or directory: './data/hypersim/semantic_images/ai_028_001/frame.0020.npz'.

@GONGJIA0208
Copy link

Guys, thanks a lot for your explanation. The parquet file is very confusing. It will be better if authors can move this explanation to readme :)

@OrangeSodahub
Copy link
Owner

Hi guys, sorry for late. Yes we use a new dataloader tool which is not so friendly for new comers. I will provide instructions additionally.

@prologu For now, some issues exist in online dataset. Please try the easiest way downloading the raw hypersim dataset from here and process them locally, please refer to readme

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants