convert .parquet to the expected well-prepared data structure of a directory #2

sugar-fly · 2024-10-25T01:38:52Z

Hi, thank you for your great work！

I want to know how to convert a series of .parquet files to the expected well-prepared data structure of a directory. Is it possible to provide a convert script here?
Thank you very much.

best,

OrangeSodahub · 2024-10-25T03:11:56Z

Thanks for your interest!

We host our data on huggingface for both efficiency and safety. And we haven't tried to convert these .parquet files to local disk files. Not quite sure about your purpose but if 1) you only want to finetune diffusion model then no need to do this, just directly use the hosted data 2) for other usages you can try

Use this line to download/load processed data (into memory):

SceneCraft/scenecraft/finetune/train_controlnet_sd.py

Lines 684 to 685 in 6568a2c

    
           # Here the dataset is defined in `scenecraft.dataset:FinetuneDataset`. 
        
           raw_dataset = load_dataset(args.train_data_dir, args.dataset, cache_dir=args.cache_dir, split="train")

where args.train_data_dir is the dataset repo name, args.dataset is one of [hypersim, scannetpp]. Then save them to local files. For how to use this loaded dataset, refer to

SceneCraft/scenecraft/finetune/train_controlnet_sd.py

Line 678 in 6568a2c

def make_train_dataset(args, tokenizer, accelerator):

For the detailed structure of dataset, refer to

SceneCraft/scenecraft/data/dataset.py

Line 342 in 6568a2c

class FinetuneDataset(datasets.GeneratorBasedBuilder):

Please let me know if any further questions. And I will generate a script to do this when available.

prologu · 2024-11-12T10:08:08Z

Hello, thank you for your work, about downloading the preprocessed dataset, the next step is to train the scenecraft model? Or do you need to go through other steps, such as training the controlnet module, about the structure of the dataset, I am also puzzled, the .parquet file code does not seem to be able to process, and the required json or jsonl files are not available, how should I deal with it, so that I can carry out the follow-up process after downloading the preprocessed dataset.

OrangeSodahub · 2024-11-12T20:13:59Z

@prologu Hi, please refer to README, when you downloaded the preprocessed data, the next step is to training the controlnet module.

You need to download the raw data and deal with the structure of the dataset and the follow-up process only when you are preprocessing the data. While the .parquet data is what we already preprocessed and could be directly used for training without any data process steps.

prologu · 2024-11-13T03:27:01Z

When I was training controlnet, there was a problem that the.parquet file could not be processed. Under the data set path, there were no data files that reported errors. Do I need to add the code to process the.parquet file format to achieve the training, or do I need to adjust some parameter positions?

OrangeSodahub · 2024-11-13T04:29:56Z

Please try runing the script of training controlnet and the processed data will be downloaded and cached automatically by huggingface. No need to add any other codes.

You could paste logs or errors here to provide more specific infomation.

prologu · 2024-11-13T04:53:15Z

1.This is a .parquet file of one of my downloaded preprocessed datasets： /root/.cache/huggingface/hub/datasets--gzzyyxy--layout_diffusion_hypersim/snapshots/808d065cdaef61af0642795ceae1d020592f9c39/train/train-00000-of-00349.parquet
2.This is the path configuration of train_concrolnet_sd.py：
parser.add_argument(
"--train_data_dir",
type=str,
default="/root/.cache/huggingface/hub/datasets--gzzyyxy--layout_diffusion_hypersim/snapshots/808d065cdaef61af0642795ceae1d020592f9c39/train",
help=(
"A folder containing the training data. Folder contents must follow the structure described in"
" https://huggingface.co/docs/datasets/image_dataset#imagefolder. In particular, a metadata.jsonl file"
" must exist to provide the captions for the images. Ignored if dataset_name is specified."
),
3.This is the error that occurred：
Traceback (most recent call last):
File "/root/ljq/SceneCraft/scripts/../scenecraft/finetune/train_controlnet_sd.py", line 1460, in
main(args)
File "/root/ljq/SceneCraft/scripts/../scenecraft/finetune/train_controlnet_sd.py", line 1167, in main
train_dataset = make_train_dataset(args, tokenizer, accelerator)
File "/root/ljq/SceneCraft/scripts/../scenecraft/finetune/train_controlnet_sd.py", line 700, in make_train_dataset
raw_dataset = load_dataset(args.train_data_dir, args.dataset, cache_dir=args.cache_dir, split="train", trust_remote_code=True)
File "/root/anaconda3/envs/scenecraft/lib/python3.9/site-packages/datasets/load.py", line 2132, in load_dataset
builder_instance = load_dataset_builder(
File "/root/anaconda3/envs/scenecraft/lib/python3.9/site-packages/datasets/load.py", line 1853, in load_dataset_builder
dataset_module = dataset_module_factory(
File "/root/anaconda3/envs/scenecraft/lib/python3.9/site-packages/datasets/load.py", line 1582, in dataset_module_factory
return LocalDatasetModuleFactoryWithoutScript(
File "/root/anaconda3/envs/scenecraft/lib/python3.9/site-packages/datasets/load.py", line 834, in get_module
patterns = get_data_patterns(base_path)
File "/root/anaconda3/envs/scenecraft/lib/python3.9/site-packages/datasets/data_files.py", line 503, in get_data_patterns
raise EmptyDatasetError(f"The directory at {base_path} doesn't contain any data files") from None
datasets.data_files.EmptyDatasetError: The directory at /root/.cache/huggingface/hub/datasets--gzzyyxy--layout_diffusion_hypersim/snapshots/808d065cdaef61af0642795ceae1d020592f9c39/train doesn't contain any data files

OrangeSodahub · 2024-11-13T09:00:21Z

Thanks for provoiding detailed information. This is the step:

First these is no need to manage the path to the downloaded data. I'm not sure about which way you used to download them. But I recommend to delete the /root/.cache/huggingface/hub/datasets--gzzyyxy--layout_diffusion_hypersim/ first. And just use

SceneCraft/scenecraft/finetune/train_controlnet_sd.py

Lines 684 to 685 in 6568a2c

    
           # Here the dataset is defined in `scenecraft.dataset:FinetuneDataset`. 
        
           raw_dataset = load_dataset(args.train_data_dir, args.dataset, cache_dir=args.cache_dir, split="train")

where you just need to set args.train_data_dir to the name of dataset repo e.g. gzzyyxy/xxxx, and args.dataset is one of hypersim and scannetpp, for cache_dir you could just use the default one (which will be at root/.cache). Once you run this line, huggingface will automatically manage the dataset (for each time you load the data, it will check if it exist at cache dir, if not then download it. However the format of data is what we cannot handle directly and can only be handled by huggingface's datasets package).

For more detailed and advanced infos, please refer to datasets which is used in our codebase.

OrangeSodahub · 2024-11-13T09:23:39Z

Knowning that this is not a traditional technique most of people used to, I will give more detaild instructions later. Please keep tracking.

prologu · 2024-11-13T10:27:38Z

Thanks for your help. I will show you the new error after setting with gzzyyxy/xxxx in the train_data_dir.
error:Traceback (most recent call last):
File "/root/ljq/SceneCraft/scripts/../scenecraft/finetune/train_controlnet_sd.py", line 1460, in
main(args)
File "/root/ljq/SceneCraft/scripts/../scenecraft/finetune/train_controlnet_sd.py", line 1167, in main
train_dataset = make_train_dataset(args, tokenizer, accelerator)
File "/root/ljq/SceneCraft/scripts/../scenecraft/finetune/train_controlnet_sd.py", line 700, in make_train_dataset
raw_dataset = load_dataset(args.train_data_dir, args.dataset, cache_dir=args.cache_dir, split="train", trust_remote_code=True)
File "/root/anaconda3/envs/scenecraft/lib/python3.9/site-packages/datasets/load.py", line 2132, in load_dataset
builder_instance = load_dataset_builder(
File "/root/anaconda3/envs/scenecraft/lib/python3.9/site-packages/datasets/load.py", line 1890, in load_dataset_builder
builder_instance: DatasetBuilder = builder_cls(
File "/root/anaconda3/envs/scenecraft/lib/python3.9/site-packages/datasets/builder.py", line 342, in init
self.config, self.config_id = self._create_builder_config(
File "/root/anaconda3/envs/scenecraft/lib/python3.9/site-packages/datasets/builder.py", line 572, in _create_builder_config
raise ValueError(
ValueError: BuilderConfig 'hypersim' not found. Available: ['default']
I do not know why the "hypersim" is becoming the config_name.

prologu · 2024-11-13T13:12:45Z

The previous issue has been resolved, but a new issue has arisen, I have downloaded the pre-processed dataset, do I still need to download the original dataset, I do not have the expected dataset file format and content, the following issues occur:
Steps: 0%| | 0/27432 [00:00<?, ?it/s]Skipped current batch data due to [Errno 2] No such file or directory: './data/hypersim/semantic_images/ai_009_009/frame.0091.npz'.
Skipped current batch data due to [Errno 2] No such file or directory: './data/hypersim/semantic_images/ai_028_001/frame.0020.npz'.

GONGJIA0208 · 2024-11-20T07:10:55Z

Guys, thanks a lot for your explanation. The parquet file is very confusing. It will be better if authors can move this explanation to readme :)

OrangeSodahub · 2024-12-30T16:49:54Z

Hi guys, sorry for late. Yes we use a new dataloader tool which is not so friendly for new comers. I will provide instructions additionally.

@prologu For now, some issues exist in online dataset. Please try the easiest way downloading the raw hypersim dataset from here and process them locally, please refer to readme

OrangeSodahub self-assigned this Oct 25, 2024

OrangeSodahub added good first issue Good for newcomers and removed good first issue Good for newcomers labels Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert .parquet to the expected well-prepared data structure of a directory #2

convert .parquet to the expected well-prepared data structure of a directory #2

sugar-fly commented Oct 25, 2024

OrangeSodahub commented Oct 25, 2024 •

edited

Loading

prologu commented Nov 12, 2024

OrangeSodahub commented Nov 12, 2024

prologu commented Nov 13, 2024

OrangeSodahub commented Nov 13, 2024 •

edited

Loading

prologu commented Nov 13, 2024

OrangeSodahub commented Nov 13, 2024 •

edited

Loading

OrangeSodahub commented Nov 13, 2024

prologu commented Nov 13, 2024

prologu commented Nov 13, 2024

GONGJIA0208 commented Nov 20, 2024

OrangeSodahub commented Dec 30, 2024

convert .parquet to the expected well-prepared data structure of a directory #2

convert .parquet to the expected well-prepared data structure of a directory #2

Comments

sugar-fly commented Oct 25, 2024

OrangeSodahub commented Oct 25, 2024 • edited Loading

prologu commented Nov 12, 2024

OrangeSodahub commented Nov 12, 2024

prologu commented Nov 13, 2024

OrangeSodahub commented Nov 13, 2024 • edited Loading

prologu commented Nov 13, 2024

OrangeSodahub commented Nov 13, 2024 • edited Loading

OrangeSodahub commented Nov 13, 2024

prologu commented Nov 13, 2024

prologu commented Nov 13, 2024

GONGJIA0208 commented Nov 20, 2024

OrangeSodahub commented Dec 30, 2024

OrangeSodahub commented Oct 25, 2024 •

edited

Loading

OrangeSodahub commented Nov 13, 2024 •

edited

Loading

OrangeSodahub commented Nov 13, 2024 •

edited

Loading