-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huggingface datasets integration? #55
Comments
Hi @ChenchaoZhao , thanks for your interest! I have not considered either of those. I have found that the pickle format works well enough for my needs. Is there something in particular that makes using this format difficult? Also, if you are interested in contributing by converting the datasets to other formats, I would be happy to host them! |
Hi @jonathanking thank you for the comment! Pickle is not considered secure in production. How should I contribute if I generate the parquet files? |
I was thinking about how to proceed, and here are my thoughts. I'm going to release an updated version of SidechainNet in a little while. I think we can wait on creating parquet files until then. However, if you are really interested in contributing, you could perhaps write a function or describe how you might convert the current format (dictionary, key/values of various types) into a format agreeable with the parquet format. Then we could use that code/or general idea when we move forward and release the next version of the code and data. I'm just not familiar with the format myself, so I'd have to investigate how to reformat the existing data. I see something about formatting it into a DataFrame and then writing a parquet file, so maybe it's not so complicated. It would just need to be able to handle the different kinds of data stored in the dictionary currently (arrays, lists, strings). Let me know what you think! |
Will there be additional features in the next release? Based my understanding, the current version probably can be converted using Huggingface Then you can upload to Huggingface Hub for more visibility or save them as |
Yes, I have a handful of features and data standardizations/improvements that I’ve been working with on my research branches that I plan to add to the next release. Thanks so much for pointing out that function! I didn’t think it would be that easy, but that sounds like a great option. I’ll keep that in mind for when I regenerate the data. I appreciate the help! |
Any plans for Huggingface
datasets
integration?Instead of using pickled dictionary, probably it is better practice to use
arrow
orparquet
format. It should be pretty easy to convert to Huggingface format.The text was updated successfully, but these errors were encountered: