Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggest Hubs when large data is present #67

Open
lshep opened this issue May 8, 2020 · 12 comments
Open

Suggest Hubs when large data is present #67

lshep opened this issue May 8, 2020 · 12 comments
Assignees
Labels
Beginner Beginner level Hackathon project

Comments

@lshep
Copy link
Contributor

lshep commented May 8, 2020

If trying to submit a package with large amounts of data --package should ideally use existing data (as there are many forms already in Bioconductor) -- if that isn't possible data should be reduced or large files should be in an accompanied ExperimentHub or AnnotationHub package

@lshep lshep added Beginner Beginner level Hackathon project Hackathon labels May 8, 2020
@lcolladotor
Copy link
Contributor

lcolladotor commented May 15, 2020

What is large here? 5mb? 1mb?

I find that this is related to #75. I've downloaded data from GitHub (from a repo I control) that is small enough and just a file here or there which is simpler than doing all the *Hub submission.

Anyway, I like this idea =) It's making me think about ways to make it easier to submit/test data on *Hub that are for another occasion ^^

@lcolladotor
Copy link
Contributor

Some related ideas:

  • I might add a function in biothis similar to usethis::use_data() but maybe a use_bioc_hub() one that structures a template to submit data to the *Hubs. This might happen after the BiocCheck-a-thon though. This would help me any failing checks to the *Hub and it could help others.

  • I need to double check the docs again and I know Kasper Daniel Hansen asked on Slack about testing *Hub interfaces before a package gets accepted, but well, I'm not sure how to do so. If this BiocCheck issue goes through, I imagine that others will have similar questions.

  • Currently, BiocCheck doesn't check if a package that generated *Hub entries actually uses them in any way. For example, two new packages I'm involved in https://github.com/ComunidadBioInfo/regutools and https://github.com/LieberInstitute/spatialLIBD have all the structure to use the *Hub but I have failed to actually follow through and get the data upload to the *Hubs. This could be it's own separate issue.

Best,
Leo

@lshep
Copy link
Contributor Author

lshep commented May 17, 2020

@Kayla-Morrell is working on create_hub_package functions and helper functions. They should be implemented shortly in the AnnotationHub/ExperimentHub packages

@lcolladotor
Copy link
Contributor

Wow, that's awesome news =) Thanks for doing this Kayla! 🙌🏽🙌🏽🙌🏽

@lcolladotor
Copy link
Contributor

For Constantin Ahlmann-Eltze's question on the 2020-05-18 intro discussion:

Maybe you can use usethis::create_package() then create data to it (usethis::use_data() and related functions), that is, create the dummy test package that breaks the structure on the fly.

@lcolladotor
Copy link
Contributor

Like code like this https://github.com/lcolladotor/biocthis/blob/master/R/biocthis_example_pkg.R with more steps.

@lcolladotor
Copy link
Contributor

lcolladotor commented May 18, 2020

What Martin Morgan was referring to I think is https://usethis.r-lib.org/reference/proj_utils.html though you might also be interested in lcolladotor/biocthis@dc38780#r39102094

PS The usethis functions are somewhat related to http://withr.r-lib.org/reference/index.html (like withr::with_dir() though r-lib/usethis#1108 (comment) points to the usethis functions)

@mtmorgan
Copy link
Collaborator

Nope, more testthat::with_mock(), which allows you to 'pretend' that a particular function behaves in a particular way, e..g, that file.size() returns a very large size, even though there is no file at all! This means that the unit tests remain very light-weight, and focus on the problem at hand ('does my function appear to handle very large files') rather than taxing the build / test system or actually processing files that are terabyte-sized.

@lcolladotor
Copy link
Contributor

Ahhh! Thanks for the clarification =)

@const-ae
Copy link

I would give this one a try and explore if RUnit has some similar ability to testthat::with_mock() @const-ae

const-ae added a commit to const-ae/BiocCheck that referenced this issue May 18, 2020
* Add checkLargeFiles to checks.R that checks if files with more than
5 MB exist in 'data/' and if yes recommends ExperimentHub/AnnotationHub
* Add unit test: creates large file with for loop
* Add call to BiocCheck.R that calls checkLargeFiles() on real package.
Uses same flag as checkIndivFileSizes
@const-ae
Copy link

I created a first commit that kind of addresses the issue. There still several short comings:

  • RUnit does not seem to support mocking function calls, I thus create a large file on the fly which takes a few seconds
  • The check is kind of redundant with the checkIndivFileSizes() check so they should probably be integrated
  • I am not sure that I fully understood the typical patterns in test_BiocCheck.R so it is possible that this should be improved as well

I thought, however, it might be a good idea to get some early feedback on this commit so that I know how to improve it. (If that is easier, I can also create a PR so that you can comment directly within the code.)

@lshep
Copy link
Contributor Author

lshep commented May 19, 2020

I'll have a look later today and provide some feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Beginner Beginner level Hackathon project
Projects
None yet
Development

No branches or pull requests

4 participants