Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BiocCheck looks and complains about the size of object files (.o, .so files). It shouldn't #219

Open
hpages opened this issue Dec 19, 2024 · 10 comments

Comments

@hpages
Copy link
Contributor

hpages commented Dec 19, 2024

See Bioconductor/Contributions#3673 (comment)

@LiNk-NY
Copy link
Contributor

LiNk-NY commented Dec 19, 2024

BiocCheck should have no business looking at the size of object files as they're not in git.

BiocCheck uses the .gitignore file to avoid checking files in the repository. There is no .gitignore in the repo; othewise, it will ignore those files for the size check.

@hpages
Copy link
Contributor Author

hpages commented Dec 19, 2024

Hmm.. not sure why would BiocCheck rely on .gitignore for this? Doesn't sound realiable at all.

BiocCheck should only care about files that belong to the repository and relying on .gitignore to identify these files cannot be trusted. There's no requirement to use a .gitignore file and even if there was, you cannot assume that developers are going to list their object files in it. I know a lot of people do (not sure where this trend is coming from) but I never understood why, and personally I don't like it (it prevents git clean -i from doing the right thing).

The file size check should be done on a clean git repo so maybe is a better job for BiocCheckGitClone?

At the very least the message of the warning should be modified. "WARNING: Package files exceed the 5MB size limit" is not true here because no file in the package actually exceeds the 5MB limit.

@LiNk-NY
Copy link
Contributor

LiNk-NY commented Dec 19, 2024

.gitignore is a reliable way of excluding files that are not part of a git (package) repo. It also allows BiocCheck to exclude files from static checks (and avoids invoking git).

BiocCheck should only care about files that belong to the repository

I agree but how do you know which ones belong without any context?

relying on .gitignore to identify these files cannot be trusted

Why can't it be trusted?

There is no requirement to use a .gitignore file but BiocCheck uses it if available. It at least helps BiocCheck approximate what package files belong.

it prevents git clean -i from doing the right thing

Personally, relying on git clean is a bit dangerous. Would using a script to remove specific file patterns be safer?

Since BiocCheck assumes that anything inside the folder (except those excluded by .gitignore) is a package file , it triggers the warning. There's no straightforward (and static) way to tell what's a package file or not. Perhaps the wording should be "repository files" since we generally don't want large files in the git repository.

The alternative is to run git ls-files and check the sizes of the files in the list but this would not catch any large git history files IIUC.

@hpages
Copy link
Contributor Author

hpages commented Dec 20, 2024

how do you know which ones belong without any context?

You don't. But you can tell for sure that some files don't belong. Answer below...

Right now, we have this quite confusing situation where BiocCheck() is called on the source tarball:

 > BiocCheck("h5mread_0.99.0.tar.gz")

but then it complains about files that are not in it:

✔ Checking individual file sizes... [3ms]
! WARNING: Package files exceed the 5MB size limit.
Files over the limit:
• /tmp/RtmpGZuiOG/file185fa6fd48366/h5mread//src/h5mread.so
• /tmp/RtmpGZuiOG/file185fa6fd48366/h5mread/src/h5mread.so

I simply don't understand why it would bother checking files that do not belong to the source tarball in the first place. These files obviously don't originate from git. They are installation artefacts that the developer has no control on (what installation artefacts are going to be produced exactly, and their size, is platform-dependent).

Personally, relying on git clean is a bit dangerous.

Well, it seems that we use .gitignore and git clean for different purposes. Which is fine. For my use case (packages with native code), not doing git clean -i before running R CMD INSTALL on the local repo is dangerous, because of the risk of stale .o or .so files. So git pull + git status + git clean -i has become my standard routine when I resume work on a package. But of course that works only if .o and .so files are not excluded via .gitignore.

Anyways, I don't think that .gitignore should be involved in the context of "Checking individual file sizes" of a package source tarball. Many repos actually list .gitignore in their .Rbuildignore file, so, even if the developers came up with an exhaustive .gitignore file that lists every possible junk you can think of, the file won't end up in the source tarball so BiocCheck won't be able to use it.

Bottom line is that you wouldn't need to exclude anything if you were only checking the files from the source tarball in the first place. If, for whatever reason, you need to run the check on the package installation folder, then it would make more sense to exclude the files that don't belong to the source tarball rather than relying on .gitignore.

Thanks,
H.

@LiNk-NY
Copy link
Contributor

LiNk-NY commented Dec 23, 2024

Hi Hervé, @hpages

Thanks for that bit of information. I have revised the check to exclude checking files in the source directory when a tarball is provided. 9e9b62e (version 1.43.3).
The check still looks at all the files in the tarball and flags any big files (without any involvement of .gitignore).

Best,
Marcel

@LiNk-NY LiNk-NY closed this as completed Dec 23, 2024
@hpages
Copy link
Contributor Author

hpages commented Dec 27, 2024

Great. Thanks Marcel!

@LiNk-NY
Copy link
Contributor

LiNk-NY commented Jan 6, 2025

Hi Hervé, @hpages

I am actually using the untar-ed directory location as the sourceDir in BiocCheck(). This means that the contents of the tarball are being checked. There are some packages that have large .so and .o files in the tarballs.
Is it okay to ignore the entire src file within the tarball?

Best regards,
Marcel

@LiNk-NY LiNk-NY reopened this Jan 6, 2025
@hpages
Copy link
Contributor Author

hpages commented Jan 8, 2025

Hi Marcel,

There are some packages that have large .so and .o files in the tarballs.

Really? Binary packages maybe but not source packages. Do you have an example of a source package that contains .so or .o files? It seems wrong to me. Maybe something that BiocCheck should report with a WARNING.

@lshep
Copy link
Contributor

lshep commented Jan 8, 2025

We checked for .so and .o files and they should be removed https://github.com/Bioconductor/BiocCheck/blob/devel/R/BiocCheckGitClone.R#L3

@hpages
Copy link
Contributor Author

hpages commented Jan 22, 2025

We checked for .so and .o files and they should be removed

That's good 😃 Hopefully BiocCheckGitClone() emits a loud warning or even an error for this.

@LiNk-NY So to answer your latest question:

Is it okay to ignore the entire src file within the tarball?

No need to skip the src/ folder and subfolders. None of that should contain .so or .o files. If they do, it's a serious offense that will be reported by BiocCheckGitClone() so they won't get away with it 😉 Also if the src/ tree contains .so or .o files that are more than 5MB, then now we have a situation of double offense, so it won't hurt if the file size check performed by BiocCheck() reports the files again (in addition to BiocCheckGitClone() already reporting them).

Anyways, as mentioned earlier in this issue, it would make more sense to perform the file size check in BiocCheckGitClone(). We only want to perform this check on files that belong to git so this is the place to do it. Trying to do it on the source tarball is kind of a little bit too late...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants