Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Consistent image ID hashes across machines #711

Merged
merged 10 commits into from
Apr 30, 2024

Conversation

AdrianGonz97
Copy link
Contributor

@AdrianGonz97 AdrianGonz97 commented Apr 1, 2024

This PR fixes an issue that we've been experiencing in huntabyte/shadcn-svelte#978 where the hash for the generated image ID is different when it's computed on different machines. The goal of that PR was to cache the images so that they can be utilized across our GH runners.

The source of the issue is the mtime stat from the image file, which seemingly differs from machine to machine, causing the hash to generate a different id, resulting in a cache MISS. Instead, mtime has been replaced for the size stat, which (in conjunction with the image config and file url) should provide sufficient uniqueness for the hash.


  • Quick Checklist
  • I have read the contributing guidelines
  • I have written new tests, as applicable (for bug fixes / features)
  • Docs have been added / updated (for bug fixes / features)
  • I have added a changeset, if applicable
  • What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)
    Bug fix

  • What is the new behavior (if this is a feature change)?
    Implements consistent image ID hashing across machines

  • Does this PR introduce a breaking change? (What changes might users need to make in their application due to this
    PR?)
    No

  • Other information:
    I wasn't sure the best way to express this change in a test, so I've hardcoded the expected id string. I'm sure there's a better way to do it, so please feel free to modify it!

Copy link

changeset-bot bot commented Apr 1, 2024

🦋 Changeset detected

Latest commit: e43d13f

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
vite-imagetools Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@benmccann
Copy link
Collaborator

Hmm, that's really weird. I'm surprised the mtime would be different across machines. Are they different OSes? Or is the mtime set based upon when you check the repo out from git. I'd love to better understand the issue before we switch away from it

Copy link

codecov bot commented Apr 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.48%. Comparing base (8790098) to head (a4aec62).
Report is 2 commits behind head on main.

❗ Current head a4aec62 differs from pull request most recent head e43d13f. Consider uploading reports for the commit e43d13f to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #711      +/-   ##
==========================================
- Coverage   95.49%   95.48%   -0.01%     
==========================================
  Files          33       33              
  Lines        1288     1286       -2     
  Branches      226      226              
==========================================
- Hits         1230     1228       -2     
  Misses         58       58              
Flag Coverage Δ
imagetools-core 95.48% <100.00%> (-0.01%) ⬇️
vite-imagetools 95.48% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@AdrianGonz97
Copy link
Contributor Author

AdrianGonz97 commented Apr 9, 2024

Hmm, that's really weird. I'm surprised the mtime would be different across machines. Are they different OSes?

It's definitely strange 😄. The OSes are set to be the same as they're running from the same workflow, so I don't think it's that. Perhaps it's more related to timezones? 🤔

Or is the mtime set based upon when you check the repo out from git

I'd be happy to dig into this question when I get a chance!

@AdrianGonz97
Copy link
Contributor Author

AdrianGonz97 commented Apr 10, 2024

Or is the mtime set based upon when you check the repo out from git

It seems like this assumption was correct, @benmccann! After checking out the repo on 2 separate instances (on the same machine), I'm ending up with two different mtime stats.

File stats 1:
{
  stats: Stats {
    dev: 2080,
    mode: 33188,
    nlink: 1,
    uid: 1000,
    gid: 1000,
    rdev: 0,
    blksize: 4096,
    ino: 768440,
    size: 159643,
    blocks: 312,
    atimeMs: 1712710759640.3115,
    mtimeMs: 1712710733650.3157,
    ctimeMs: 1712710733650.3157,
    birthtimeMs: 1712710733650.3157,
    atime: 2024-04-10T00:59:19.640Z,
    mtime: 2024-04-10T00:58:53.650Z,
    ctime: 2024-04-10T00:58:53.650Z,
    birthtime: 2024-04-10T00:58:53.650Z
  }
}
File stats 2:
{
  stats: Stats {
    dev: 2080,
    mode: 33188,
    nlink: 1,
    uid: 1000,
    gid: 1000,
    rdev: 0,
    blksize: 4096,
    ino: 1598355,
    size: 159643,
    blocks: 312,
    atimeMs: 1712711152420.249,
    mtimeMs: 1712711150750.2493,
    ctimeMs: 1712711150750.2493,
    birthtimeMs: 1712711150750.2493,
    atime: 2024-04-10T01:05:52.420Z,
    mtime: 2024-04-10T01:05:50.750Z,
    ctime: 2024-04-10T01:05:50.750Z,
    birthtime: 2024-04-10T01:05:50.750Z
  }
}

@benmccann
Copy link
Collaborator

Sorry for the long delay. I'm afraid I'd forgotten about this PR.

I'm nervous about using the file size because you could hit false positives with it. I'm not sure how expensive it would be to compute a hash for every file, but it seems safer.

I had an idea that we could use mtime first and then fallback to the hash if it's different, but that probably wouldn't work because it would require us storing two hashes.

@AdrianGonz97
Copy link
Contributor Author

AdrianGonz97 commented Apr 29, 2024

I'm nervous about using the file size because you could hit false positives with it.

Fair enough!

I'm not sure how expensive it would be to compute a hash for every file, but it seems safer.

I agree. Testing locally, I modified it to use imageBuffer instead, which had no perceivable performance impact relative to the rest of build.

I also did some microbenchmarking for it (just in case), testing different image sizes to see their impact:

┌─────────┬───────────────────────────────────┬───────────┬────────────────────┬──────────┬─────────┐
│ (index) │ Task Name                         │ ops/sec   │ Average Time (ns)  │ Margin   │ Samples │
├─────────┼───────────────────────────────────┼───────────┼────────────────────┼──────────┼─────────┤
│ 0       │ 'hash with size - base (500kb)'   │ '231,842' │ 4313.26988522398   │ '±0.63%' │ 231843  │
│ 1       │ 'hash with buffer - base (500kb)' │ '5,979'   │ 167226.18963211094 │ '±0.23%' │ 5980    │
│ 2       │ 'hash with buffer - 11mb'         │ '165'     │ 6036291.090361429  │ '±0.43%' │ 166     │
│ 3       │ 'hash with buffer - 5mb'          │ '311'     │ 3215181.397435922  │ '±0.31%' │ 312     │
│ 4       │ 'hash with buffer - 2.5mb'        │ '1,099'   │ 909751.2799999929  │ '±0.20%' │ 1100    │
│ 5       │ 'hash with buffer - 1mb'          │ '2,599'   │ 384681.17346155015 │ '±0.40%' │ 2600    │
│ 6       │ 'hash with buffer - 500kb'        │ '5,875'   │ 170197.13359428805 │ '±0.28%' │ 5876    │
│ 7       │ 'hash with buffer - 350kb'        │ '6,425'   │ 155626.61889200658 │ '±0.25%' │ 6426    │
│ 8       │ 'hash with buffer - 100kb'        │ '16,379'  │ 61051.895238088415 │ '±0.44%' │ 16380   │
└─────────┴───────────────────────────────────┴───────────┴────────────────────┴──────────┴─────────┘

Here's the repo for it too: https://github.com/AdrianGonz97/imagetools-hashing-benchmarks

Hashing with the size (or mtime) stat is significantly faster, however, I would consider hashing with the image to be sufficiently fast enough for our needs. Especially since this operation doesn't execute hundreds of thousands of times per build.

So in practice, I think it's completely fine to hash with the image itself (even without the mtime optimization you described) as the performance impact is miniscule when compared to the rest of the build.


Side note: While writing the benchmark, I noticed that generateImageID is async when it doesn't need to be. Do you have any idea why that's the case? Scratch that, looking back at the history, it seems like it used to await a promise before the caching PR was introduced. Not a problem, I can fix that.

Copy link
Collaborator

@benmccann benmccann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for all the changes and the thorough job!

@benmccann benmccann merged commit f049b7f into JonasKruckenberg:main Apr 30, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants