Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make grype-db download smaller by switching compression methods #367

Closed
3 tasks done
willmurphyscode opened this issue Aug 21, 2024 · 6 comments · Fixed by #437
Closed
3 tasks done

Make grype-db download smaller by switching compression methods #367

willmurphyscode opened this issue Aug 21, 2024 · 6 comments · Fixed by #437
Assignees
Labels
changelog-ignore Don't consider when generating the changelog enhancement New feature or request
Milestone

Comments

@willmurphyscode
Copy link
Contributor

willmurphyscode commented Aug 21, 2024

What would you like to be added:

Grype should download a smaller file during it's database update, probably by using .zstd compression on the current database schema.

Why is this needed:

The Grype database has grown over the years, to the point where now the database is 184 mb as a gzipped tar. This results in load on the CDN, and poor experience for many users.

Tasks:

  • update yardstick to be able to handle zstd files Enable importing schema v6 grype archives yardstick#441
  • update grype-db-manager config shape to include a suffix per schema version, and pass that suffix as -e to grype-db package
  • update that configuration to pass -e tar.zstd for v4 and v5 schemas code responds to this, no config update necessary
@willmurphyscode willmurphyscode added the enhancement New feature or request label Aug 21, 2024
@wagoodman
Copy link
Contributor

wagoodman commented Sep 17, 2024

We may be able to use Xz in a performant way instead if we use https://github.com/xi2/xz This appears to be an order of magnitude faster than https://github.com/ulikunitz/xz for decompression concerns. This would mean we'd need to shell out to compress within grype-db, which seems like an alright tradeoff (The ulikunitz repo I think yields larger than expected archives than the native xz utils).

Another consideration is on the compression side: I'm seeing that golang-only implementations are not achieving the best compression ratios compared to native tooling. That implies we might want to shell out to native tooling when creating archives.

@wagoodman wagoodman added this to the DB v6 milestone Sep 17, 2024
@wagoodman wagoodman changed the title Make grype-db download smaller by using zstd compression Make grype-db download smaller by switching compression methods Sep 17, 2024
@wagoodman wagoodman moved this to Ready in OSS Sep 17, 2024
@wagoodman
Copy link
Contributor

wagoodman commented Sep 17, 2024

Prototype for grype is here anchore/grype@main...fast-xz . This is down from 80 second with ulikunitz to 16 seconds. Before continuing: is this acceptable? With v6 the DB size will be much smaller than what was tested with, assuming the trend is linear, it looks like this will be ~10 seconds to decompress.

What's missing is removing some of the copied untar code from go-getter and leveraging the stereoscope tar utils (may require some refactoring in stereoscope).

@popey
Copy link
Contributor

popey commented Sep 18, 2024

While busy doing other things, I ran a compression benchmark against today's grype vuln database. I don't know if it's valuable data to you, but I am posting here anyway. I ran it on my ThinkPad Z13, so it's 1-2-year-old commodity hardware.

Summary

Algorithm             Time(U+S)(s)  Time(E)(M:s)  ComprRatio  SpaceSave(%)
xz                    309.47        5:06.46       14.15       92.94
gzip                  20.79         0:19.93       7.79        87.18
bzip2                 119.72        1:58.66       11.26       91.13
lzip                  265.31        4:24.80       13.19       92.42
lzma                  312.10        5:09.29       14.13       92.93
lzop                  2.68          0:02.30       4.90        79.64
zstd                  6.22          0:03.89       9.01        88.92
lzip                  261.30        4:21.17       13.17       92.42
7z                    553.75        0:55.57       13.87       92.80
zip                   19.71         0:19.96       7.78        87.17
zstd -T0 -1           8.20          0:01.46       8.18        87.79
zstd -T0 -3 (def.)    13.97         0:01.85       9.01        88.92
zstd -T0 -5           33.25         0:04.06       9.58        89.57
zstd -T0 -10          68.36         0:08.60       10.94       90.87
zstd -T0 -15          236.82        0:32.13       11.23       91.10
zstd -T0 -19          1756.01       3:57.46       13.27       92.47
zstd -T0 --ultra -22  2193.86       13:03.73      17.52       94.30

Full results

(csv format)

Algorithm,Time(U+S)(s),Time(E)(M:s),ComprRatio,SpaceSave(%),T-Start(UT),T-End(UT),S-Start(b),S-End(b),Command
xz,309.47,5:06.46,14.15,92.94,1726672883,1726673189,1445834752,102163132,tar --absolute-names --xz -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.xz /home/alan/.cache/grype/db/5/vulnerability.db
gzip,20.79,0:19.93,7.79,87.18,1726673189,1726673209,1445834752,185461612,tar --absolute-names --gzip -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.gz /home/alan/.cache/grype/db/5/vulnerability.db
bzip2,119.72,1:58.66,11.26,91.13,1726673209,1726673328,1445834752,128338698,tar --absolute-names --bzip2 -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.bz2 /home/alan/.cache/grype/db/5/vulnerability.db
lzip,265.31,4:24.80,13.19,92.42,1726673328,1726673593,1445834752,109612679,tar --absolute-names --lzip -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.lz /home/alan/.cache/grype/db/5/vulnerability.db
lzma,312.10,5:09.29,14.13,92.93,1726673593,1726673902,1445834752,102269096,tar --absolute-names --lzma -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.lzma /home/alan/.cache/grype/db/5/vulnerability.db
lzop,2.68,0:02.30,4.90,79.64,1726673902,1726673905,1445834752,294485866,tar --absolute-names --lzop -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.lzop /home/alan/.cache/grype/db/5/vulnerability.db
zstd,6.22,0:03.89,9.01,88.92,1726673905,1726673909,1445834752,160330902,tar --absolute-names --zstd -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.zst /home/alan/.cache/grype/db/5/vulnerability.db
lzip,261.30,4:21.17,13.17,92.42,1726673909,1726674170,1445834752,109719253,tar --absolute-names --lzip -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.lz /home/alan/.cache/grype/db/5/vulnerability.db
7z,553.75,0:55.57,13.87,92.80,1726674170,1726674225,1445834752,104210534,7z a -bso0 -bsp0 /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.7z /home/alan/.cache/grype/db/5/vulnerability.db
zip,19.71,0:19.96,7.78,87.17,1726674225,1726674245,1445834752,185629104,zip -q -r /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.zip /home/alan/.cache/grype/db/5/vulnerability.db
zstd -T0 -1,8.20,0:01.46,8.18,87.79,1726674245,1726674247,1445834752,176662743,tar --absolute-names -I zstd -T0 -1 -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.zst /home/alan/.cache/grype/db/5/vulnerability.db
zstd -T0 -3 (def.),13.97,0:01.85,9.01,88.92,1726674247,1726674249,1445834752,160330902,tar --absolute-names -I zstd -T0 -3 -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.zst /home/alan/.cache/grype/db/5/vulnerability.db
zstd -T0 -5,33.25,0:04.06,9.58,89.57,1726674249,1726674253,1445834752,150815387,tar --absolute-names -I zstd -T0 -5 -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.zst /home/alan/.cache/grype/db/5/vulnerability.db
zstd -T0 -10,68.36,0:08.60,10.94,90.87,1726674253,1726674262,1445834752,132079524,tar --absolute-names -I zstd -T0 -10 -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.zst /home/alan/.cache/grype/db/5/vulnerability.db
zstd -T0 -15,236.82,0:32.13,11.23,91.10,1726674262,1726674294,1445834752,128697206,tar --absolute-names -I zstd -T0 -15 -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.zst /home/alan/.cache/grype/db/5/vulnerability.db
zstd -T0 -19,1756.01,3:57.46,13.27,92.47,1726674294,1726674531,1445834752,108899112,tar --absolute-names -I zstd -T0 -19 -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.zst /home/alan/.cache/grype/db/5/vulnerability.db
zstd -T0 --ultra -22,2193.86,13:03.73,17.52,94.30,1726674531,1726675315,1445834752,82520379,tar --absolute-names -I zstd -T0 --ultra -22 -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.zst /home/alan/.cache/grype/db/5/vulnerability.db

@wagoodman wagoodman self-assigned this Sep 26, 2024
@wagoodman wagoodman moved this from Ready to In Progress in OSS Sep 26, 2024
@jonjohnsonjr
Copy link

I would highly recommend using zstandard over xz!

@wagoodman
Copy link
Contributor

wagoodman commented Nov 20, 2024

Me too! When evaluating I've been trying to minimize file size while not impacting decompression time in grype. Something that threw a wrench into this evaluation process is when to use golang implementations for these methods vs shelling out to tooling to do this. I've found when compressing with a golang implementation there tends to be less ideal compression ratios and decompression times. The lesson here learned is: compress with native tooling (for best archives), decompress with golang implementations (allowing us to keep grype as a portable static binary easily). I also found that the compression ratio is pretty sensitive to what is being compressed, so while we've been prototyping a new schema we ended up changing a lot of the details based on apparent ratios we were getting with those designs (for instance, a more normalized DB design tended to be a smaller DB file size, but not a great compression ratio when compressing for distribution... but relaxing normalization and leaning more towards a json blob store the ratio was maximized).

So! Where are we at today with all of the feedback incorporated? In terms of distribution sizes:

original:
711M    build/vulnerability-db_v6.0.0_2024-11-14T01:32:00Z_1732070774.tar

archives:
 81M    build/vulnerability-db_v6.0.0_2024-11-14T01:32:00Z_1732070774-19.tar.zst    # zstd -19
 63M    build/vulnerability-db_v6.0.0_2024-11-14T01:32:00Z_1732070774-22.tar.zst    # zstd -22 --ultra
 58M    build/vulnerability-db_v6.0.0_2024-11-14T01:32:00Z_1732070774.tar.xz        # xz -9

Where Xz-9 and Zstd-22 are comparable enough to be candidates here.

And timing (after trying out / swapping some decompression libs... I'll spare folks the details here):

❯ time grype db import vulnerability-db_v6.0.0_2024-11-14T01:32:00Z_1732070774.tar.xz
grype db import   6.32s user 0.34s system 89% cpu 7.488 total

❯ time grype db import vulnerability-db_v6.0.0_2024-11-14T01:32:00Z_1732070774-19.tar.zst
grype db import   1.21s user 0.58s system 168% cpu 1.065 total

❯ time grype db import vulnerability-db_v6.0.0_2024-11-14T01:32:00Z_1732070774-22.tar.zst
grype db import   2.89s user 0.46s system 130% cpu 2.574 total

From a timing perspective Zstd wins here.

edit: --ultra impacts memory used in decompression, for archives around these sizes -22 uses ~130 MB of memory while -21 uses ~70MB of memory, so we might be tweaking some of the final values here still.

Overall, the final verdict is Zstd 🎉

@wagoodman wagoodman added the changelog-ignore Don't consider when generating the changelog label Nov 27, 2024
@wagoodman
Copy link
Contributor

Adding changelog ignore since this, though this is implemented in #437, it won't be usable until v6 is enabled as the default schema (probably in a couple months). We don't want to pick this up in the next release notes.

@wagoodman wagoodman linked a pull request Nov 27, 2024 that will close this issue
@github-project-automation github-project-automation bot moved this from In Progress to Done in OSS Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog-ignore Don't consider when generating the changelog enhancement New feature or request
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants