build and process `.conda` artifacts #1586

beckermr · 2022-01-20T14:59:13Z

jakirkham · 2022-11-08T20:45:28Z

cc @conda-forge/core (as Anaconda.org & CDN can now handle .conda)

beckermr · 2022-11-16T18:31:39Z

edit: all of these items are done

notes from core call:

        * [x] make sure on announcement you mention the minimum conda version (4.7)
        * [x] check that ci services do not do duplicate uploads
        * [x] set compression level for big packages
            * flag is [`--zstd-compression-level`](https://github.com/conda/conda-build/blob/3baa21e0af022b3f971068566831c812497545f1/conda_build/cli/main_build.py#L159-L165)
            * default is 22, set [here](https://github.com/conda/conda-build/blob/3baa21e0af022b3f971068566831c812497545f1/conda_build/config.py#L53)

beckermr · 2022-11-16T18:36:12Z

use

conda_build:
  zstd_compression_level: 16 to 19

in the condarc

beckermr · 2022-11-16T22:06:10Z

@conda-forge/core I went with compression level 16. LMK if you have any issues with that.

cc @mbargull @mariusvniekerk

jakirkham · 2022-11-16T23:31:12Z

How long does it take to run 16 (or 19)? How much more compression does one see between the two? Understand we may not have benchmarks, but any info that can help guide us would be useful.

Should we allow this to be overridable? For example if compression takes too long on a feedstock close to a CI time limit and we want to dial it down.

Edit: Just realized PR ( #1852 ) shows this being configurable. So think that answer the last question.

beckermr · 2022-11-17T00:01:24Z

How long does it take to run 16 (or 19)? How much more compression does one see between the two? Understand we may not have benchmarks, but any info that can help guide us would be useful.

I don't have any of this info. I think we ship this PR and then figure out as things run how it is working.

jakirkham · 2022-11-17T00:16:51Z

Yeah being able to configure it is more important I think.

From past experience with compressors the last little bit tends to take a lot longer for minimal gain. So was just trying to get a sense of how "flat" the curve was getting to aid in decision making.

beckermr · 2022-11-17T00:40:48Z

Here is a benchmark for numpy using the following script on my (old intel) mac

#!/usr/bin/env bash

in_pkg=$1
out_pkg=${in_pkg/.tar.bz2/.conda}
bak_pkg=${in_pkg}.bak

cp ${in_pkg} ${bak_pkg}

for level in 1 4 10 16 17 18 19 20 21; do
    cp ${bak_pkg} ${in_pkg}
    rm -f ${out_pkg}
    rm -rf ${out_pkg/.conda//}
    start=`python -c "import time; print(time.time())"`
    cph transmute --zstd-compression-level=${level} ${in_pkg} .conda
    end=`python -c "import time; print(time.time())"`
    ttime=$( echo "$end - $start" | bc -l )
    start=`python -c "import time; print(time.time())"`
    cph x ${out_pkg}
    end=`python -c "import time; print(time.time())"`
    runtime=$( echo "$end - $start" | bc -l )

    size=$(ls -lah ${out_pkg} | cut -w -f 5)

    echo "${level} ${size} ${runtime} ${ttime}"
done

cp ${bak_pkg} ${in_pkg}
rm -f ${bak_pkg}
rm -f ${out_pkg}
rm -rf ${out_pkg/.conda//}

results (columns are zstd level, size, extraction time, transmute time)

$ ./bench.sh numpy-1.23.4-py39h9e3402d_1.tar.bz2 
1 8.2M 2.182568 7.1936909                                                                                                                                                 
4 7.2M 1.803828 7.8562648                                                                                                                                                 
10 6.4M 1.9773452 8.359201                                                                                                                                                
16 5.9M 1.975351 16.997171                                                                                                                                                
17 5.8M 3.171298 20.3572858                                                                                                                                               
18 5.7M 2.3847492 23.3421962                                                                                                                                              
19 5.7M 2.237947 36.101651                                                                                                                                                
20 5.2M 3.756540 35.1239249                                                                                                                                               
21 5.2M 3.2139912 40.8598119

Things flatten for this size around 10-16. This package is ~32M uncompressed.

beckermr · 2022-11-17T00:53:49Z

Here is the start of a benchmark for a much bigger file (compressed around 450 MB)

$ ./bench.sh stackvana-afw-0.2022.46-py310hff52083_0.tar.bz2 
1 464M 17.158145 161.082254                                                                                                                                               
4 419M 14.839792 157.401964                                                                                                                                               
10 375M 16.084401 199.193263         
16 338M 13.825499 711.3774772

beckermr · 2022-11-17T01:06:30Z

I think 16 will be fine for now. We can lower it as needed for big packages and we only take a small hit on small ones.

beckermr · 2022-11-17T01:06:47Z

We really could use an adaptive option in conda build.

jakirkham · 2022-11-17T04:06:50Z

Thanks Matt! 🙏

Agreed 16 seems like plenty.

Also notably better than their .tar.bz2 equivalents:

package name	`.tar.bz2` (MB)	`.conda` @ 16 (MB)
`numpy-1.23.4-py39h9e3402d_1`	6.6	5.9
`stackvana-afw-0.2022.46-py310hff52083_0`	435.0	338.0

jakirkham · 2022-11-17T04:26:37Z

On a different note (kind of related to adaptive), we may in the future want to leverage Zstandard's dictionary to pretrain on the content of many packages. We could then package this constructed dictionary and use it to improve overall compression and cutdown compression/decompression time.

One question here is how compressible are packages in aggregate. There may be some things (like text) that compress really well and other things (like dynamic libraries) that may do less well. A somewhat related question is whether it is worth creating per file format dictionaries (though this would be a modification to the format). Given other packagers, filesystems, etc. have gone down the path of using Zstandard already, we may be able to glean results from their efforts.

dhirschfeld · 2022-11-17T23:24:33Z

Great to see this happening - I'm excited about the performance improvements this might bring! ❤️

I've been building my own .conda packages for a bit now and one usability issue I've run into is not being able to see inside the package.

With the old .tar.bz2 packages I could open them in 7-Zip and browse the contents / folder-structure. That was often invaluable in debugging broken builds.

With the .conda format the contents appear as a pkg-*.tar.zst binary blob:

Is there an easy way to browse the contents of a .conda package?

jakirkham · 2022-11-17T23:29:56Z

We have made the same observation ( conda/conda-package-handling#5 ) 🙂

chrisburr · 2022-11-18T07:42:46Z

Is there an easy way to browse the contents of a .conda package?

All of this information is in https://github.com/regro/libcfgraph and I have a local clone which I regularly use with rg.

Longer term I was already thingking this would be a great feature for https://prefix.dev/ if @wolfv is interested.

dhirschfeld · 2022-11-18T08:22:54Z

All of this information is in https://github.com/regro/libcfgraph

The idea is that this is useful for debugging broken builds - i.e. the build fails because of missing files in the package so the new package version never gets published outside of CI or my local desktop. As a dev I want to know what the internal file/folder structure of the newly built (broken) package was so I can compare with my expectations.

I don't know much about prefix.dev but doesn't that just report on dependencies between published packages?

jakirkham · 2022-11-18T22:25:47Z

Related to this have filed an issue to create a CEP spelling out the .conda spec more fully ( conda/ceps#42 )

jaimergp · 2022-11-20T09:31:49Z

@dhirschfeld - you can use conda_package_handling (via cph extract) for your .conda needs!

dholth · 2022-11-20T15:22:53Z

zstd has built-in benchmarking

beckermr · 2022-11-20T15:38:42Z

What does this mean?

dholth · 2022-11-21T00:50:24Z

If you run zstd -b1 -e19 somefile it will tell you how long each level took.

mbargull · 2022-11-23T07:14:57Z

" copy-pasta" of some comments of mine from internal chat:

Re: dictionary:
Those can make sense in some cases, but we're also then treading in the "over-optimizing and leaving utilities behind" area. Meaning, we get diminishing returns with those minor optimizations and also I wouldn't be able to bsdtar -x all-the-things then ;).

Re: memory usage:
This is also why I recommend to not use the --ultra, e.g., -22 settings, but limit ourselves to -19 at max. Yes, compressing would take an unreasonable amount of resources in some cases, but for me more importantly would that decompression would also be affected! zstd by default ~~uses~~ can use much more memory on decompression (i.e., our users' side) as ye olde gzip/bzip2 anyway. If you use --ultra settings, then it its window sizes can surpass the 128MB mark and ~~needs~~ may need as much memory on decompression as well.
[edited to say "can use"/"may need"]

A thing orthogonal to tweaking the compression level would be to give the compressor a better arranged stream of data. Previously, Ray did some experimentation with binsort, but it was rarely used and slow as hell in the chosen configuration. But in some cases it could yield notable improved compression. I'm not sure how much of an impact it would have for zstd because it already uses much bigger window sizes (128MB vs 900KB block size of bzip2 -9 IIRC). Nowadays, one would probably look at, e.g., how https://github.com/mhx/dwarfs/tree/v0.6.2#overview (see similarity hashing) arrange their input instead of trying the binsort approach.

dholth · 2022-11-23T08:31:26Z

Thankfully the window size is also limited by the uncompressed archive size.

Synching after conda-forge/arrow-cpp-feedstock#875, which does quite a lot of things, see this [summary](conda-forge/arrow-cpp-feedstock#875 (review)). I'm not keeping the commit history here, but it might be instructive to check the commits there to see why certain changes came about. It also fixes the CI that was broken by a3ef64b (undoing the changes of #14102 in `tasks.yml`). Finally, it adapts to conda making a long-planned [switch](conda-forge/conda-forge.github.io#1586) w.r.t. to the format / extension of the artefacts it produces. I'm very likely going to need some help (or at least pointers) for the R-stuff. CC @ xhochy (for context, I never got a response to conda-forge/r-arrow-feedstock#55, but I'll open a PR to build against libarrow 10). Once this is done, I can open issues to tackle the tests that shouldn't be failing, resp. the segfaults on PPC resp. in conjunction with `sparse`. * Closes: #14828 Authored-by: H. Vetinari <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

michaelosthege · 2022-12-16T18:15:24Z

Even though I have conda 4.12.0, my mamba 0.27.0 has trouble finding packages that are only avaialable as .conda artifacts: mamba-org/mamba#2172

beckermr pinned this issue Apr 19, 2022

jakirkham mentioned this issue Nov 8, 2022

Conda servers don't seem to cache gzip compressed indexes - limiting download speed to 2-3MB/s (compressed) conda/infrastructure#637

Closed

2 tasks

jakirkham mentioned this issue Nov 17, 2022

Exploring package contents easily conda/conda-package-handling#5

Closed

beckermr unpinned this issue Nov 18, 2022

beckermr closed this as completed Nov 18, 2022

mbargull mentioned this issue Nov 24, 2022

conda-build CLI overrides condarc's zstd_compression_level with the default value conda/conda-build#4649

Closed

2 tasks

bollwyvl mentioned this issue Nov 24, 2022

nothing provides requested pandas 1.5.2 conda-incubator/setup-miniconda#264

Closed

mbargull mentioned this issue Nov 30, 2022

Increase zstd_compression_level to 19 conda-forge/conda-forge-ci-setup-feedstock#217

Merged

5 tasks

h-vetinari mentioned this issue Dec 4, 2022

GH-14828: [CI][Conda] Sync with conda-forge, fix nightly jobs apache/arrow#14832

Merged

scottyhq mentioned this issue Dec 6, 2022

use-only-tar-bz2: true for package cache results in error "nothing provides requested" conda-incubator/setup-miniconda#267

Open

jakirkham mentioned this issue Dec 16, 2022

Package request: PeaZip conda-forge/staged-recipes#21563

Open

2 tasks

weiji14 mentioned this issue Mar 11, 2023

update environment and lock files CryoInTheCloud/CryoCloudWebsite#51

Closed

tongyuantongyu mentioned this issue Jul 11, 2023

Build conda package in the new .conda format pytorch/builder#1455

Open

mbargull mentioned this issue Feb 13, 2024

Make .conda the default format for future conda-build versions conda/conda-build#5183

Open

2 tasks

h-vetinari mentioned this issue Jun 5, 2024

Include .conda packages ContinuumIO/anaconda-package-data#45

Open

h-vetinari mentioned this issue Nov 5, 2024

Start building .conda artefacts conda-forge/cdt-builds#73

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build and process `.conda` artifacts #1586

build and process `.conda` artifacts #1586

beckermr commented Jan 20, 2022 •

edited

Loading

jakirkham commented Nov 8, 2022

beckermr commented Nov 16, 2022 •

edited

Loading

beckermr commented Nov 16, 2022

beckermr commented Nov 16, 2022

jakirkham commented Nov 16, 2022 •

edited

Loading

beckermr commented Nov 17, 2022

jakirkham commented Nov 17, 2022

beckermr commented Nov 17, 2022 •

edited

Loading

beckermr commented Nov 17, 2022 •

edited

Loading

beckermr commented Nov 17, 2022

beckermr commented Nov 17, 2022

jakirkham commented Nov 17, 2022 •

edited

Loading

jakirkham commented Nov 17, 2022 •

edited

Loading

dhirschfeld commented Nov 17, 2022

jakirkham commented Nov 17, 2022

chrisburr commented Nov 18, 2022

dhirschfeld commented Nov 18, 2022 •

edited

Loading

jakirkham commented Nov 18, 2022

jaimergp commented Nov 20, 2022

dholth commented Nov 20, 2022

beckermr commented Nov 20, 2022

dholth commented Nov 21, 2022

mbargull commented Nov 23, 2022

dholth commented Nov 23, 2022

michaelosthege commented Dec 16, 2022

build and process .conda artifacts #1586

build and process .conda artifacts #1586

Comments

beckermr commented Jan 20, 2022 • edited Loading

jakirkham commented Nov 8, 2022

beckermr commented Nov 16, 2022 • edited Loading

beckermr commented Nov 16, 2022

beckermr commented Nov 16, 2022

jakirkham commented Nov 16, 2022 • edited Loading

beckermr commented Nov 17, 2022

jakirkham commented Nov 17, 2022

beckermr commented Nov 17, 2022 • edited Loading

beckermr commented Nov 17, 2022 • edited Loading

beckermr commented Nov 17, 2022

beckermr commented Nov 17, 2022

jakirkham commented Nov 17, 2022 • edited Loading

jakirkham commented Nov 17, 2022 • edited Loading

dhirschfeld commented Nov 17, 2022

jakirkham commented Nov 17, 2022

chrisburr commented Nov 18, 2022

dhirschfeld commented Nov 18, 2022 • edited Loading

jakirkham commented Nov 18, 2022

jaimergp commented Nov 20, 2022

dholth commented Nov 20, 2022

beckermr commented Nov 20, 2022

dholth commented Nov 21, 2022

mbargull commented Nov 23, 2022

dholth commented Nov 23, 2022

michaelosthege commented Dec 16, 2022

build and process `.conda` artifacts #1586

build and process `.conda` artifacts #1586

beckermr commented Jan 20, 2022 •

edited

Loading

beckermr commented Nov 16, 2022 •

edited

Loading

jakirkham commented Nov 16, 2022 •

edited

Loading

beckermr commented Nov 17, 2022 •

edited

Loading

beckermr commented Nov 17, 2022 •

edited

Loading

jakirkham commented Nov 17, 2022 •

edited

Loading

jakirkham commented Nov 17, 2022 •

edited

Loading

dhirschfeld commented Nov 18, 2022 •

edited

Loading