Enable binary encoding and printing of multi-package wit files #1609

macovedj · 2024-06-11T04:18:09Z

Wit files that leverage multiple packages should be able to be encoded into binary and printed back out with the changes in this PR

alexcrichton

Thanks for this! On a skim this looks reasonable enough, but there's one part that will require careful review I haven't gotten to yet. Specifically this is affecting the binary format for wasm-encoded WIT packages and I need to review to see if this has an impact on how this interacts with older versions of wasm-tools. For example older versions need to be able to decode binaries produced by this PR and vice-versa, binaries produced here additionally need to be able to be decoded by older versions (assuming the new feature of more-than-one-package isn't used).

To perhaps help with this can you add some tests? There's no tests currently for this functionality and I think it would be good to have some in crates/wit-component/tests/interfaces/*.wit. Additionally would it be possible to avoid changing the v1 encoding? That in theory isn't being emitted by deafult any more and may even be able to be removed, it's been a bit since I reviewed the state of that.

alexcrichton · 2024-06-12T20:28:19Z

Ok reading over this some more I don't think that this is going to work as-is. It looks like this is basically adding a few loops in a few places, but I think that the changes required are going to be more significant than that. For example each interface/world in a package, when encoded today, is exported as the kebab-cased-name of the interface/world. Each kebab-cased export is a standalone component type which has full context for just that interface/world.

With multiple packages something will need to change about this format. Otherwise it's not possible to encode two packages which contains interface foo for example since that would generate two kebab-cased exports named "foo". I haven't thought much about what the format should be, so I don't have an answer off the top of my head.

alexcrichton · 2024-06-12T21:26:00Z

cc @lukewagner for possible brainstorming on the format here. The question being that today we have:

$ wasm-tools component wit -t <<< "package a:b; interface foo {}"
(component
  (type (;0;)
    (component
      (type (;0;)
        (instance)
      )
      (export (;0;) "a:b/foo" (instance (type 0)))
    )
  )
  (export (;1;) "foo" (type 0))
  (@custom "package-docs" "\00{}")
  (@producers
    (processed-by "wit-component" "0.210.0")
  )
)

What's the best way to represent this WIT package:

package a:b1 { interface foo {} }
package a:b2 { interface foo {} }

I'm tempted to say:

(component
  (component $a:b1 (; output of `wasm-tools component wit -t for just `package a:b1` ;) )
  (component $a:b2 (; output of `wasm-tools component wit -t for just `package a:b2` ;) )

  (export "a:b1" (component $a:b1))
  (export "a:b2" (component $a:b2))
)

This isn't currently valid though as "a:b1" isn't a valid export string, right now it needs to be fully qualified. This could be seen as motivation for relaxing those restrictions, but that's balooning the change a good deal so I'm curious if you have thoughts on this.

lukewagner · 2024-06-12T22:52:17Z

Ooh yeah, really good question. Agreed that the natural first step seems to be to put each nested WIT package into its own nested component. Then to your question: maybe this is weird but: what if we just didn't export those nested WIT packages at all? That would imply that when WIT tooling was slurping up a binary WIT package, it would need to recurse through all nested components and populate its internal namespace with all interfacename-d type exports of exported component types (rather than only doing so for the root component). Would that work?

What's neat about this is that it preserves the distinction between:

package ns:a;
package ns:b { interface i {} }
interface i {}

and

package ns:a { interface i {} }
package ns:b { interface i {} }

The former you could say is defining the package ns:a (which exports i) and bundling ns:b as an external dependency whereas the latter is defining an empty unnamed package that bundles external packages ns:a and ns:b. I think that distinction makes sense?

alexcrichton · 2024-06-13T15:13:44Z

I think that could work yeah and that seems reasonable to me. For the start I think it'd just walk the root component looking for subcomponents rather than spidering the whole thing, but either probably wouldn't be too hard to implement. I'm thinking though it'd be best to only accept a conservative shape of components to keep our options open in the future to different organizations if we want.

Otherwise though this WIT:

package ns:a;
package ns:b { interface i {} }
interface i {}

currently isn't valid where WIT text files are either package foo:bar; mode or exclusively package foo:bar { ... } mode. That being said being able to express the distinction there eventually I think is good regardless if we want to support it in the future.

lukewagner · 2024-06-13T15:33:22Z

Ah right, good point; unless and until we add package { ... } blocks inside package { ... } blocks, you shouldn't need to spider.

macovedj · 2024-06-13T19:25:42Z

So if I'm following correctly then for the scope of this PR, since mixed package declaration styles isn't yet supported, then in Alex's example where we have:

package a:b1 { interface foo {} }
package a:b2 { interface foo {} }

this would end up encoded with no exports:

(component
  (component $a:b1 (; output of `wasm-tools component wit -t for just `package a:b1` ;) )
  (component $a:b2 (; output of `wasm-tools component wit -t for just `package a:b2` ;) )
)

Is that right?

Then whenever we end up with mixed declaration styles being supported (maybe I"ll do that as a follow up or if it doesn't seem so bad I may try to include it in this PR) we could see both this and the current day encodings appear side by side in a single encoded wit file.

alexcrichton · 2024-06-13T19:42:31Z

Indeed! So the encoder will basically say if there's one package do what it does today but with multiple packages being encoded they're encoded in this new format. Similarly the decoding process would need to recognize the new multi-package format and decode that in addition to supporting the previous single-package format.

There's still sort of a question of defaults, but I think that's reasonable to tackle at a later date. Specifically I'm wondering for example what the output of wasm-tools component wit would be for wasi:cli. Whether that by default transitively includes all dependency packages I'm not sure.

macovedj · 2024-06-15T20:25:49Z

So I have implemented these changes locally for encoding multiple wit packages. When decoding though, I found that some modifications will be have to made here as the current tooling will try to parse the binary as a component instead of a wit package if no exports are present.

I think a convention needs to be decided on how to discern whether these are nested components in a component implementation vs being used to describe a wit package? I was thinking that maybe the component names can can provide a mechanism? If not then we could go back to expecting exports for these cases, and the convention could be established in how they're exported?

macovedj · 2024-06-17T14:28:07Z

After some more thought... I'll actually probably investigate refactoring ComponentInfo into ComponentsInfo, with multiple packages in it. Then we could perform the existing check on each package, and it could also provide a clearer route to mixing package declaration styles.

alexcrichton · 2024-06-17T17:34:38Z

That's a good point! I agree though that we can probably update how metadata is tracked and extend the "is empty" check to exclude the case of "no exports but has a bunch of components which all look like WIT packages"

macovedj · 2024-06-21T18:27:15Z

Cool. I think I've essentially accomplished the thrust of what's involved in order to do to that now (locally)... As I'm tidying it up before requesting additional review, I'll be evaluating whether or not to include support for mixing package declaration styles or have that be a follow up.

macovedj · 2024-06-26T21:04:38Z

Hmm seems I botched something up switching branches... should be able to look into what went wrong shortly...

macovedj · 2024-07-01T16:47:41Z

I think this should be ready for review now... just investigating why it's struggling to get gcc with yum in CI/CD.

alexcrichton · 2024-07-09T21:20:17Z

crates/wit-component/src/encoding/wit/v2.rs

+        let mut names = NameMap::new();
+        for pkg in self.packages {
+            let package = &self.resolve.packages[*pkg];
+            if let PackageKind::Explicit = package.kind {


I'm a bit confused as to what this new PackageKind field is and how it's processed here. I think ideally this would be removed entirely since it seems to be an artifact of the original text encoding and shouldn't affect this binary encoding here.

Instead what I might recommend is:

If there's only a single package, encode in the old "v2 style".

If there's multiple packages, encoded in this "explicit" style where a sub-component is created.

I'd also recommend extracing the shared code between these two branches of the if since right now it looks like the body of the loop was copy/pasted.

Additionally I think long-term we'll want to move towards the "new v2 style" where all packages, even if it's just one, are encoded with a sub-component. In that sense I think it would make sense to have a few more knobs here like other transitions we've made with the binary format:

An env var should be added which if specified and is 0 or 1 indicates which encoding branch to use.

Faililng that a const with a doc block explaining what it's controlling.

Eventually in the future we'll invert the value of the const to default to the new encoding. Further in the future we'll remove the env var. Further still we'll remove decoding support for the original encoding.

After some thought, and keeping in mind that we'll eventually want to support a wit file that has both an "implicit"/"unscoped" package declaration as well as an "explicit"/"scoped" package declaration, is the number of packages actually what we want to use as the criteria? If we do things this way instead of capturing the PackageKind... then I'm not sure how we can tell the difference between package a:b {} and package a:b if in the scoped case, it's the only scoped package. I think we'd end up transforming one package declaration style into the other when performing a roundtrip if this choice is made based on the number of packages that exist, which may be fine, but just wanted to make sure that we're considering this.

In the limit of time I would prefer to not spec/support multiple encoding formats for a WIT package in wasm. Right now we support v1/v2 and this PR is adding a third. I'd eventually like to sunset all the prior versions, however.

Given that my goal would be to use this new "v3" unconditionally. I'm proposing to distinguish based off 1-or-more packages purely for backwards compatibility. That way preexisting workflows don't break with older versions since the new format wouldn't be used by default. Once the new format is in enough supporting tooling it would then be enabled by default.

Personally I don't think it's too important to preserve package a:b { ... } vs package a:b; ... since they're semantically the same thing.

Yeah ok that makes sense to me

One last thought that struck me...

Is there a use case where the scoped packages are used similarly to how the deps folder is used today?
In this case where we're mixing package declaration styles, the implicit package may be the one being used by guest lang tooling for knowing which world to target, and the others are there as deps. If we needed to encode in this case, then I think this logic would treat the implicit package like the deps, rather than it being treated as "higher" than the others? I guess perhaps we could rely on guest lang tooling to keep track of which package is the one that matters for targeting?

alexcrichton · 2024-07-09T21:21:22Z

crates/wit-component/tests/interfaces.rs

-        }
+    // Finally convert the decoded package to wasm again and make sure it
+    // matches the prior wasm.
+    let wasm2 = wit_component::encode(Some(true), resolve, &[decoded_package])?;


I'm a bit surprised this test is passing because I would expect that the decoded_package argument should be the result of decoded.packages() now that multiple packages are supported. Could you investigate if a new test is needed to reflect that?

Looks like the assertion was wrong! accidentally setting wat2 to be wasm instead of wasm2 when checking the diff. You're right, when fixing the assertion, the test fails as is. Will push up changes to fix the assertion and handle decoded_package properly.

Turns out when I fixed it fixed the assertion, it came out that the docs are being stored in the wat in a different order depending on if a wit is being parsed vs a binary, in such a way that it doesn't impact how the docs get printed I believe. Can fix the tests by performing a sort on the metadata fields, but still investigating how to just get encode things in the same order regardless.

Is that a regression from this PR? Or a preexisting issue?

I think this is a preexisting issue.

if wasm != wasm2 { let wat2 = wasmprinter::print_bytes(&wasm)?; assert_eq!(wat, wat2, "document did not roundtrip correctly"); }

If I understand correctly, assert_eq is receiving two different wat samples created from wasm, neither represent wasm2... Was spending some time trying to dig into this, and it looks like modifying main also has this issue of doc metadata being reordered if I'm reading the code correctly...

Ah ok, in that case feel free to leave this as-is and I can investigate it after landing this

alexcrichton · 2024-07-09T21:24:31Z

crates/wit-parser/src/decoding.rs

+                        cur_package.package_metadata = _package_metadata.clone();
+                    } else if s.name() == "component-name" {
+                        if let KnownCustom::ComponentName(reader) = s.as_known() {
+                            let _packages = register_names(reader, &mut cur_package)?;


I'm a bit hesitant to rely on a custom section for something so significant to the structure of the final WIT, would it be possible to avoid this inference of infer via other means?

So I actually had similar thoughts when implementing... But I think that the way names of components are handled today is exclusively via custom sections? Will investigate some more, but if we want something similar to the wat outlined above where the name of the component is tied to the package name, I'm not sure of another way to get the information that informs a package name out of the binary format.

Documentation and @since/@unstable are definitely all handled with custom sections, but to me those are semi-ok to strip at any time where the name of a package is significant enough that you'd never want it stripped. I'm mostly wondering if this can be inferred from the exported items or something like that rather than having to be listed in the name section. An alternative could be to export the components themselves from the outer component where the export name has the full package name (of sorts at least)

I guess there's maybe an edge case here where you could have an empty scoped package? I think we could infer things from exports so long as there are exports, and as far as I know the only case where there wouldn't be is if the package is empty... Would you be comfortable erroring on empty scoped packages then? Or yeah, we could also do the alternative and export the components themselves...

@lukewagner do you have thoughts on this point specifically? The question being how is the package name recovered when multiple packages are encoded in a component and one package is possibly empty.

Or yeah... I think I like your idea of exporting the component, seems like the easiest way to deduce the name of the component without relying on the name section... How the PR handles it at the moment, even if we can tell it's an explicit/scoped package from the way the exports are structured, I think we still need the name section to figured out the actual name of the package that should be printed from a binary (unless there's another mechanism I'm missing). Also curious to hear @lukewagner's thoughts about if we exported the components.

crates/wit-parser/src/decoding.rs

crates/wit-component/src/encoding/wit/mod.rs

crates/wit-component/src/encoding/wit/v2.rs

alexcrichton · 2024-07-15T18:50:40Z

crates/wit-component/src/encoding/wit/v2.rs

+        let mut names = NameMap::new();
+        for pkg in self.packages {
+            let package = &self.resolve.packages[*pkg];
+            if let PackageKind::Explicit = package.kind {


In the limit of time I would prefer to not spec/support multiple encoding formats for a WIT package in wasm. Right now we support v1/v2 and this PR is adding a third. I'd eventually like to sunset all the prior versions, however.

Given that my goal would be to use this new "v3" unconditionally. I'm proposing to distinguish based off 1-or-more packages purely for backwards compatibility. That way preexisting workflows don't break with older versions since the new format wouldn't be used by default. Once the new format is in enough supporting tooling it would then be enabled by default.

Personally I don't think it's too important to preserve package a:b { ... } vs package a:b; ... since they're semantically the same thing.

alexcrichton · 2024-07-15T18:51:21Z

crates/wit-component/tests/interfaces.rs

-        }
+    // Finally convert the decoded package to wasm again and make sure it
+    // matches the prior wasm.
+    let wasm2 = wit_component::encode(Some(true), resolve, &[decoded_package])?;


Is that a regression from this PR? Or a preexisting issue?

alexcrichton · 2024-07-15T18:52:26Z

crates/wit-parser/src/decoding.rs

+                        cur_package.package_metadata = _package_metadata.clone();
+                    } else if s.name() == "component-name" {
+                        if let KnownCustom::ComponentName(reader) = s.as_known() {
+                            let _packages = register_names(reader, &mut cur_package)?;


@lukewagner do you have thoughts on this point specifically? The question being how is the package name recovered when multiple packages are encoded in a component and one package is possibly empty.

crates/wit-parser/src/decoding.rs

alexcrichton · 2024-07-15T18:56:33Z

crates/wit-parser/src/decoding.rs

+    name: Option<PackageName>,
+    /// Wasmparser-defined type information learned after a component is fully
+    /// validated.
+    types: Option<types::Types>,


It should be ok to only store the types at the "root" since I don't think a snapshot is needed along the way

I see, at this point though I think I've been able to combine ComponentInfo with ExplicitPackageInfo, so I think it'll just have a value of None when for nested packages.

crates/wit-parser/src/decoding.rs

…der around

alexcrichton · 2024-07-31T15:24:36Z

I'm going to close this as the conclusion of this PR manifested in #1700 which is to instead of supporting the feature being added here instead explicitly don't support it. The encoding/printing phases will continue to work on one "primary package".

macovedj added 2 commits June 10, 2024 23:17

enable binary encoding and printing of multi package wit files

3277e9c

fix fuzz

29da207

alexcrichton reviewed Jun 11, 2024

View reviewed changes

macovedj force-pushed the roundtrip-multi-package branch 2 times, most recently from 301c1a4 to fd677d1 Compare June 26, 2024 20:41

macovedj force-pushed the roundtrip-multi-package branch from fd677d1 to 27576a1 Compare June 27, 2024 00:09

explicit packages encoded as nested components

3e8b0f8

macovedj force-pushed the roundtrip-multi-package branch from 27576a1 to 3e8b0f8 Compare June 27, 2024 00:17

macovedj added 3 commits June 26, 2024 19:35

serde

4ba14d3

simplify ExplicitPackageInfo for decoding multi-package wit

20f083a

retry

500ae6f

macovedj added 3 commits July 1, 2024 12:06

retry

d88d268

update centos version and mirror

616721e

Merge branch 'main' into roundtrip-multi-package

4b84e91

alexcrichton reviewed Jul 9, 2024

View reviewed changes

macovedj added 2 commits July 11, 2024 11:34

merge resolve per explicit package

19301eb

remove package kind

5d98d8c

common function for encoding packages

a13b387

macovedj force-pushed the roundtrip-multi-package branch from df9a84d to 955bf5b Compare July 13, 2024 15:57

fix interfaces test assertion

d3d2d2c

macovedj force-pushed the roundtrip-multi-package branch from 955bf5b to d3d2d2c Compare July 13, 2024 16:02

Merge branch 'main' into roundtrip-multi-package

ed1a0ad

macovedj force-pushed the roundtrip-multi-package branch from 1a0cb52 to ed1a0ad Compare July 15, 2024 16:50

simplify decoding

476e3f5

alexcrichton reviewed Jul 15, 2024

View reviewed changes

macovedj added 3 commits July 15, 2024 15:39

combine ComponentInfo with ExplicitPackageInfo and dont pass sub_enco…

cde809e

…der around

encode metadata via helper

9ba171a

v3 env var and const, additional error

04dfc6a

alexcrichton closed this Jul 31, 2024

Enable binary encoding and printing of multi-package wit files #1609

Enable binary encoding and printing of multi-package wit files #1609

Conversation

macovedj commented Jun 11, 2024

alexcrichton left a comment

Choose a reason for hiding this comment

alexcrichton commented Jun 12, 2024

alexcrichton commented Jun 12, 2024

lukewagner commented Jun 12, 2024

alexcrichton commented Jun 13, 2024

lukewagner commented Jun 13, 2024

macovedj commented Jun 13, 2024

alexcrichton commented Jun 13, 2024

macovedj commented Jun 15, 2024

macovedj commented Jun 17, 2024

alexcrichton commented Jun 17, 2024

macovedj commented Jun 21, 2024 • edited Loading

macovedj commented Jun 26, 2024

macovedj commented Jul 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

macovedj Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexcrichton commented Jul 31, 2024

macovedj commented Jun 21, 2024 •

edited

Loading

macovedj Jul 11, 2024 •

edited

Loading