Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The implementation of fd_filestat_get in wasi-preview1-component-adapter is unsound #8956

Open
whitequark opened this issue Jul 14, 2024 · 20 comments

Comments

@whitequark
Copy link
Contributor

whitequark commented Jul 14, 2024

The implementation uses the value of metadataHash as st_ino:

dev: 1,
ino: metadata_hash.lower,

This is unsound. To quote the documentation for metadataHash:

Return a hash of the metadata associated with a filesystem object referred to by a descriptor.

This returns a hash of the last-modification timestamp and file size, and may also include the inode number, device number, birth timestamp, and other metadata fields that may change when the file is modified or replaced. It may also include a secret value chosen by the implementation and not otherwise exposed.

Implementations are encourated to provide the following properties:

  • If the file is not modified or replaced, the computed hash value should usually not change.
  • If the object is modified or replaced, the computed hash value should usually change.
  • The inputs to the hash should not be easily computable from the computed hash.

However, none of these is required.

Applications will commonly use st1.st_ino == st2.st_ino in order to determine if two files are the same (for example, Clang does it). My implementation of metadataHash, which always returns 0 and is compliant with the definition above, caused the combination of Clang and the wasip1 component adapter, to always treat all read files as the same file (which in practice meant the first #include caused an infinite loop).

@whitequark
Copy link
Contributor Author

It's unclear to me how to provide the desired semantics (st1.st_ino == st2.st_ino iff isSameObject(f1, f2)), or, in fact, any reasonable semantics that doesn't make private assumptions about the exact value of metadataHash (i.e. any sound semantics). It's pretty late though so I may be missing something obvious.

@whitequark
Copy link
Contributor Author

whitequark commented Jul 14, 2024

On reflection, I think the intended behavior here is to maintain a mapping from metadataHash to "potential files" (i.e. both files corresponding to file descriptors, and files that could be opened from a directory descriptor) that have previously been stat'd or opened, and to assign each "potential file" a fresh inode value.

This would still be problematic, in a subtler way: nothing in the documentation for metadataHash implies that always returning 0 may lead to undesirable performance implications down the line. Normally, we treat quadratic behavior arising in hash tables as a security issue in cases like this! So clearly either this implementation also doesn't work, or the contract for metadataHash must be amended.

whitequark added a commit to YoWASP/runtime-js that referenced this issue Jul 14, 2024
@whitequark
Copy link
Contributor Author

On more reflection, since metadataHash changes if the file is modified, it can't be used to index into such a table, so we're back to fd_filestat_get can't be sound.

How was this supposed to be implemented?

@alexcrichton
Copy link
Member

I can link WebAssembly/wasi-filesystem#81 as more possible background on this, but I don't have any extra info beyond that.

@pchickey
Copy link
Contributor

Like Alex, the only understanding I have of this topic is from PR 81 on the spec. It is quite possible that both the spec and our implementation are questionable, and that you are the first person to encounter problems with it in the wild. I don't have the expertise or time to devote to really digging into what the best path forward here is right now. Dan is out on leave right now, so I'm standing in as a wasi-filesystem champion in his absence, and I will defer to your judgement for changes to the spec and/or implementation.

@whitequark
Copy link
Contributor Author

Thank you. I've been feeling not the best lately, and this has directly impacted my ability to reason on really intricate topics like this one. I'll keep this issue on my radar and try to work out a way forward that works for everyone involved. I do not yet have any firm conclusions and I'll need a deeper understanding of the topic and more consultation with others before I'll have them.

Something that I noticed seems to have been an omission is the question of polyfilling the APIs. In one direction it's straightforward: wasip2 can be done on top of wasip1 easily. But it seems like nobody has drilled in detail into how to emulate wasip1 on top of wasip2, and given the state of the ecosystem (LLVM only directly knowing how to emit wasip1 binaries with wasip2 requiring a post-processing step) I think that's actually the more important direction?

@pchickey Do you think there is possibility or appetite for a sort of "sidecar" specification that builds on top of wasip2 in order to provide a sound polyfill for fd_filestat_get where if it's not available then st_dev/st_ino end up having zeroes in them? Or should I not consider that as a possible direction?

@alexcrichton
Copy link
Member

Oh that's actually the most important direction for us, too, implementing p1 on top of p2. The crate you linked in the OP is the current canonical source for implementing p1 on top of p2 within wasm. There's a host-side implementation at https://github.com/bytecodealliance/wasmtime/blob/main/crates/wasi/src/preview1.rs as well where the p1 APIs are all defined primarily in terms of their p2 counterparts.

Modifying the adapter (wasi_snapshot_preview1.wasm) is quite difficult, however, as it's written in "Rust" but that's in quotes because you can't do a lot you can do in normal Rust (e.g. use a hash map). Allocations are very carefully managed and such, so doing something fancy with std_{dev,ino} when metadata_hash is zero is probably going to be difficult.

@pchickey
Copy link
Contributor

Agreed with Alex: implementing p1 on top of p2 is an important direction. There are some places we are choosing to do that with low fidelity (the whole rights system in p1 is a mess and we want to leave it behind as much as possible, but it ends up mattering for corner cases where wasi-libc ended up depending on it) but I think this case matters a whole lot more.

As for the incompleteness of the wasip2 support in wasi-libc, so that LLVM could emit a purely wasip2 binary, thats something thats "just" gated on someone who can work on wasi-libc having the time to do so, which is why it has stalled the last few months. We do consider that important to get done at some point, Alex has been filling in some related missing pieces recently, mostly focusing on the parts the Rust toolchain will need.

@whitequark
Copy link
Contributor Author

whitequark commented Jul 15, 2024

Allocations are very carefully managed and such, so doing something fancy with std_{dev,ino} when metadata_hash is zero is probably going to be difficult.

In this case, given how p1 and p2 are defined, there must be a spec change to maintain soundness. For example there could be an additional wit interface that returns an arbitrary 128-bit unique identifier for a descriptor or a descriptor based path that is then stuffed into st_dev/st_ino.

@whitequark
Copy link
Contributor Author

whitequark commented Jul 16, 2024

I think a backstop solution that prevents the most egregious consequences (e.g. Clang considering every input file as the same) is to use a monotonically increasing global counter for filling in st_dev/st_ino. I can't think of any application where this would fail as badly, though of course this is still not a reasonable behavior for stat to exhibit. It's possible though that the reason I can't think of them is because my experience is limited still.

@alexcrichton
Copy link
Member

Could you perhaps hash the filename in the meantime? And pretend that each filename is its own unique dev/ino?

@whitequark
Copy link
Contributor Author

whitequark commented Jul 16, 2024

What about fstat?

I guess st_dev would be the fd and st_ino would be the hash of a filename, if any, opened off that fd.

@alexcrichton
Copy link
Member

Ah sorry I'm assuming that the structure you're maintaining on the host corresponding to the file would contain the hashed filename or the full filename or something like that. If all you have is the fd and nothing else I don't know what would be done.

@whitequark
Copy link
Contributor Author

whitequark commented Jul 16, 2024

Oh there is a misunderstanding. What I was considering is the ability to solve this issue in the preview1 component adapter for an arbitrary host. In my particular host I have basically added fake but consistent inodes to everything, I return them as the metadata hash lower (in accordance with what the adapter currently does) and my code works fine now; it's not a blocker for me personally, I'm raising this because it is an ecosystem health issue.

@alexcrichton
Copy link
Member

Ah makes sense. I would personally come to the same conclusion as you then that there's no way to actually "fix" things without changing the standard and/or docs there. Given the weak guarantees of metadata-hash and the strong guarantees of st_{dev,ino} I don't know how the adapter can work.

One possibility would be to add something which indicates whether metadata-hash is "strong", e.g. is suitable as-is for st_{dev,ino}. In such a situation the adapter could set dev/ino based on metadata-hash. If the "strong" bit is false then the hash of the filename plus whatever metadata-hash is could perhaps be used for dev/ino. That's just a guess though and is still sort of situation solving this. In any case this is probably best left for discussion on wasi-filesystem itself at this point.

@whitequark
Copy link
Contributor Author

I agree, which one of us should kick off that discussion?

@alexcrichton
Copy link
Member

If you wouldn't mind I'd defer to you as I fear I wouldn't have enough context on this

@cfallin
Copy link
Member

cfallin commented Jul 16, 2024

Potentially bad idea, but I'm curious if it spawns other thinking: what if the inode were synthesized from an actual hash of file contents?

Basically, imagine the filesystem like a content-addressed store (e.g. git). The inode is then just the hash that addresses some content.

This certainly satisfies the "check the inode and timestamp first to see if it's the same file contents" use-case; it's no worse than what one would have to do manually to verify same file contents in the absence of an inode/timestamp, i.e., actually read the data.

It's also conceptually "minimal" in the sense that it doesn't require any properties or guarantees from the p2 API.

The main downside is that it's slow: it does not satisfy what the user expects to be an O(1)-ish metadata check. This is especially an issue since the inode number must be provided as the result of a stat operation (i.e., is not requested separately). Maybe that alone is enough to throw out this idea. But maybe not?

@bjorn3
Copy link
Contributor

bjorn3 commented Jul 16, 2024

what if the inode were synthesized from an actual hash of file contents?

That could confuse tar into thinking two files are hardlinks to each other and thus turn them into actual hard links when unpacking.

@whitequark
Copy link
Contributor Author

See WebAssembly/wasi-filesystem#153.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants