Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use new ExitCode type and values, change ActorError to make bad exit codes less likely #191

Merged
merged 10 commits into from
Apr 11, 2022

Conversation

anorth
Copy link
Member

@anorth anorth commented Apr 6, 2022

Updates all uses of the FVM's ExitCode type to the new one that will be released in the next crates. Updates to use the codes defined in the spec.

Changes ActorError, the type used to propagate abort codes to the trampoline, to make it easier to construct common ones and a bit harder to accidentally attempt to exit with a system exit code. As part of this, introduces exit code 24 for "internal send failed", because the exit code of the internal send may be a system exit code and can't be propagated by the caller.

It's going to be painful, but please do review each change in exit code (or at least those that used to be SysErrXxx). There are many places where I've now had to replace a system exit code with a user exit code. The tests all pass locally, but I am somewhat surprised at that.

  • Depend on new fvm_shared crates before merging.

FYI @dignifiedquire

@anorth anorth requested review from vmx, Stebalien and raulk April 6, 2022 11:51
@@ -36,6 +35,8 @@ pub enum Method {
Collect = 4,
}

pub const ERR_CHANNEL_STATE_UPDATE_AFTER_SETTLED: ExitCode = ExitCode::new(32);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling attention to this actor-defined exit code, as an example.

@@ -61,6 +61,8 @@ pub enum Method {
CurrentTotalPower = 9,
}

pub const ERR_TOO_MANY_PROVE_COMMITS: ExitCode = ExitCode::new(32);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another actor-specific exit code.

}
pub fn internal_send_aborted(msg: String) -> Self {
// TODO use symbolic constant for USR_SEND_ABORTED when it's defined.
Self { exit_code: ExitCode::from(24), msg }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tada! I can define this in builtin-actors without waiting for an enum update in FVM.

@@ -60,7 +60,7 @@ impl<B> FvmRuntime<B> {
fn assert_not_validated(&mut self) -> Result<(), ActorError> {
if self.caller_validated {
return Err(actor_error!(
SysErrIllegalActor,
unspecified, // FIXME is this the right one?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: which standard exit code should we use for caller validation errors?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this would be "unspecified". But maybe we do need an extra USR_ASSERTION_FAILED error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -263,28 +261,40 @@ where
value: TokenAmount,
) -> Result<RawBytes, ActorError> {
if self.in_transaction {
return Err(actor_error!(SysErrIllegalActor; "runtime.send() is not allowed"));
return Err(actor_error!(unspecified; "runtime.send() is not allowed"));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: which standard exit code for state transaction violations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

match fvm::send::send(&to, method, params, value) {
Ok(ret) => {
if ret.exit_code.is_success() {
Ok(ret.return_data)
} else {
Err(ActorError::from(ret.exit_code))
// The returned code can't be simply propagated as it may be a system exit code.
Err(ActorError::internal_send_aborted(format!(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's where failures of internal sends are caught.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this hide the error code from the caller?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. As noted elsewhere, returning ActorError from send isn't appropriate. I'm wary of breaking the consistency with other runtime methods, but maybe I'll try fixing this one now.

}
}
Err(err) => Err(match err {
// Some of these errors are from operations in the Runtime or SDK layer
// before or after the underlying VM send syscall.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my commentary here as I was trying to get to grips with things. I'm reaching the conclusion that error numbers are not good as error values, because context is lost and it's hard to know if the right message is being attached later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They seem to be only enough in the case we are okay with loosing that information, otherwise we must track additional information

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using "numbers" here (in the syscall interface) because they're consensus critical and we want to keep things as "simple" as possible. If we can't distinguish between two cases, we should add additional exit codes for that. But I think all cases are covered.

https://github.com/filecoin-project/fvm-specs/blob/main/07-errors.md#error-numbers-by-syscall should tell you exactly what each error number means with respect to each syscall.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my commentary here as I was trying to get to grips with things. I'm reaching the conclusion that error numbers are not good as error values, because context is lost and it's hard to know if the right message is being attached later.

Or are you referring to the exit code in ActorError? In that case, I agree. My attitude there is:

  1. At the point where we can definitively say "we should exit with some specific exit code", we should just exit with that exit code (abort).
  2. Before then, we should return descriptive (structured) errors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Here" is not the syscall interface, though. It's calling the SDK, which has more code and makes multiple syscalls. We need a descriptive structured error here. Propagating an error number is an antipattern.

value, *balance
return Err(ActorError::unchecked(
ExitCode::SYS_SENDER_STATE_INVALID,
format!("cannot send value: {:?} exceeds balance: {:?}", value, *balance),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mock has to replicate what the FVM would do sometimes. I have a hunch that ActorError is not a good type to return from the Runtime API.

Actually this code might be wrong, and it should be an "internal send failed", masking the system code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah using an ActorError here seems weird at best, wrong at worst.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Steb points out elsewhere, we don't want to mask the system exit code either, we just can't propagate it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #197 for a possible fix

@@ -965,51 +965,64 @@ impl Syscalls for MockRuntime {
) -> anyhow::Result<Cid> {
let exp =
self.expectations.borrow_mut().expect_compute_unsealed_sector_cid.take().ok_or_else(
|| actor_error!(ErrIllegalState; "Unexpected syscall to ComputeUnsealedSectorCID"),
|| actor_error!(illegal_state; "Unexpected syscall to ComputeUnsealedSectorCID"),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mock is totally messed up here. It should not be returning ActorError for mock expectation violations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed as #196

@@ -24,7 +24,7 @@ impl ActorDowncast for anyhow::Error {
match downcast_util(self) {
Ok(actor_error) => actor_error.wrap(msg),
Err(other) => {
ActorError::new(default_exit_code, format!("{}: {}", msg.as_ref(), other))
ActorError::unchecked(default_exit_code, format!("{}: {}", msg.as_ref(), other))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This downcasting smells iffy, but is used a lot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the solution to this is: remove the usage of anyhow in all the ipld crates (amt,hamt, blockstore,encoding) using conrete error enums. anyhow is great as a "Final" error type, but not as a generic error from which you want to retrieve the original error again. (I have some wip code that does this, but it's a bit more work to finish)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is now implemented in filecoin-project/ref-fvm#463

@jennijuju jennijuju requested a review from lemmih April 6, 2022 12:35
@@ -60,7 +60,7 @@ impl<B> FvmRuntime<B> {
fn assert_not_validated(&mut self) -> Result<(), ActorError> {
if self.caller_validated {
return Err(actor_error!(
SysErrIllegalActor,
unspecified, // FIXME is this the right one?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this would be "unspecified". But maybe we do need an extra USR_ASSERTION_FAILED error.

@@ -60,7 +60,7 @@ impl<B> FvmRuntime<B> {
fn assert_not_validated(&mut self) -> Result<(), ActorError> {
if self.caller_validated {
return Err(actor_error!(
SysErrIllegalActor,
unspecified, // FIXME is this the right one?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -263,28 +261,40 @@ where
value: TokenAmount,
) -> Result<RawBytes, ActorError> {
if self.in_transaction {
return Err(actor_error!(SysErrIllegalActor; "runtime.send() is not allowed"));
return Err(actor_error!(unspecified; "runtime.send() is not allowed"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

match fvm::send::send(&to, method, params, value) {
Ok(ret) => {
if ret.exit_code.is_success() {
Ok(ret.return_data)
} else {
Err(ActorError::from(ret.exit_code))
// The returned code can't be simply propagated as it may be a system exit code.
Err(ActorError::internal_send_aborted(format!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this hide the error code from the caller?

Comment on lines 282 to 285
// How do we know the receiver is what triggered this?
// An error number doesn't carry enough information at this level
// above the raw syscall (and even then).
actor_error!(not_found; "receiver not found")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The send syscall is defined to return NotFound if, and only if, the receiver doesn't exist.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but this is not the syscall! This is invoking an SDK method that makes three syscalls. We don't know where the NotFound came from here.

Comment on lines 288 to 292
// Actually this is more like an illegal argument where the caller attempted
// to transfer an amount larger than that representable by the VM's
// token amount type. Yes, the caller doesn't have that amount, but if they'd
// attempted to transfer a representable amount it would fail with
// SYS_INSUFFICIENT_FUNDS instead, so this difference is wierd.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll fail with InsufficientFunds (the syscall error) in both cases. See https://github.com/filecoin-project/fvm-project/blob/724650bf92b239bb7d2079cde7d5bad121d7c79f/07-errors.md#L53-L57 for the only "system" exit codes that can appear here.

Basically, the rules are:

  1. Syscall error numbers indicates that something went wrong with the syscall.
  2. Exit codes indicates that the syscall succeeded, but the called actor failed or did something wrong.

In other words:

  1. If we fail before we invoke the target actor, we get a syscall error.
  2. If we fail afterwards, we get an exit code. Usually a "user" exit code, but sometimes a "system" exit code if the actor fails without yielding a usable exit code.

Other than that, the system exit codes exist for recording failures on-chain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I traced the code. The FVM SDK will return ErrorNumber::InsufficientFunds prior to invoking the syscall in the case I describe. This is not the syscall interface. The SDK makes three syscalls and performs some other checks below here. This match branch is specifically for when we don't get a system exit code from the send (and so we can't invoke one here in userland).

I know we want it to behave how you described, but it doesn't. This is another example of error numbers being a poor representation except for directly at the actual syscall.

}
}
Err(err) => Err(match err {
// Some of these errors are from operations in the Runtime or SDK layer
// before or after the underlying VM send syscall.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using "numbers" here (in the syscall interface) because they're consensus critical and we want to keep things as "simple" as possible. If we can't distinguish between two cases, we should add additional exit codes for that. But I think all cases are covered.

https://github.com/filecoin-project/fvm-specs/blob/main/07-errors.md#error-numbers-by-syscall should tell you exactly what each error number means with respect to each syscall.

}
}
Err(err) => Err(match err {
// Some of these errors are from operations in the Runtime or SDK layer
// before or after the underlying VM send syscall.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my commentary here as I was trying to get to grips with things. I'm reaching the conclusion that error numbers are not good as error values, because context is lost and it's hard to know if the right message is being attached later.

Or are you referring to the exit code in ActorError? In that case, I agree. My attitude there is:

  1. At the point where we can definitively say "we should exit with some specific exit code", we should just exit with that exit code (abort).
  2. Before then, we should return descriptive (structured) errors.

dignifiedquire and others added 5 commits April 8, 2022 16:01
The real fix is discussed in #144, but this version is strictly "more
correct" than the previous code.
Many of these are temporary (waiting on #144), but this is as correct as
we can get it for now.
@Stebalien Stebalien enabled auto-merge (squash) April 11, 2022 13:55
@Stebalien Stebalien merged commit 293ceb4 into master Apr 11, 2022
@Stebalien Stebalien deleted the anorth/syserr branch April 11, 2022 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants