[nexus] handle sled-agent errors as described in RFD 486 #6455

hawkw · 2024-08-27T19:45:14Z

RFD 486 describes a principled design for how to use the Failed state
for instances and VMMs. In particular, it proposes that:

Instances and VMMs should move to the Failed state independently.
A VMM goes to Failed if either:
- Nexus observes that a VMM it thought was resident on a sled is no
  longer resident there, or,
- it observes through some other means (sled expungement, sled reboot)
  that all VMMs that were resident on that sled must now be gone.
Instances go to Failed when an instance-update saga observes that
an instance's active VMM has been marked Failed. An update saga
transitioning an instance to failed cleans up the instance's resource
allocations, similarly to how update sagas handle transitioning an
instance to Destroyed when its active VMM has been destroyed.
For purposes of the external API:
- An instance that is itself in the Failed state appears externally
  to be Failed.
- If an instance in the "has-VMM" internal state and points to a
  Failed VMM, the instance is Stopping and then transitions to
  Failed.
- Once the actual instance (not just its active VMM) is Failed, it
  can be started or destroyed.

This branch implements the behavior described above.

In particular, I've added code to the instance-update saga to handle
instances whose active or migration target VMMs have transitioned to
Failed, similarly to how it handles the VMMs transitioning to
Destroyed. I've changed the code that detects whether a sled-agent
error indicates that an instance has failed to only do so when the error
is a 404 with the specific error code that indicates that the sled-agent
has forgotten about a VMM that it previously knew about, and changed the
Nexus functions that ask sled-agents to update an instance's state, and
the instance_watcher background task, to mark instances as Failed
when they encounter only the appropriate sled-agent error. And, I've
changed the mark_instance_failed function in Nexus to instead mark the
VMM record as failed and trigger an instance-update saga (and it's now
named mark_vmm_failed), and made similar changes to the
instance_watcher background task. Finally, I've also made sure that
Failed instances can be destroyed and restarted, and that instances
with Failed active VMMs appear to be Stopping until the VMM is
cleaned up by an update saga.

In addition to this, it was necessary to (re)introduce a
VmmState::Creating variant. The purpose of this variant is to
differentiate between VMMs which have been created in the database but
do not yet exist on a sled, from those which are known to sled-agent and
are actually starting/migrating.

This is necessary because presently, the instance_watcher background
task will attempt to perform health checks for Starting and
Migrating VMMs, and --- if the sled-agent does not actually know about
the VMM yet --- will move it to Failed. But, when a VMM is newly
created, the sled-agent won't know about it yet until it has been
registered with the sled-agent, so if the instance_watcher task
activates while an instance-start or instance-migrate saga is in
flight, the instance-watcher task may incorrectly move its VMM to
Failed if the VMM record has been created but not yet registered. Now,
after adding the Creating variant, we can avoid performing health
checks for those VMMs until the sled-agent actually acknowledges a
request to register the VMM.

Besides the behavior implemented here, RFD 486 also proposes that the
boot_on_fault field in the instance record be used by Nexus to
determine whether to automatically attempt to restart a Failed
instance. This will be implemented in a subsequent branch --- see issue
#6491 (and also #4872).

hawkw · 2024-08-29T21:20:35Z

hmm, this seems weird: https://buildomat.eng.oxide.computer/wg/0/details/01J6FR1NA0RNQ8VF5K25N438A9/gnVKN2xH2E5tpe4GlGIk8e1eihuY33940e06zP8gB7nxGxjZ/01J6FR24KFBQZH5HSPYARQFR2D#S5818

hawkw · 2024-08-30T18:15:04Z

nexus/src/app/instance.rs

+        match value.0 {
+            // Errors communicating with the sled-agent should return 502 Bad
+            // Gateway to the caller.
+            progenitor_client::Error::CommunicationError(e) => {
+                // N.B.: it sure *seems* like we should be using the
+                // `HttpError::for_status` constructor here, but that
+                // constructor secretly calls the `for_client_error`
+                // constructor, which panics if we pass a status code that isn't
+                // a 4xx error. So, instead, we construct an internal error and
+                // then munge its status code.
+                let mut error = HttpError::for_internal_error(e.to_string());
+                error.status_code = if e.is_timeout() {
+                    // This isn't actually in the RFD, but I thought it might be
+                    // nice to return 504 Gateway Timeout when the communication
+                    // error is a timeout...
+                    http::StatusCode::GATEWAY_TIMEOUT
+                } else {
+                    http::StatusCode::BAD_GATEWAY
+                };
+                error
+            }
+            // Invalid request errors from the sled-agent should return 500
+            // Internal Server Errors.
+            progenitor_client::Error::InvalidRequest(s) => {
+                HttpError::for_internal_error(s)
+            }
+            e => HttpError::for_internal_error(e.to_string()),
+        }


@gjcolombo do you think it's worth adding an external message to any of these errors? I was thinking that if self.instance_unhealthy() is true, it might be nice to return an error with a message like "instance failure detected" (and perhaps that should be a 503 Service Unavailable rather than 500?), and maybe the 502s and 504s should say "error communicating with sled-agent" or something? I dunno...

clients/nexus-client/src/lib.rs

hawkw · 2024-08-30T20:17:01Z

Noticed that I had forgotten to make instances with Failed active VMMs candidates for the background instance-updater task, so I'm doing that now as well.

Edit: done in 8c43ba1

gjcolombo

Great stuff here--this was much more straightforward than I had anticipated when I wrote the RFD! I have a few nits and small fixes to suggest but nothing blocking.

nexus/src/app/instance.rs

nexus/src/app/sagas/instance_common.rs

nexus/db-model/src/vmm_state.rs

clients/nexus-client/src/lib.rs

nexus/db-queries/src/db/datastore/vmm.rs

hawkw · 2024-08-30T22:30:55Z

@gjcolombo okay, I think I've addressed all your suggestions!

gjcolombo

New changes LGTM!

Now, it just moves the `vmm` record to `Failed`, and triggers an update saga, which will (eventually) advance the instance record to failed. But I still need to do that part :) Also, some error type munging, which I don't actually need yet.

…artable

This commit introduces a new `VmmState` variant, `Creating`. The purpose of this variant is to differentiate between VMMs which have been created in the database but do not yet exist on a sled, from those which are known to sled-agent and are actually starting/migrating. This is necessary because presently, the `instance-watcher` background task will attempt to perform health checks for `Starting` and `Migrating` VMMs, and --- if the sled-agent does not actually know about the VMM yet --- will move it to `Failed`. But, when a VMM is newly created, the sled-agent won't know about it yet until it has been registered with the sled-agent, so if the `instance-watcher` task activates while an `instance-start` or `instance-migrate` saga is in flight, the `instance-watcher`` task may incorrectly move its VMM to `Failed` if the VMM record has been created but not yet registered. Now, after adding the `Creating` variant, we can avoid performing health checks for those VMMs until the sled-agent actually acknowledges a request to register the VMM.

hawkw · 2024-09-01T21:40:50Z

I've manually verified this change on madrid by creating an instance, power-cycling the sled that it's on, watching the instance go to Failed, and then restarting it. Everything seems to work as expected.

On madrid in the switch zone, I run omdb db instances, and see that my instance is in the Running state, and on sled BRM42220007:

root@oxz_switch1:~# omdb db instances
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (94.0.0)
ID                                   NAME        STATE   PROPOLIS_ID                          SLED_ID                              HOST_SERIAL
8cb8e347-99bf-4769-a5db-022a14cb0fed my-instance running 7f62223f-8d31-419e-a4e2-1cff197f9879 4d86e152-aa9b-4bb5-a1ea-9b4161cbb81e BRM42220007

Then, I power-cycle that sled. Note that I didn't think long enough to realize that the sled I was about to power cycle was the scrimlet I was currently connected to, so I then had to wait for it to come back... 😅

root@oxz_switch1:~# pilot sp cycle BRM42220007
16  BRM42220007        client_loop: send disconnect: Broken pipe
eliza@jeeves ~ $ ssh madridswitch
(root@fe80::aa40:25ff:fe05:602%madrid_sw1tp0) Password:

eliza@jeeves ~ $ /staff/lab/madrid/enable-tp-announce.sh
+ ssh madridgz bash
+ mkdir -p /usr/share/compliance
+ pilot gimlet info -j
+ cat
+ zlogin oxz_switch bash /tmp/enable-tp-announce.sh
+ sed -i 's/^PASSREQ/#&/' /etc/default/login
+ sed -i /PermitRoot/s/no/yes/ /etc/ssh/sshd_config
+ sed -i /PermitEmpty/s/no/yes/ /etc/ssh/sshd_config
+ sed -i /PasswordAuth/s/no/yes/ /etc/ssh/sshd_config
+ sed -i '/AllowUsers/s/$/ root/' /etc/ssh/sshd_config
+ passwd -d root
passwd: password information changed for root
+ svcadm restart ssh
+ nohup pilot techport announce techport0 techport1
eliza@jeeves ~ $ ssh madridswitch
The illumos Project     helios-2.0.22845        August 2024

Now that the scrimlet is back up, I run omdb db instances again, and note that the instance has now transitioned to Failed:

root@oxz_switch1:~# omdb db instances
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (94.0.0)
ID                                   NAME        STATE  PROPOLIS_ID          SLED_ID           HOST_SERIAL
8cb8e347-99bf-4769-a5db-022a14cb0fed my-instance failed <no active Propolis> <not on any sled> -
root@oxz_switch1:~#

Back on my laptop, I restart the failed instance:

eliza@theseus ~/Code/oxide/oxide.rs $ oxide instance start --instance 8cb8e347-99bf-4769-a5db-022a14cb0fed
{
  "description": "",
  "hostname": "my-instance",
  "id": "8cb8e347-99bf-4769-a5db-022a14cb0fed",
  "memory": 8589934592,
  "name": "my-instance",
  "ncpus": 2,
  "project_id": "b432681c-fadc-479c-9b45-40043cceab25",
  "run_state": "starting",
  "time_created": "2024-09-01T21:19:52.057567Z",
  "time_modified": "2024-09-01T21:19:52.057567Z",
  "time_run_state_updated": "2024-09-01T21:33:07.391957Z"
}

Finally, back on the scrimlet, we can see the instance has been restarted successfully, and is now on a different sled.

root@oxz_switch1:~# omdb db instances
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (94.0.0)
ID                                   NAME        STATE   PROPOLIS_ID                          SLED_ID                              HOST_SERIAL
8cb8e347-99bf-4769-a5db-022a14cb0fed my-instance running f10379bd-cb58-4b21-882a-d0916a31a7f8 8e47468a-fee0-4a1d-ae4c-1552a7b873b0 BRM42220004
root@oxz_switch1:~#

hawkw · 2024-09-01T21:45:37Z

Power-cycling BRM42220004 also results in the instance going to Failed, although it takes a little bit for the instance-watcher background task to have actually run:

root@oxz_switch1:~# omdb db instances
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (94.0.0)
ID                                   NAME        STATE   PROPOLIS_ID                          SLED_ID                              HOST_SERIAL
8cb8e347-99bf-4769-a5db-022a14cb0fed my-instance running f10379bd-cb58-4b21-882a-d0916a31a7f8 8e47468a-fee0-4a1d-ae4c-1552a7b873b0 BRM42220004
root@oxz_switch1:~# pilot sp cycle BRM42220004
17  BRM42220004        ok
root@oxz_switch1:~# pilot sp status BRM42220004
17  BRM42220004        on (A0)
root@oxz_switch1:~# omdb db instances
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (94.0.0)
ID                                   NAME        STATE   PROPOLIS_ID                          SLED_ID                              HOST_SERIAL
8cb8e347-99bf-4769-a5db-022a14cb0fed my-instance running f10379bd-cb58-4b21-882a-d0916a31a7f8 8e47468a-fee0-4a1d-ae4c-1552a7b873b0 BRM42220004
root@oxz_switch1:~# omdb db instances
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (94.0.0)
ID                                   NAME        STATE  PROPOLIS_ID          SLED_ID           HOST_SERIAL
8cb8e347-99bf-4769-a5db-022a14cb0fed my-instance failed <no active Propolis> <not on any sled> -
root@oxz_switch1:~#

As of #6455, instances in the `Failed` state are allowed to be restarted. As described in [RFD 486 § 4.3], Nexus should automatically restart such instances when the instance record's `auto_restart_policy` field (née `boot_on_fault`, see #6499) indicates that it is permitted to do so. This branch implements this behavior, by adding a new `instance_reincarnation` RPW. This RPW consists of a background task which queries CRDB for instances in the `Failed` state which are eligible to be automatically restarted, and starts a new `instance-start` saga for any instances it discovers. Instances are considered eligible for reincarnation if all of the following conditions are true: - **The instance is in the `InstanceState::Failed` state.** Meaning, it must have *no active VMM* --- if the instance is in the `InstanceState::Vmm` state with an active VMM which is in `VmmState::Failed`, that indicates no instance-update saga has run to clean up the old VMM's resources. In this case, we must wait until the instance record itself has transitioned to `Failed` before attempting to restart it. - **The instance's `auto_restart_policy` permits it to be restarted.** The `auto_restart_policy` enum is changed from representing the classess of failures on which an instance may be restarted (`never`, `sled_failures_only`, and `all_failures`) was changed to a more general quality-of-service for automatic restarts. Presently, this can either be `never`, meaning that the instance can never be restarted automatically; or `best_effort`, meaning that the control plane should try to keep the instance running but is permitted to not restart it if necessary to preserve the stability of the whole system. The default policy for instances which don't provide one is `best_effort`. - **A cooldown period has elapsed since the instance's last automatic restart.**. In order to prevent instances which fail frequently from compromising the reliability of the system as a whole, we enforce a cooldown period between automatic restarts. This is tracked by setting a last-auto-restart timestamp in the `instance` record. At present, the cooldown period is always one hour unless the instance record overrides it, which currently can only be done using a test-only Nexus method. In the future, we may consider allowing the cooldown period to be configured by users, but because it presents a denial-of-service risk, setting it should probably be a more priveliged operation than creating an instance. The implementation of the actual reincarnation background task is relatively straightforward: it runs a new `DataStore::find_reincarnatable_instances` query, which returns instances in the `Failed` state, and it spawns `instance-start` sagas for any instances which satisfy the above conditions. The restart cooldown is implemented by adding a new `time_last_auto_restarted` field to the `instance` table and to the `InstanceRuntimeState` type. We add a `Reason` enum to the `instance-start` saga's `Params` that indicates why the instance is being started, and if it is `Reason::AutoRestart`, then the start saga will set the `time_last_auto_restarted` timestamp when moving the instance record to the `Starting` state. Additionally, I've added the `auto_restart_policy` field to the `params::InstanceCreate` type in the Nexus API so that instances can be created with auto-restart policies. Reviewers should note that many of the files changed in this diff were only modified by adding that field to a `params::InstanceCreate` literal. This branch adds a minimal, initial implementation of instance auto-restarting. [RFD 486 § 4.3]: https://rfd.shared.oxide.computer/rfd/486#_instances Fixes #5553

PR #6503 changed Nexus to attempt to automatically restart instances which are in the `Failed` state. Now that we do this, we should probably change the allowable instance state transitions to permit a user to stop an instance that is `Failed`, as a way to say "stop trying to restart this instance" (as `Stopped` instances are not restarted). This branch changes `Nexus::instance_request_state` and `select_instance_change_action` to permit stopping a `Failed` instance. Fixes #6640 I believe this also fixes #2825, along with #6455 (which allowed restarting `Failed` instances).

Currently, instances whose active VMM is `SagaUnwound` appear externally as `Stopped`. We decided to report them as `Stopped` because start sagas are permitted to run for instances with `SagaUnwound` active VMMs, and --- at the time when the `SagaUnwound` VMM state was introduced, `Failed` instances could not be started. However, #6455 added the ability to restart `Failed` instances, and #6652 will permit them to be stopped. Therefore, we should recast instances with `SagaUnwound` active VMMs as `Failed`: they weren't asked politely to stop; instead, we attempted to start them and something went wrong...which sounds like `Failed` to me. This becomes more important in light of #6638: if we will attempt automatically restart such instances, they should definitely appear to be `Failed`. The distinction between `Failed` and `Stopped` becomes that `Failed` means "this thing isn't running, but it's supposed to be; we may try to fix that for you if permitted to do so", while `Stopped` means "this thing isn't running and that's fine, because you asked for it to no longer be running". Thus, this commit changes `SagaUnwound` VMMs to appear `Failed` externally.

PR #6503 changed Nexus to attempt to automatically restart instances which are in the `Failed` state. Now that we do this, we should probably change the allowable instance state transitions to permit a user to stop an instance that is `Failed`, as a way to say "stop trying to restart this instance" (as `Stopped` instances are not restarted). This branch changes `Nexus::instance_request_state` and `select_instance_change_action` to permit stopping a `Failed` instance. Fixes #6640 Fixes #2825, along with #6455 (which allowed restarting `Failed` instances).

hawkw force-pushed the eliza/486-sa-errors branch from 997f385 to d1e7165 Compare August 29, 2024 18:58

hawkw changed the title ~~[WIP] implement RFD 486~~ [nexus] implement RFD 486 Aug 29, 2024

hawkw requested a review from gjcolombo August 29, 2024 22:39

hawkw mentioned this pull request Aug 30, 2024

Nexus should restart Failed instances when boot_on_fault says to #6491

Closed

hawkw force-pushed the eliza/486-sa-errors branch from 0b61ff4 to b811827 Compare August 30, 2024 18:14

hawkw commented Aug 30, 2024

View reviewed changes

hawkw changed the title ~~[nexus] implement RFD 486~~ [nexus] handle sled-agent errors as described in RFD 486 Aug 30, 2024

hawkw marked this pull request as ready for review August 30, 2024 18:15

hawkw commented Aug 30, 2024

View reviewed changes

clients/nexus-client/src/lib.rs Outdated Show resolved Hide resolved

gjcolombo approved these changes Aug 30, 2024

View reviewed changes

hawkw requested a review from smklein August 30, 2024 22:24

gjcolombo approved these changes Aug 30, 2024

View reviewed changes

This was referenced Aug 31, 2024

[nexus] Turn instance.boot_on_fault into an enum #6499

Merged

[nexus] Reincarnate Failed instances #6503

Merged

hawkw added 13 commits September 1, 2024 13:17

Handle Failed active VMMs in instance-update saga

5a8e698

janky first pass on hacking up instance_watcher

5247f66

make sim sled agent return realistic error codes

fd6a2c2

add tests for (some) new failed instance behavior

01e06ea

allow failed instances to be restarted, don't go to failed until rest…

36cb7ae

…artable

run update saga synchronously in mark_vmm_failed

e1c968a

update saga tests should expect failed vmms to be torn down

c8f96e3

rustfmt...

12a467f

okay FINE (i really hope this is safe)

84d5235

back out some unnecessary changes

816448c

remove "put" from the thing we use for health checks

68da5b3

hawkw added 12 commits September 1, 2024 13:17

return the error codes described in RFD 486

51f2c78

remove some defunct comments

0d7635c

whoops, the error type change broke this

e5db540

more tests

4ece4de

add remaining tests

ace4faa

also return gateway timeout

090b8d2

ah, i had gotten the error conversion rules wrong

62caa90

instances with failed vmms should be candidates for background update

288d20b

address @gjcolombo's review feedback

e6e8cb6

return 503 when instance is wayyyyy gone

01bd26c

purge VmmState::Creating from APIs

f171767

blehhh rustfmt

1b3fcad

hawkw force-pushed the eliza/486-sa-errors branch from 01454f5 to 1b3fcad Compare September 1, 2024 20:17

This was referenced Sep 1, 2024

Better error reporting mechanism needed oxidecomputer/propolis#335

Open

need a way to trigger cleanup and next steps for vanished instances #4872

Closed

hawkw merged commit 46eb2b4 into main Sep 2, 2024
22 checks passed

hawkw deleted the eliza/486-sa-errors branch September 2, 2024 18:41

This was referenced Sep 24, 2024

Revisit moving instances to Failed in handle_instance_put_result #4226

Closed

Failed instances should be allowed to stop and restart #2825

Closed

hawkw mentioned this pull request Sep 24, 2024

[nexus] Allow stopping Failed instances #6652

Merged

hawkw mentioned this pull request Sep 24, 2024

[nexus] Consider SagaUnwound instances Failed #6658

Merged

hawkw mentioned this pull request Oct 9, 2024

[nexus] add POST /v1/instances/{instance}/force-terminate #6795

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nexus] handle sled-agent errors as described in RFD 486 #6455

[nexus] handle sled-agent errors as described in RFD 486 #6455

hawkw commented Aug 27, 2024 •

edited

Loading

hawkw commented Aug 29, 2024

hawkw Aug 30, 2024

hawkw commented Aug 30, 2024 •

edited

Loading

gjcolombo left a comment

hawkw commented Aug 30, 2024

gjcolombo left a comment

hawkw commented Sep 1, 2024

hawkw commented Sep 1, 2024

[nexus] handle sled-agent errors as described in RFD 486 #6455

[nexus] handle sled-agent errors as described in RFD 486 #6455

Conversation

hawkw commented Aug 27, 2024 • edited Loading

hawkw commented Aug 29, 2024

hawkw Aug 30, 2024

Choose a reason for hiding this comment

hawkw commented Aug 30, 2024 • edited Loading

gjcolombo left a comment

Choose a reason for hiding this comment

hawkw commented Aug 30, 2024

gjcolombo left a comment

Choose a reason for hiding this comment

hawkw commented Sep 1, 2024

hawkw commented Sep 1, 2024

hawkw commented Aug 27, 2024 •

edited

Loading

hawkw commented Aug 30, 2024 •

edited

Loading