Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata includes Nextstrain clades in the Nextclade_pango column #456

Closed
2 tasks done
joverlee521 opened this issue Jul 8, 2024 · 12 comments
Closed
2 tasks done
Assignees
Labels
bug Something isn't working

Comments

@joverlee521
Copy link
Contributor

joverlee521 commented Jul 8, 2024

TODOs

  • verify GISAID metadata output
  • verify open metadata output

Context

Determined as the root cause for nextstrain/forecasts-ncov#104

Both the private GISAID and public open metadata include Nextstrain clades (e.g. 24A (JN.1)) in the Nextclade_pango column.

The timing of the downstream errors coincides with the latest release of Nextclade datasets and the latest release of Nextclade v3.8.0

@joverlee521 joverlee521 added the bug Something isn't working label Jul 8, 2024
@joverlee521
Copy link
Contributor Author

The workflow blinding appends Nextclade TSV outputs. I'm assuming something changed in the outputs between Nextclade versions, leading to this error. (Will dig into this later...)

I'm going to cancel the currently running ncov-ingest jobs, upload the *.renew files to trigger a full Nextclade run, and re-run the ingest jobs.

@joverlee521
Copy link
Contributor Author

Uploaded *.renew files

$ aws s3 cp - s3://nextstrain-ncov-private/nextclade.tsv.zst.renew < /dev/null
$ aws s3 cp - s3://nextstrain-data/files/ncov/open/nextclade.tsv.zst.renew < /dev/null
$ aws s3 cp - s3://nextstrain-ncov-private/nextclade_21L.tsv.zst.renew < /dev/null
$ aws s3 cp - s3://nextstrain-data/files/ncov/open/nextclade_21L.tsv.zst.renew < /dev/null

Running GISAID and open workflows.

@joverlee521
Copy link
Contributor Author

Ah right, open workflow is blocked on #455. Oh how all stars aligned this weekend 😅

At least we'll resolve the issue in GISAID metadata for now.

@joverlee521
Copy link
Contributor Author

I'm assuming something changed in the outputs between Nextclade versions, leading to this error. (Will dig into this later...)

Nextclade 3.8.0 added new columns for the new relative mutations feature and changed the order of existing output columns:

order nextclade-3.7_columns_name nextclade-3.8_columns_name
1 index index
2 seqName seqName
3 clade clade
4 Nextclade_pango clade_display
5 partiallyAliased clade_who
6 clade_nextstrain clade_nextstrain
7 clade_who partiallyAliased
8 clade_display Nextclade_pango

The change in ordering is the root cause of this issue, but the new columns would have messed things up too.

Maybe to guard against this, we should at least have a check that the new Nextclade outputs columns match the columns of the Nextclade cache.

@corneliusroemer
Copy link
Member

Oh no, I always did renew whenever I updated Nextclade datasets, but the software version wasn't on my radar.

Thanks for tracking this down!

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Jul 9, 2024

That's odd! The order was not supposed to be changed.

Would be nice to remove dependency on order where possible. Though I guess this is where the old rows are concatenated to the new rows, so this is not trivial to do. I'm thinking storing only the necessary columns in a predefined order might help. And then also transforming the incoming rows to the same shape. Actually, I just saw #457 and this is easier and more reliable.

Also I need to be more careful when releasing stuff :)

@corneliusroemer
Copy link
Member

Also I need to be more careful when releasing stuff :)

I don't think this is at all on you @ivan-aksamentov - TSV columns can change, we shouldn't rely on stability and invalidate cache properly.

@joverlee521
Copy link
Contributor Author

Oy, the GISAID ingest is still running after 20+ hours...if it runs past 24 hours, the GH Action workflow will show success/complete, but the job will continue to run on AWS Batch.

(I will add final stats.json to #446)

@corneliusroemer
Copy link
Member

Ah yeah - we almost never run with the 21L touchfiles, so it's quite some extra work to do. Choosing a larger machine in the future might work if we know there's a pending full rerun. Nextclade parallelizes very well, so can go to a machine with say 64 cores or even larger.

@joverlee521 joverlee521 self-assigned this Jul 9, 2024
@joverlee521
Copy link
Contributor Author

Nextclade parallelizes very well, so can go to a machine with say 64 cores or even larger.

Oh, I'm not sure the workflow is utilizing this because there are no threads defined for the Snakemake rule.

@corneliusroemer
Copy link
Member

Oh, I'm not sure the workflow is utilizing this because there are no threads defined for the Snakemake rule.

No need for threads, nextclade uses all by default. What I meant to say was that we can use more cores to speed things up if we find workflows take too long. See comment over at #459 :)

@joverlee521
Copy link
Contributor Author

Verified the full reruns of Nextclade fixed the metadata for both GISAID and open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants