Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs / SOP: How to review a data build #544

Open
joeflack4 opened this issue May 25, 2024 · 4 comments
Open

Docs / SOP: How to review a data build #544

joeflack4 opened this issue May 25, 2024 · 4 comments
Assignees
Labels
documentation Improvements or additions to documentation ease:high qc / test

Comments

@joeflack4
Copy link
Contributor

joeflack4 commented May 25, 2024

Overview

Add / update a page to include these docs:


The data build is not reviewed for specific changes in content, but general patterns of changes. It is recommended to spend 10 minutes reviewing each databuild.

There are two important reasons to review data builds: (1) looking out for large unexplainable changes and (2) increasing your familiarity with the data generated by the pipeline. The later is as important as the former: as data stewards in the Mondo Ingest pipeline you should understand all data (every single file!) that is generated insight out, and the best way to do that is to review each file many times until it sticks. It is not wrong to use a data release to ask questions like: "what is the purpose of this file?".

Checklist

  1. Ensure that no files are added or removed. There are few good reasons for files being added or removed and if they happen they should be explained.
  2. ORDO, DOID and OMIM matches and migration files should have "reasonable" changes, i.e. be in line with one could expect as a consequence of a few weeks worth of curation (example: 1000 added lines is not a good sign, but 70 removed lines is within reason).
  3. Metrics and ontology related files should change within reason (numbers like axiom counts changing in the realms of 250 plus minus are nearly always ok, changes between 250 and 1000 are worth a second look, and changes beyond 1000 merit an investigation).
  4. (Almost) no file should be totally empty.

Checklist item details

4. (Almost) no file should be totally empty.

Nico:

Lexmatch files can be empty (although they should be predictably empty, e.g. the emptiness should be explainable and I think such files should at least have the column headers in them). es!

Additional info

Context: Original discussion

  • Approvers: At least 1 approver who is not the PR author is required.

Related

@joeflack4 joeflack4 self-assigned this May 25, 2024
@joeflack4 joeflack4 added documentation Improvements or additions to documentation ease:high labels May 25, 2024
@joeflack4 joeflack4 changed the title Docs: How to review a data build Docs / SOP: How to review a data build Jun 13, 2024
@joeflack4
Copy link
Contributor Author

Also maybe worth adding, but I remember also:

  • General face check (I suppose many of the bullets above qualify for that)
  • Lexmatch outputs: Check to make sure there are plenty of rows
  • Slurp outputs: Check to make sure there are plenty of rows

@joeflack4
Copy link
Contributor Author

We could also make a QC test / script, maybe even make a GH action for it ; but I don’t know if we’re at the point where that’s worth doing.

@joeflack4
Copy link
Contributor Author

joeflack4 commented Jul 25, 2024

@matentzn About this criteria:

Ensure that no files are added or removed. There are few good reasons for files being added or removed and if they happen they should be explained.

I sometimes see files removed or added, but they are like lexical mapping closematch files, or broadmatch or maybe narrowmatch. I take it we're not quite so worried about some files being removed?

If there were exactmatch files being removed, certain other files, I would be more worried.

@matentzn
Copy link
Member

I sometimes see files removed or added, but they are like lexical mapping closematch files, or broadmatch or maybe narrowmatch. I take it we're not quite so worried about some files being removed?

Yeah, this can happen. Basically we need to learn as a group that "files removed" is only a warning sign, and judge internally if it was expected or not (ideally by ourselves). Small files that disappear that are not exact are usually no sign for concern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ease:high qc / test
Projects
None yet
Development

No branches or pull requests

2 participants