Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFO: Restructure, formalize, and validate MANIFEST files #50

Open
StevenCannon-USDA opened this issue May 29, 2024 · 5 comments
Open

RFO: Restructure, formalize, and validate MANIFEST files #50

StevenCannon-USDA opened this issue May 29, 2024 · 5 comments

Comments

@StevenCannon-USDA
Copy link
Contributor

We've had MANIFEST files nearly since the inception of the Data Store (actually, the content was originally included in the README file in each collection, but we decided early-on to move the per-file metadata into two MANIFEST files). However, these files aren't validated and aren't (to my knowledge) being used programmatically.

There are potential uses for per-file metadata though. For example, in the diversity collections, there are often multiple VCFs. Which of these should be displayed in tools such as JBrowse and GCViT? We could a file-naming convention, as we've done with e.g. "genome_main.gff3" in the annotations collections, but if multiple VCFs in a collection should be displayed, a label such as "main" isn't appropriate.

So, the proposal:

  1. Formalize the MANIFEST file as a (validated) yml document
  2. Merge the current "descriptions" and "correspondence" files into a single file MANIFEST.metadata_file_prefix.yml
  3. Allow additional fields with programmatic use, e.g. "display: true"

An example of the proposed merged, restructured file, from collection Glycine/max/diversity/Wm82.gnm2.div.Wickland_Battu_2017

cat MANIFEST.Wm82.gnm2.div.Wickland_Battu_2017.yml
---
- name: glyma.Wm82.gnm2.div.Wickland_Battu_2017.SNPdata1.vcf.gz
  description: genotype information from Population 1; 378 F2 lines resulting from
    a cross between Prize and an NMU-mutagenized individual of Williams 82.
  display: true
  prior_names:
    - glyma.Wm82.gnm2.div.RW0X.SNPdata1.vcf.gz
    - Pop1_SNPs_minDP2.vcf.gz
- name: glyma.Wm82.gnm2.div.Wickland_Battu_2017.SNPdata2.vcf.gz
  description: genotype information from Population 2; 391 F2 individuals from a -
    cross between two breeding lines.
  display: true
  prior_names:
    - glyma.Wm82.gnm2.div.RW0X.SNPdata2.vcf.gz
    - Pop2_SNPs_minDP2.vcf.gz
- name: glyma.Wm82.gnm2.div.Wickland_Battu_2017.SNPdata3.vcf.gz
  description: genotype information from Population 3; 81 unrelated accessions -
    that form an association panel.
  display: true
  prior_names:
    - glyma.Wm82.gnm2.div.RW0X.SNPdata3.vcf.gz
    - Pop3_SNPs_minDP2.vcf.gz
@StevenCannon-USDA
Copy link
Contributor Author

Tagging especially @adf-ncgr, @ctcncgr, @nathanweeks for your review & consideration

@adf-ncgr
Copy link
Contributor

seems like a good idea to me, though I think perhaps some more thought is needed around the "additional fields with programmatic use" aspect. For example, "display: true" could be interpreted in a lot of ways (e.g. it could mean that any file not tagged with that shouldn't appear in the h5ai view). Maybe it would be swinging too far in the direction of specificity but I could imagine using attributes in such a file to specify exactly into which of our various systems a given data file has (or should/should not be) included (e.g. glycinemine: true, sequenceserver.legumeinfo.org: false). But I definitely like having a programmatic location for attributes like description that could be consumed by things like the autocontent scripts.

As far as file-naming conventions are concerned, I see your point but I do also think that we ought to maintain the established file naming conventions (e.g. genome_main.fna) where they suffice (you probably weren't suggesting that we overturn them, just wanted to be clear about it...)

@StevenCannon-USDA
Copy link
Contributor Author

"we ought to maintain the established file naming conventions (e.g. genome_main.fna)"

Yes, for sure.

This conversation arose regarding the diversity collections, which have been minimally specified to-date. It is possible that the extra field(s) would only be used for that file type -- but I could imagine them being used for others such as synteny tracks, expression data, etc.

@StevenCannon-USDA
Copy link
Contributor Author

OK, I have generated (provisional) MANIFEST files for all of the Glycine/max/diversity collections.
The intent is for those to be tracked as metadata, so I have modified the .gitignore to ignore the two-file MANIFESTS elsewhere through the Data Store. Specifically, these are ignored:

  MANIFEST.*.correspondence.yml
  MANIFEST.*.descriptions.yml

... while this one is tracked:

  MANIFEST.Wm82.gnm1.div.Hu_Zhang_2020.yml

@StevenCannon-USDA
Copy link
Contributor Author

After discussion at LIS/PB/SB meeting today, I have changed the display field to applications, and implemented it for the Glycine/max/diversity collections. @nathanweeks @adf-ncgr

Here is an approximate specification -- which I'll put in place at the datastore-specifications repository pending discussion here:


For each Data Store collection, a single file MANIFEST.collection_name.yml will be used to provide basic information about data files in the collection.

The MANIFEST file must include, for each data file (bgzipped or in rare special cases gzipped), the name of the file and a description of the file. The MANIFEST file should NOT include index files (e.g. gz.tbi or .gz.fai) and should NOT include other metadata files.

Optional additional fields: applications, with yaml array of one or more applications that should use the indicated file; and prior_names, with yaml array of one or more previous names for the indicated file.

If no application in LegumeInfo/PeanutBase/SoyBase directly consumes a file, "applications" should not be specified (in that case, omit this field). In some cases, an application discovers certain files by other means (genome_main.fna and gene_models_main.gff3); in those cases, an "applications" field should also not be specified.

The file must be valid yaml, as tested with yamllint or equivalent.

Example:

cat Wm82.gnm1.div.Hu_Zhang_2020/MANIFEST.Wm82.gnm1.div.Hu_Zhang_2020.yml 
---
- name: glyma.Wm82.gnm1.div.Hu_Zhang_2020.SNPdata.hmp.gz
  description: Genotype information for 96 wild soybean accessions in hapmap format
  prior_names:
    - glyma.Wm82.gnm1.div.WKJG.SNPdata.hmp.gz
    - 96w_used_tassel.hmp.txt
- name: glyma.Wm82.gnm1.div.Hu_Zhang_2020.SNPdata.vcf.gz
  description: Genotype information for 96 wild soybean accessions in VCF format
  applications:
    - jbrowse
  prior_names:
    - glyma.Wm82.gnm1.div.WKJG.SNPdata.vcf.gz
    - 96w_used.vcf.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants