-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
99 lines (65 loc) · 5.03 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
Migration of valid Release 16 Metadata and Primary Files to Release 17 (Dictionary version 0.9a)
IMPORTANT NOTES:
================
- These scripts will overwrite the original submission files
- These scripts can only be used on valid Release 16 submission files
- One submission files are processed, please upload to DCC submission system and validate.
- Files must be valid against current dictionary in order to be included in Release 17
Contents:
=========
After archive is decompressed, a folder will be created called "migrate_icgc16_to_icgc17" with the following contents:
1. remove_columns.sh
A simple bash script that will modify validated Release 16 submission files to structurally conform to Release 17 data specifications by removing columns no longer required:
- Remove 'analyzed_sample_type' and 'analyzed_sample_type_other' fields from 'sample' file
- Remove 'note' column from all files
For more information, please see Release Notes available at http://docs.icgc.org/dictionary
2. specimen_type_mapping.py
This Python script will map specimen_type DCC terms used in Release 16 to the current specimen_type codelist in Release 17.
3. mapping_specimen_type.pl
An alternative Perl script that will perform the same function as the Python specimen_type_mapping.py script.
4. oldDCC_to_newDCC_specimen_type_mapping.txt
A tab-delimited file consisting of a mapping between the old DCC specimen_type terms and the current DCC specimen_type terms. For more details, please review documentation at http://docs.icgc.org/specimen-type-mapping
Requirements:
=============
- Linux
- If using Python script specimen_type_mapping.py:
- at least Python 2.7
Python Modules:
- gzip https://docs.python.org/2/library/gzip.html
- bz2 https://docs.python.org/2/library/bz2.html
- fnmatch https://docs.python.org/2/library/fnmatch.html
- collections https://docs.python.org/2/library/collections.html
- If using Perl script mapping_specimen_type.pl:
Perl Modules:
- PerlIO::gzip http://search.cpan.org/~majensen/PerlIO-via-gzip-0.021/lib/PerlIO/via/gzip.pm
- PerlIO::via::Bzip2 http://search.cpan.org/~arjen/PerlIO-via-Bzip2-0.02/lib/PerlIO/via/Bzip2.pm
Instructions:
=============
1. Run remove_columns.sh bash script to remove columns no longer needed
To execute:
If running the script from the same folder as submission files (only Release 16 files should be in this folder):
> ./remove_columns.sh
If running script outside of the submission file directory, you can use the "-i" flag to specify the directory where the Release 16 files reside:
> ./remove_columns.sh -i <name of directory where Release 16 submission files are located>
2. Convert old DCC specimen_type terms to new DCC specimen_type terms. Both a Perl version and a Python version are offered to perform this conversion - you only need to use one! (ie. if you are a Perl user, use can use the Perl version)
To execute:
- If using Python script (specimen_type_mapping.py):
> python specimen_type_mapping.py -i <name of input directory>
- Alternatively, if you prefer the Perl script (mapping_specimen_type.pl):
> perl mapping_specimen_type.pl <name of input directory>
IMPORTANT NOTES:
In most cases there is a one-to-one mapping between Release 16 specimen_type terms and Release 17 specimen_type terms. However, there are two previously used terms ("bone marrow" and "lymph node") which are ambigious (could be either normal or tumour) and will require the submitter to specify/convert these manually. Such cases will be reported in the DCC_specimen_type_mapping.log log file.
Example: If the specimen_type was specified as "7" (lymph node) in Release 16, you will need to chagne it to a more specific specimen type, either 107 (Normal - lymph node) or 119 (Metastatic tumour - lymph node)
_______________________________________________________________________________________________________________________
| | | | |
| Old specimen_type: | Old DCC codelist ID: | New specimen_type: | New DCC Codelist ID: |
|=======================|=======================|===============================================|======================|
| lymph_node | 7 | Normal – lymph node | 107 |
| | | Metastatic tumour – lymph node | 119 |
|_______________________|_______________________|_______________________________________________|______________________|
| bone marrow | 6 | Normal – bone marrow | 103 |
| | | Primary tumour – blood derived (bone marrow) | 111 |
| | | Recurrent tumour – blood derived (bone marrow)| 116 |
|_______________________|_______________________|_______________________________________________|______________________|
3. Once files are converted, upload submission files to submission system at http://submissions.dcc.icgc.org and validate. If files are valid, please sign-off on your submission for it be included in Release 17 (** before submission deadline August 15th).
For questions or concerns, please contact [email protected]