Merge pull request #26 from yaoluxun/master

Add move table function
WGLab · Nov 21, 2019 · 503af50 · 503af50
2 parents 9fea8e6 + c812282
commit 503af50
Show file tree

Hide file tree

Showing 6 changed files with 247 additions and 150 deletions.
diff --git a/README.md b/README.md
@@ -2,24 +2,24 @@
 
 ## Methodology of DeepMod
 
-DeepMod is a computational tool which takes long-read signals as input and outputs modification summary for each genomic position in a reference genome together with modification prediction for each base in a long read. The modification prediction model in DeepMod is a well-trained bidirectional recurrent neural network (RNN) with long short-term memory (LSTM) units. LSTM RNN is a class of artificial neural network for modeling sequential behaviors with LSTM to preclude vanishing gradient problem.  To detect DNA modifications, normalized signals of events in a long read were rescaled from -5 and 5, and signal mean, standard deviation and the number of signals together with base information (denoted by 7-feature description) were obtained for each event as input of a LSTM unit with 100 hidden nodes. In DeepMod with 3 hidden layers in RNN. Predicted modification summary for each position would be generated in a BED format, suggesting how many reads cover genomic positions, how many mapped bases in long reads were predicted to be modified and the coverage percentage of prediction modifications. This modification prediction by DeepMod is strand-sensitive and single-nucleotide based. 
+DeepMod is a computational tool which takes long-read signals as input and outputs modification summary for each genomic position in a reference genome together with modification prediction for each base in a long read. The modification prediction model in DeepMod is a well-trained bidirectional recurrent neural network (RNN) with long short-term memory (LSTM) units. LSTM RNN is a class of artificial neural network for modeling sequential behaviors with LSTM to preclude vanishing gradient problem.  To detect DNA modifications, normalized signals of events in a long read were rescaled from -5 and 5, and signal mean, standard deviation and the number of signals together with base information (denoted by 7-feature description) were obtained for each event as input of a LSTM unit with 100 hidden nodes. In DeepMod with 3 hidden layers in RNN. Predicted modification summary for each position would be generated in a BED format, suggesting how many reads cover genomic positions, how many mapped bases in long reads were predicted to be modified and the coverage percentage of prediction modifications. This modification prediction by DeepMod is strand-sensitive and single-nucleotide based.
 
 ### Inputs of DeepMod
 
-The input of DeepMod is Nanopore long read data together a refrence genome. 
+The input of DeepMod is Nanopore long read data together a refrence genome.
 
 ## System Requirements
 ### Hardware requirements
 DeepMod is based on deep learning framework, and needs to access raw data of Nanopore sequencing. Thus, it needs enough RAM to support deep learning framework and enough hard drive for raw data of Nanopore sequencing. GPU can substantially speedup the detection process. For optimal performance, we recommend a computer with:
  * RAM: 20+ GB per thread
  * GPU or CPU with 8+ cores
  * HDD or better with SSD. Dependent on how large raw data is (for 30X E coli data, it might need 10+GB, while for 30X human data, it might need 10+TB)
- 
+
 ### Software requirements
 The developmental version of DeepMod has been tested on Linux operating system: CentOS 7.0 with both CPU and GPU machines.
 
 ### Future improvement
-Now, DeepMod does not support multi-fast5 and Move table. For multi-fast5 issue, one can use API at https://github.com/nanoporetech/ont_fast5_api to convert multi-fast5 to single fast5 file, and then re-basecall to get event information as input of DeepMod. We have been working on improvement of DeepMod to support both multi-fast5 and Move table. 
+Now, DeepMod supports basecalled data with either event tables or move tables. But it does not support multi-fast5. For multi-fast5 issue, one can use API at https://github.com/nanoporetech/ont_fast5_api to convert multi-fast5 to single fast5 file, and then re-basecall to get event information as input of DeepMod. We have been working on improvement of DeepMod to support multi-fast5.
 
 ## Installation
 Please refer to [Installation](https://github.com/WGLab/DeepMod/blob/master/docs/Install.md) for how to install DeepMod.
@@ -38,7 +38,7 @@ For release history, please visit [here](https://github.com/WGLab/NanoDeepMod/re
 
 ## Contact
 
-If you have any questions/issues/bugs, please post them on [GitHub](https://github.com/WGLab/DeepMod/issues). They would also be helpful to other users. 
+If you have any questions/issues/bugs, please post them on [GitHub](https://github.com/WGLab/DeepMod/issues). They would also be helpful to other users.
 
 ## Reference
 **Please cite the publication below if you use our tool:**

diff --git a/bin/DeepMod.py b/bin/DeepMod.py
@@ -63,7 +63,7 @@ def mCommonParam(margs):
    if moptions['outFolder']==None or (not os.path.isdir(moptions['outFolder'])):
       try:
          os.system('mkdir -p '+moptions['outFolder']);
-      except: 
+      except:
          ErrorMessage = ErrorMessage + ("\n\tThe output folder (%s) does not exist and cannot be created." % moptions['outFolder'])
 
    # check all data in a recurive way
@@ -85,6 +85,8 @@ def mCommonParam(margs):
 
    moptions['SignalGroup'] = margs.SignalGroup;
 
+   moptions['move'] = margs.move
+
    return [moptions, ErrorMessage]
 
 #
@@ -99,7 +101,7 @@ def mDetect(margs):
    moptions['basecall_1d'] = margs.basecall_1d
    moptions['basecall_2strand'] = margs.basecall_2strand
    # Whether consider those chromosome which contain -_:/
-   # default: yes; 
+   # default: yes;
    moptions['ConUnk'] = margs.ConUnk
    # output layer information for deep learning
    moptions['outputlayer'] = margs.outputlayer
@@ -168,7 +170,7 @@ def mDetect(margs):
       parser.print_help();
       parser.parse_args(['detect', '--help']);
       sys.exit(1)
-   
+
    from scripts import myDetect
    myDetect.mDetect_manager(moptions)
 
@@ -211,7 +213,7 @@ def mTrain(margs):
       if moptions['test'][0] in ['-']:
          moptions['test'][1] = int(moptions['test'][1]) * (10**6)
          moptions['test'][2] = int(moptions['test'][2]) * (10**6)
-      else: moptions['test'][1] = int(moptions['test'][1])/100.0 
+      else: moptions['test'][1] = int(moptions['test'][1])/100.0
    else: moptions['test'] = ['N', '100']
 
    # print help document if necessary options are not provided.
@@ -232,7 +234,7 @@ def mTrain(margs):
 #
 def mGetFeatures(margs):
    from scripts import myGetFeatureBasedPos
-   
+
    # get common options
    moptions, ErrorMessage = mCommonParam(margs)
    # motif-based data: positive or negative control data
@@ -254,7 +256,7 @@ def mGetFeatures(margs):
       rsp = margs.region.split(':')
       for rv_ind in range(len(rsp)):
          rsp[rv_ind] = rsp[rv_ind].strip();
-         if not rsp[rv_ind]=='': 
+         if not rsp[rv_ind]=='':
             moptions['region'][rv_ind] = rsp[rv_ind]
 
    # referene genome
@@ -309,6 +311,7 @@ def mGetFeatures(margs):
 com_group_for_comparison.add_argument("--windowsize", type=int, default=21, help="The window size to extract features. Default: 21")
 com_group_for_comparison.add_argument("--alignStr", type=str, default='minimap2', choices=["bwa","minimap2"], help="Alignment tools (bwa or minimap2 is supported). Default: minimap2")
 com_group_for_comparison.add_argument("--SignalGroup", type=str, default='simple', choices=["simple","rundif"], help="How to associate signals to each called bases. Default: simple")
+com_group_for_comparison.add_argument("--move", default=False, action="store_true", help="Whether the basecalled data use move tables instead of event tables. Default: False")
 
 # add detection options
 parser_detect = subparsers.add_parser('detect', parents=[parent_parser], help="Detect modifications at a genomic scale", description="Detect modifications by integrating all long reads for a genome", epilog="For example, \n \
@@ -323,7 +326,7 @@ def mGetFeatures(margs):
 parser_detect.add_argument("--basecall_1d", default="Basecall_1D_000", help="Path for basecall_1d. Default: Basecall_1D_000")
 parser_detect.add_argument("--basecall_2strand", default="BaseCalled_template", help="Path for basecall_2strand. Default: BaseCalled_template")
 parser_detect.add_argument("--region", default=None, help="The region of interest: for example, chr:1:100000;chr2:10000");
-parser_detect.add_argument("--ConUnk", default=True, choices=[False, True], help="Whether contain unknown chromosome"); 
+parser_detect.add_argument("--ConUnk", default=True, choices=[False, True], help="Whether contain unknown chromosome");
 parser_detect.add_argument("--outputlayer", default="", choices=["", "sigmoid"], help="how to put activation function for output layer")
 parser_detect.add_argument("--Base", type=str, default='C', choices=['A', 'C', 'G', 'T'], help="Interest of bases");
 parser_detect.add_argument("--mod_cluster", default=0, choices=[0,1], help="1: CpG cluster effect; 0: not");
@@ -373,4 +376,3 @@ def mGetFeatures(margs):
 else:
    args = parser.parse_args()
    args.func(args);
-
diff --git a/bin/scripts/MoveTable.py b/bin/scripts/MoveTable.py
@@ -0,0 +1,79 @@
+
+import os,sys
+import numpy as np
+import h5py
+
+
+def getMove_Info(moptions, sp_param, move_data):
+    '''
+    sp_param.keys: fq_seq, raw_signals, first_sample_template, duration_template
+    '''
+
+    #sp_param['first_sample_template'] = sp_param['f5reader']['/Analyses/Segmentation_001/Summary/segmentation'].attrs['first_sample_template']
+    #sp_param['duration_template'] = sp_param['f5reader']['/Analyses/Segmentation_001/Summary/segmentation'].attrs['duration_template']
+
+    seg = "Segmentation_" + moptions['basecall_1d'].split('_')[-1]
+    attr_path = '/'.join(['', 'Analyses', seg, 'Summary', 'segmentation'])
+    #mv_str = '/'.join(['', 'Analyses', moptions['basecall_1d'], moptions['basecall_2strand'], 'Move'])
+    sp_param['first_sample_template'] = sp_param['f5reader'][attr_path].attrs['first_sample_template']
+    sp_param['duration_template'] = sp_param['f5reader'][attr_path].attrs['duration_template']
+    #move_data = sp_param['f5reader'][mv_str][()]
+    nrow = len(sp_param['fq_seq']) # row number of event_info; equals to the base number
+    nsig = len(sp_param['raw_signals'])
+    first = int(sp_param['first_sample_template'])
+    duration = int(sp_param['duration_template'])
+    move_info = np.empty([nrow], dtype=[('mean', '<f4'), ('stdv', '<f4'), ('start', np.uint64), ('length', np.uint64), ('model_state', 'U5')])
+    effect_sig_index = list(range(first, nsig))
+    pivot = first
+    seg_count = 0 #which segmentation
+    for i in range(1, len(move_data)):
+        if move_data[i] == 1:
+            move_info[seg_count]['mean'] = np.mean(sp_param['raw_signals'][pivot:(2*i + first)])
+            move_info[seg_count]['length'] = 2*i + first - pivot
+            move_info[seg_count]['stdv'] = np.std(sp_param['raw_signals'][pivot:(2*i + first)])
+            move_info[seg_count]['start'] = pivot
+            if seg_count == 0:
+                move_info[seg_count]['model_state'] = 'N'*2 + sp_param['fq_seq'][seg_count:seg_count+3]
+            elif seg_count == 1:
+                move_info[seg_count]['model_state'] = 'N' + sp_param['fq_seq'][seg_count-1:seg_count+3]
+            elif seg_count == nrow-2:
+                move_info[seg_count]['model_state'] = sp_param['fq_seq'][seg_count-2:seg_count+2] + 'N'
+            else:
+                move_info[seg_count]['model_state'] = sp_param['fq_seq'][seg_count-2 : seg_count+3]
+            pivot = 2*i + first
+            seg_count += 1
+    move_info[seg_count]['mean'] = np.mean(sp_param['raw_signals'][pivot:nsig])
+    move_info[seg_count]['length'] = nsig - pivot
+    move_info[seg_count]['stdv'] = np.std(sp_param['raw_signals'][pivot:nsig])
+    move_info[seg_count]['start'] = pivot
+    move_info[seg_count]['model_state'] = sp_param['fq_seq'][seg_count-2:seg_count+1] + 'N'*2
+    return move_info
+
+
+if __name__=='__main__':
+   moptions = {}
+   sp_param = {}
+
+   exple_data = ['/mnt/isilon/wang_lab/shared/temp_shared/fast5_move/IBDUCAL377261L_20170201_FNfab41074_MN17640_mux_scan_X209_66786_ch12_read102_strand.fast5', \
+                 '/mnt/isilon/wang_lab/shared/temp_shared/fast5_move/IBDUCAL377261L_20170201_FNfab41074_MN17640_mux_scan_X209_66786_ch47_read165_strand.fast5', \
+                 '/mnt/isilon/wang_lab/shared/temp_shared/fast5_move/IBDUCAL377261L_20170201_FNfab41074_MN17640_mux_scan_X209_66786_ch48_read38_strand.fast5', \
+                 '/mnt/isilon/wang_lab/shared/temp_shared/fast5_move/IBDUCAL377261L_20170201_FNfab41074_MN17640_mux_scan_X209_66786_ch52_read12_strand.fast5' \
+                 ]
+
+   sp_param['f5reader'] = h5py.File(exple_data[0], 'r');
+
+
+   fq_str = '/Analyses/Basecall_1D_001/BaseCalled_template/Fastq'
+   mv_str = '/Analyses/Basecall_1D_001/BaseCalled_template/Move'
+   sg_str = '/Raw/Reads/'
+
+   sp_param['fq_seq'] = sp_param['f5reader'][fq_str][()].splitlines()[1]
+   k = list(sp_param['f5reader'][sg_str].keys())[0]
+   sp_param['raw_signals'] = sp_param['f5reader'][sg_str][k]['Signal'][()]
+   sp_param['first_sample_template'] = sp_param['f5reader']['/Analyses/Segmentation_001/Summary/segmentation'].attrs['first_sample_template']
+   sp_param['duration_template'] = sp_param['f5reader']['/Analyses/Segmentation_001/Summary/segmentation'].attrs['duration_template']
+   move_data = sp_param['f5reader'][mv_str][()]
+
+   getEvent_Info(moptions, sp_param, move_data)
+
+   sp_param['f5reader'].close();