Skip to content

Commit

Permalink
Merge pull request #26 from yaoluxun/master
Browse files Browse the repository at this point in the history
Add move table function
  • Loading branch information
liuqianhn authored Nov 21, 2019
2 parents 9fea8e6 + c812282 commit 503af50
Show file tree
Hide file tree
Showing 6 changed files with 247 additions and 150 deletions.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,24 +2,24 @@

## Methodology of DeepMod

DeepMod is a computational tool which takes long-read signals as input and outputs modification summary for each genomic position in a reference genome together with modification prediction for each base in a long read. The modification prediction model in DeepMod is a well-trained bidirectional recurrent neural network (RNN) with long short-term memory (LSTM) units. LSTM RNN is a class of artificial neural network for modeling sequential behaviors with LSTM to preclude vanishing gradient problem. To detect DNA modifications, normalized signals of events in a long read were rescaled from -5 and 5, and signal mean, standard deviation and the number of signals together with base information (denoted by 7-feature description) were obtained for each event as input of a LSTM unit with 100 hidden nodes. In DeepMod with 3 hidden layers in RNN. Predicted modification summary for each position would be generated in a BED format, suggesting how many reads cover genomic positions, how many mapped bases in long reads were predicted to be modified and the coverage percentage of prediction modifications. This modification prediction by DeepMod is strand-sensitive and single-nucleotide based.
DeepMod is a computational tool which takes long-read signals as input and outputs modification summary for each genomic position in a reference genome together with modification prediction for each base in a long read. The modification prediction model in DeepMod is a well-trained bidirectional recurrent neural network (RNN) with long short-term memory (LSTM) units. LSTM RNN is a class of artificial neural network for modeling sequential behaviors with LSTM to preclude vanishing gradient problem. To detect DNA modifications, normalized signals of events in a long read were rescaled from -5 and 5, and signal mean, standard deviation and the number of signals together with base information (denoted by 7-feature description) were obtained for each event as input of a LSTM unit with 100 hidden nodes. In DeepMod with 3 hidden layers in RNN. Predicted modification summary for each position would be generated in a BED format, suggesting how many reads cover genomic positions, how many mapped bases in long reads were predicted to be modified and the coverage percentage of prediction modifications. This modification prediction by DeepMod is strand-sensitive and single-nucleotide based.

### Inputs of DeepMod

The input of DeepMod is Nanopore long read data together a refrence genome.
The input of DeepMod is Nanopore long read data together a refrence genome.

## System Requirements
### Hardware requirements
DeepMod is based on deep learning framework, and needs to access raw data of Nanopore sequencing. Thus, it needs enough RAM to support deep learning framework and enough hard drive for raw data of Nanopore sequencing. GPU can substantially speedup the detection process. For optimal performance, we recommend a computer with:
* RAM: 20+ GB per thread
* GPU or CPU with 8+ cores
* HDD or better with SSD. Dependent on how large raw data is (for 30X E coli data, it might need 10+GB, while for 30X human data, it might need 10+TB)

### Software requirements
The developmental version of DeepMod has been tested on Linux operating system: CentOS 7.0 with both CPU and GPU machines.

### Future improvement
Now, DeepMod does not support multi-fast5 and Move table. For multi-fast5 issue, one can use API at https://github.com/nanoporetech/ont_fast5_api to convert multi-fast5 to single fast5 file, and then re-basecall to get event information as input of DeepMod. We have been working on improvement of DeepMod to support both multi-fast5 and Move table.
Now, DeepMod supports basecalled data with either event tables or move tables. But it does not support multi-fast5. For multi-fast5 issue, one can use API at https://github.com/nanoporetech/ont_fast5_api to convert multi-fast5 to single fast5 file, and then re-basecall to get event information as input of DeepMod. We have been working on improvement of DeepMod to support multi-fast5.

## Installation
Please refer to [Installation](https://github.com/WGLab/DeepMod/blob/master/docs/Install.md) for how to install DeepMod.
Expand All @@ -38,7 +38,7 @@ For release history, please visit [here](https://github.com/WGLab/NanoDeepMod/re

## Contact

If you have any questions/issues/bugs, please post them on [GitHub](https://github.com/WGLab/DeepMod/issues). They would also be helpful to other users.
If you have any questions/issues/bugs, please post them on [GitHub](https://github.com/WGLab/DeepMod/issues). They would also be helpful to other users.

## Reference
**Please cite the publication below if you use our tool:**
Expand Down
18 changes: 10 additions & 8 deletions bin/DeepMod.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ def mCommonParam(margs):
if moptions['outFolder']==None or (not os.path.isdir(moptions['outFolder'])):
try:
os.system('mkdir -p '+moptions['outFolder']);
except:
except:
ErrorMessage = ErrorMessage + ("\n\tThe output folder (%s) does not exist and cannot be created." % moptions['outFolder'])

# check all data in a recurive way
Expand All @@ -85,6 +85,8 @@ def mCommonParam(margs):

moptions['SignalGroup'] = margs.SignalGroup;

moptions['move'] = margs.move

return [moptions, ErrorMessage]

#
Expand All @@ -99,7 +101,7 @@ def mDetect(margs):
moptions['basecall_1d'] = margs.basecall_1d
moptions['basecall_2strand'] = margs.basecall_2strand
# Whether consider those chromosome which contain -_:/
# default: yes;
# default: yes;
moptions['ConUnk'] = margs.ConUnk
# output layer information for deep learning
moptions['outputlayer'] = margs.outputlayer
Expand Down Expand Up @@ -168,7 +170,7 @@ def mDetect(margs):
parser.print_help();
parser.parse_args(['detect', '--help']);
sys.exit(1)

from scripts import myDetect
myDetect.mDetect_manager(moptions)

Expand Down Expand Up @@ -211,7 +213,7 @@ def mTrain(margs):
if moptions['test'][0] in ['-']:
moptions['test'][1] = int(moptions['test'][1]) * (10**6)
moptions['test'][2] = int(moptions['test'][2]) * (10**6)
else: moptions['test'][1] = int(moptions['test'][1])/100.0
else: moptions['test'][1] = int(moptions['test'][1])/100.0
else: moptions['test'] = ['N', '100']

# print help document if necessary options are not provided.
Expand All @@ -232,7 +234,7 @@ def mTrain(margs):
#
def mGetFeatures(margs):
from scripts import myGetFeatureBasedPos

# get common options
moptions, ErrorMessage = mCommonParam(margs)
# motif-based data: positive or negative control data
Expand All @@ -254,7 +256,7 @@ def mGetFeatures(margs):
rsp = margs.region.split(':')
for rv_ind in range(len(rsp)):
rsp[rv_ind] = rsp[rv_ind].strip();
if not rsp[rv_ind]=='':
if not rsp[rv_ind]=='':
moptions['region'][rv_ind] = rsp[rv_ind]

# referene genome
Expand Down Expand Up @@ -309,6 +311,7 @@ def mGetFeatures(margs):
com_group_for_comparison.add_argument("--windowsize", type=int, default=21, help="The window size to extract features. Default: 21")
com_group_for_comparison.add_argument("--alignStr", type=str, default='minimap2', choices=["bwa","minimap2"], help="Alignment tools (bwa or minimap2 is supported). Default: minimap2")
com_group_for_comparison.add_argument("--SignalGroup", type=str, default='simple', choices=["simple","rundif"], help="How to associate signals to each called bases. Default: simple")
com_group_for_comparison.add_argument("--move", default=False, action="store_true", help="Whether the basecalled data use move tables instead of event tables. Default: False")

# add detection options
parser_detect = subparsers.add_parser('detect', parents=[parent_parser], help="Detect modifications at a genomic scale", description="Detect modifications by integrating all long reads for a genome", epilog="For example, \n \
Expand All @@ -323,7 +326,7 @@ def mGetFeatures(margs):
parser_detect.add_argument("--basecall_1d", default="Basecall_1D_000", help="Path for basecall_1d. Default: Basecall_1D_000")
parser_detect.add_argument("--basecall_2strand", default="BaseCalled_template", help="Path for basecall_2strand. Default: BaseCalled_template")
parser_detect.add_argument("--region", default=None, help="The region of interest: for example, chr:1:100000;chr2:10000");
parser_detect.add_argument("--ConUnk", default=True, choices=[False, True], help="Whether contain unknown chromosome");
parser_detect.add_argument("--ConUnk", default=True, choices=[False, True], help="Whether contain unknown chromosome");
parser_detect.add_argument("--outputlayer", default="", choices=["", "sigmoid"], help="how to put activation function for output layer")
parser_detect.add_argument("--Base", type=str, default='C', choices=['A', 'C', 'G', 'T'], help="Interest of bases");
parser_detect.add_argument("--mod_cluster", default=0, choices=[0,1], help="1: CpG cluster effect; 0: not");
Expand Down Expand Up @@ -373,4 +376,3 @@ def mGetFeatures(margs):
else:
args = parser.parse_args()
args.func(args);

79 changes: 79 additions & 0 deletions bin/scripts/MoveTable.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@

import os,sys
import numpy as np
import h5py


def getMove_Info(moptions, sp_param, move_data):
'''
sp_param.keys: fq_seq, raw_signals, first_sample_template, duration_template
'''

#sp_param['first_sample_template'] = sp_param['f5reader']['/Analyses/Segmentation_001/Summary/segmentation'].attrs['first_sample_template']
#sp_param['duration_template'] = sp_param['f5reader']['/Analyses/Segmentation_001/Summary/segmentation'].attrs['duration_template']

seg = "Segmentation_" + moptions['basecall_1d'].split('_')[-1]
attr_path = '/'.join(['', 'Analyses', seg, 'Summary', 'segmentation'])
#mv_str = '/'.join(['', 'Analyses', moptions['basecall_1d'], moptions['basecall_2strand'], 'Move'])
sp_param['first_sample_template'] = sp_param['f5reader'][attr_path].attrs['first_sample_template']
sp_param['duration_template'] = sp_param['f5reader'][attr_path].attrs['duration_template']
#move_data = sp_param['f5reader'][mv_str][()]
nrow = len(sp_param['fq_seq']) # row number of event_info; equals to the base number
nsig = len(sp_param['raw_signals'])
first = int(sp_param['first_sample_template'])
duration = int(sp_param['duration_template'])
move_info = np.empty([nrow], dtype=[('mean', '<f4'), ('stdv', '<f4'), ('start', np.uint64), ('length', np.uint64), ('model_state', 'U5')])
effect_sig_index = list(range(first, nsig))
pivot = first
seg_count = 0 #which segmentation
for i in range(1, len(move_data)):
if move_data[i] == 1:
move_info[seg_count]['mean'] = np.mean(sp_param['raw_signals'][pivot:(2*i + first)])
move_info[seg_count]['length'] = 2*i + first - pivot
move_info[seg_count]['stdv'] = np.std(sp_param['raw_signals'][pivot:(2*i + first)])
move_info[seg_count]['start'] = pivot
if seg_count == 0:
move_info[seg_count]['model_state'] = 'N'*2 + sp_param['fq_seq'][seg_count:seg_count+3]
elif seg_count == 1:
move_info[seg_count]['model_state'] = 'N' + sp_param['fq_seq'][seg_count-1:seg_count+3]
elif seg_count == nrow-2:
move_info[seg_count]['model_state'] = sp_param['fq_seq'][seg_count-2:seg_count+2] + 'N'
else:
move_info[seg_count]['model_state'] = sp_param['fq_seq'][seg_count-2 : seg_count+3]
pivot = 2*i + first
seg_count += 1
move_info[seg_count]['mean'] = np.mean(sp_param['raw_signals'][pivot:nsig])
move_info[seg_count]['length'] = nsig - pivot
move_info[seg_count]['stdv'] = np.std(sp_param['raw_signals'][pivot:nsig])
move_info[seg_count]['start'] = pivot
move_info[seg_count]['model_state'] = sp_param['fq_seq'][seg_count-2:seg_count+1] + 'N'*2
return move_info


if __name__=='__main__':
moptions = {}
sp_param = {}

exple_data = ['/mnt/isilon/wang_lab/shared/temp_shared/fast5_move/IBDUCAL377261L_20170201_FNfab41074_MN17640_mux_scan_X209_66786_ch12_read102_strand.fast5', \
'/mnt/isilon/wang_lab/shared/temp_shared/fast5_move/IBDUCAL377261L_20170201_FNfab41074_MN17640_mux_scan_X209_66786_ch47_read165_strand.fast5', \
'/mnt/isilon/wang_lab/shared/temp_shared/fast5_move/IBDUCAL377261L_20170201_FNfab41074_MN17640_mux_scan_X209_66786_ch48_read38_strand.fast5', \
'/mnt/isilon/wang_lab/shared/temp_shared/fast5_move/IBDUCAL377261L_20170201_FNfab41074_MN17640_mux_scan_X209_66786_ch52_read12_strand.fast5' \
]

sp_param['f5reader'] = h5py.File(exple_data[0], 'r');


fq_str = '/Analyses/Basecall_1D_001/BaseCalled_template/Fastq'
mv_str = '/Analyses/Basecall_1D_001/BaseCalled_template/Move'
sg_str = '/Raw/Reads/'

sp_param['fq_seq'] = sp_param['f5reader'][fq_str][()].splitlines()[1]
k = list(sp_param['f5reader'][sg_str].keys())[0]
sp_param['raw_signals'] = sp_param['f5reader'][sg_str][k]['Signal'][()]
sp_param['first_sample_template'] = sp_param['f5reader']['/Analyses/Segmentation_001/Summary/segmentation'].attrs['first_sample_template']
sp_param['duration_template'] = sp_param['f5reader']['/Analyses/Segmentation_001/Summary/segmentation'].attrs['duration_template']
move_data = sp_param['f5reader'][mv_str][()]

getEvent_Info(moptions, sp_param, move_data)

sp_param['f5reader'].close();
Loading

0 comments on commit 503af50

Please sign in to comment.