SGA Scoring script and related utilities
These are customary, and (I think) not hardcoded anywhere except in the parameters you supply. Thus, you can change them as long as you set those parameter-paths up accordingly. Sub-directories and symbolic links etc also work fine.
rawdata/
: for input filesscored/
: for output filesrefdata/
: for supplemental input files
-
Mandatory
inputfile
database dump to score (9 columns) (customarily underrawdata/
)outputfile
output file pattern (no extension) (scored/
)smfitnessfile
path to query / array fitness file (refdata
etc )linkagefile
path to linkage definition file (ignored if linkage is skipped, see below)coord_file
path to orf coordinate file (ignored if linkage is skipped)removearraylist
path to list of known bad arrays (to be removed)
-
Overrideable (have a hard-coded default, but may drop you into the debugger for confirmation if not supplied): -
border_strain_orf
ORF_strainid found on the border -wild_type
ORF_strainid to use as wild-type -skip_perl_step
skips a preprocessing step. If you're scoring aninputfile
you've already scored, you can skip this to save some time -
Optional:
skip_linkage_detection
set totrue
to skip linkage detection altogetherskip_linkage_mask
set the totrue
to replace linkage colonies after correctionsskip_wt_remove
set this totrue
to include WT query strain/s in the outputeps_qnorm_ref
path to mat-file containing a quantile normalization table
You can paste this into matlab, (% are comments)
% -------------------------------------------------------------------------------------
% October 9, 2014
% nolink score of fg30 for adrian verster
% in which linkage data are held out from corrections, then replaced for analysis
% -------------------------------------------------------------------------------------
inputfile = 'rawdata/Collab/AdrianVerster/raw_sga_fg_t30_131130_adrian.txt';
outputfile= 'scored/Collab/AdrianVerster/nolink_sga_fg_t30_131130_adrian_scored_141009';
skip_perl_step = false;
skip_linkage_mask = true;
skip_wt_remove = false;
wild_type = 'URA3control_sn4757';
border_strain_orf = 'YOR202W_dma1';
smfitnessfile = 'refdata/smf_t30_130417.txt';
linkagefile = 'refdata/linkage-est_sga_merged_131122.txt';
coord_file = 'refdata/chrom_coordinates_111220.txt';
removearraylist = 'refdata/bad_array_strains_140526.csv';
compute_sgascore
The script creates 4 output files, beginning with the supplied output pattern, but each with a unique extension:
.txt
final scores in 12-column format.log
a record of events, pretty much a copy of what you see on the screen when you run the script.mat
a matfile containing everything in the workspace at the end of scoring, for post-mortems.orf
a list of unique strain_ids from columns 1 & 2 of.txt
which may come in handy (e.g. for building a hashmap to load the.txt
file
The script has a number of places marked:
% SAFE ENTRY
Which means "The only variables which change after this point, are created after this point."
These have been determined by manual inspection and are by no means exhaustive. They are used to resume
the script mid-execution to save time when testing changes. I generally use this procedure:
- Locate the position of the change you wish to test (line XXX)
- Locate the first SAFE ENTRY above XXX, (line YYY)
- Delete lines 1-YYY and save the results to
resume.m
- To supress logging to the log-file, set this magic number at the top of
resume.m
:lfid = -11;
- Or to log:
lfid = fopen([outputfile '.log'], 'a')
(a
to append,w
to overwrite)
- Or to log:
- load the mat-file you want to resume (if not already loaded)
- run
resume
See the Column_Key for description of input and outputfiles
- MATLAB version (other than 2010b which has a bug in
svm()
) - Image Processing Toolbox (image_toolbox)
- Statistics Toolbox (statistics_toolbox)