dot calling state of the art #287
Replies: 34 comments
-
An entire you'd typically need ~10kb and 5kb coolers to perform GM-level dot-calling
the snake-thing about wildcards ...
get the combined bedpe lists with the dots
force rebalance - in case you'd want cis only, modify tolerance, diags, whatever ...
expected ...
chrM and sometimes chrY screw up dot calling - remove em from expected
dot-calling for each sample/resolution ...
merge dots using little unfinished script that should probably end up in cooltools at some point
in case anyone needs this - to prepare multires bedpe for higlass ingestion ...
|
Beta Was this translation helpful? Give feedback.
-
FYI the actual command that works for me is Feature request: specify which chromosome(s) or regions to use, it would be convenient to quickly test on just a small subset. Also, |
Beta Was this translation helpful? Give feedback.
-
Also, would be curious to see rough timings for different resolutions/datasets/numbers of cores, and how much memory is required. |
Beta Was this translation helpful? Give feedback.
-
And I can't get it to work... Here is the error: I call it like this:
Output:
And this is followed by a long list of what looks like dataframes with information about, perhaps, chunks? The full output is attached. Similar errors have been reported when the objects passed in mutiprocessing are too big, so that's why I tried reducing chunksize and max separation, but no luck. Can try to make them even smaller, but presumably you tried it with similar values and it worked, so maybe this is not the issue anyway? |
Beta Was this translation helpful? Give feedback.
-
COuld you try: #!/bin/sh
# Grid Engine options (lines prefixed with #$)
#$ -cwd
#$ -N call_dots
#$ -l h_rt=12:00:00
#$ -l h_vmem=64G
#$ -pe sharedmem 8
#$ -j yes
#$ -V
export MKL_NUM_THREADS=1
cooltools call-dots -n 8 \
-s ../output/dots/$(basename $1)\_10kb_scores.hdf5 \
-o ../output/dots/$(basename $1)\_10kb_dots.bedpe \
-v --fdr 0.1 \
--max-nans-tolerated 7 \
--max-loci-separation 2000000 \
--dots-clustering-radius 21000 \
--tile-size 2000000 \
$1 $1\_expected.tsv |
Beta Was this translation helpful? Give feedback.
-
OK, it now worked, thank you! But didn't save a bedpe file, actually, only an hdf5 dump... Maybe when I was struggling with the environment I accidentally reinstalled from the main branch or smth? I'll try to reinstall it retry. Also, do you really think it requires so much memory? This is 64Gb per core, so 64*8=512Gb... |
Beta Was this translation helpful? Give feedback.
-
Ah actually, I checked, and it only used 33Gb total memory. |
Beta Was this translation helpful? Give feedback.
-
Weird, I reinstalled it from
Is it normal the pool of 8 workers is going to tackle some number of tiles twice? And does it have anything to do with the error? |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
I see, thanks! Yeah, it ran now with 8*8 Gb (when it gave that error). OK, maybe it's just because I have fewer contacts than in their data? Anyway, is there some quick fix for this I could implement? |
Beta Was this translation helpful? Give feedback.
-
so @Phlya , temporary solution , let's make this as a justification, let me share some slides with you: https://docs.google.com/presentation/d/1-hwAS4G5G4LbmmelonE4uubmfd6-V1XzpDI3z72RdZA/edit?usp=sharing there slides 24-30 are all about the improvements that |
Beta Was this translation helpful? Give feedback.
-
Does it automatically exclude the first two diagonals, btw? If not, this could cause very high read counts, I guess... |
Beta Was this translation helpful? Give feedback.
-
Again, all of that stems from the fact that I was trying to make |
Beta Was this translation helpful? Give feedback.
-
ok, let me do the fix, so you could run the thing at least |
Beta Was this translation helpful? Give feedback.
-
OK, very interesting presentation, thank you! Yeah, my expected has two diagonals nan'd, so it won't be that - saw you had that problem in the presentation, so checked already. Thank you, let me know when/how it works! |
Beta Was this translation helpful? Give feedback.
-
my bad ... wait |
Beta Was this translation helpful? Give feedback.
-
dekkerlab@377106e |
Beta Was this translation helpful? Give feedback.
-
No worries, thank you! Just need to find the value that works now, 43 still failed... |
Beta Was this translation helpful? Give feedback.
-
Set that argument from above to 50, and got a new problem:
|
Beta Was this translation helpful? Give feedback.
-
any chance you have some weird (almost empty, no Hi-C signal) chromosomes in the dataset ? ones that wwouldn't have any "dots" ?
|
Beta Was this translation helpful? Give feedback.
-
Awesome, it's working! For future reference, here is the command I used to remove chrY and chrM: |
Beta Was this translation helpful? Give feedback.
-
Why do I get this error with some coolers?..
|
Beta Was this translation helpful? Give feedback.
-
Another side effect of rapid feature-driven development ... The key-word here is "i tried" ... What I can do, for now, is to introduce a flag |
Beta Was this translation helpful? Give feedback.
-
Well, sounds like the "bug" actually doesn't affect any of the results, right? I'd be fine with a quick fix then. Except I don't quite understand why this happens with some files and not others? |
Beta Was this translation helpful? Give feedback.
-
Minor issue, in non-final files there is a |
Beta Was this translation helpful? Give feedback.
-
it would be great to have an option to pass an adjustment factor to thresholding_step() that allows for modifying the threshold required for enrichment_factors: e.g. enrichment_factor_1 = (enrichment_factor_1 -1 ) * adjustment_factor + 1 |
Beta Was this translation helpful? Give feedback.
-
it would also be great to report (cluster_start1,cluster_stop1) (cluster_start2,cluster_stop2) in postprocessed calls for each cluster (i.e. bounding rectangle) |
Beta Was this translation helpful? Give feedback.
-
Is there a way to adjust the resolution/bin size? I got this working but would like to try a smaller resolution, as I am using Micro-C and I see that call-dots missed lots of loops at smaller resolutions. |
Beta Was this translation helpful? Give feedback.
-
Hey @xinyangbing - you are absolutely right in a sense that the default At the moment - the best way to adjust any internal parameter (and maybe to skip some filtering steps) - is to use a jupyter-notebook https://cooltools.readthedocs.io/en/latest/notebooks/08_dot-calling-internals.html as far as higher resolutions go - the most important parameters to adjust would be Here are the sizes that they use in Rao et al 2014 (HICCUPS): so what would be reasonable please let us know if you are willing to give the step-by-step notebook a try - we can provide further assistance with that |
Beta Was this translation helpful? Give feedback.
-
Proposal to "modularize" and transform "dotfinder" into a more generally applicable "2d-convolution/correlation" module ... Right now "dotfinder" is "tuned" to be a module that +/- re-implements hiccups and is not very flexible. Ideally one could benefit from a function that has following API: def _convolve_or_correlate_tile(
tile_origin, # i,j of the upper left tile corner to keep track of pixel ids
observed_tile, # dense tile of the input matrix - to be convolved
expected_tile, # (?) dense tile of the expected matching the input matrix
kernels, # dict of the convolution/correlation kernels
targets = ["observed","expected","observed_over_expected"] # which matrices to convolve - obs, exp, obs/exp ?
mode = "convolution" # to convolve (i.e. sum with the weights in a given kernel) or to correlate
# something else ?
):
"""
Returns
----------
conv_df : pd.DataFrame
A pixel-table with the convolution/correlation results that is convenient for further scoring, filtering and transformations
Notes
--------
Reported columns:
bin1_id - matrix row index, adjusted to tile_origin_i
bin2_id - matrix column index, adjusted to tile_origin_j
kernel_name.target.conv - convolution results for a given pixel and a given target, examples:
["donut.observed.conv", "donut.expected.conv", "vertical.observed.conv", "vertical.expected.conv"]
kernel_name.target.corr - correlation results for a given pixel and a given target, examples:
["donut.observed_over_expected.corr", "vertical.observed_over_expected.corr"]
kernel_name.nnans - number of NaNs in the vicinity of a given pixel - e.g. overlap kernel footprint
kernel_name.nzeroes - number of zeroes in the vicinity of a given pixel - e.g. overlap kernel footprint
nnans and nzeroes are generally useful for various quality filtering of pixels - e.g. spuriously bright pixel in a generally sparse
region, pixels near the masked rows/columns etc.
More optional output columns could include - raw or balanced "count" for a given pixel, expected value matching that pixel - etc
""" Tiled output of such generalized function tile_df = _convolve_or_correlate_tile(
tile_origin
observed_tile,
expected_tile,
kernels = {"donut": donut},
targets = ["observed","expected"],
mode = "convolution"
# output extra columns: expected column, raw counts column
)
# pre-filter pixels
# upper diagonal
# not too many zeroes/nans
# etc ...
# annotate with balancing weights weight1 and weight2 - i.e. both bin1_id and bin2_id :
df = cooler.annotate(tile_df)
df["hiccup.donut.la_exp_score"] = df["expected"] / (df["weight1"]*df["weight2"]) * (df["donut.ovserved.conv"] / df["donut.expected.conv"]) In other words this function would remove the main culprit of the existing |
Beta Was this translation helpful? Give feedback.
-
This issue is going to briefly touch upon the state of the dot-calling in
cooltools
.Intro
We were trying to re-implement HiCCUPS dot-calling algorithm under the
cooltools
umbrella for some time now. It is still under active development and right now code is scattered across forks and branches.master
The initial progress that we made with dot-calling, by implementing convolution based calculation of locally adjusted expected (
donut
,lowleft
,vertical
,horizontal
) is reflected in this repository in themaster
branch. The post processing steps in thismaster
are closest to the so-called BH-FDR version of dot-calling in the original HiCCUPS paper (Rao etal 2014) - in a sense that we do not do the lambda-chunking to perform multiple hypothesis testing. Moreover this implementation simply ends with the dump of the pre-calculated adjusted expected for different kernels:donut
,lowleft
,vertical
,horizontal
, and reports that post-processing in a bad shape. Thus this isn't very usable for now, not for the final dot-calling at least.new-dotfinder
Lambda-chunking procedure is implemented in the
dekkerlab
fork of thecooltools
branchnew-dotfinder
, which ispip
-installable:The second command would allow to modify the source code, whereas the 1st one would simply install it. One would want to modify the source code if the enrichment threshold modification is needed, for instance - as those are not implemented as CLI options just yet. A typical run would be:
Which would produce list of dots that passed multiple hypothesis testing (the lambda-chunking step itself) but haven't been post-processed, i.e. clustered or filtered by enrichment. The post processed list of dots would show up in the same folder as
signif_dots
but with the prefixfinal_
- we'll fix this ugliness later on of course.call_dots
CLI determines resolution and pick correct kernels parametersw
,p
accordingly (all of the defaults kept same as HiCCUPS).This dot-caller albeit very close to HiCCUPS implementation, deviates from it in some regards, some of the most importnat aspects:
lowleft
)Birch
, HiCCUPS implements special greedy algorithm for that - results are very close anyways.NaNs
after balancing): HiCCUPS disregard pixels that are within~5
pixels away from bad-rows/cols, instead we simply check number ofNaNs
in a kernel footprint--max-nans-tolerated
- given the resolution/size of the kernels one can realize which pixels would be discarded.... I might add more details here later on, and edit/elaborate more on this here ...
shrink-donut-dotfinder
This branch elaborates further on top of the
new-dotfinder
by dealing with the disrepancy#1
- dynamic kernels resizing. As it follows from the name here we implemented only the near-diagonal kernels shrinakge - arguably the most important aspect of the dynamic kernels, which was preventing us from calling dots really close to the diagonal and driving the deviation betweencooltools
dot-calling and HiCCUPS. There are no additional parameters that user needs to control in this case, everything is done the same way as in HiCCUPS - and this is data-independent kernel shrinkage, as opposed to the enlarging kernels based on the value of thelowleft
kernels for each pixel tested.This branch eliminates another difference between HiCCUPS and
cooltools
which is related to the way lambda-chunking is implemented and is too technical to describe here. To give some numbers, for Rao et al 2014 GM... primary dataset we are getting ~7700 dots vs ~8050 by HiCCUPS, where overlap is ~7600.Beta Was this translation helpful? Give feedback.
All reactions