dot calling state of the art #287

sergpolly · 2018-09-07T20:23:48Z

sergpolly
Sep 7, 2018
Maintainer

This issue is going to briefly touch upon the state of the dot-calling in cooltools.

Intro

We were trying to re-implement HiCCUPS dot-calling algorithm under the cooltools umbrella for some time now. It is still under active development and right now code is scattered across forks and branches.

master

The initial progress that we made with dot-calling, by implementing convolution based calculation of locally adjusted expected (donut, lowleft, vertical, horizontal) is reflected in this repository in the master branch. The post processing steps in this master are closest to the so-called BH-FDR version of dot-calling in the original HiCCUPS paper (Rao etal 2014) - in a sense that we do not do the lambda-chunking to perform multiple hypothesis testing. Moreover this implementation simply ends with the dump of the pre-calculated adjusted expected for different kernels: donut, lowleft, vertical, horizontal, and reports that post-processing in a bad shape. Thus this isn't very usable for now, not for the final dot-calling at least.

new-dotfinder

Lambda-chunking procedure is implemented in the dekkerlab fork of the cooltools branch new-dotfinder, which is pip-installable:

pip install git+https://github.com/dekkerlab/cooltools@new-dotfinder
pip install -e git+https://github.com/dekkerlab/cooltools@new-dotfinder#egg=cooltools

The second command would allow to modify the source code, whereas the 1st one would simply install it. One would want to modify the source code if the enrichment threshold modification is needed, for instance - as those are not implemented as CLI options just yet. A typical run would be:

cooltools call_dots -n {cores} 
    -o {signif_dots} -v --fdr 0.1 
    --max-nans-tolerated 7 
    --max-loci-separation 20000000 
    --dots-clustering-radius 21000 
    --tile-size 10000000 
    {input_cooler} {input_expected}

Which would produce list of dots that passed multiple hypothesis testing (the lambda-chunking step itself) but haven't been post-processed, i.e. clustered or filtered by enrichment. The post processed list of dots would show up in the same folder as signif_dots but with the prefix final_ - we'll fix this ugliness later on of course. call_dots CLI determines resolution and pick correct kernels parameters w, p accordingly (all of the defaults kept same as HiCCUPS).
This dot-caller albeit very close to HiCCUPS implementation, deviates from it in some regards, some of the most importnat aspects:

fixed kernels size for every pixel (in HiCCUPS "donuts" are shrinked near the diagonal and enlarged as needed based on the value of lowleft)
clustering is slightly different - we use off the shelve Birch, HiCCUPS implements special greedy algorithm for that - results are very close anyways.
really minor detail in the way we treat pixels near the bad-rows/columns (ones filled with NaNs after balancing): HiCCUPS disregard pixels that are within ~5 pixels away from bad-rows/cols, instead we simply check number of NaNs in a kernel footprint --max-nans-tolerated - given the resolution/size of the kernels one can realize which pixels would be discarded.

... I might add more details here later on, and edit/elaborate more on this here ...

shrink-donut-dotfinder

This branch elaborates further on top of the new-dotfinder by dealing with the disrepancy #1 - dynamic kernels resizing. As it follows from the name here we implemented only the near-diagonal kernels shrinakge - arguably the most important aspect of the dynamic kernels, which was preventing us from calling dots really close to the diagonal and driving the deviation between cooltools dot-calling and HiCCUPS. There are no additional parameters that user needs to control in this case, everything is done the same way as in HiCCUPS - and this is data-independent kernel shrinkage, as opposed to the enlarging kernels based on the value of the lowleft kernels for each pixel tested.

This branch eliminates another difference between HiCCUPS and cooltools which is related to the way lambda-chunking is implemented and is too technical to describe here. To give some numbers, for Rao et al 2014 GM... primary dataset we are getting ~7700 dots vs ~8050 by HiCCUPS, where overlap is ~7600.

sergpolly · 2018-09-07T20:53:26Z

sergpolly
Sep 7, 2018
Maintainer Author

An entire Snakefile that I'd use to call dots at 5 and 10kb and combine for a bunch of samples/coolers.
Broken down and commented:

you'd typically need ~10kb and 5kb coolers to perform GM-level dot-calling
also, describe your samples here - i.e. ./path(1,2)/your-sample(1,2).cool must exist

resolutions = [5000,10000]
names = ["your-sample1","your-sample2"]
paths = ["path1", "path2"]

the snake-thing about wildcards ...

wildcard_constraints:
    res="\d+"

get the combined bedpe lists with the dots

rule all:
    input:
        expand("{path}/combineddots/final_cloops_{hic_name}.combined.bedpe",zip,path=paths,hic_name=names)

force rebalance - in case you'd want cis only, modify tolerance, diags, whatever ...

rule balance:
    input:
        "{path}/{hic_name}.{res}.cool"
    output:
        touch("{path}/touchtmp/{hic_name}.{res}.touch")
    threads: 9
    shell:
        "cooler balance -p {threads} --ignore-diags 1 --force {input}"

expected ...

rule compute_expected:
    input:
        "{path}/{hic_name}.{res}.cool",
        "{path}/touchtmp/{hic_name}.{res}.touch"
    output:
        "{path}/expected/{hic_name}.{res}.cool.cis.expected.ALL"
    threads: 9
    shell:
        "cooltools compute_expected -p {threads} --drop-diags 1 {input[0]} > {output}"

chrM and sometimes chrY screw up dot calling - remove em from expected
so that dot-caller won't use them

rule clean_expected:
    input:
        "{path}/expected/{hic_name}.{res}.cool.cis.expected.ALL"
    output:
        "{path}/expected/{hic_name}.{res}.cool.cis.expected"
    shell:
        "grep -v -e \"^chrM\" {input} | grep -v -e \"^chrY\" > {output}"

dot-calling for each sample/resolution ...

rule call_dots:
    input:
        cooler = "{path}/{hic_name}.{res}.cool",
        expected = "{path}/expected/{hic_name}.{res}.cool.cis.expected"
    params:
        fdr = 0.1,
        diag_width = 10000000,
        tile_size = 10000000
    output:
        signif_dots="{path}/dots/cloops_{hic_name}.{res}.bedpe",
        filtered_dots="{path}/dots/final_cloops_{hic_name}.{res}.bedpe"
    threads: 17
    shell:
        "cooltools call_dots -n {threads} "
        "    -o {output.signif_dots} -v --fdr {params.fdr} "
        "    --max-nans-tolerated 7 "
        "    --max-loci-separation {params.diag_width} "
        "    --dots-clustering-radius 21000 "
        "    --tile-size {params.tile_size} "
        "    {input.cooler} {input.expected}"

merge dots using little unfinished script that should probably end up in cooltools at some point
https://github.com/sergpolly/random_tools

rule merge_dots_res:
    input:
        expand("{{path}}/dots/final_cloops_{{hic_name}}.{res}.bedpe",res=resolutions)
    params:
        radius = 10000
    output:
        "{path}/combineddots/final_cloops_{hic_name}.combined.bedpe"
    shell:
        "python ./random_tools/merge_dot_lists.py "
        "    --radius {params.radius} -v "
        "    --output {output} {input}"

in case anyone needs this - to prepare multires bedpe for higlass ingestion ...

# rule prepare_multires:
#     input:
#         "final_cloops_{hic_name}.combined.bedpe"
#     output:
#         "final_cloops_{hic_name}.combined.bedpe.multires"
#     params:
#         genome = "hg19"
#     shell:
#         "clodius aggregate bedpe "
#         "       -a {params.genome} --has-header "
#         "       --chr1-col 1 --from1-col 2 --to1-col 3 "
#         "       --chr2-col 4 --from2-col 5 --to2-col 6 "
#         "       --output-file {output} {input} "

0 replies

Phlya · 2018-10-30T13:19:13Z

Phlya
Oct 30, 2018
Maintainer

FYI the actual command that works for me is cooltools call-dots, i.e. not underscore, but dash... Not sure why this happens. Seems like with the new installation all underscores were converted to dashes, like in compute-expected etc as well.

Feature request: specify which chromosome(s) or regions to use, it would be convenient to quickly test on just a small subset.

Also, -s, --output-scores is a required argument, but not used in the examples here (maybe it just doesn't need to be required and it's a bug?).

0 replies

Phlya · 2018-10-30T13:26:04Z

Phlya
Oct 30, 2018
Maintainer

Also, would be curious to see rough timings for different resolutions/datasets/numbers of cores, and how much memory is required.

0 replies

Phlya · 2018-10-30T17:28:23Z

Phlya
Oct 30, 2018
Maintainer

And I can't get it to work... Here is the error:

I call it like this:

#!/bin/sh
# Grid Engine options (lines prefixed with #$)
#$ -cwd
#$ -N call_dots
#$ -l h_rt=12:00:00
#$ -l h_vmem=32G
#$ -pe sharedmem 4
#$ -j yes
#$ -V

cooltools call-dots -n 4 \
    -s ../output/dots/$(basename $1)\_10kb_scores.hdf5 \
    -o ../output/dots/$(basename $1)\_10kb_dots.bedpe \
    -v --fdr 0.1 \
    --max-nans-tolerated 7 \
    --max-loci-separation 10000000 \
    --dots-clustering-radius 21000 \
    --tile-size 5000000 \
    $1 $1\_expected.tsv

Output:

Tokenization took: 128.85 ms
Type conversion took: 48.86 ms
Parser memory cleanup took: 0.01 ms
Tokenization took: 105.19 ms
Type conversion took: 33.80 ms
Parser memory cleanup took: 0.01 ms
Tokenization took: 0.72 ms
Type conversion took: 0.78 ms
Parser memory cleanup took: 0.00 ms
../coolers/2i.10000.cool and ../coolers/2i.10000.cool_expected.tsv passed cross-compatibility checks.
Kernels parameters are set as w,p=5,2 for the cooler with 10000 bp resolution.
Preparing to convolve 1566 tiles:
creating a Pool of 4 workers to tackle 1566 tiles
/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/dotfinder.py:379: RuntimeWarning: invalid value encountered in true_divide
  Ek_raw = np.multiply(E_raw, np.divide(KO, KE))
/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/dotfinder.py:379: RuntimeWarning: invalid value encountered in true_divide
  Ek_raw = np.multiply(E_raw, np.divide(KO, KE))
/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/dotfinder.py:379: RuntimeWarning: invalid value encountered in true_divide
  Ek_raw = np.multiply(E_raw, np.divide(KO, KE))
Traceback (most recent call last):
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/bin/cooltools", line 11, in <module>
    load_entry_point('cooltools==0.1.0', 'console_scripts', 'cooltools')()
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/cli/call_dots.py", line 537, in call_dots
    nproc, verbose)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/cli/call_dots.py", line 138, in scoring_step
    for chunk in chunks:
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/multiprocess/pool.py", line 320, in <genexpr>
    return (item for chunk in result for item in chunk)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/multiprocess/pool.py", line 735, in next
    raise value
multiprocess.pool.MaybeEncodingError: Error sending result:

And this is followed by a long list of what looks like dataframes with information about, perhaps, chunks? The full output is attached.
call_dots.o4801378.txt

Similar errors have been reported when the objects passed in mutiprocessing are too big, so that's why I tried reducing chunksize and max separation, but no luck. Can try to make them even smaller, but presumably you tried it with similar values and it worked, so maybe this is not the issue anyway?

0 replies

sergpolly · 2018-10-30T19:25:05Z

sergpolly
Oct 30, 2018
Maintainer Author

COuld you try:

#!/bin/sh
# Grid Engine options (lines prefixed with #$)
#$ -cwd
#$ -N call_dots
#$ -l h_rt=12:00:00
#$ -l h_vmem=64G
#$ -pe sharedmem 8
#$ -j yes
#$ -V

export MKL_NUM_THREADS=1

cooltools call-dots -n 8 \
    -s ../output/dots/$(basename $1)\_10kb_scores.hdf5 \
    -o ../output/dots/$(basename $1)\_10kb_dots.bedpe \
    -v --fdr 0.1 \
    --max-nans-tolerated 7 \
    --max-loci-separation 2000000 \
    --dots-clustering-radius 21000 \
    --tile-size 2000000 \
    $1 $1\_expected.tsv

0 replies

Phlya · 2018-10-31T13:06:23Z

Phlya
Oct 31, 2018
Maintainer

OK, it now worked, thank you! But didn't save a bedpe file, actually, only an hdf5 dump... Maybe when I was struggling with the environment I accidentally reinstalled from the main branch or smth? I'll try to reinstall it retry.

Also, do you really think it requires so much memory? This is 64Gb per core, so 64*8=512Gb...

0 replies

Phlya · 2018-10-31T13:14:50Z

Phlya
Oct 31, 2018
Maintainer

Ah actually, I checked, and it only used 33Gb total memory.

0 replies

Phlya · 2018-10-31T13:30:28Z

Phlya
Oct 31, 2018
Maintainer

Weird, I reinstalled it from shrink-donut-dotfinder, re-ran with exactly the same command, except reduced resources so it starts running faster, and now I get an error:

Tokenization took: 69.72 ms
Type conversion took: 26.42 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 102.02 ms
Type conversion took: 27.02 ms
Parser memory cleanup took: 0.01 ms
Tokenization took: 0.56 ms
Type conversion took: 0.61 ms
Parser memory cleanup took: 0.00 ms
../coolers/WT.10000.cool and ../coolers/WT.10000.cool_expected.tsv passed cross-compatibility checks.
Kernels parameters are set as w,p=5,2 for the cooler with 10000 bp resolution.
preparing big tiles ...
preparing small diagonal tiles ...
scoring and histogramming (DEBUG WAY) ...
Preparing to convolve and build histogramms ...
Creating a Pool of 8 workers
Pool of 8 workers is going to tackle 1327 tiles
Pool of 8 workers is going to tackle 2651 tiles
reduce hchunks and diag_hchunks ...
Traceback (most recent call last):
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/bin/cooltools", line 11, in <module>
    sys.exit(cli())
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/cli/call_dots.py", line 1354, in call_dots
    nproc, verbose)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/cli/call_dots.py", line 651, in scoring_and_histogramming_step
    .format(k,last_la_exp_bin)
AssertionError: There are la_exp.donut.value in (7486.107, inf], please check the histogram

Is it normal the pool of 8 workers is going to tackle some number of tiles twice? And does it have anything to do with the error?

0 replies

sergpolly · 2018-10-31T13:46:57Z

sergpolly
Oct 31, 2018
Maintainer Author

about memory: I didn't know your cluster scheduler requests memory per core, so, for sure, 0.5TB is an overkill! - like yeah - huge overkill - I though of 64GB ram total, just in case ...
shrink-donut-dotfinder is indeed going to do convolution twice! once in a wider region max-loci-separation around the diagonal and one more time in a fixed small region around diagonal,, where your donuts start to shrink - that's the price we're paying to implement it
convolving twice has nothing to do with the error - and I know exactly what the error is all about:
- it's about this https://github.com/dekkerlab/cooltools/blob/437c914ca581c7858a9b49af1101437913019fe8/cooltools/cli/call_dots.py#L17
- and then that's where I generate corresponding bins for histogramming the locally-adjusted expected: https://github.com/dekkerlab/cooltools/blob/437c914ca581c7858a9b49af1101437913019fe8/cooltools/cli/call_dots.py#L1268
- because stuff is "hardcoded" , the largest value "allowed" for the locally-adjusted expected is 2^(HiCCUPS_W1_MAX_INDX/3), so whenever something gets into the bin [2^(HiCCUPS_W1_MAX_INDX/3), +inf) the program stops and yells at you ...
- apparently your data has a wider "dynamic range" of Hi-C interaction counts that was anticipated by HiCCUPS creators ...
- that's all because I was rushing to include more features from HiCCUPS without making program more "stable"
- this histogramming step should become dynamical, not static one that bugs out when your exceed the limit ...

0 replies

Phlya · 2018-10-31T13:50:56Z

Phlya
Oct 31, 2018
Maintainer

I see, thanks!

Yeah, it ran now with 8*8 Gb (when it gave that error).

OK, maybe it's just because I have fewer contacts than in their data? Anyway, is there some quick fix for this I could implement?

0 replies

sergpolly · 2018-10-31T13:52:08Z

sergpolly
Oct 31, 2018
Maintainer Author

so @Phlya , temporary solution , let's make this HiCCUPS_W1_MAX_INDX a command line parameter for now - so that you can adjust it as needed

as a justification, let me share some slides with you: https://docs.google.com/presentation/d/1-hwAS4G5G4LbmmelonE4uubmfd6-V1XzpDI3z72RdZA/edit?usp=sharing

there slides 24-30 are all about the improvements that shrinking_donuts bring - look at the short range "dots/loops" in the histograms: 24-25 slides
then 26-27 - about how shrinking donuts actually look like

0 replies

Phlya · 2018-10-31T13:52:51Z

Phlya
Oct 31, 2018
Maintainer

Does it automatically exclude the first two diagonals, btw? If not, this could cause very high read counts, I guess...

0 replies

sergpolly · 2018-10-31T14:00:02Z

sergpolly
Oct 31, 2018
Maintainer Author

call_dots script auto-ignores only the main diagonal (true-HiCCUPS style)
if you supply expected with 2 diagonals "missing" it would ignore 2,

Again, all of that stems from the fact that I was trying to make call_dots , as close to HiCCUPS as possible, so I was just literally working on one example, i.e. their GM data, using their binned contacts (dump .hic into .cool) and trheir balancing even.
I was meant to start comparing entire pipelines, but never got to that point - i.e. let's take their raw fastqs map them with distiller, bin the data, balance (here important question is cis vs global balancing, they do cis-only, cause they can't do global ...), call dots and compare again - this stuff is still pending

0 replies

sergpolly · 2018-10-31T14:00:38Z

sergpolly
Oct 31, 2018
Maintainer Author

ok, let me do the fix, so you could run the thing at least

0 replies

Phlya · 2018-10-31T14:04:33Z

Phlya
Oct 31, 2018
Maintainer

OK, very interesting presentation, thank you!
(A LOT of reads you have there, but 51% trans??? I had to ask, sorry)

Yeah, my expected has two diagonals nan'd, so it won't be that - saw you had that problem in the presentation, so checked already.

Thank you, let me know when/how it works!

0 replies

sergpolly · 2018-10-31T15:39:12Z

sergpolly
Oct 31, 2018
Maintainer Author

my bad ... wait

0 replies

sergpolly · 2018-10-31T15:44:45Z

sergpolly
Oct 31, 2018
Maintainer Author

dekkerlab@377106e
i'm stupid - not you

0 replies

Phlya · 2018-10-31T16:07:32Z

Phlya
Oct 31, 2018
Maintainer

No worries, thank you! Just need to find the value that works now, 43 still failed...

0 replies

Phlya · 2018-10-31T16:15:38Z

Phlya
Oct 31, 2018
Maintainer

Set that argument from above to 50, and got a new problem:

... A few more clustering infos above...

Clustering is completed:
21 clusters detected
1.62+/-0.84 mean size

Traceback (most recent call last):
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/bin/cooltools", line 11, in <module>
    sys.exit(cli())
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/cli/call_dots.py", line 1511, in call_dots
    dots_clustering_radius, verbose)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/cli/call_dots.py", line 839, in clustering_step_local
    verbose=verbose)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/dotfinder.py", line 120, in clust_2D_pixels
    brc.fit(pixels)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/sklearn/cluster/birch.py", line 449, in fit
    return self._fit(X)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/sklearn/cluster/birch.py", line 452, in _fit
    X = check_array(X, accept_sparse='csr', copy=self.copy)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/sklearn/utils/validation.py", line 462, in check_array
    context))
ValueError: Found array with 0 sample(s) (shape=(0, 2)) while a minimum of 1 is required.

0 replies

sergpolly · 2018-10-31T16:25:19Z

sergpolly
Oct 31, 2018
Maintainer Author

any chance you have some weird (almost empty, no Hi-C signal) chromosomes in the dataset ? ones that wwouldn't have any "dots" ?
I had it in human data with chrY chromosome (chrM as well) even in male cells ...
In case this is the reason, the way to exclude weirdos would be to remove them from the expected - as described above: #39 (comment)
this paragraph:

chrM and sometimes chrY screw up dot calling - remove em from expected
so that dot-caller won't use them

0 replies

Phlya · 2018-10-31T16:50:11Z

Phlya
Oct 31, 2018
Maintainer

Awesome, it's working!

For future reference, here is the command I used to remove chrY and chrM:
grep -v chrM coolers/WT.10000.cool_expected.tsv | grep -v chrY > coolers/WT.10000.cool_expected_nochrYM.tsv

0 replies

Phlya · 2018-11-01T12:01:54Z

Phlya
Nov 1, 2018
Maintainer

Why do I get this error with some coolers?..

Ready to concatenate chunks if filtered pixels ...
Traceback (most recent call last):
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/bin/cooltools", line 11, in <module>
    sys.exit(cli())
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/cli/call_dots.py", line 1468, in call_dots
    nproc, verbose)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/cli/call_dots.py", line 775, in scoring_and_extraction_step
    assert ~significant_pixels.duplicated().any()
AssertionError

0 replies

sergpolly · 2018-11-01T14:40:04Z

sergpolly
Nov 1, 2018
Maintainer Author

Another side effect of rapid feature-driven development ...
Because, we are convolving the heatmap with the kernels (donuts,lowleft, etc) twice - one time for the whole region of max-loci-separation, and the other time - narrow band around the diagonal for shrinking donuts - then we're merging results together
I tried to make it so, that there were no overlap between the two - there must be only ONE entry for each pixel in the final dataframe of locally-adjusted expected-s, i.e. loc.-adj.expected calculated using shrunken kernels must overwrite ones, calculated using full-sized ones.

The key-word here is "i tried" ...
And thing is, I had this issue as well, and though that it got fixed, but apparently not so much ...
I left these assertions throughout the code, in order to catch any issues ...

What I can do, for now, is to introduce a flag --production , that would ignore these assertions. Because that overlap is not that big of a deal actually - it's more of sign to me that there are impperfections throughout the code ... @Phlya what do you think - you want to go ahead with a little bug inside, for now, or wait for a proper fix ?

0 replies

Phlya · 2018-11-01T14:55:07Z

Phlya
Nov 1, 2018
Maintainer

Well, sounds like the "bug" actually doesn't affect any of the results, right? I'd be fine with a quick fix then. Except I don't quite understand why this happens with some files and not others?

0 replies

Phlya · 2018-11-01T17:14:59Z

Phlya
Nov 1, 2018
Maintainer

Minor issue, in non-final files there is a weight1 column between the coordinates of two ends, it complicates plotting based on just the first 6 columns... Which is easiest to implement for general bedpe files with columns beyond the minimal 6.

0 replies

gfudenberg · 2019-04-30T20:47:06Z

gfudenberg
Apr 30, 2019
Maintainer

it would be great to have an option to pass an adjustment factor to thresholding_step() that allows for modifying the threshold required for enrichment_factors: e.g. enrichment_factor_1 = (enrichment_factor_1 -1 ) * adjustment_factor + 1

0 replies

gfudenberg · 2019-05-06T19:04:32Z

gfudenberg
May 6, 2019
Maintainer

it would also be great to report (cluster_start1,cluster_stop1) (cluster_start2,cluster_stop2) in postprocessed calls for each cluster (i.e. bounding rectangle)

0 replies

xinyangbing · 2020-03-23T18:33:39Z

xinyangbing
Mar 23, 2020

Is there a way to adjust the resolution/bin size? I got this working but would like to try a smaller resolution, as I am using Micro-C and I see that call-dots missed lots of loops at smaller resolutions.

0 replies

sergpolly · 2020-03-25T23:52:23Z

sergpolly
Mar 25, 2020
Maintainer Author

Hey @xinyangbing - you are absolutely right in a sense that the default call_dots CLI interface does not expose a lot of paramteres (yet) - currently its parameters are an exact copy of those from HiCCUPS that were applied to very deep GM12... dataset in Rao et al 2014 - which are extremely strict and optimized for 5/10kb ...

At the moment - the best way to adjust any internal parameter (and maybe to skip some filtering steps) - is to use a jupyter-notebook https://cooltools.readthedocs.io/en/latest/notebooks/08_dot-calling-internals.html
https://github.com/mirnylab/cooltools/blob/master/docs/notebooks/08_dot-calling-internals.ipynb

as far as higher resolutions go - the most important parameters to adjust would be w and p -> the half-width of the "donut"(convolution kernel as we call it here) and half-width of the internal part of the "donut" (target, dot-footprint, etc) - https://cooltools.readthedocs.io/en/latest/notebooks/08_dot-calling-internals.html#Kernels-used-for-calculating-local-enrichment-of-a-pixel

Here are the sizes that they use in Rao et al 2014 (HICCUPS):
25kb: w=3, p=1 ---> donut-width: (2*w+1)*binsize ~170kb; target-width: (2*p+1)*binsize ~70kb
10kb: w=5, p=2 ---> donut-width: (2*w+1)*binsize ~110kb; target-width: (2*p+1)*binsize ~50kb
5kb: w=7, p=4 ---> donut-width: (2*w+1)*binsize ~75kb; target-width: (2*p+1)*binsize ~45kb

so what would be reasonable w and p for 2kb ? we don't have a definitive answer, but perhaps something like: w ~13+/-2 and p~9-10 - maybe a good start:
(2w+1)binsize ~54kb and (2p+1)binsize~42kb

please let us know if you are willing to give the step-by-step notebook a try - we can provide further assistance with that

0 replies

sergpolly · 2023-11-23T19:47:42Z

sergpolly
Nov 23, 2023
Maintainer Author

Proposal to "modularize" and transform "dotfinder" into a more generally applicable "2d-convolution/correlation" module ...

Right now "dotfinder" is "tuned" to be a module that +/- re-implements hiccups and is not very flexible.
One of the "critical" functions there is get_adjusted_expected_tile_some_nans
and it is the main obstacle to a more general/modular implementation.

Ideally one could benefit from a function that has following API:

def _convolve_or_correlate_tile(
    tile_origin,  # i,j of the upper left tile corner to keep track of pixel ids
    observed_tile,  # dense tile of the input matrix - to be convolved
    expected_tile,  # (?) dense tile of the expected matching the input matrix
    kernels,  # dict of the convolution/correlation kernels
    targets = ["observed","expected","observed_over_expected"]  # which matrices to convolve - obs, exp, obs/exp ?
    mode = "convolution"  # to convolve (i.e. sum with the weights in a given kernel) or to correlate
   # something else ?
):
"""
Returns
----------
conv_df  :  pd.DataFrame
A pixel-table with the convolution/correlation results that is convenient for further scoring, filtering and transformations

Notes
--------
Reported columns:
        bin1_id - matrix row index, adjusted to tile_origin_i
        bin2_id - matrix column index, adjusted to tile_origin_j
        kernel_name.target.conv - convolution results for a given pixel and a given target, examples:
            ["donut.observed.conv", "donut.expected.conv", "vertical.observed.conv", "vertical.expected.conv"]
        kernel_name.target.corr - correlation results for a given pixel and a given target, examples:
            ["donut.observed_over_expected.corr", "vertical.observed_over_expected.corr"]
        kernel_name.nnans - number of NaNs in the vicinity of a given pixel - e.g. overlap kernel footprint
        kernel_name.nzeroes - number of zeroes in the vicinity of a given pixel - e.g. overlap kernel footprint

nnans and nzeroes are generally useful for various quality filtering of pixels - e.g. spuriously bright pixel in a generally sparse
region, pixels near the masked rows/columns etc.

 More optional output columns could include - raw or balanced "count" for a given pixel, expected value matching that pixel - etc
"""

Tiled output of such generalized function _convolve_or_correlate_tile can be further cooler.annotate-ed with actual pixel coordinates, balancing weight, etc etc.
Using _convolve_or_correlate_tileone could rewrite hiccups scoring for a "donut" kernel into something like:

tile_df = _convolve_or_correlate_tile(
    tile_origin
    observed_tile,
    expected_tile,
    kernels = {"donut": donut},
    targets = ["observed","expected"],
    mode = "convolution"
    # output extra columns: expected column, raw counts column
)

# pre-filter pixels
# upper diagonal
# not too many zeroes/nans
# etc ...

# annotate with balancing weights weight1 and weight2 - i.e. both bin1_id and bin2_id :
df = cooler.annotate(tile_df)
df["hiccup.donut.la_exp_score"] = df["expected"] / (df["weight1"]*df["weight2"]) * (df["donut.ovserved.conv"] / df["donut.expected.conv"])

In other words this function would remove the main culprit of the existing get_adjusted_expected_tile_some_nans function which does concolution of observed and expected, but reports only the final result of the hiccups-like scoring - i.e. conv(observed)/conv(expected)*expected*_unbalancing_factor - which is later compared to the raw pixel counts .

0 replies

dot calling state of the art #287

sergpolly Sep 7, 2018 Maintainer

Intro

master

new-dotfinder

shrink-donut-dotfinder

Replies: 34 comments

sergpolly Sep 7, 2018 Maintainer Author

Phlya Oct 30, 2018 Maintainer

Phlya Oct 30, 2018 Maintainer

Phlya Oct 30, 2018 Maintainer

sergpolly Oct 30, 2018 Maintainer Author

Phlya Oct 31, 2018 Maintainer

Phlya Oct 31, 2018 Maintainer

Phlya Oct 31, 2018 Maintainer

sergpolly Oct 31, 2018 Maintainer Author

Phlya Oct 31, 2018 Maintainer

sergpolly Oct 31, 2018 Maintainer Author

Phlya Oct 31, 2018 Maintainer

sergpolly Oct 31, 2018 Maintainer Author

sergpolly Oct 31, 2018 Maintainer Author

Phlya Oct 31, 2018 Maintainer

sergpolly Oct 31, 2018 Maintainer Author

sergpolly Oct 31, 2018 Maintainer Author

Phlya Oct 31, 2018 Maintainer

Phlya Oct 31, 2018 Maintainer

sergpolly Oct 31, 2018 Maintainer Author

Phlya Oct 31, 2018 Maintainer

Phlya Nov 1, 2018 Maintainer

sergpolly Nov 1, 2018 Maintainer Author

Phlya Nov 1, 2018 Maintainer

Phlya Nov 1, 2018 Maintainer

gfudenberg Apr 30, 2019 Maintainer

gfudenberg May 6, 2019 Maintainer

xinyangbing Mar 23, 2020

sergpolly Mar 25, 2020 Maintainer Author

sergpolly Nov 23, 2023 Maintainer Author

sergpolly
Sep 7, 2018
Maintainer

sergpolly
Sep 7, 2018
Maintainer Author

Phlya
Oct 30, 2018
Maintainer

Phlya
Oct 30, 2018
Maintainer

Phlya
Oct 30, 2018
Maintainer

sergpolly
Oct 30, 2018
Maintainer Author

Phlya
Oct 31, 2018
Maintainer

Phlya
Oct 31, 2018
Maintainer

Phlya
Oct 31, 2018
Maintainer

sergpolly
Oct 31, 2018
Maintainer Author

Phlya
Oct 31, 2018
Maintainer

sergpolly
Oct 31, 2018
Maintainer Author

Phlya
Oct 31, 2018
Maintainer

sergpolly
Oct 31, 2018
Maintainer Author

sergpolly
Oct 31, 2018
Maintainer Author

Phlya
Oct 31, 2018
Maintainer

sergpolly
Oct 31, 2018
Maintainer Author

sergpolly
Oct 31, 2018
Maintainer Author

Phlya
Oct 31, 2018
Maintainer

Phlya
Oct 31, 2018
Maintainer

sergpolly
Oct 31, 2018
Maintainer Author

Phlya
Oct 31, 2018
Maintainer

Phlya
Nov 1, 2018
Maintainer

sergpolly
Nov 1, 2018
Maintainer Author

Phlya
Nov 1, 2018
Maintainer

Phlya
Nov 1, 2018
Maintainer

gfudenberg
Apr 30, 2019
Maintainer

gfudenberg
May 6, 2019
Maintainer

xinyangbing
Mar 23, 2020

sergpolly
Mar 25, 2020
Maintainer Author

sergpolly
Nov 23, 2023
Maintainer Author