Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue 81, "call empty droplets" #301

Merged
merged 34 commits into from
Mar 18, 2024
Merged

Fix issue 81, "call empty droplets" #301

merged 34 commits into from
Mar 18, 2024

Conversation

fmalmeida
Copy link
Contributor

Right now, just opening a draft PR on the attempt of solving issue #81 so that it is easier to keep track of modifications.

One work is "done" I will add a thorough overview of the changes, with explanations of the main modifications and listing on TODOs to be addressed before merging.

Then, of course, only then I will add some reviewers.

fmalmeida added 3 commits January 31, 2024 11:42
tested only with kallisto aligner (both with and without automated kallisto filtering with bustools --filter parameter)
@fmalmeida fmalmeida added the enhancement New feature or request label Feb 2, 2024
@fmalmeida fmalmeida self-assigned this Feb 2, 2024
Copy link

github-actions bot commented Feb 2, 2024

Python linting (black) is failing

To keep the code consistent with lots of contributors, we run automated code consistency checks.
To fix this CI test, please run:

  • Install black: pip install black
  • Fix formatting errors in your pipeline: black .

Once you push these changes the test should pass, and you can hide this comment 👍

We highly recommend setting up Black in your code editor so that this formatting is done automatically on save. Ask about it on Slack for help!

Thanks again for your contribution!

Comment on lines 20 to 21
tuple val(meta), path ("*.count/counts_unfiltered"), emit: raw_counts // TODO: Add to nf-coew/modules before merging PR
tuple val(meta), path ("*.count/counts_filtered") , emit: filtered_counts, optional: true // TODO: Add to nf-coew/modules before merging PR
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I am aware that this modification must go to nf-core/modules and not here, thus I added the TODO so this removed once testing is done.

Copy link

github-actions bot commented Feb 2, 2024

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 2406600

+| ✅ 171 tests passed       |+
#| ❔   4 tests were ignored |#
!| ❗   3 tests had warnings |!

❗ Test warnings:

  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!

❔ Tests ignored:

✅ Tests passed:

Run details

  • nf-core/tools version 2.13.1
  • Run at 2024-03-18 16:06:48

@fmalmeida
Copy link
Contributor Author

fmalmeida commented Feb 15, 2024

Hi @grst and @apeltzer ,
This PR is now ready for review and discussions.
As promised, I am going to add a description of changes and open things.

The empty-drops module

First of all, I have added a script to perform the empty-drops calling and filtering using a library that is available in bioconda, bioconductor-dropletutils.

With that, I added a module which is simple, it takes a matrix file, and performs the empty drops call on it, generating another matrix file.

Note: The module fails on small datasets or whenever there is not sufficient data points for filtering. Thus, it must work on "raw/unprocessed" matrices and should be skipped (--skip_emptydrops=true) for small datasets.

The inclusion in the workflow

With the module generated, I could then include it in the workflow.

    // Run emptydrops calling module
    if ( !params.skip_emptydrops ) {

        //
        // emptydrops should only run on the raw matrices thus, filter-out the filtered result of the aligners that can produce it
        //
        if ( params.aligner in [ 'cellranger', 'cellrangerarc', 'kallisto', 'star' ] ) {
            ch_mtx_matrices_for_emptydrops =
                ch_mtx_matrices.filter { meta, mtx_files ->
                    mtx_files.toString().contains("raw_feature_bc_matrix") || // cellranger
                    mtx_files.toString().contains("counts_unfiltered")  || // kallisto
                    mtx_files.toString().contains("raw")                   // star
            }
        } else {
            ch_mtx_matrices_for_emptydrops = ch_mtx_matrices
        }
        EMPTYDROPS_CELL_CALLING( ch_mtx_matrices_for_emptydrops )
        ch_mtx_matrices = ch_mtx_matrices.mix( EMPTYDROPS_CELL_CALLING.out.filtered_matrices )
    }

One thing to note above is that, as discussed previously, I have to add a checker/filter in order to only pass on the raw/unprocessed matrices generated by the assemblers, because, I think it does not make sense to run the module in the already filtered/processed matrices.

Or do you think we should invert, only running in the processed ones?

Changes in conversion modules

Because now we will have both the data directly from the aligners, and a custom-made filtering module, I had to change the conversion modules a bit so they are aware of that, and can understand the difference between raw/filtered from the aligners itself and what is the custom empty drops filter.

With that, to try to avoid confusion by the user, I had to add such "suffixes" to the generated converted files, so now we write data with such sufixes:

*_{raw,filtered,custom_emptydrops_filter}_matrix.{h5ad,rds}

In this case, the meanings are:

suffix meaning
raw Conversion of the raw/unprocessed matrix generated by the tool. It is also used for tools that generate only one matrix, such as alevin.
filtered Conversion of the filtered/processed matrix generated by the tool
custom_emptydrops_filter custom_emptydrops_filter Conversion of the matrix that was generated by the new custom empty drops filter module

I also had to update the two conversion-scripts for rds and h5ad, because now they also have to understand this new matrix generated, and to allow it to also transpose the matrices from cellranger, because, when we do the normal conversion of cellranger matrices, we use the .h5 files it generates, so it is very fine, but the cellranger-emptydrops filtering does not produce such file and then the conversion scripts need to understand it.

Integrating with the aligners

Finally, it has to be discussed the changes made in the aligners-connections to integrate this module. One by one.

Alevin

Because alevin only produce on matrix result, without producing a pair (raw&filtered) as others like cellranger, star and kallisto, the integration was seamless and required no major change, just connecting it to the module.

When using it, the results generated are the following:

alevin_run/
├── alevin
│   ├── mtx_conversions
│   │   ├── combined_custom_emptydrops_filter_matrix.h5ad
│   │   ├── combined_raw_matrix.h5ad
│   │   ├── pbmc8k
│   │   │   ├── pbmc8k_custom_emptydrops_filter_matrix.h5ad
│   │   │   ├── pbmc8k_custom_emptydrops_filter_matrix.rds
│   │   │   ├── pbmc8k_raw_matrix.h5ad
│   │   │   └── pbmc8k_raw_matrix.rds
│   │   └── versions.yml
│   ├── pbmc8k
│   │   └── emptydrops_filtered
│   │       ├── quants_mat_cols.txt
│   │       ├── quants_mat.mtx
│   │       └── quants_mat_rows.txt
│   ├── pbmc8k_alevin_results
│   │   ├── af_map
│   │   │   ├── alevin
│   │   │   ├── aux_info
│   │   │   ├── cmd_info.json
│   │   │   ├── libParams
│   │   │   ├── logs
│   │   │   ├── map.rad
│   │   │   └── unmapped_bc_count.bin
│   │   ├── af_quant
│   │   │   ├── alevin
│   │   │   ├── all_freq.bin
│   │   │   ├── collate.json
│   │   │   ├── featureDump.txt
│   │   │   ├── generate_permit_list.json
│   │   │   ├── map.collated.rad
│   │   │   ├── permit_freq.bin
│   │   │   ├── permit_map.bin
│   │   │   ├── quant.json
│   │   │   └── unmapped_bc_count_collated.bin
│   │   └── simpleaf_quant_log.json
├── alevinqc
├── fastqc
├── multiqc
└── pipeline_info

Kallisto

For kallisto, I first had to include a new parameter in the pipeline, called, --kb_filter because it is possible that kallisto produce only one matrix or a pair (raw/filtered) depending on whether the bustools --filter command is used or not. So, I added this new parameter to make sure I could include the module taking account of both scenarios.

Finally, to make sure I had a channel that could be properly used, and of course, that we could filter raw / filtered data to choose what to pass on to the empty drops filter module, I had to modify the generated channels by the module (see here).

tuple val(meta), path ("*.count")                  , emit: count
tuple val(meta), path ("*.count/counts_unfiltered"), emit: raw_counts                       // TODO: Add to nf-coew/modules before merging PR
tuple val(meta), path ("*.count/counts_filtered")  , emit: filtered_counts, optional: true  // TODO: Add to nf-coew/modules before merging PR

Then, of course, I updated the downstream snippets of the codes in the suf-workflows and workflow to understand it.

I know these changes shall got to nf-core/modules first. I added here first for discussion, when all solved, I can add to nf-core/modules whatever is needed.

Results look like this, when --filter is on.

kallisto_lamanno_run
├── fastqc
├── kallisto
│   ├── mtx_conversions
│   │   ├── combined_custom_emptydrops_filter_matrix.h5ad
│   │   ├── combined_filtered_matrix.h5ad
│   │   ├── combined_raw_matrix.h5ad
│   │   ├── pbmc8k
│   │   │   ├── pbmc8k_spliced_matrix.h5ad
│   │   │   ├── pbmc8k_spliced_matrix.rds
│   │   │   ├── pbmc8k_unspliced_matrix.h5ad
│   │   │   └── pbmc8k_unspliced_matrix.rds
│   │   └── versions.yml
│   ├── pbmc8k.count
│   │   ├── 10x_version2_whitelist.txt
│   │   ├── counts_filtered
│   │   │   ├── spliced.barcodes.txt
│   │   │   ├── spliced.genes.txt
│   │   │   ├── spliced.mtx
│   │   │   ├── unspliced.barcodes.txt
│   │   │   ├── unspliced.genes.txt
│   │   │   └── unspliced.mtx
│   │   ├── counts_unfiltered
│   │   │   ├── spliced.barcodes.txt
│   │   │   ├── spliced.genes.txt
│   │   │   ├── spliced.mtx
│   │   │   ├── unspliced.barcodes.txt
│   │   │   ├── unspliced.genes.txt
│   │   │   └── unspliced.mtx
│   │   ├── emptydrops_filtered
│   │   │   ├── spliced.barcodes.txt
│   │   │   ├── spliced.genes.txt
│   │   │   ├── spliced.mtx
│   │   │   ├── unspliced.barcodes.txt
│   │   │   ├── unspliced.genes.txt
│   │   │   └── unspliced.mtx
│   │   ├── filter_barcodes.txt
│   │   ├── inspect.json
│   │   ├── inspect.spliced.json
│   │   ├── inspect.unspliced.json
│   │   ├── kb_info.json
│   │   ├── matrix.ec
│   │   ├── output.bus
│   │   ├── output.filtered.bus
│   │   ├── output.unfiltered.bus
│   │   ├── run_info.json
│   │   ├── spliced.filtered.bus
│   │   ├── spliced.unfiltered.bus
│   │   ├── transcripts.txt
│   │   ├── unspliced.filtered.bus
│   │   └── unspliced.unfiltered.bus
│   └── versions.yml
├── multiqc
└── pipeline_info

kallisto_run
├── fastqc
├── kallisto
│   ├── mtx_conversions
│   │   ├── combined_custom_emptydrops_filter_matrix.h5ad
│   │   ├── combined_raw_matrix.h5ad
│   │   ├── pbmc8k
│   │   │   ├── pbmc8k_custom_emptydrops_filter_matrix.h5ad
│   │   │   ├── pbmc8k_custom_emptydrops_filter_matrix.rds
│   │   │   ├── pbmc8k_raw_matrix.h5ad
│   │   │   └── pbmc8k_raw_matrix.rds
│   │   └── versions.yml
│   ├── pbmc8k.count
│   │   ├── 10x_version2_whitelist.txt
│   │   ├── counts_unfiltered
│   │   │   ├── cells_x_genes.barcodes.txt
│   │   │   ├── cells_x_genes.genes.txt
│   │   │   └── cells_x_genes.mtx
│   │   ├── emptydrops_filtered
│   │   │   ├── cells_x_genes.barcodes.txt
│   │   │   ├── cells_x_genes.genes.txt
│   │   │   └── cells_x_genes.mtx
│   │   ├── inspect.json
│   │   ├── kb_info.json
│   │   ├── matrix.ec
│   │   ├── output.bus
│   │   ├── output.unfiltered.bus
│   │   ├── run_info.json
│   │   └── transcripts.txt
│   └── versions.yml
├── multiqc
└── pipeline_info

Cellranger

For cellranger, basically it happened the same to Kallisto. The difference is that cellranger always produces a pair of raw/filtered and then I had to just modify the channels to account for that so would make filtering later on easier (see here)

I also had to update the conversion-scripts because the emptydrops filter module does not produce a .h5 file so the scripts where not ready for converting .mtx cellranger files.

Results for it looks like this:

cellranger_run/
├── cellranger
│   ├── count
│   │   ├── pbmc8k
│   │   │   ├── emptydrops_filtered
│   │   │   └── outs
│   │   └── versions.yml
│   ├── mkgtf
│   │   └── genome_genes.filtered.gtf
│   ├── mkref
│   │   ├── cellranger_reference
│   │   │   ├── fasta
│   │   │   ├── genes
│   │   │   ├── reference.json
│   │   │   └── star
│   │   └── versions.yml
│   └── mtx_conversions
│       ├── combined_custom_emptydrops_filter_matrix.h5ad
│       ├── combined_filtered_matrix.h5ad
│       ├── combined_raw_matrix.h5ad
│       ├── pbmc8k
│       │   ├── pbmc8k_custom_emptydrops_filter_matrix.h5ad
│       │   ├── pbmc8k_custom_emptydrops_filter_matrix.rds
│       │   ├── pbmc8k_filtered_matrix.h5ad
│       │   ├── pbmc8k_filtered_matrix.rds
│       │   ├── pbmc8k_raw_matrix.h5ad
│       │   └── pbmc8k_raw_matrix.rds
│       └── versions.yml
├── fastqc
├── multiqc
└── pipeline_info

STAR

For star, basically the same for cellranger. It always produce a raw/filtered pair, but I had to adjust the out-channels to make the filtering/selection easier. Of course, adjusting all the downstream channel selections to account for them.

The results for it look like this:

star_run/
├── fastqc
├── multiqc
├── pipeline_info
└── star
    ├── mtx_conversions
    │   ├── combined_custom_emptydrops_filter_matrix.h5ad
    │   ├── combined_filtered_matrix.h5ad
    │   ├── combined_raw_matrix.h5ad
    │   ├── pbmc8k
    │   │   ├── pbmc8k_custom_emptydrops_filter_matrix.h5ad
    │   │   ├── pbmc8k_custom_emptydrops_filter_matrix.rds
    │   │   ├── pbmc8k_filtered_matrix.h5ad
    │   │   ├── pbmc8k_filtered_matrix.rds
    │   │   ├── pbmc8k_raw_matrix.h5ad
    │   │   └── pbmc8k_raw_matrix.rds
    │   └── versions.yml
    └── pbmc8k
        ├── emptydrops_filtered
        │   ├── barcodes.tsv
        │   ├── features.tsv
        │   └── matrix.mtx
        ├── pbmc8k.Aligned.sortedByCoord.out.bam
        ├── pbmc8k.Log.final.out
        ├── pbmc8k.Log.out
        ├── pbmc8k.Log.progress.out
        ├── pbmc8k.SJ.out.tab
        ├── pbmc8k.Solo.out
        │   ├── Barcodes.stats
        │   └── Gene
        └── versions.yml

UniverSC and cellrangerarc

I could not even run them, so could not be tested nor integrated.
Thus, I believe they first need to be solved in order to have a proper testing profile to allow integrating these modules here.

Just not sure what should go first.

About the out-channels

You will see in the scrnaseq.nf workflow, that even after generating a proper channel for raw/filtered in the modules of cellranger, star and kallisto I still mix them all in the general ch_mtx_matrices channel.

I do this to guarantee all results are going to the downstream analysis.
The reason I modified the modules to generate these split channels, was just to make sure they grabbed the folder which contains only the raw/filtered results to allow a proper filtering.
Before they were using a higher level ** selection and then all files were being dumped in the channel in a flatten mode, so the filtering was very complicate and resulting in files with duplicated names.

You see that I had to modify the conversion subworkflow to account for that as well.

@fmalmeida fmalmeida marked this pull request as ready for review February 15, 2024 08:33
@fmalmeida fmalmeida requested review from grst and apeltzer February 15, 2024 08:33
@fmalmeida fmalmeida linked an issue Feb 29, 2024 that may be closed by this pull request
Copy link
Member

@apeltzer apeltzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me apart from some parts:

  • Changes in the module code --> need to go upstream, ideally open PRs already for this
  • Upgrades in the respective upstream modules might be necessary, check out the cellranger update PR, which could be merged prior to updating the modules here I believe --> should be easy to do..
  • Some more docs on what this sfeature does would be helpful - so that people can both see it in the changelog and also in the main documentation

@fmalmeida
Copy link
Contributor Author

Looks good to me apart from some parts:

  • Changes in the module code --> need to go upstream, ideally open PRs already for this
  • Upgrades in the respective upstream modules might be necessary, check out the cellranger update PR, which could be merged prior to updating the modules here I believe --> should be easy to do..
  • Some more docs on what this sfeature does would be helpful - so that people can both see it in the changelog and also in the main documentation

Yes, the changes in the modules I added as a TODO so that, once we know and agree on all changes required, it can be done in nf-core/modules.

About the docs, I will work on it.

bin/emptydrops_cell_calling.R Outdated Show resolved Hide resolved
bin/emptydrops_cell_calling.R Outdated Show resolved Hide resolved
modules/local/emptydrops.nf Outdated Show resolved Hide resolved
modules/local/emptydrops.nf Outdated Show resolved Hide resolved
@grst
Copy link
Member

grst commented Mar 11, 2024

sounds good

@grst
Copy link
Member

grst commented Mar 13, 2024

@nf-core-bot fix linting

@fmalmeida
Copy link
Contributor Author

Now that kallisto was updated, and the workflows it provides are different. I will have to test it with them as well.
Once they work, the last thing required will be the docs and the PRs to modules.

@fmalmeida
Copy link
Contributor Author

fmalmeida commented Mar 13, 2024

Hi @grst ,

Can you take a look again at the changes?

The only one missing is the last one, which is currently running.

@grst
Copy link
Member

grst commented Mar 13, 2024

I'm wondering if instead of updating all those modules it would be easier to do something like

ch_filtered = ch_out.map{
        meta, files -> [meta, out.findAll{ it -> it.contains("filtered") }]
}

@fmalmeida
Copy link
Contributor Author

I'm wondering if instead of updating all those modules it would be easier to do something like

ch_filtered = ch_out.map{
        meta, files -> [meta, out.findAll{ it -> it.contains("filtered") }]
}

That was my first try. But because many of them use the ** to grab the files. It catches the files and not the directories. So, when filtering, we only select the files that have it in its name instead of the directory and all that is inside.

I added some information in the last section of this comment #301 (comment)

@grst
Copy link
Member

grst commented Mar 13, 2024

I see, fair enough then

Copy link
Member

@grst grst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go as soon as the tests pass!

@fmalmeida
Copy link
Contributor Author

Pipeline execution terminated.

As said here, the nf-core modules were updated and TODOs removed. Documentation was updated. And also, workflow (kallisto) was updated so that the new modules works for both non-standard kallisto workflows, lamanno and nac.

Results structure organization and namings are being produced as said here.

Finally, all testings passed, so, merging the PR 😄

@fmalmeida fmalmeida enabled auto-merge March 18, 2024 16:05
@fmalmeida fmalmeida merged commit 1043441 into dev Mar 18, 2024
11 checks passed
@fmalmeida fmalmeida deleted the 81-call-empty-droplets branch March 18, 2024 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Call empty droplets
5 participants