diff --git a/Python/1_extract_histopathological_features/myslim/__init__.py b/.gitattributes similarity index 100% rename from Python/1_extract_histopathological_features/myslim/__init__.py rename to .gitattributes diff --git a/.github/.dockstore.yml b/.github/.dockstore.yml new file mode 100755 index 0000000..191fabd --- /dev/null +++ b/.github/.dockstore.yml @@ -0,0 +1,6 @@ +# Dockstore config version, not pipeline version +version: 1.2 +workflows: + - subclass: nfl + primaryDescriptorPath: /nextflow.config + publish: True diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md new file mode 100755 index 0000000..6576937 --- /dev/null +++ b/.github/CONTRIBUTING.md @@ -0,0 +1,125 @@ +# nf-core/spotlight: Contributing Guidelines + +Hi there! +Many thanks for taking an interest in improving nf-core/spotlight. + +We try to manage the required tasks for nf-core/spotlight using GitHub issues, you probably came to this page when creating one. +Please use the pre-filled template to save time. + +However, don't be put off by this template - other more general issues and suggestions are welcome! +Contributions to the code are even more welcome ;) + +> [!NOTE] +> If you need help using or modifying nf-core/spotlight then the best place to ask is on the nf-core Slack [#spotlight](https://nfcore.slack.com/channels/spotlight) channel ([join our Slack here](https://nf-co.re/join/slack)). + +## Contribution workflow + +If you'd like to write some code for nf-core/spotlight, the standard workflow is as follows: + +1. Check that there isn't already an issue about your idea in the [nf-core/spotlight issues](https://github.com/nf-core/spotlight/issues) to avoid duplicating work. If there isn't one already, please create one so that others know you're working on this +2. [Fork](https://help.github.com/en/github/getting-started-with-github/fork-a-repo) the [nf-core/spotlight repository](https://github.com/nf-core/spotlight) to your GitHub account +3. Make the necessary changes / additions within your forked repository following [Pipeline conventions](#pipeline-contribution-conventions) +4. Use `nf-core schema build` and add any new parameters to the pipeline JSON schema (requires [nf-core tools](https://github.com/nf-core/tools) >= 1.10). +5. Submit a Pull Request against the `dev` branch and wait for the code to be reviewed and merged + +If you're not used to this workflow with git, you can start with some [docs from GitHub](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests) or even their [excellent `git` resources](https://try.github.io/). + +## Tests + +You have the option to test your changes locally by running the pipeline. For receiving warnings about process selectors and other `debug` information, it is recommended to use the debug profile. Execute all the tests with the following command: + +```bash +nf-test test --profile debug,test,docker --verbose +``` + +When you create a pull request with changes, [GitHub Actions](https://github.com/features/actions) will run automatic tests. +Typically, pull-requests are only fully reviewed when these tests are passing, though of course we can help out before then. + +There are typically two types of tests that run: + +### Lint tests + +`nf-core` has a [set of guidelines](https://nf-co.re/developers/guidelines) which all pipelines must adhere to. +To enforce these and ensure that all pipelines stay in sync, we have developed a helper tool which runs checks on the pipeline code. This is in the [nf-core/tools repository](https://github.com/nf-core/tools) and once installed can be run locally with the `nf-core lint ` command. + +If any failures or warnings are encountered, please follow the listed URL for more documentation. + +### Pipeline tests + +Each `nf-core` pipeline should be set up with a minimal set of test-data. +`GitHub Actions` then runs the pipeline on this data to ensure that it exits successfully. +If there are any failures then the automated tests fail. +These tests are run both with the latest available version of `Nextflow` and also the minimum required version that is stated in the pipeline code. + +## Patch + +:warning: Only in the unlikely and regretful event of a release happening with a bug. + +- On your own fork, make a new branch `patch` based on `upstream/master`. +- Fix the bug, and bump version (X.Y.Z+1). +- A PR should be made on `master` from patch to directly this particular bug. + +## Getting help + +For further information/help, please consult the [nf-core/spotlight documentation](https://nf-co.re/spotlight/usage) and don't hesitate to get in touch on the nf-core Slack [#spotlight](https://nfcore.slack.com/channels/spotlight) channel ([join our Slack here](https://nf-co.re/join/slack)). + +## Pipeline contribution conventions + +To make the nf-core/spotlight code and processing logic more understandable for new contributors and to ensure quality, we semi-standardise the way the code and other contributions are written. + +### Adding a new step + +If you wish to contribute a new step, please use the following coding standards: + +1. Define the corresponding input channel into your new process from the expected previous process channel +2. Write the process block (see below). +3. Define the output channel if needed (see below). +4. Add any new parameters to `nextflow.config` with a default (see below). +5. Add any new parameters to `nextflow_schema.json` with help text (via the `nf-core schema build` tool). +6. Add sanity checks and validation for all relevant parameters. +7. Perform local tests to validate that the new code works as expected. +8. If applicable, add a new test command in `.github/workflow/ci.yml`. +9. Update MultiQC config `assets/multiqc_config.yml` so relevant suffixes, file name clean up and module plots are in the appropriate order. If applicable, add a [MultiQC](https://https://multiqc.info/) module. +10. Add a description of the output files and if relevant any appropriate images from the MultiQC report to `docs/output.md`. + +### Default values + +Parameters should be initialised / defined with default values in `nextflow.config` under the `params` scope. + +Once there, use `nf-core schema build` to add to `nextflow_schema.json`. + +### Default processes resource requirements + +Sensible defaults for process resource requirements (CPUs / memory / time) for a process should be defined in `conf/base.config`. These should generally be specified generic with `withLabel:` selectors so they can be shared across multiple processes/steps of the pipeline. A nf-core standard set of labels that should be followed where possible can be seen in the [nf-core pipeline template](https://github.com/nf-core/tools/blob/master/nf_core/pipeline-template/conf/base.config), which has the default process as a single core-process, and then different levels of multi-core configurations for increasingly large memory requirements defined with standardised labels. + +The process resources can be passed on to the tool dynamically within the process with the `${task.cpus}` and `${task.memory}` variables in the `script:` block. + +### Naming schemes + +Please use the following naming schemes, to make it easy to understand what is going where. + +- initial process channel: `ch_output_from_` +- intermediate and terminal channels: `ch__for_` + +### Nextflow version bumping + +If you are using a new feature from core Nextflow, you may bump the minimum required version of nextflow in the pipeline with: `nf-core bump-version --nextflow . [min-nf-version]` + +### Images and figures + +For overview images and other documents we follow the nf-core [style guidelines and examples](https://nf-co.re/developers/design_guidelines). + +## GitHub Codespaces + +This repo includes a devcontainer configuration which will create a GitHub Codespaces for Nextflow development! This is an online developer environment that runs in your browser, complete with VSCode and a terminal. + +To get started: + +- Open the repo in [Codespaces](https://github.com/nf-core/spotlight/codespaces) +- Tools installed + - nf-core + - Nextflow + +Devcontainer specs: + +- [DevContainer config](.devcontainer/devcontainer.json) diff --git a/.github/ISSUE_TEMPLATE/bug_report.yml b/.github/ISSUE_TEMPLATE/bug_report.yml new file mode 100755 index 0000000..a75bac4 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/bug_report.yml @@ -0,0 +1,50 @@ +name: Bug report +description: Report something that is broken or incorrect +labels: bug +body: + - type: markdown + attributes: + value: | + Before you post this issue, please check the documentation: + + - [nf-core website: troubleshooting](https://nf-co.re/usage/troubleshooting) + - [nf-core/spotlight pipeline documentation](https://nf-co.re/spotlight/usage) + + - type: textarea + id: description + attributes: + label: Description of the bug + description: A clear and concise description of what the bug is. + validations: + required: true + + - type: textarea + id: command_used + attributes: + label: Command used and terminal output + description: Steps to reproduce the behaviour. Please paste the command you used to launch the pipeline and the output from your terminal. + render: console + placeholder: | + $ nextflow run ... + + Some output where something broke + + - type: textarea + id: files + attributes: + label: Relevant files + description: | + Please drag and drop the relevant files here. Create a `.zip` archive if the extension is not allowed. + Your verbose log file `.nextflow.log` is often useful _(this is a hidden file in the directory where you launched the pipeline)_ as well as custom Nextflow configuration files. + + - type: textarea + id: system + attributes: + label: System information + description: | + * Nextflow version _(eg. 23.04.0)_ + * Hardware _(eg. HPC, Desktop, Cloud)_ + * Executor _(eg. slurm, local, awsbatch)_ + * Container engine: _(e.g. Docker, Singularity, Conda, Podman, Shifter, Charliecloud, or Apptainer)_ + * OS _(eg. CentOS Linux, macOS, Linux Mint)_ + * Version of nf-core/spotlight _(eg. 1.1, 1.5, 1.8.2)_ diff --git a/.github/ISSUE_TEMPLATE/config.yml b/.github/ISSUE_TEMPLATE/config.yml new file mode 100755 index 0000000..3689217 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/config.yml @@ -0,0 +1,7 @@ +contact_links: + - name: Join nf-core + url: https://nf-co.re/join + about: Please join the nf-core community here + - name: "Slack #spotlight channel" + url: https://nfcore.slack.com/channels/spotlight + about: Discussion about the nf-core/spotlight pipeline diff --git a/.github/ISSUE_TEMPLATE/feature_request.yml b/.github/ISSUE_TEMPLATE/feature_request.yml new file mode 100755 index 0000000..d2a1c0f --- /dev/null +++ b/.github/ISSUE_TEMPLATE/feature_request.yml @@ -0,0 +1,11 @@ +name: Feature request +description: Suggest an idea for the nf-core/spotlight pipeline +labels: enhancement +body: + - type: textarea + id: description + attributes: + label: Description of feature + description: Please describe your suggestion for a new feature. It might help to describe a problem or use case, plus any alternatives that you have considered. + validations: + required: true diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md new file mode 100755 index 0000000..e398068 --- /dev/null +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -0,0 +1,26 @@ + + +## PR checklist + +- [ ] This comment contains a description of changes (with reason). +- [ ] If you've fixed a bug or added code that should be tested, add tests! +- [ ] If you've added a new tool - have you followed the pipeline conventions in the [contribution docs](https://github.com/nf-core/spotlight/tree/master/.github/CONTRIBUTING.md) +- [ ] If necessary, also make a PR on the nf-core/spotlight _branch_ on the [nf-core/test-datasets](https://github.com/nf-core/test-datasets) repository. +- [ ] Make sure your code lints (`nf-core lint`). +- [ ] Ensure the test suite passes (`nextflow run . -profile test,docker --outdir `). +- [ ] Check for unexpected warnings in debug mode (`nextflow run . -profile debug,test,docker --outdir `). +- [ ] Usage Documentation in `docs/usage.md` is updated. +- [ ] Output Documentation in `docs/output.md` is updated. +- [ ] `CHANGELOG.md` is updated. +- [ ] `README.md` is updated (including new tool citations and authors/contributors). diff --git a/.github/workflows/awsfulltest.yml b/.github/workflows/awsfulltest.yml new file mode 100755 index 0000000..52f7b86 --- /dev/null +++ b/.github/workflows/awsfulltest.yml @@ -0,0 +1,39 @@ +name: nf-core AWS full size tests +# This workflow is triggered on published releases. +# It can be additionally triggered manually with GitHub actions workflow dispatch button. +# It runs the -profile 'test_full' on AWS batch + +on: + release: + types: [published] + workflow_dispatch: +jobs: + run-platform: + name: Run AWS full tests + if: github.repository == 'nf-core/spotlight' + runs-on: ubuntu-latest + steps: + - name: Launch workflow via Seqera Platform + uses: seqeralabs/action-tower-launch@v2 + # TODO nf-core: You can customise AWS full pipeline tests as required + # Add full size test data (but still relatively small datasets for few samples) + # on the `test_full.config` test runs with only one set of parameters + with: + workspace_id: ${{ secrets.TOWER_WORKSPACE_ID }} + access_token: ${{ secrets.TOWER_ACCESS_TOKEN }} + compute_env: ${{ secrets.TOWER_COMPUTE_ENV }} + revision: ${{ github.sha }} + workdir: s3://${{ secrets.AWS_S3_BUCKET }}/work/spotlight/work-${{ github.sha }} + parameters: | + { + "hook_url": "${{ secrets.MEGATESTS_ALERTS_SLACK_HOOK_URL }}", + "outdir": "s3://${{ secrets.AWS_S3_BUCKET }}/spotlight/results-${{ github.sha }}" + } + profiles: test_full + + - uses: actions/upload-artifact@v4 + with: + name: Seqera Platform debug log file + path: | + seqera_platform_action_*.log + seqera_platform_action_*.json diff --git a/.github/workflows/awstest.yml b/.github/workflows/awstest.yml new file mode 100755 index 0000000..4b0c80b --- /dev/null +++ b/.github/workflows/awstest.yml @@ -0,0 +1,33 @@ +name: nf-core AWS test +# This workflow can be triggered manually with the GitHub actions workflow dispatch button. +# It runs the -profile 'test' on AWS batch + +on: + workflow_dispatch: +jobs: + run-platform: + name: Run AWS tests + if: github.repository == 'nf-core/spotlight' + runs-on: ubuntu-latest + steps: + # Launch workflow using Seqera Platform CLI tool action + - name: Launch workflow via Seqera Platform + uses: seqeralabs/action-tower-launch@v2 + with: + workspace_id: ${{ secrets.TOWER_WORKSPACE_ID }} + access_token: ${{ secrets.TOWER_ACCESS_TOKEN }} + compute_env: ${{ secrets.TOWER_COMPUTE_ENV }} + revision: ${{ github.sha }} + workdir: s3://${{ secrets.AWS_S3_BUCKET }}/work/spotlight/work-${{ github.sha }} + parameters: | + { + "outdir": "s3://${{ secrets.AWS_S3_BUCKET }}/spotlight/results-test-${{ github.sha }}" + } + profiles: test + + - uses: actions/upload-artifact@v4 + with: + name: Seqera Platform debug log file + path: | + seqera_platform_action_*.log + seqera_platform_action_*.json diff --git a/.github/workflows/branch.yml b/.github/workflows/branch.yml new file mode 100755 index 0000000..bb06097 --- /dev/null +++ b/.github/workflows/branch.yml @@ -0,0 +1,44 @@ +name: nf-core branch protection +# This workflow is triggered on PRs to master branch on the repository +# It fails when someone tries to make a PR against the nf-core `master` branch instead of `dev` +on: + pull_request_target: + branches: [master] + +jobs: + test: + runs-on: ubuntu-latest + steps: + # PRs to the nf-core repo master branch are only ok if coming from the nf-core repo `dev` or any `patch` branches + - name: Check PRs + if: github.repository == 'nf-core/spotlight' + run: | + { [[ ${{github.event.pull_request.head.repo.full_name }} == nf-core/spotlight ]] && [[ $GITHUB_HEAD_REF == "dev" ]]; } || [[ $GITHUB_HEAD_REF == "patch" ]] + + # If the above check failed, post a comment on the PR explaining the failure + # NOTE - this doesn't currently work if the PR is coming from a fork, due to limitations in GitHub actions secrets + - name: Post PR comment + if: failure() + uses: mshick/add-pr-comment@b8f338c590a895d50bcbfa6c5859251edc8952fc # v2 + with: + message: | + ## This PR is against the `master` branch :x: + + * Do not close this PR + * Click _Edit_ and change the `base` to `dev` + * This CI test will remain failed until you push a new commit + + --- + + Hi @${{ github.event.pull_request.user.login }}, + + It looks like this pull-request is has been made against the [${{github.event.pull_request.head.repo.full_name }}](https://github.com/${{github.event.pull_request.head.repo.full_name }}) `master` branch. + The `master` branch on nf-core repositories should always contain code from the latest release. + Because of this, PRs to `master` are only allowed if they come from the [${{github.event.pull_request.head.repo.full_name }}](https://github.com/${{github.event.pull_request.head.repo.full_name }}) `dev` branch. + + You do not need to close this PR, you can change the target branch to `dev` by clicking the _"Edit"_ button at the top of this page. + Note that even after this, the test will continue to show as failing until you push a new commit. + + Thanks again for your contribution! + repo-token: ${{ secrets.GITHUB_TOKEN }} + allow-repeats: false diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml new file mode 100755 index 0000000..4bec948 --- /dev/null +++ b/.github/workflows/ci.yml @@ -0,0 +1,46 @@ +name: nf-core CI +# This workflow runs the pipeline with the minimal test dataset to check that it completes without any syntax errors +on: + push: + branches: + - dev + pull_request: + release: + types: [published] + +env: + NXF_ANSI_LOG: false + +concurrency: + group: "${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}" + cancel-in-progress: true + +jobs: + test: + name: Run pipeline with test data + # Only run on push if this is the nf-core dev branch (merged PRs) + if: "${{ github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'nf-core/spotlight') }}" + runs-on: ubuntu-latest + strategy: + matrix: + NXF_VER: + - "23.04.0" + - "latest-everything" + steps: + - name: Check out pipeline code + uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4 + + - name: Install Nextflow + uses: nf-core/setup-nextflow@v2 + with: + version: "${{ matrix.NXF_VER }}" + + - name: Disk space cleanup + uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1 + + - name: Run pipeline with test data + # TODO nf-core: You can customise CI pipeline run tests as required + # For example: adding multiple test runs with different parameters + # Remember that you can parallelise this by using strategy.matrix + run: | + nextflow run ${GITHUB_WORKSPACE} -profile test,docker --outdir ./results diff --git a/.github/workflows/clean-up.yml b/.github/workflows/clean-up.yml new file mode 100755 index 0000000..0b6b1f2 --- /dev/null +++ b/.github/workflows/clean-up.yml @@ -0,0 +1,24 @@ +name: "Close user-tagged issues and PRs" +on: + schedule: + - cron: "0 0 * * 0" # Once a week + +jobs: + clean-up: + runs-on: ubuntu-latest + permissions: + issues: write + pull-requests: write + steps: + - uses: actions/stale@28ca1036281a5e5922ead5184a1bbf96e5fc984e # v9 + with: + stale-issue-message: "This issue has been tagged as awaiting-changes or awaiting-feedback by an nf-core contributor. Remove stale label or add a comment otherwise this issue will be closed in 20 days." + stale-pr-message: "This PR has been tagged as awaiting-changes or awaiting-feedback by an nf-core contributor. Remove stale label or add a comment if it is still useful." + close-issue-message: "This issue was closed because it has been tagged as awaiting-changes or awaiting-feedback by an nf-core contributor and then staled for 20 days with no activity." + days-before-stale: 30 + days-before-close: 20 + days-before-pr-close: -1 + any-of-labels: "awaiting-changes,awaiting-feedback" + exempt-issue-labels: "WIP" + exempt-pr-labels: "WIP" + repo-token: "${{ secrets.GITHUB_TOKEN }}" diff --git a/.github/workflows/download_pipeline.yml b/.github/workflows/download_pipeline.yml new file mode 100755 index 0000000..2d20d64 --- /dev/null +++ b/.github/workflows/download_pipeline.yml @@ -0,0 +1,86 @@ +name: Test successful pipeline download with 'nf-core download' + +# Run the workflow when: +# - dispatched manually +# - when a PR is opened or reopened to master branch +# - the head branch of the pull request is updated, i.e. if fixes for a release are pushed last minute to dev. +on: + workflow_dispatch: + inputs: + testbranch: + description: "The specific branch you wish to utilize for the test execution of nf-core download." + required: true + default: "dev" + pull_request: + types: + - opened + - edited + - synchronize + branches: + - master + pull_request_target: + branches: + - master + +env: + NXF_ANSI_LOG: false + +jobs: + download: + runs-on: ubuntu-latest + steps: + - name: Install Nextflow + uses: nf-core/setup-nextflow@v2 + + - name: Disk space cleanup + uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1 + + - uses: actions/setup-python@82c7e631bb3cdc910f68e0081d67478d79c6982d # v5 + with: + python-version: "3.12" + architecture: "x64" + - uses: eWaterCycle/setup-singularity@931d4e31109e875b13309ae1d07c70ca8fbc8537 # v7 + with: + singularity-version: 3.8.3 + + - name: Install dependencies + run: | + python -m pip install --upgrade pip + pip install git+https://github.com/nf-core/tools.git@dev + + - name: Get the repository name and current branch set as environment variable + run: | + echo "REPO_LOWERCASE=${GITHUB_REPOSITORY,,}" >> ${GITHUB_ENV} + echo "REPOTITLE_LOWERCASE=$(basename ${GITHUB_REPOSITORY,,})" >> ${GITHUB_ENV} + echo "REPO_BRANCH=${{ github.event.inputs.testbranch || 'dev' }}" >> ${GITHUB_ENV} + + - name: Download the pipeline + env: + NXF_SINGULARITY_CACHEDIR: ./ + run: | + nf-core download ${{ env.REPO_LOWERCASE }} \ + --revision ${{ env.REPO_BRANCH }} \ + --outdir ./${{ env.REPOTITLE_LOWERCASE }} \ + --compress "none" \ + --container-system 'singularity' \ + --container-library "quay.io" -l "docker.io" -l "ghcr.io" \ + --container-cache-utilisation 'amend' \ + --download-configuration + + - name: Inspect download + run: tree ./${{ env.REPOTITLE_LOWERCASE }} + + - name: Run the downloaded pipeline (stub) + id: stub_run_pipeline + continue-on-error: true + env: + NXF_SINGULARITY_CACHEDIR: ./ + NXF_SINGULARITY_HOME_MOUNT: true + run: nextflow run ./${{ env.REPOTITLE_LOWERCASE }}/$( sed 's/\W/_/g' <<< ${{ env.REPO_BRANCH }}) -stub -profile test,singularity --outdir ./results + - name: Run the downloaded pipeline (stub run not supported) + id: run_pipeline + if: ${{ job.steps.stub_run_pipeline.status == failure() }} + env: + NXF_SINGULARITY_CACHEDIR: ./ + NXF_SINGULARITY_HOME_MOUNT: true + run: nextflow run ./${{ env.REPOTITLE_LOWERCASE }}/$( sed 's/\W/_/g' <<< ${{ env.REPO_BRANCH }}) -profile test,singularity --outdir ./results diff --git a/.github/workflows/fix-linting.yml b/.github/workflows/fix-linting.yml new file mode 100755 index 0000000..7d075b5 --- /dev/null +++ b/.github/workflows/fix-linting.yml @@ -0,0 +1,89 @@ +name: Fix linting from a comment +on: + issue_comment: + types: [created] + +jobs: + fix-linting: + # Only run if comment is on a PR with the main repo, and if it contains the magic keywords + if: > + contains(github.event.comment.html_url, '/pull/') && + contains(github.event.comment.body, '@nf-core-bot fix linting') && + github.repository == 'nf-core/spotlight' + runs-on: ubuntu-latest + steps: + # Use the @nf-core-bot token to check out so we can push later + - uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4 + with: + token: ${{ secrets.nf_core_bot_auth_token }} + + # indication that the linting is being fixed + - name: React on comment + uses: peter-evans/create-or-update-comment@71345be0265236311c031f5c7866368bd1eff043 # v4 + with: + comment-id: ${{ github.event.comment.id }} + reactions: eyes + + # Action runs on the issue comment, so we don't get the PR by default + # Use the gh cli to check out the PR + - name: Checkout Pull Request + run: gh pr checkout ${{ github.event.issue.number }} + env: + GITHUB_TOKEN: ${{ secrets.nf_core_bot_auth_token }} + + # Install and run pre-commit + - uses: actions/setup-python@82c7e631bb3cdc910f68e0081d67478d79c6982d # v5 + with: + python-version: "3.12" + + - name: Install pre-commit + run: pip install pre-commit + + - name: Run pre-commit + id: pre-commit + run: pre-commit run --all-files + continue-on-error: true + + # indication that the linting has finished + - name: react if linting finished succesfully + if: steps.pre-commit.outcome == 'success' + uses: peter-evans/create-or-update-comment@71345be0265236311c031f5c7866368bd1eff043 # v4 + with: + comment-id: ${{ github.event.comment.id }} + reactions: "+1" + + - name: Commit & push changes + id: commit-and-push + if: steps.pre-commit.outcome == 'failure' + run: | + git config user.email "core@nf-co.re" + git config user.name "nf-core-bot" + git config push.default upstream + git add . + git status + git commit -m "[automated] Fix code linting" + git push + + - name: react if linting errors were fixed + id: react-if-fixed + if: steps.commit-and-push.outcome == 'success' + uses: peter-evans/create-or-update-comment@71345be0265236311c031f5c7866368bd1eff043 # v4 + with: + comment-id: ${{ github.event.comment.id }} + reactions: hooray + + - name: react if linting errors were not fixed + if: steps.commit-and-push.outcome == 'failure' + uses: peter-evans/create-or-update-comment@71345be0265236311c031f5c7866368bd1eff043 # v4 + with: + comment-id: ${{ github.event.comment.id }} + reactions: confused + + - name: react if linting errors were not fixed + if: steps.commit-and-push.outcome == 'failure' + uses: peter-evans/create-or-update-comment@71345be0265236311c031f5c7866368bd1eff043 # v4 + with: + issue-number: ${{ github.event.issue.number }} + body: | + @${{ github.actor }} I tried to fix the linting errors, but it didn't work. Please fix them manually. + See [CI log](https://github.com/nf-core/spotlight/actions/runs/${{ github.run_id }}) for more details. diff --git a/.github/workflows/linting.yml b/.github/workflows/linting.yml new file mode 100755 index 0000000..1fcafe8 --- /dev/null +++ b/.github/workflows/linting.yml @@ -0,0 +1,68 @@ +name: nf-core linting +# This workflow is triggered on pushes and PRs to the repository. +# It runs the `nf-core lint` and markdown lint tests to ensure +# that the code meets the nf-core guidelines. +on: + push: + branches: + - dev + pull_request: + release: + types: [published] + +jobs: + pre-commit: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4 + + - name: Set up Python 3.12 + uses: actions/setup-python@82c7e631bb3cdc910f68e0081d67478d79c6982d # v5 + with: + python-version: "3.12" + + - name: Install pre-commit + run: pip install pre-commit + + - name: Run pre-commit + run: pre-commit run --all-files + + nf-core: + runs-on: ubuntu-latest + steps: + - name: Check out pipeline code + uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4 + + - name: Install Nextflow + uses: nf-core/setup-nextflow@v2 + + - uses: actions/setup-python@82c7e631bb3cdc910f68e0081d67478d79c6982d # v5 + with: + python-version: "3.12" + architecture: "x64" + + - name: Install dependencies + run: | + python -m pip install --upgrade pip + pip install nf-core + + - name: Run nf-core lint + env: + GITHUB_COMMENTS_URL: ${{ github.event.pull_request.comments_url }} + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} + GITHUB_PR_COMMIT: ${{ github.event.pull_request.head.sha }} + run: nf-core -l lint_log.txt lint --dir ${GITHUB_WORKSPACE} --markdown lint_results.md + + - name: Save PR number + if: ${{ always() }} + run: echo ${{ github.event.pull_request.number }} > PR_number.txt + + - name: Upload linting log file artifact + if: ${{ always() }} + uses: actions/upload-artifact@65462800fd760344b1a7b4382951275a0abb4808 # v4 + with: + name: linting-logs + path: | + lint_log.txt + lint_results.md + PR_number.txt diff --git a/.github/workflows/linting_comment.yml b/.github/workflows/linting_comment.yml new file mode 100755 index 0000000..40acc23 --- /dev/null +++ b/.github/workflows/linting_comment.yml @@ -0,0 +1,28 @@ +name: nf-core linting comment +# This workflow is triggered after the linting action is complete +# It posts an automated comment to the PR, even if the PR is coming from a fork + +on: + workflow_run: + workflows: ["nf-core linting"] + +jobs: + test: + runs-on: ubuntu-latest + steps: + - name: Download lint results + uses: dawidd6/action-download-artifact@09f2f74827fd3a8607589e5ad7f9398816f540fe # v3 + with: + workflow: linting.yml + workflow_conclusion: completed + + - name: Get PR number + id: pr_number + run: echo "pr_number=$(cat linting-logs/PR_number.txt)" >> $GITHUB_OUTPUT + + - name: Post PR comment + uses: marocchino/sticky-pull-request-comment@331f8f5b4215f0445d3c07b4967662a32a2d3e31 # v2 + with: + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} + number: ${{ steps.pr_number.outputs.pr_number }} + path: linting-logs/lint_results.md diff --git a/.github/workflows/release-announcements.yml b/.github/workflows/release-announcements.yml new file mode 100755 index 0000000..03ecfcf --- /dev/null +++ b/.github/workflows/release-announcements.yml @@ -0,0 +1,75 @@ +name: release-announcements +# Automatic release toot and tweet anouncements +on: + release: + types: [published] + workflow_dispatch: + +jobs: + toot: + runs-on: ubuntu-latest + steps: + - name: get topics and convert to hashtags + id: get_topics + run: | + echo "topics=$(curl -s https://nf-co.re/pipelines.json | jq -r '.remote_workflows[] | select(.full_name == "${{ github.repository }}") | .topics[]' | awk '{print "#"$0}' | tr '\n' ' ')" >> $GITHUB_OUTPUT + + - uses: rzr/fediverse-action@master + with: + access-token: ${{ secrets.MASTODON_ACCESS_TOKEN }} + host: "mstdn.science" # custom host if not "mastodon.social" (default) + # GitHub event payload + # https://docs.github.com/en/developers/webhooks-and-events/webhooks/webhook-events-and-payloads#release + message: | + Pipeline release! ${{ github.repository }} v${{ github.event.release.tag_name }} - ${{ github.event.release.name }}! + + Please see the changelog: ${{ github.event.release.html_url }} + + ${{ steps.get_topics.outputs.topics }} #nfcore #openscience #nextflow #bioinformatics + + send-tweet: + runs-on: ubuntu-latest + + steps: + - uses: actions/setup-python@82c7e631bb3cdc910f68e0081d67478d79c6982d # v5 + with: + python-version: "3.10" + - name: Install dependencies + run: pip install tweepy==4.14.0 + - name: Send tweet + shell: python + run: | + import os + import tweepy + + client = tweepy.Client( + access_token=os.getenv("TWITTER_ACCESS_TOKEN"), + access_token_secret=os.getenv("TWITTER_ACCESS_TOKEN_SECRET"), + consumer_key=os.getenv("TWITTER_CONSUMER_KEY"), + consumer_secret=os.getenv("TWITTER_CONSUMER_SECRET"), + ) + tweet = os.getenv("TWEET") + client.create_tweet(text=tweet) + env: + TWEET: | + Pipeline release! ${{ github.repository }} v${{ github.event.release.tag_name }} - ${{ github.event.release.name }}! + + Please see the changelog: ${{ github.event.release.html_url }} + TWITTER_CONSUMER_KEY: ${{ secrets.TWITTER_CONSUMER_KEY }} + TWITTER_CONSUMER_SECRET: ${{ secrets.TWITTER_CONSUMER_SECRET }} + TWITTER_ACCESS_TOKEN: ${{ secrets.TWITTER_ACCESS_TOKEN }} + TWITTER_ACCESS_TOKEN_SECRET: ${{ secrets.TWITTER_ACCESS_TOKEN_SECRET }} + + bsky-post: + runs-on: ubuntu-latest + steps: + - uses: zentered/bluesky-post-action@80dbe0a7697de18c15ad22f4619919ceb5ccf597 # v0.1.0 + with: + post: | + Pipeline release! ${{ github.repository }} v${{ github.event.release.tag_name }} - ${{ github.event.release.name }}! + + Please see the changelog: ${{ github.event.release.html_url }} + env: + BSKY_IDENTIFIER: ${{ secrets.BSKY_IDENTIFIER }} + BSKY_PASSWORD: ${{ secrets.BSKY_PASSWORD }} + # diff --git a/.gitignore b/.gitignore index 02a6e52..19c927e 100644 --- a/.gitignore +++ b/.gitignore @@ -11,11 +11,9 @@ data_example .docker_temp* .txt *.pyc -data/checkpoint .vscode slurm_out output -output_example # Ignore nextflow files in development nf-* .nextflow.log* @@ -29,3 +27,37 @@ spatial_features_matrix_TCGA.csv TEST-* clincial_file*.tsv data/clinical_file_TCGA_SKCM.tsv +*.tar.gz + +# Skip data files +output_* + + +.nextflow* +work/ +data/ +results/ +.DS_Store +testing/ +testing* +*.pyc + +output/ +assets/checkpoint +assets/codebook.txt +asse + +spotlight.sif +plugins/* + + +DUMMY-* +BACKUP-* +test-data + + +ARCHIVE + + +gaitilab.config +nf_run_spotlight.sh diff --git a/Python/3_spatial_characterization/__init__.py b/CITATIONS.md similarity index 100% rename from Python/3_spatial_characterization/__init__.py rename to CITATIONS.md diff --git a/Dockerfile b/Dockerfile index cb0edf7..b786d78 100644 --- a/Dockerfile +++ b/Dockerfile @@ -7,17 +7,31 @@ RUN apt-get update && \ apt-get install -y python3-pip && \ apt-get install -y openslide-tools && \ apt-get install -y python3-openslide && \ - apt-get install -y libgl1-mesa-glx + apt-get install -y libgl1-mesa-glx && \ + apt-get install -y python3.8-dev && \ + apt-get install -y build-essential && \ + apt-get install -y pkg-config && \ + apt-get install -y python-dev && \ + apt-get install -y libhdf5-dev && \ + apt-get install -y libblosc-dev + # Set up python environment # RUN apt install python3.8-venv -RUN python3 -m venv /spotlight_venv -RUN . spotlight_venv/bin/activate + +# Add nf-bin with all Python/R scripts +ENV VIRTUAL_ENV=/spotlight_venv +RUN python3 -m venv ${VIRTUAL_ENV} +ENV PATH="${VIRTUAL_ENV}/bin:$PATH" + +# RUN . spotlight_venv/bin/activate COPY ./env_requirements.txt ./ -RUN pip3 install -r env_requirements.txt +RUN pip3 install --upgrade pip setuptools wheel +RUN pip3 install --default-timeout=900 -r env_requirements.txt # Set up directories # -RUN mkdir -p /project/Python/libs -WORKDIR / +# WORKDIR /nf-bin + +ENV PATH="nf-bin:${PATH}" + -ENV PYTHONPATH /project/Python/libs diff --git a/Python/1_extract_histopathological_features/codebook.txt b/Python/1_extract_histopathological_features/codebook.txt deleted file mode 100644 index 42ce278..0000000 --- a/Python/1_extract_histopathological_features/codebook.txt +++ /dev/null @@ -1,42 +0,0 @@ -ACC_T 0 -BLCA_T 1 -BRCA_N 2 -BRCA_T 3 -CESC_T 4 -COAD_N 5 -COAD_T 6 -ESCA_N 7 -ESCA_T 8 -GBM_T 9 -HNSC_N 10 -HNSC_T 11 -KICH_N 12 -KICH_T 13 -KIRC_N 14 -KIRC_T 15 -KIRP_N 16 -KIRP_T 17 -LGG_T 18 -LIHC_N 19 -LIHC_T 20 -LUAD_N 21 -LUAD_T 22 -LUSC_N 23 -LUSC_T 24 -MESO_T 25 -OV_N 26 -OV_T 27 -PCPG_T 28 -PRAD_N 29 -PRAD_T 30 -READ_T 31 -SARC_T 32 -STAD_N 33 -STAD_T 34 -TGCT_T 35 -THCA_N 36 -THCA_T 37 -THYM_T 38 -UCEC_T 39 -UVM_T 40 -SKCM_T 41 diff --git a/Python/1_extract_histopathological_features/myslim/create_clinical_file.py b/Python/1_extract_histopathological_features/myslim/create_clinical_file.py deleted file mode 100644 index 12ab064..0000000 --- a/Python/1_extract_histopathological_features/myslim/create_clinical_file.py +++ /dev/null @@ -1,154 +0,0 @@ -import argparse -import os -import os.path -import numpy as np -import pandas as pd -import sys - - -def create_TCGA_clinical_file( - class_names, - clinical_files_dir, - output_dir=None, - tumor_purity_threshold=80, - path_codebook=None -): - """ - Create a clinical file based on the slide metadata downloaded from the GDC data portal - 1. Read the files and add classname and id based on CODEBOOK.txt - 2. Filter tumor purity - 3. Save file - - Args: - class_names (str): single class name e.g. LUAD_T or path to file with class names - clinical_files_dir (str): String with path to folder with subfolders pointing to the raw clinical files (slide.tsv) - output_dir (str): Path to folder where the clinical file should be stored - tumor_purity_threshold (int): default=80 - multi_class_path (str): path to file with class names to be merged into one clinical file - - Returns: - {output_dir}/generated_clinical_file.txt" containing the slide_submitter_id, sample_submitter_id, image_file_name, percent_tumor_cells, class_name, class_id in columns and records (slides) in rows. - - """ - # ---- Setup parameters ---- # - if not os.path.isdir(output_dir): - os.mkdir(output_dir) - - if (os.path.isfile(class_names)): # multi class names - class_names = pd.read_csv( - class_names, header=None).to_numpy().flatten() - else: # single class names - class_name = class_names - - CODEBOOK = pd.read_csv( - path_codebook, - delim_whitespace=True, - header=None, names=["class_name", "value"] - ) - - # ---- 1. Constructing a merged clinical file ---- # - # Read clinical files - # a) Single class - if os.path.isfile(clinical_files_dir): - clinical_file = pd.read_csv(clinical_files_dir, sep="\t") - # only keep tissue (remove _T or _N) to check in filename - clinical_file["class_name"] = class_name - clinical_file["class_id"] = int( - CODEBOOK.loc[CODEBOOK["class_name"] == class_name].values[0][1] - ) - print(clinical_file) - print(CODEBOOK) - # b) Multiple classes - elif os.path.isdir(clinical_files_dir) & (len(class_names) > 1): - clinical_file_list = [] - # Combine all clinical raw files based on input - for class_name in class_names: - clinical_file_temp = pd.read_csv( - f"{clinical_files_dir}/clinical_file_TCGA_{class_name[:-2]}.tsv", - sep="\t", - ) - # only keep tissue (remove _T or _N) to check in filename - clinical_file_temp["class_name"] = class_name - clinical_file_temp["class_id"] = int( - CODEBOOK.loc[CODEBOOK["class_name"] == class_name].values[0][1] - ) - clinical_file_list.append(clinical_file_temp) - clinical_file = pd.concat( - clinical_file_list, axis=0).reset_index(drop=True) - - # ---- 2) Filter: Availability of tumor purity (percent_tumor_cells) ---- # - # Remove rows with missing tumor purity - clinical_file["percent_tumor_cells"] = ( - clinical_file["percent_tumor_cells"] - .replace("'--", np.nan, regex=True) - .astype(float) - ) - - # Convert strings to numeric type - clinical_file["percent_tumor_cells"] = pd.to_numeric( - clinical_file["percent_tumor_cells"] - ) - clinical_file = clinical_file.dropna(subset=["percent_tumor_cells"]) - clinical_file = clinical_file.where( - clinical_file["percent_tumor_cells"] >= float(tumor_purity_threshold) - ) - # ---- 3) Formatting and saving ---- # - clinical_file["image_file_name"] = [ - f"{slide_submitter_id}.{str(slide_id).upper()}.svs" - for slide_submitter_id, slide_id in clinical_file[ - ["slide_submitter_id", "slide_id"] - ].to_numpy() - ] - - clinical_file = clinical_file.dropna(how="all") - clinical_file = clinical_file.drop_duplicates() - clinical_file = clinical_file.drop_duplicates(subset="slide_submitter_id") - clinical_file = clinical_file[ - [ - "slide_submitter_id", - "sample_submitter_id", - "image_file_name", - "percent_tumor_cells", - "class_name", - "class_id", - ] - ] - clinical_file = clinical_file.dropna(how="any", axis=0) - clinical_file.to_csv( - f"{output_dir}/generated_clinical_file.txt", - index=False, - sep="\t", - ) - print("\nFinished creating a new clinical file") - - -if __name__ == "__main__": - parser = argparse.ArgumentParser() - parser.add_argument( - "--class_names", - help="Either (a) single classname or (b) Path to file with classnames according to codebook.txt (e.g. LUAD_T)", required=True - ) - parser.add_argument( - "--clinical_files_dir", - help="Path to folders containing subfolders for multiple tumor types.", required=True - ) - parser.add_argument( - "--tumor_purity_threshold", - help="Integer for filtering tumor purity assessed by pathologists", - default=80, required=False - ) - parser.add_argument( - "--output_dir", help="Path to folder for saving all created files", default=None, required=False - ) - parser.add_argument( - "--path_codebook", help="Path to codebook", default=None, required=False - ) - args = parser.parse_args() - - create_TCGA_clinical_file( - class_names=args.class_names, - tumor_purity_threshold=args.tumor_purity_threshold, - clinical_files_dir=args.clinical_files_dir, - output_dir=args.output_dir, - path_codebook=args.path_codebook - ) diff --git a/Python/1_extract_histopathological_features/myslim/create_file_info_train.py b/Python/1_extract_histopathological_features/myslim/create_file_info_train.py deleted file mode 100644 index 9429dbf..0000000 --- a/Python/1_extract_histopathological_features/myslim/create_file_info_train.py +++ /dev/null @@ -1,97 +0,0 @@ -import argparse -import os -import os.path -import sys -import pandas as pd - -sys.path.append(f"{os.path.dirname(os.getcwd())}/Python/libs") -REPO_DIR = os.path.dirname(os.getcwd()) - -# trunk-ignore(flake8/E402) -import DL.utils as utils - -# trunk-ignore(flake8/E402) -from openslide import OpenSlide - -def format_tile_data_structure(slides_folder, output_folder, clinical_file_path): - """ - Specifying the tile data structure required to store tiles as TFRecord files (used in convert.py) - - Args: - slides_folder (str): path pointing to folder with all whole slide images (.svs files) - output_folder (str): path pointing to folder for storing all created files by script - clinical_file_path (str): path pointing to formatted clinical file (either generated or manually formatted) - - Returns: - {output_folder}/file_info_train.txt containing the path to the individual tiles, class name, class id, percent of tumor cells and JPEG quality - - """ - tiles_folder = output_folder + "/tiles" - - clinical_file = pd.read_csv(clinical_file_path, sep="\t") - clinical_file.dropna(how="all", inplace=True) - clinical_file.drop_duplicates(inplace=True) - clinical_file.drop_duplicates(subset="slide_submitter_id", inplace=True) - - # 2) Determine the paths paths of jpg tiles - all_tile_names = os.listdir(tiles_folder) - jpg_tile_names = [] - jpg_tile_paths = [] - - for tile_name in all_tile_names: - if "jpg" in tile_name: - jpg_tile_names.append(tile_name) - jpg_tile_paths.append(tiles_folder + "/" + tile_name) - - # 3) Get corresponding data from the clinical file based on the tile names - jpg_tile_names_stripped = [ - utils.get_slide_submitter_id(jpg_tile_name) for jpg_tile_name in jpg_tile_names - ] - jpg_tile_names_df = pd.DataFrame( - jpg_tile_names_stripped, columns=["slide_submitter_id"] - ) - jpg_tiles_df = pd.merge( - jpg_tile_names_df, clinical_file, on=["slide_submitter_id"], how="left" - ) - - # 4) Determine jpeg_quality of slides - slide_quality = [] - for slide_name in jpg_tiles_df.image_file_name.unique(): - print("{}/{}".format(slides_folder, slide_name)) - img = OpenSlide("{}/{}".format(slides_folder, slide_name)) - #print(img.properties.values) - #image_description = img.properties.values.__self__.get("tiff.ImageDescription").split("|")[0] - #image_description_split = image_description.split(" ") - #jpeg_quality = image_description_split[-1] - jpeg_quality = "80" - slide_quality.append([slide_name, "RGB" + jpeg_quality]) - - slide_quality_df = pd.DataFrame( - slide_quality, columns=["image_file_name", "jpeg_quality"] - ) - jpg_tiles_df = pd.merge( - jpg_tiles_df, slide_quality_df, on=["image_file_name"], how="left" - ) - jpg_tiles_df["tile_path"] = jpg_tile_paths - - # Create output dataframe - output = jpg_tiles_df[ - ["tile_path", "class_name", "class_id", "jpeg_quality", "percent_tumor_cells"] - ] - output.to_csv(output_folder + "/file_info_train.txt", index=False, sep="\t") - - print("Finished creating the necessary file for computing the features in the next step") - - -if __name__ == "__main__": - parser = argparse.ArgumentParser() - parser.add_argument("--slides_folder", help="Set slides folder") - parser.add_argument("--output_folder", help="Set output folder") - parser.add_argument("--clin_path", help="Set clinical file path") - args = parser.parse_args() - - format_tile_data_structure( - slides_folder=args.slides_folder, - output_folder=args.output_folder, - clinical_file_path=args.clin_path, - ) diff --git a/Python/1_extract_histopathological_features/myslim/create_tiles_from_slides.py b/Python/1_extract_histopathological_features/myslim/create_tiles_from_slides.py deleted file mode 100644 index 2e476df..0000000 --- a/Python/1_extract_histopathological_features/myslim/create_tiles_from_slides.py +++ /dev/null @@ -1,97 +0,0 @@ -#!/usr/bin/python -import os -import sys - -import numpy as np -import pandas as pd -from PIL import Image - -sys.path.append(f"{os.path.dirname(os.getcwd())}/Python/libs") -REPO_DIR = os.path.dirname(os.getcwd()) - -# trunk-ignore(flake8/E402) -from openslide import OpenSlide - -# trunk-ignore(flake8/E402) -import DL.image as im - -def create_tiles_from_slides(slides_folder, output_folder, clinical_file_path): - """ - Create tiles from slides - Dividing the whole slide images into tiles with a size of 512 x 512 pixels, with an overlap of 50 pixels at a magnification of 20x. In addition, remove blurred and non-informative tiles by using the weighted gradient magnitude. - - Source: - Fu, Y., Jung, A. W., Torne, R. V., Gonzalez, S., Vöhringer, H., Shmatko, A., Yates, L. R., Jimenez-Linan, M., Moore, L., & Gerstung, M. (2020). Pan-cancer computational histopathology reveals mutations, tumor composition and prognosis. Nature Cancer, 1(8), 800–810. https://doi.org/10.1038/s43018-020-0085-8 - - Args: - slides_folder (str): path pointing to folder with all whole slide images (.svs files) - output_folder (str): path pointing to folder for storing all created files by script (i.e. .jpg files for the created tiles) - - Returns: - jpg files for the created tiles in the specified folder {output_folder}/tiles - - """ - - # Create folder for storing the tiles if non-existent - tiles_folder = "{}/tiles".format(output_folder) - if not os.path.exists(tiles_folder): - os.makedirs(tiles_folder) - print(tiles_folder) - - # Subset images of interest (present in generated clinical file) - clinical_file = pd.read_csv(clinical_file_path, sep="\t") - clinical_file.dropna(how="all", inplace=True) - clinical_file.drop_duplicates(inplace=True) - clinical_file.drop_duplicates(subset="slide_submitter_id", inplace=True) - subset_images=clinical_file.image_file_name.tolist() - print(subset_images) - # Check if slides are among our data - available_images=os.listdir(slides_folder) - images_for_tiling=list(set(subset_images) & set(available_images)) - - print(len(images_for_tiling), 'images available:') - counter=1 - for slide_filename in images_for_tiling: - if slide_filename.endswith(('.svs','.ndpi')): - print(counter, ':', slide_filename) - slide = OpenSlide("{}/{}".format(slides_folder, slide_filename)) - slide_name = slide_filename.split(".")[0] - if ( - str(slide.properties.values.__self__.get("tiff.ImageDescription")).find( - "AppMag = 40" - ) - != -1 - ): - region_size = 1024 - tile_size = 924 - else: - region_size = 512 - tile_size = 462 - [width, height] = slide.dimensions - for x_coord in range(1, width, tile_size): - for y_coord in range(1, height, tile_size): - slide_region = slide.read_region( - location=(x_coord, y_coord), - level=0, - size=(region_size, region_size), - ) - slide_region_converted = slide_region.convert("RGB") - tile = slide_region_converted.resize((512, 512), Image.ANTIALIAS) - grad = im.getGradientMagnitude(np.array(tile)) - unique, counts = np.unique(grad, return_counts=True) - if counts[np.argwhere(unique <= 20)].sum() < 512 * 512 * 0.6: - tile.save( - "{}/{}_{}_{}.jpg".format( - tiles_folder, slide_name, x_coord, y_coord - ), - "JPEG", - optimize=True, - quality=94, - ) - counter=counter+1 - - print("Finished creating tiles from the given slides") - - -if __name__ == "__main__": - create_tiles_from_slides(sys.argv[1], sys.argv[2]) diff --git a/Python/1_extract_histopathological_features/myslim/post_process_features.py b/Python/1_extract_histopathological_features/myslim/post_process_features.py deleted file mode 100644 index bbc84a0..0000000 --- a/Python/1_extract_histopathological_features/myslim/post_process_features.py +++ /dev/null @@ -1,95 +0,0 @@ -#  Module imports -import argparse -import os -import sys -import dask.dataframe as dd -import pandas as pd - -#  Custom imports -import DL.utils as utils - - -def post_process_features(output_dir, slide_type, data_source="TCGA"): - """ - Format extracted histopathological features from bot.train.txt file generated by myslim/bottleneck_predict.py and extract the 1,536 features, tile names. Extract several variables from tile ID. - - Args: - output_dir (str): path pointing to folder for storing all created files by script - - Returns: - {output_dir}/features.txt contains the 1,536 features, followed by the sample_submitter_id, tile_ID, slide_submitter_id, Section, Coord_X and Coord_Y and in the rows the tiles - """ - # Read histopathological computed features - if slide_type == "FF": - features_raw = pd.read_csv( - output_dir + "/bot_train.txt", sep="\t", header=None) - # Extract the DL features (discard: col1 = tile paths, col2 = true class id) - features = features_raw.iloc[:, 2:] - features.columns = list(range(1536)) - # Add new column variables that define each tile - features["tile_ID"] = [utils.get_tile_name( - tile_path) for tile_path in features_raw.iloc[:, 0]] - features["Coord_X"] = [i[-2] - for i in features["tile_ID"].str.split("_")] - features["Coord_Y"] = [i[-1] - for i in features["tile_ID"].str.split("_")] - # FIX add sample_submitter_id and slide_submitter_id depending on data_source - if (data_source == "TCGA"): - features["sample_submitter_id"] = features["tile_ID"].str[0:16] - features["slide_submitter_id"] = features["tile_ID"].str[0:23] - features["Section"] = features["tile_ID"].str[20:23] - else: - features["sample_submitter_id"] = features['tile_ID'].str.split( - '_').str[0] - - #  Save features to .csv file - features.to_csv(output_dir + "/features.txt", sep="\t", header=True) - - elif slide_type == "FFPE": - features_raw = dd.read_csv( - output_dir + "/bot_train.txt", sep="\t", header=None) - features_raw['tile_ID'] = features_raw.iloc[:, 0] - features_raw.tile_ID = features_raw.tile_ID.map( - lambda x: x.split("/")[-1]) - features_raw['tile_ID'] = features_raw['tile_ID'].str.replace( - ".jpg'", "") - features = features_raw.map_partitions( - lambda df: df.drop(columns=[0, 1])) - new_names = list(map(lambda x: str(x), list(range(1536)))) - new_names.append('tile_ID') - features.columns = new_names - # FIX add sample_submitter_id and slide_submitter_id depending on data_source - if (data_source == "TCGA"): - features["sample_submitter_id"] = features["tile_ID"].str[0:16] - features["slide_submitter_id"] = features["tile_ID"].str[0:23] - features["Section"] = features["tile_ID"].str[20:23] - else: - features["sample_submitter_id"] = features['tile_ID'].str.split( - '_').str[0] - features['Coord_X'] = features['tile_ID'].str.split('_').str[1] - features['Coord_Y'] = features['tile_ID'].str.split('_').str[-1] - # Save features using parquet - # TODO TESTING move function to utils and convert to def instead of lambda - # name_function=lambda x: f"features-{x}.parquet" - OUTPUT_PATH = f"{output_dir}/features_format_parquet" - if os.path.exists(OUTPUT_PATH): - print("Folder exists") - else: - os.makedirs(OUTPUT_PATH) - - features.to_parquet(path=OUTPUT_PATH, compression='gzip', - name_function=utils.name_function) - - print("Formatted all features") - - -if __name__ == "__main__": - parser = argparse.ArgumentParser() - parser.add_argument("--output_dir", help="Set output folder") - parser.add_argument( - "--slide_type", help="Type of tissue slide (FF or FFPE)]") - parser.add_argument( - "--data_source", help="Data source, default='TCGA'") - args = parser.parse_args() - post_process_features(output_dir=args.output_dir, - slide_type=args.slide_type, data_source=args.data_source) diff --git a/Python/1_extract_histopathological_features/myslim/post_process_predictions.py b/Python/1_extract_histopathological_features/myslim/post_process_predictions.py deleted file mode 100644 index e126a8c..0000000 --- a/Python/1_extract_histopathological_features/myslim/post_process_predictions.py +++ /dev/null @@ -1,215 +0,0 @@ -# Module imports -import argparse -import os -import sys -import dask.dataframe as dd -import pandas as pd - -#  Custom imports -import DL.utils as utils -import numpy as np - - -def post_process_predictions(output_dir, slide_type, path_codebook, path_tissue_classes): - """ - Format predicted tissue classes and derive tumor purity from pred.train.txt file generated by myslim/bottleneck_predict.py and - The pred.train.txt file contains the tile ID, the true class id and the 42 predicted probabilities for the 42 tissue classes. - - Args: - output_dir (str): path pointing to folder for storing all created files by script - - Returns: - {output_dir}/predictions.txt containing the following columns - - tile_ID, - - pred_class_id and true_class_id: class ids defined in codebook.txt) - - pred_class_name and true_class_name: class names e.g. LUAD_T, defined in codebook.txt) - - pred_probability: corresponding probability - - is_correct_pred (boolean): correctly predicted tissue class label - - tumor_label_prob and normal_label_prob: probability for predicting tumor and normal label (regardless of tumor or tissue type) - - is_correct_pred_label (boolean): correctly predicted 'tumor' or 'normal' tissue regardless of tumor or tissue type - In the rows the tiles. - """ - - # Initialize - # path_codebook = f"{os.path.dirname(os.getcwd())}/Python/1_extract_histopathological_features/codebook.txt" - # path_tissue_classes = f"{os.path.dirname(os.getcwd())}/Python/1_extract_histopathological_features/tissue_classes.csv" - codebook = pd.read_csv(path_codebook, delim_whitespace=True, header=None) - codebook.columns = ["class_name", "class_id"] - tissue_classes = pd.read_csv(path_tissue_classes, sep="\t") - - # Read predictions - if slide_type == "FF": - predictions_raw = pd.read_csv( - output_dir + "/pred_train.txt", sep="\t", header=None) - # Extract tile name incl. coordinates from path - tile_names = [utils.get_tile_name(tile_path) - for tile_path in predictions_raw[0]] - # Create output dataframe for post-processed data - predictions = pd.DataFrame(tile_names, columns=["tile_ID"]) - # Get predicted probabilities for all 42 classes + rename columns - pred_probabilities = predictions_raw.iloc[:, 2:] - pred_probabilities.columns = codebook["class_id"] - # Get predicted and true class ids - predictions["pred_class_id"] = pred_probabilities.idxmax( - axis="columns") - predictions["true_class_id"] = 41 - # Get corresponding max probabilities to the predicted class - predictions["pred_probability"] = pred_probabilities.max(axis=1) - # Replace class id with class name - predictions["true_class_name"] = predictions["true_class_id"].copy() - predictions["pred_class_name"] = predictions["pred_class_id"].copy() - found_class_ids = set(predictions["true_class_id"]).union( - set(predictions["pred_class_id"])) - for class_id in found_class_ids: - predictions["true_class_name"].replace( - class_id, codebook["class_name"][class_id], inplace=True - ) - predictions["pred_class_name"].replace( - class_id, codebook["class_name"][class_id], inplace=True - ) - - # Define whether prediction was right - predictions["is_correct_pred"] = ( - predictions["true_class_id"] == predictions["pred_class_id"]) - predictions["is_correct_pred"] = predictions["is_correct_pred"].replace( - False, "F") - predictions.is_correct_pred = predictions.is_correct_pred.astype(str) - # Get tumor and tissue ID - # TODO ERROR - temp = pd.DataFrame( - {"tumor_type": predictions["true_class_name"].str[:-2]}) - temp = pd.merge(temp, tissue_classes, on="tumor_type", how="left") - # Set of IDs for normal and tumor (because of using multiple classes) - IDs_tumor = list(set(temp["ID_tumor"])) - if list(set(temp.tumor_type.tolist()))[0] == 'SKCM': - # Probability for predicting tumor and normal label (regardless of tumor [tissue] type) - predictions["tumor_label_prob"] = np.nan - predictions["normal_label_prob"] = np.nan - for ID_tumor in IDs_tumor: - vals = pred_probabilities.loc[temp["ID_tumor"] - == ID_tumor, ID_tumor] - predictions.loc[temp["ID_tumor"] == - ID_tumor, "tumor_label_prob"] = vals - - predictions["is_correct_pred_label"] = np.nan - else: - IDs_normal = list(set(temp["ID_normal"])) - # Probability for predicting tumor and normal label (regardless of tumor [tissue] type) - predictions["tumor_label_prob"] = np.nan - predictions["normal_label_prob"] = np.nan - for ID_tumor in IDs_tumor: - vals = pred_probabilities.loc[temp["ID_tumor"] - == ID_tumor, ID_tumor] - predictions.loc[temp["ID_tumor"] == - ID_tumor, "tumor_label_prob"] = vals - - for ID_normal in IDs_normal: - vals = pred_probabilities.loc[temp["ID_normal"] - == ID_normal, ID_normal] - predictions.loc[temp["ID_normal"] == - ID_normal, "normal_label_prob"] = vals - - # Check if the correct label (tumor/normal) is predicted - temp_probs = predictions[["tumor_label_prob", "normal_label_prob"]] - is_normal_label_prob = ( - temp_probs["normal_label_prob"] > temp_probs["tumor_label_prob"] - ) - is_tumor_label_prob = ( - temp_probs["normal_label_prob"] < temp_probs["tumor_label_prob"] - ) - is_normal_label = predictions["true_class_name"].str.find( - "_N") != -1 - is_tumor_label = predictions["true_class_name"].str.find( - "_T") != -1 - - is_normal = is_normal_label & is_normal_label_prob - is_tumor = is_tumor_label & is_tumor_label_prob - - predictions["is_correct_pred_label"] = is_normal | is_tumor - predictions["is_correct_pred_label"].replace( - True, "T", inplace=True) - predictions["is_correct_pred_label"].replace( - False, "F", inplace=True) - - #  Save features to .csv file - predictions.to_csv(output_dir + "/predictions.txt", sep="\t") - - elif slide_type == "FFPE": - predictions_raw = dd.read_csv( - output_dir + "/pred_train.txt", sep="\t", header=None) - predictions_raw['tile_ID'] = predictions_raw.iloc[:, 0] - predictions_raw.tile_ID = predictions_raw.tile_ID.map( - lambda x: x.split("/")[-1]) - predictions_raw['tile_ID'] = predictions_raw['tile_ID'].str.replace( - ".jpg'", "") - predictions = predictions_raw.map_partitions( - lambda df: df.drop(columns=[0, 1])) - new_names = list(map(lambda x: str(x), codebook["class_id"])) - new_names.append('tile_ID') - predictions.columns = new_names - predictions = predictions.map_partitions(lambda x: x.assign( - pred_class_id=x.iloc[:, 0:41].idxmax(axis="columns"))) - predictions["true_class_id"] = 41 - predictions = predictions.map_partitions(lambda x: x.assign( - pred_probability=x.iloc[:, 0:41].max(axis="columns"))) - predictions["true_class_name"] = predictions["true_class_id"].copy() - predictions["pred_class_name"] = predictions["pred_class_id"].copy() - predictions.pred_class_id = predictions.pred_class_id.astype(int) - res = dict(zip(codebook.class_id, codebook.class_name)) - predictions = predictions.map_partitions(lambda x: x.assign( - pred_class_name=x.loc[:, 'pred_class_id'].replace(res))) - predictions = predictions.map_partitions(lambda x: x.assign( - true_class_name=x.loc[:, 'true_class_id'].replace(res))) - predictions["is_correct_pred"] = ( - predictions["true_class_id"] == predictions["pred_class_id"]) - predictions["is_correct_pred"] = predictions["is_correct_pred"].replace( - False, "F") - predictions.is_correct_pred = predictions.is_correct_pred.astype(str) - temp = predictions.map_partitions(lambda x: x.assign( - tumor_type=x["true_class_name"].str[:-2])) - temp = temp.map_partitions(lambda x: pd.merge( - x, tissue_classes, on="tumor_type", how="left")) - if (temp['tumor_type'].compute() == 'SKCM').any(): - # Probability for predicting tumor and normal label (regardless of tumor [tissue] type) - predictions["tumor_label_prob"] = np.nan - predictions["normal_label_prob"] = np.nan - predictions = predictions.map_partitions( - lambda x: x.assign(tumor_label_prob=x.loc[:, '41'])) - predictions["is_correct_pred_label"] = np.nan - else: - # TO DO - predictions["tumor_label_prob"] = np.nan - predictions["normal_label_prob"] = np.nan - # predictions = predictions.map_partitions(lambda x: x.assign(tumor_label_prob=x.loc[:, '41'])) - # predictions = predictions.map_partitions(lambda x: x.assign(tumor_label_prob=x.loc[:, '41'])) - - # Save features using parquet - def name_function(x): return f"predictions-{x}.parquet" - OUTPUT_PATH = f"{output_dir}/predictions_format_parquet" - if os.path.exists(OUTPUT_PATH): - print("Folder exists") - else: - os.makedirs(OUTPUT_PATH) - - predictions.to_parquet( - path=OUTPUT_PATH, compression='gzip', name_function=name_function) - - print("Formatted all predicted tissue labels") - - -if __name__ == "__main__": - parser = argparse.ArgumentParser() - parser.add_argument("--output_dir", help="Set output folder") - parser.add_argument( - "--slide_type", help="Type of tissue slide (FF or FFPE)]") - parser.add_argument( - "--path_codebook", help="codebook.txt file", required=True, type=str) - parser.add_argument( - "--path_tissue_classes", help="Tissue_classes.csv file", required=True, type=str) - - args = parser.parse_args() - post_process_predictions(output_dir=args.output_dir, slide_type=args.slide_type, path_codebook=args.path_codebook, - path_tissue_classes=args.path_tissue_classes) - - -# $cur_dir/codebook.txt $cur_dir/tissue_classes.csv $output_dir diff --git a/Python/1_extract_histopathological_features/myslim/python_test.py b/Python/1_extract_histopathological_features/myslim/python_test.py deleted file mode 100644 index 82f0164..0000000 --- a/Python/1_extract_histopathological_features/myslim/python_test.py +++ /dev/null @@ -1,7 +0,0 @@ -import os -import DL.utils as utils -import sys -print(sys.path) - -print(os.path.dirname(os.getcwd())) -print(os.getcwd()) diff --git a/Python/1_extract_histopathological_features/post_processing.py b/Python/1_extract_histopathological_features/post_processing.py deleted file mode 100644 index c05ef15..0000000 --- a/Python/1_extract_histopathological_features/post_processing.py +++ /dev/null @@ -1,49 +0,0 @@ -import argparse -import os -from myslim.post_process_features import post_process_features -from myslim.post_process_predictions import post_process_predictions - - -def execute_postprocessing(output_dir, slide_type, path_codebook, path_tissue_classes, data_source): - """ - 1. Format extracted histopathological features - 2. Format predictions of the 42 classes - - Args: - output_dir (str): path pointing to folder for storing all created files by script - - Returns: - {output_dir}/features.txt - {output_dir}/predictions.txt - """ - post_process_features(output_dir=output_dir, - slide_type=slide_type, data_source=data_source) - post_process_predictions(output_dir=output_dir, slide_type=slide_type, - path_codebook=path_codebook, path_tissue_classes=path_tissue_classes) - - -if __name__ == "__main__": - parser = argparse.ArgumentParser() - parser.add_argument("--output_dir", help="Set output folder", type=str) - parser.add_argument( - "--slide_type", help="Type of tissue slide (FF or FFPE)", required=True, type=str) - parser.add_argument( - "--path_codebook", help="codebook.txt file", required=True, type=str) - parser.add_argument( - "--path_tissue_classes", help="Tissue_classes.csv file", required=True, type=str) - parser.add_argument( - "--data_source", help="Data source, default='TCGA'") - args = parser.parse_args() - - if os.path.exists(args.output_dir): - print("Output folder exists") - else: - os.makedirs(args.output_dir) - - execute_postprocessing( - output_dir=args.output_dir, - slide_type=args.slide_type, - path_codebook=args.path_codebook, - path_tissue_classes=args.path_tissue_classes, - data_source=args.data_source - ) diff --git a/Python/1_extract_histopathological_features/pre_processing.py b/Python/1_extract_histopathological_features/pre_processing.py deleted file mode 100755 index 06e5078..0000000 --- a/Python/1_extract_histopathological_features/pre_processing.py +++ /dev/null @@ -1,85 +0,0 @@ -import argparse -import os -import pandas as pd -import sys - -from myslim.create_file_info_train import format_tile_data_structure -from myslim.create_tiles_from_slides import create_tiles_from_slides -from myslim.datasets.convert import _convert_dataset - -# sys.path.append(f"{os.path.dirname(os.getcwd())}/Python/libs") -# REPO_DIR = os.path.dirname(os.getcwd()) - - -def execute_preprocessing(slides_folder, output_folder, clinical_file_path, N_shards=320): - """ - Execute several pre-processing steps necessary for extracting the histopathological features - 1. Create tiles from slides - 2. Construct file necessary for the deep learning architecture - 3. Convert images of tiles to TF records - - Args: - slides_folder (str): path pointing to folder with all whole slide images (.svs files) - output_folder (str): path pointing to folder for storing all created files by script - clinical_file_path (str): path pointing to formatted clinical file (either generated or manually formatted) - N_shards (int): default: 320 - checkpoint_path (str): path pointing to checkpoint to be used - - Returns: - {output_folder}/tiles/{tile files} - {output_folder}/file_info_train.txt file specifying data structure of the tiles required for inception architecture (to read the TF records) - {output_folder}/process_train/{TFrecord file} files that store the data as a series of binary sequencies - - """ - - # Create an empty folder for TF records if folder doesn't exist - process_train_dir = f"{output_folder}/process_train" - if not os.path.exists(process_train_dir): - os.makedirs(process_train_dir) - - # Perform image tiling, only kept images of interest - create_tiles_from_slides(slides_folder=slides_folder, - output_folder=output_folder, clinical_file_path=clinical_file_path) - - # File required for training - format_tile_data_structure( - slides_folder=slides_folder, - output_folder=output_folder, - clinical_file_path=clinical_file_path - ) - - # Convert tiles from jpg to TF record1 - file_info = pd.read_csv(f"{output_folder}/file_info_train.txt", sep="\t") - training_filenames = list(file_info["tile_path"].values) - training_classids = [int(id) for id in list(file_info["class_id"].values)] - tps = [int(id) for id in list(file_info["percent_tumor_cells"].values)] - Qs = list(file_info["jpeg_quality"].values) - - _convert_dataset( - split_name="train", - filenames=training_filenames, - tps=tps, - Qs=Qs, - classids=training_classids, - output_dir=process_train_dir, - NUM_SHARDS=N_shards, - ) - - print("Finished converting dataset") - print( - f"The converted data is stored in the directory: {process_train_dir}") - - -if __name__ == "__main__": - parser = argparse.ArgumentParser() - parser.add_argument("--slides_folder", help="Set slides folder") - parser.add_argument("--output_folder", help="Set output folder") - parser.add_argument("--clinical_file_path", help="Set clinical file path") - parser.add_argument("--N_shards", help="Number of shards", default=320) - args = parser.parse_args() - execute_preprocessing( - slides_folder=args.slides_folder, - output_folder=args.output_folder, - clinical_file_path=args.clinical_file_path, - N_shards=args.N_shards, - ) diff --git a/Python/2_train_multitask_models/checks.ipynb b/Python/2_train_multitask_models/checks.ipynb deleted file mode 100755 index d2af40b..0000000 --- a/Python/2_train_multitask_models/checks.ipynb +++ /dev/null @@ -1,841 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": 52, - "metadata": {}, - "outputs": [], - "source": [ - "\"\"\" Compute and combine cell type abundances from different quantification methods necessary for TF learning\n", - "Args: \n", - " clinical_file (str): path pointing to a txt or tsv file\n", - " path_published_data (str): path pointing to a folder containing published computed features\n", - " path_computed_features (str): path pointing to a folder containing the files generated with immunedeconv\n", - " data_path (str): path pointing to a folder where the dataframe containing all features should be stored, stored as .txt file\n", - "\n", - "Returns: \n", - " ./task_selection_names.pkl: pickle file containing variable names. \n", - " {data_path}/TCGA_{cancer_type}_ensembled_selected_tasks.csv\" containing the following cell type quantification methods: \n", - " tumor_purity = [\n", - " 'tumor purity (ABSOLUTE)',\n", - " 'tumor purity (ESTIMATE)',\n", - " 'tumor purity (EPIC)'\n", - " ]\n", - "\n", - " T_cells = [\n", - " 'CD8 T cells (Thorsson)', \n", - " 'Cytotoxic cells',\n", - " 'Effector cells',\n", - " 'CD8 T cells (quanTIseq)', \n", - " 'TIL score',\n", - " 'Immune score', \n", - " ]\n", - "\n", - " endothelial_cells = [\n", - " 'Endothelial cells (xCell)',\n", - " 'Endothelial cells (EPIC)', \n", - " 'Endothelium', ]\n", - "\n", - " CAFs = [\n", - " 'Stromal score',\n", - " 'CAFs (MCP counter)', \n", - " 'CAFs (EPIC)',\n", - " 'CAFs (Bagaev)',\n", - " ]\n", - "\n", - "\"\"\"\n", - "\n", - "import os\n", - "import sys\n", - "\n", - "import matplotlib.pyplot as plt\n", - "import numpy as np\n", - "import pandas as pd\n", - "import seaborn as sns\n", - "\n", - "sys.path.append(f\"{os.path.dirname(os.getcwd())}/libs\")\n", - "import joblib\n", - "import model.preprocessing as preprocessing" - ] - }, - { - "cell_type": "code", - "execution_count": 53, - "metadata": {}, - "outputs": [], - "source": [ - "# Final feature selection\n", - "tumor_purity = [\n", - " 'tumor purity (ABSOLUTE)',\n", - " 'tumor purity (ESTIMATE)',\n", - " 'tumor purity (EPIC)'\n", - "]\n", - "\n", - "T_cells = [\n", - " 'CD8 T cells (Thorsson)', \n", - " 'Cytotoxic cells',\n", - " 'Effector cells',\n", - " 'CD8 T cells (quanTIseq)', \n", - " 'TIL score',\n", - " 'Immune score', \n", - "]\n", - "\n", - "endothelial_cells = [\n", - " 'Endothelial cells (xCell)',\n", - " 'Endothelial cells (EPIC)', \n", - " 'Endothelium', ]\n", - "\n", - "CAFs = [\n", - " 'Stromal score',\n", - " 'CAFs (MCP counter)', \n", - " 'CAFs (EPIC)',\n", - " 'CAFs (Bagaev)',\n", - "]\n", - "\n", - "IDs = ['slide_submitter_id', 'sample_submitter_id',\n", - " ]\n", - "\n", - "tile_vars = ['Section', 'Coord_X', 'Coord_Y', \"tile_ID\"]\n", - "var_dict = {\n", - " \"CAFs\": CAFs, \n", - " \"T_cells\": T_cells,\n", - " \"tumor_purity\": tumor_purity,\n", - " \"endothelial_cells\": endothelial_cells,\n", - " \"IDs\":IDs,\n", - " \"tile_IDs\": tile_vars\n", - "}\n" - ] - }, - { - "cell_type": "code", - "execution_count": 63, - "metadata": {}, - "outputs": [], - "source": [ - "ensembled_tasks = pd.read_csv(\"/Users/joankant/Library/CloudStorage/OneDrive-TUEindhoven/spotlight/data/TCGA_FF_SKCM_ensembled_selected_tasks.csv\", sep=\"\\t\")" - ] - }, - { - "cell_type": "code", - "execution_count": 64, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Unnamed: 0slide_submitter_idsample_submitter_idStromal scoreCAFs (MCP counter)CAFs (EPIC)CAFs (Bagaev)Endothelial cells (xCell)Endothelial cells (EPIC)EndotheliumCD8 T cells (Thorsson)Cytotoxic cellsEffector cellsCD8 T cells (quanTIseq)TIL scoreImmune scoretumor purity (ABSOLUTE)tumor purity (ESTIMATE)tumor purity (EPIC)
00TCGA-D3-A8GP-06A-01-TSATCGA-D3-A8GP-06A-1344.9512.365915-0.558672-1.405636-9.9657840.398655-0.791017NaN-0.792303-0.322414-1.050480NaN-701.880.960.9539876.554462
11TCGA-D3-A8GP-06A-02-TSBTCGA-D3-A8GP-06A-1344.9512.365915-0.558672-1.405636-9.9657840.398655-0.791017NaN-0.792303-0.322414-1.050480NaN-701.880.960.9539876.554462
22TCGA-RP-A693-06A-01-TSATCGA-RP-A693-06A-111.9614.0268811.4724940.0342851.3815971.4864560.980811NaN0.3173240.6944571.946440NaN1021.120.320.7395196.392896
33TCGA-EE-A3JD-06A-01-TSATCGA-EE-A3JD-06A171.9513.0457820.177346-0.9535390.1729581.3463541.078788NaN0.6439991.8770633.106620NaN2555.570.180.5357876.133479
44TCGA-FS-A4FC-06A-01-TS1TCGA-FS-A4FC-06A-354.1914.8088032.4319911.194521-9.9657841.3193560.237883NaN-0.225290-0.333677-1.005219NaN-60.200.590.8555636.446450
............................................................
378378TCGA-ER-A194-01A-01-TSATCGA-ER-A194-01A-45.3213.4929521.2109050.4335633.5159781.7541012.1887830.95060.4215371.1539370.8964930.3931117.540.600.7231976.495422
379379TCGA-EE-A2MH-06A-01-TSATCGA-EE-A2MH-06A-466.2512.613355-0.395582-0.8359962.2859151.0993731.341252NaN0.3274901.2654091.950049NaN1472.690.430.7298316.422688
380380TCGA-EE-A3AF-06A-01-TSATCGA-EE-A3AF-06A-225.5413.9566271.8711600.340345-9.9657840.202723-0.229261NaN-0.0386720.3622250.714307NaN862.830.780.7657866.490166
381381TCGA-EE-A2MU-06A-02-TSBTCGA-EE-A2MU-06A375.1115.9270674.2094111.266537-9.965784-2.4617480.000569NaN0.5964991.6613152.790765NaN1806.580.290.6016486.178854
382382TCGA-EE-A2MD-06A-01-TSATCGA-EE-A2MD-06A-734.3213.6891842.073051-0.727678-9.965784-0.753840-0.625993NaN-0.330836-0.172356-0.605996NaN98.210.630.8719566.351655
\n", - "

383 rows × 19 columns

\n", - "
" - ], - "text/plain": [ - " Unnamed: 0 slide_submitter_id sample_submitter_id Stromal score \\\n", - "0 0 TCGA-D3-A8GP-06A-01-TSA TCGA-D3-A8GP-06A -1344.95 \n", - "1 1 TCGA-D3-A8GP-06A-02-TSB TCGA-D3-A8GP-06A -1344.95 \n", - "2 2 TCGA-RP-A693-06A-01-TSA TCGA-RP-A693-06A -111.96 \n", - "3 3 TCGA-EE-A3JD-06A-01-TSA TCGA-EE-A3JD-06A 171.95 \n", - "4 4 TCGA-FS-A4FC-06A-01-TS1 TCGA-FS-A4FC-06A -354.19 \n", - ".. ... ... ... ... \n", - "378 378 TCGA-ER-A194-01A-01-TSA TCGA-ER-A194-01A -45.32 \n", - "379 379 TCGA-EE-A2MH-06A-01-TSA TCGA-EE-A2MH-06A -466.25 \n", - "380 380 TCGA-EE-A3AF-06A-01-TSA TCGA-EE-A3AF-06A -225.54 \n", - "381 381 TCGA-EE-A2MU-06A-02-TSB TCGA-EE-A2MU-06A 375.11 \n", - "382 382 TCGA-EE-A2MD-06A-01-TSA TCGA-EE-A2MD-06A -734.32 \n", - "\n", - " CAFs (MCP counter) CAFs (EPIC) CAFs (Bagaev) \\\n", - "0 12.365915 -0.558672 -1.405636 \n", - "1 12.365915 -0.558672 -1.405636 \n", - "2 14.026881 1.472494 0.034285 \n", - "3 13.045782 0.177346 -0.953539 \n", - "4 14.808803 2.431991 1.194521 \n", - ".. ... ... ... \n", - "378 13.492952 1.210905 0.433563 \n", - "379 12.613355 -0.395582 -0.835996 \n", - "380 13.956627 1.871160 0.340345 \n", - "381 15.927067 4.209411 1.266537 \n", - "382 13.689184 2.073051 -0.727678 \n", - "\n", - " Endothelial cells (xCell) Endothelial cells (EPIC) Endothelium \\\n", - "0 -9.965784 0.398655 -0.791017 \n", - "1 -9.965784 0.398655 -0.791017 \n", - "2 1.381597 1.486456 0.980811 \n", - "3 0.172958 1.346354 1.078788 \n", - "4 -9.965784 1.319356 0.237883 \n", - ".. ... ... ... \n", - "378 3.515978 1.754101 2.188783 \n", - "379 2.285915 1.099373 1.341252 \n", - "380 -9.965784 0.202723 -0.229261 \n", - "381 -9.965784 -2.461748 0.000569 \n", - "382 -9.965784 -0.753840 -0.625993 \n", - "\n", - " CD8 T cells (Thorsson) Cytotoxic cells Effector cells \\\n", - "0 NaN -0.792303 -0.322414 \n", - "1 NaN -0.792303 -0.322414 \n", - "2 NaN 0.317324 0.694457 \n", - "3 NaN 0.643999 1.877063 \n", - "4 NaN -0.225290 -0.333677 \n", - ".. ... ... ... \n", - "378 0.9506 0.421537 1.153937 \n", - "379 NaN 0.327490 1.265409 \n", - "380 NaN -0.038672 0.362225 \n", - "381 NaN 0.596499 1.661315 \n", - "382 NaN -0.330836 -0.172356 \n", - "\n", - " CD8 T cells (quanTIseq) TIL score Immune score \\\n", - "0 -1.050480 NaN -701.88 \n", - "1 -1.050480 NaN -701.88 \n", - "2 1.946440 NaN 1021.12 \n", - "3 3.106620 NaN 2555.57 \n", - "4 -1.005219 NaN -60.20 \n", - ".. ... ... ... \n", - "378 0.896493 0.393 1117.54 \n", - "379 1.950049 NaN 1472.69 \n", - "380 0.714307 NaN 862.83 \n", - "381 2.790765 NaN 1806.58 \n", - "382 -0.605996 NaN 98.21 \n", - "\n", - " tumor purity (ABSOLUTE) tumor purity (ESTIMATE) tumor purity (EPIC) \n", - "0 0.96 0.953987 6.554462 \n", - "1 0.96 0.953987 6.554462 \n", - "2 0.32 0.739519 6.392896 \n", - "3 0.18 0.535787 6.133479 \n", - "4 0.59 0.855563 6.446450 \n", - ".. ... ... ... \n", - "378 0.60 0.723197 6.495422 \n", - "379 0.43 0.729831 6.422688 \n", - "380 0.78 0.765786 6.490166 \n", - "381 0.29 0.601648 6.178854 \n", - "382 0.63 0.871956 6.351655 \n", - "\n", - "[383 rows x 19 columns]" - ] - }, - "execution_count": 64, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "ensembled_tasks" - ] - }, - { - "cell_type": "code", - "execution_count": 65, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 65, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAd8AAAEFCAYAAACipe0RAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAABRzUlEQVR4nO2dd7wdVbn+v08SeheQagy9Q+jSi4jYKAoG8KpgQZQici38hEtTFIVrAxEQKSpSBQVEikDoJUgaoRMiIIpwgUAgtOT5/bHWTiaTvfeZfc4+5+xz8n75rE9m1rxrrXdmH2bNas+SbYIgCIIg6DuG9LcDQRAEQTCvEZVvEARBEPQxUfkGQRAEQR8TlW8QBEEQ9DFR+QZBEARBHxOVbxAEQRD0MVH5BkEQBIMeSedK+o+kBxtcl6RfSHpC0gRJm/SmP1H5BkEQBPMC5wO7Nbn+EWCNHA4CftWbzkTlGwRBEAx6bN8GvNTEZA/gt07cAywpaYXe8mdYb2UcDFYeC0m0IAgqsqZ6knqh4ftVft+8+czFXyG1WGucbfvsFopbCXimcP5sjvtXC3lUJirfIAiCoCORqnfO5oq2lcp2ruLqZduD/JoS3c5dIGlEeYBe0vGSvtlfPlWhqo+S/l+eYPCopA/3hW9BEARVEEMqhzbwLPC+wvnKwHPtyLgeUfnOw0haF9gXWI80EeEMSUP716sgCIKENKRyaANXAZ/Ls54/AEy13StdzhCVb4+RNFrSjyTdJ+kxSdvl+AMkXSHpOkmPS/pxIc2vJN0vaZKkEwrxUyT9QNLd+fomkq6X9KSkgwt235I0Jk+HL6Y/Ordg/wasVcH9PYCLbb9l+yngCWCLNjyWIAiCHjNkyNDKoSskXQTcDawl6VlJX5R0cOHdei0wmfQe/DXwtd66L4gx33YxzPYWkj4KHAfskuNHAhsDbwGPSjrN9jPA0bZfyq3MmyRtaHtCTvOM7a0k/ZQ0NX4bYEFgEnCmpF1JU+G3II1RXCVpe+B1Uit2Y9Lv+gDwd4DaH5ftM0t+rwTcUzivTTAIgiDoANrXPrS9XxfXDRzStgK7IFq+XdNowL0Yf0X+9+/AiEL8Tban2n4TeAh4f47/tKQHgLGkLt91C2muyv9OBO61/ZrtF4A3JS0J7JrDWFIFuzapMt4OuNL2G7ZfLeSD7TPrVLxQcYKBpINyS/z+s8++pE6SIAiC9tPH3c59SrR8u+b/gKVKce8Bniqcv5X/ncGcz/StwvEMYJikVYBvApvbflnS+aSWbTnNzFL6mTlvAT+0fVbRIUlH0PrMvEoTDOacRRhLjYIg6BsGYqValcF7Z23C9jTgX5I+CCDpPaTJSXd0M8vFSV3EUyUtR1JVaYXrgS9IWjT7s5Kk9wK3AXtJWkjSYsAnKuR1FbCvpAXyR8EawH0t+hMEQdAr9PFs5z4lWr7V+BzwS0n/m89PsP1kdzKyPV7SWNIY7mTgzhbT3yBpHeBuSQDTgP+y/YCkS4BxwD+A22tpGo352p4k6VJSl/i7wCG2Z3TnvoIgCNrNYG75Ko0xB0FVots5CIKq9Ezhapm1jqj8vnnx0Z/1qKy+Jlq+QRAEQUeiunNCBwdR+QZBEAQdyWDudo7KNwiCIOhIBnPlO6DvLKtLbZaPr83rYMs2faLDLGlaD9Pflf8dIWn/QvzILN7RYyStKOnyBtdmPcsgCIJOYDCv8x14HjfA9kdtv9LffnQX21vnwxHA/oVLI4GWKl9JdXs0bD9ne+/u+BcEQdD3DGkhDCwGjMeSFpH0F0njJT0oaVTp+hRJy+TjuhrHklbLWst/l3S7pLWblLdPLme8pNty3AGSTi/YXCNpx8L5/0p6QNJNkpbNcaMl/VTSbZIelrR51nx+XNL3C2lrLeeTge0kjZP0HeBEYFQ+H5Wfw7lZ23mspD0Kvl0m6Wrghgb3NGuHprwe+OKsD30JsFCXP0IQBEEfMmTIsMphoDFgKl+SsMVztjeyvT5wXT0jSZsyW+P4k8DmhctnA4fZ3pSkMnVGk/KOBT5seyNg9wr+LQI8YHsT4FaSxnONt21vD5wJ/JmkH7o+cICkpUv5HAXcbnuk7R9lPy7J55cARwM3294c2Ak4RdIiOe1WwOdt71zB368Cb9jeEDgJ2LSRYchLBkHQH4TIRmcwEThV0o+Aa2zfnkUmyszSOAaQdFX+d1Fga+CyQroFmpR3J3B+FqG4ooldjZlArWb6fSlNUa95Um2bKkmTSfKO/1ch/xq7ArsXxrEXBIbn4xttv1Qxn+2BXwDYniBpQiPDkJcMgqA/GIhjuVUZMJWv7cdyq/ajwA8l1e1arZnXiRsCvGJ7ZMXyDpa0JfAxYJykkSQVqOJfw4L10tbxoSu95lYQ8Cnbj84RmXx9vcW8oiINgqBjadDAGhQMmM8KSSuSukl/D5wKbNLAtK7Gcd7p5ylJ++T8JGmjJuWtZvte28cCL5JaqFOAkZKGSHofc+59OwSoTWban+5rP78GLNbk/HrgMOW/Skkbd7Oc24DP5DzWBzbsZj5BEAS9wmCe7TxgWr7ABqTxzZnAO6Qxy1PLRs00jkmVza8kHQPMB1wMjG9Q3imS1iC1NG8q2D1F6j5+kLSlX43XgfUk/R2YCswxIawFJgDvShpP2s/3AuAoSeOAHwLfA34GTMgV8BTg490o51fAebm7eRyxoUIQBB3GkPoLNwYFoe0ctEiM+QZBUJWeaTuvsvGPK79vnhr77QHVRz14PyuCIAiCAc1AnMVclXm+8pV0NLBPKfoy2yf1hz/tQNIGwO9K0W/Z3rI//AmCIOgWA3AstyrzROUraTTwTdv3l6/lSrbHFW0W2/im7e6Mv5KlHT9n+/Cc19u2a5KTewKP2X6oSl62J5KUseqVszuwru2T61ybZnvR7vgfBEHQbgbiRKqqzBOV70AgfxjUPg52BKYBd+XzPYFrSJveV0LSMNvv1innKmavOw6CIOhYYqlRL1BPLlLSsVk28UFJZxeW03Qp0ZilEx+RdEGWTLxc0sJ1yt1V0t1ZBvKyLL7RyMeTJT2U8zs1x50vae+CTXFDhcUlXZnTnKn82SZpmqQfKcla/k3SFvmeJueWKJJ2VJKrHAEcDHxDSVJyB5LC1in5fDU1kMnMvv1E0i3Ajxrc0yyJTEmr5GcxRtL3Kv94QRAEfcAQDascBhr92aavJxd5uu3N8/lCzLmEpopE41rA2Vky8VXga8UClbSfjwF2yTKQ9wNH1nNO0nuAvYD1cn7fr2dXYgvgv0nLolYjyVtCkp4cnWUtX8t5fSjnf2IxA9tT8j3+NEtK3kpqqX4rnz9Jc5nMNfP9/XcFf38O/CpLVf67gn0QBEHfIVUPA4z+rHwnArvkFuF2tqcCO0m6V9JEYGdgvYL9XBKNtt8CahKNAM/YvjMf/x7YtlTmB4B1gTvzutnPA+9v4N+rwJvAOZI+CbxR4Z7usz3Z9gzgokL5bzNbi3oicKvtd/LxiAr5zkJzymSOA84CViiYXJbLr8I22U+Ye4JWsczQdg6CoO8ZvJsa9d+YbwO5yEOAzWw/I+l45pRvrCLRWF4TVj4XSf94vwr+vStpC+CDpI0aDiV9EMySmMzd4vM3Ka92/o5nL6ie5b/tmWqw/V8TupLJbLvEZGg7B0HQLwzAFm1V+nPMt5Fc5Iu5ddedfWeHS9oqH+/H3BKP9wDbSFo9+7CwpDUb+LcosITta4EjmD17eAqzdwDag6SUVWOLPI46hKRw1XaJyVZlMrvgTtKHBWSpySAIgo4hup17hQ2A+3LX6dGkcdBfk7pi/wSM6UaeDwOfz5KJ7yFJKM7C9gvAAcBF2eYeoNGevosB12S7W4Fv5PhfAztIug8ob2ZwN2k/3gdJMpRXduMeAK4m6VOPk7QdSQbzW0r7965Gqii/qCRBOYn0EdAdvg4cImkMsEQ38wiCIOgdBnG386CRl8yzhK/Jk7WCXiO6nYMgqErP5CXX2P6syu+bx2/7yoBq/g68+dlBEATBvMGQAVWftsQAbKzXx/aU7rZ689rccaXw4Xb72JdIOrDOPf2yv/0KgqD3WGj4cf3tQnuJMd9Bz/2kiVO15/EV29dLOqKeUEdvUhPb6Gk+ts/L64KL4ZB2+BgEQdAnqIUwwJjnu53z7OiPA5vYfisLcdSWDx1BWi881xpfSUNbWE/bcaiB/GQQBEHHEN3Og5oVgBezYAe2X7T9nKTDgRWBW7JcY00m8kRJ9wJbSTpSSQrzQUlHZJuazOU5Of5CSbtIulNJDnOLbLeFpLvyDOa7JK3VzElJ60m6L3cfT5C0Ro7/XD4fL+l3Oe79km7K8TdJGp7j55CfVAOZyiAIgo5giKqHAUZUvnAD8D5Jj0k6Q0lLGdu/AJ4DdrK9U7ZdBHgwb803HTiQtNzoA8CXJW2c7VYnSTduSFrKtD9J7eqbwHezzSPA9rY3Bo4FftCFnwcDP8/iGpsBz0paj7RMa2fbG5GWDgGcDvw2y2JeCPyikE9RfrKZTGUQBEH/EpXv4MX2NJJoxkHAC8Alkg5oYD4D+GM+3ha40vbrOY8rgO3ytadsT7Q9k7QO96ascFWUk1yCJBH5IPBT5pTSrMfdwHclfQd4v+3pJMWty22/mO/lpWy7FfCHfPw75pTZvMz2jAoylbMIeckgCPqFGPMd3OSx29HA6Kwr/Xng/DqmbxbGeZv93GX5y6I0Zu2Zfw+4xfZeeY3y6C58/EPu7v4YcL2kL2UfqqyDK9rUREG6kqkslh3ykkEQ9Dlu8yxmSbuReiWHAueU9zWXtARpns9w0rv6VNvntdWJzDzf8pW0Vm38NDMS+Ec+Lss8FrkN2DNLVC5C2qHo9haKXgL4Zz4+oIKfqwKTc3f4VaQu7ZuATyvv6qS0ExOkfYCLspFzyVy2WaYyCIKg/bSx21nSUOCXwEdIG+zsJ2ndktkhwEN5GG9H4H8lzU8vMM9XvsCiwAXK+/aSfpTj87Wzgb/WJlwVsf0AqXV8H3Av6StqbAvl/pi0ocSdpK+wrhgFPJi7iNcmjelOAk4Cbs1Skz/JtocDB+b7+Syzx4LLtEumMgiCoP20t9t5C+CJvPPc2yTZ3vI7z8BikkSqG14ibabTdgaNvGTQV0S3cxB0KgsNP47pT5/Q324U6Jm85Op7/rby++bJP3/+K6S5OzXOzkNmAEjaG9jN9pfy+WeBLW0fWrBZjNSzuDap13OU7b/05B4aEWO+QRAEQWfSQtU959yUyrmVK/cPA+NIk1lXA26UdHsepmsr0e0cBEHQgEEn1zjQaK+85LPA+wrnK5OWkxY5ELjCiSdIu9P1iv5BVL5BEARBZ9LeyncMsIbSnuvzkyalXlWyeRr4YCpaywFrAZPbeEez6KjKV9Lyki6W9GSeAHWtCpvdS/qGpDfzdPBa3I6SphY2D/hbC+WtUNNRzvlY0hcL1zfOcd8sxH0zK1g9mFWlPpfjR0t6NMfd2ZViVW+Q72HrbqTbQNL5veBSEARB92njfr5ZTvdQ4HrS3u+X2p4k6WBJB2ez7wFb5yWnNwHfqekotJuOGfPNs8uuBC6wvW+OGwksBzyWzfYjfb3sxZzrcG+3/fFuFHsk8OvC+UTSrOLf5PN9gfEFHw8GPgRsYfvV/BGwZyH9Z2zfL+kg4BRg92741BN2BKaRlhpVQknjeaKklSUNt/10r3kXBEHQCm1WrrJ9LXBtKe7MwvFzwK5tLbQBndTy3Ql4p/Qgxtm+HUDSaqSp38eQKuGmSNqn0Dq9rYHZp4DrCudPAwtKWi5/DOwG/LVw/bvA12qD77an2r6gTr63kSQmyz6tLulv2acHlLSVJemU7OtESaOy7Ry7G0k6XVl5S9IUSSfkPCZKWjsLdRwMfCP3AGwnaVlJf5Q0JodtcvrjJZ0t6Qbgt7mIq5m9NjgIgqDf8RBVDgONjmn5AusDf29yfT/gIpKQxVqS3mv7P/nadnn9KyT5xJNIeskftv1PSUuWM5O0CvBybUOFApcD+wBjgQfI6lR5Cvpitp+scC+fILWiy1wInGz7SkkLkj5+PkkS9tgIWAYY0+RjociLtjeR9DXgm7a/JOlMYJrtU7PPfwB+avsOpc0VrgfWyek3BbbNMpWQtlU8irT+OAiCoP8ZgPv0VqWTWr5dsS9wcdZLvoJUQda4vbBn7Uk57k7gfElfpr6IxQokLecyl+a8a5V9jSpSjhfmj4BtSBsVzE6cKu+VbF8JYPtN22+QdJcvsj3D9vPArcDmXZQD6RlA+mAZ0cBmF+D07NNVwOLZD4CrChUvwH9IuzjNhULbOQiC/iC0nfuEScDe9S5I2hBYg7TmCtJ+u5NJUmF1sX2wpC1JWsjjJI20/X8Fk+nAgnXS/VvSO6Sx3a+TNh8gj/G+LmlV241mv33G9v0NrjX682gU/y5zfhyVfa212GfQ+HccAmxVqmTJz/D1ku2CpGcyF6HtHARBvzAAu5Or0kkt35uBBXJLFQBJmytt8bcfcLztETmsCKwk6f2NMpO0mu17bR8LvMic67sgTeIa0SD5saRZbjNK8T8Efilp8VzG4nlyVZfkceJnJe2Z0y4gaWHS+PAoSUMlLQtsT5Ks/AewbrZbgjz9vQvKWtQ3kGb3kcsc2STtmsCDVe4lCIKgT2jvUqOOomMq37zl3l7Ah5SWGk0iaSw/R+pyvrKU5EqaTxA6JU9GepBUwY0vXrT9OvCkpLkmRtm+y/af6uT5K+AW0rjsg6Qu4jcq3F6NzwKHK2ku3wUsn+9jQvbvZuDbtv9t+xlSF/gE0lhxFd3oq4G9ahOuSBrPm0maIOkh0oSsRuwE9IqMWhAEQbcYquphgDFPaztL2gvY1PYx/e1LfyJpAdKHxLZ5LVwTots5mHfoPK3k5nSevz3Tdl7ti5dV13b+zT4DqgbupDHfPifPOl66v/3oAIYDR3Vd8QZBEPQdHlDVaWvM05UvgO1z+tuH/sb248Dj/e1HEATBHAziCVfzfOUbBEEQdCgDcCJVVTpmwtVAQf2vP13MZ5ykXfK1Gfn8QUmX5ZnUSJpWyGvN7O8Tkh6WdKmSmldoOwdB0HkMUfUwwIiWbwtIHaE/3Sif6bZHZp8uJM1s/knB9wVJs5mPtH11jtsJWDa0nYMg6EgG4CzmqkTLtzU6QX+6Crczt7b0/sDdtYo3+36L7dra3tB2DoKgsxjELd+ofFujZf3pwrXtCl3FR+e4mv70RtTZAamB/nQxn3G5wi+mGQZ8hLm1pbvy/X5gu3oXQl4yCIL+wFLlMNCIbuf2si+wl+2Zkmr60zUJzHrdxTX96UuZrdVcpJ7+dKNu54UKm0vczuxtEavSUNs55CWDIOgXBnHzMCrf1ugI/ekGzBrzbeL7Dk2uN9R2DoIg6BcGYHdyVQbxd0Wv0En6063yB2BrSR8rlL+bpA3yaWg7B0HQWQwdUj0MMAaex/1Ih+hPl8d867bE6/g+Hfg4cJikx7PW8wGk7mYIbecgCDqN2FIwqGH7OeDTdS6tUsf2yMLp6DrXP1mhyNNJleQxtkcDS9Qzsr1oV/G2HwF2K9tkbefNgCMq+BMEQdAneBB3O0fl2+H0kf50aDsHQdB5ROUb9Ce9rT8d2s5BEHQkA3AJUVXmyTHfDpKInCDpb6X1wP1Cfh5r9LcfQRAEsxjSQhhgDECXe0ZBInK07dVsrwt8lyQRWaMoEVnkdtsjc9ilhWLrSUSOtL1hLueQlm+k/fwK+HZ/OxEEQTCLmO08qOgYicj8IbAY8HI+30LSXZLG5n/XyvEL500QJki6RNK9kjbL136V1acmSTqhkPemkm6V9HdJ1+fW9zqS7ivYjJA0IZ/eDuySFbKCIAj6n0EsLzkvvmhbloi0XVuOs11BReoy2ycxWyLyn5KWLGfWTCISWBp4ndTyBngE2N72u0q7Ff2AVHF/LeexoaT1gXGFvI62/ZKkocBNWezjYeA0YA/bL0gaBZxk+wuS5pe0qu3JwCjgUoCsyvUEsFEXzycIgqBPGIiykVWZF1u+XbEvcLHtmSTJx30K14rdzifluJpE5JeBoXXyayQROdL2+4DzgB/n+CWAy/K6358C6+X4bYGLAfJGCBMKeX1a0gPA2Gy/LrAW6SPjxlzJHwOsnO0vZfZSqVFAUay5rsRkaDsHQdAvDOIx33mx5dtpEpFXAX/Mx98DbrG9l6QRzF4bXPfzL7eqvwlsbvtlpT15F8z2k2xvVSfZJaQK/orkvouznOtKTIa2cxAE/UK0fAcVnSYRuS3wZD5eAvhnPj6gYHMHubUqaV2gJgm5OKnbeqqk5Ui7GQE8CiwraaucZj5J6wHYfhKYAfwPc7Z6IUlMTmriaxAEQd8RY76DB9uWtBfwM0lHAW8CU0jqTvsyuwKrUZOIvLdBlqfkJToCbqKORGRe0rS67SdydG3MV8BU4Es5/sfABZKOJH0k1Dgjx08gdS9PAKbaflzSWFKFOZnUBY7tt7Ps5C/ycqlhwM+YXbFeApxCQZUrV97Tbf+rwX0GQRD0LUMHXqVaFSW54qA3yZX9praP6Wb6ocB8tt/Ms7FvAta0/XYbffwG8KrtLrYijG7nYN5hoeHHMf3pE7o27BA6z981e1R7Dv/JLZXfN08fuVOf1tSSlgW+Q5pnM2to0fbOVdLPcy3f/qANEpELA7dImo/UWv5qOyvezCvA79qcZxAEQffp7DHfC0m9iB8DDgY+z9yTaxsSlW8f0ROJSNuvkTY+6DVsn9eb+QdBELRMZ4/lLm37N5K+bvtW4FZJt1ZNHJVvEARB0Jl0dN3LO/nffyntk/4cs5d0dkmPZztLmqE595c9qsX0UyQt0+T6kpK+VjjfsaaT3EIZ5+cJSEg6J88YbmY/uqYg1V2yetSD+bglnyUtlNWp6q0bLtptIek2SY9KeiTf28JN7Gf5IekASafn40MlHVjVvyAIgr5g6NDqoR/4fp7Q+t+kJZ/n0MK2rO1o+U63PbIN+TRiSZLC0xntyMz2l7q26ne+AFxhe0Yjgzw7+TJgX9t3Ky1M/hRJrvKNFss7lzRTOrqegyDoGNo95CtpN+DnJEGkc2yfXMdmR9LqkPmAF23v0CC7l21PJa1Y2Smn3aaqL722zje3aE+Q9ICkiZLWzvFLS7pBSb/4LAodC5KOVNJJflDSETn6ZGC13Ko+JcctKuny3Nq7MFc8dfWM6/g1q1WrBrrITe5pcyXN5fGS7pO0mKShkk6RNEZJe/krXeSxQ6GXYKykxeqYfQb4c7bfS2nnIynpMz8maXnSZgwX2L4b0hIq25fbfl7SIpLOzT6NlbRHM59svwFMkbRFV88gCIKgr5BUOVTIayhJMOkjpBnK+5V7QZUkgs8Adre9HnMqHJY5rWJcXdpR+S5U6nYeVbj2ou1NSDvmfDPHHQfcYXtjkrrTcEgVJ3AgsCXwAeDLkjYGjgKezHKM38p5bExq3q8LrApsozQT+DRgb9ubklpzNQnIRhxtezNgQ2AHJYWrukianzSz7eu2NwJ2IalBfZG05nZzYPPs9yqN8snP4ZDcW7AdJUWpXM6qtqdAmikN/JtU2f4aOM72v2muUX00cHP2aSfSWuRFmvgEcH/2Zy4U8pJBEPQDUvVQgS2AJ2xPzqtFLgbKDZP9Sb2OTwMUdP0LPmkrSf9NEjI6shCOp77EcF16u9v5ivzv34FP5uPta8e2/yLp5Ry/LXCl7dcBlOQPtyNV0GXus/1sthtHUpB6hdl6xpAeQleCEZ+WdBDpOaxAqswnNLBdC/iX7THZ91dz+bsCGyqPKZNUqtYgKVvV407gJ5IuJP3Iz5auL5PvpchhwIPAPbYv6uKeAHYFdpdU++BZkPyR04T/AGvXuxDykkEQ9AetdDvnd/lBhaiz87urxkrAM4XzZ0mNvSJrAvNJGk0awvu57d+WbOYn7Xw3LNvUeJUG0sX16O3ZzrWdfGaUyqr3Am+ld7+4Q1At72Z6xnMX1lgXuWESGvt9mO3rS/mPqJeJ7ZMl/QX4KHCPpF1sP1IwqacFvRIwE1hO0pC86cMkYFNy93Qdnz5l+9GST8vVsa1RV9c5CIKgv1ALfbNzNhLqZ1cvWel8GOm9+kFgIeBuSffYntWYKiwrOt/2P6p7OCf9oe18G2lME0kfAZYqxO+ptHftIqSN7G8HXmPOr4tGNNQzbkAjXeRGPAKsKGnznP9iSnvfXg98NXd7I2nNZl28SlrQE23/iNTVO0dr0/bLwFBJC2b7YaSJUPuTtgo8MpueDnxeaVOHWt7/lceDrwcOK4yFb9zFvUH64nuwgl0QBEGfMHRI9VCBZ5lTe39l0vKgss11tl+3/SKpXtqoQX5v5Pk+10q6uRaq3ltvjPnONXusxAnA9krb4O0K1PrWHwDOB+4j6SifY3ts3iHoTqVJWKc0yJPch7838CNJ40l73m7dxH48SSd5ErNn+zYk5z8KOC3nfyOptXgO8BDwgNLSorNo3qNwRL6X8aSW5l/r2NxA6oaHtNfv7bZvJ1W8X5K0ju3nSZrTpyotNXqY1E3/Kml3pPmACdmn7zW7t8w2wN8q2AVBEPQJbR7zHQOsIWmVPLdmX+Ye1vwzSXt/mNKyzS1JjZ56XEhqlK1Cqtem5DKq3VtoO3ceuaV6pO3Pdl55MeYbzDt0nlZyczrP355pO6933m2V3zeTDty+y7IkfZS0jGgocK7tkyQdDGD7zGzzLdLk35mkRuDPGuT1d9ubSppge8Mcd2uTpUlzEApXHYjtsZJukTS02VrfNrIMaYvBIAiCjqHKEqJWsH0tcG0p7szS+SmkXd+6okcKV1H5dii2z+3Dsm7sq7KCIAiq0sqEq36gqHB1Gmke0TeqJm7rrSmkJhvl0RapyZzP9NIz/ly2m6IkZjJeScRk+UL8Mvl4eUkXK+0v/FCeKLCmpGUlXdeTewyCIGg3bR7zbSu2r7E91faDtneyvantektj69Lulm9ITbafWVKTuQvmySbPeCfbL0r6AWmi1uG1C3nm85UkVax9c9xIYDnbj0n6l6RtbDedeBYEQdBXDOnQlq+knYBDmb1a5WHgdNujq+bRJ7emkJpslEdLUpMtcBuweiluJ+Cd4viG7XF5FjXAn3JZQRAEHcEQVQ99RR7fPRe4hrQE9DOkceRz84SuSrS75buQkuJUjR/arukRvmh7k9xt/E3gS8yWmjwx39BBMJfUpIB7lfZJPApYv9byUxLA3hhYjzTYfSdJavJeUh/8HrZfUJK8PInUimzE0bZfUtL/vEnShrbrql1pttTkKNtjJC1OSWpS0gKkJVI3UF+cA2ZLTd4paVHgzTrlzJKazKxWesaHFSrQGh8HJpbimslRQlpz/P0m14MgCPqU/uhOrsC3gD3zctUa4yTdT6p3rq2fbE76sts5pCbnpjtSk826nW+RNCP7fUwDm0b8B1ix3gUVZNvOOutEDjpoVD2zIAiCttKhle/ypYoXANsT1FxFcA76crZzSE2W6KbUZDN2yqos9ZhEc93RhvKSoe0cBEF/oL7sT67O6928Ngf9PZwdUpMtSE32kJuBBSR9uVD+5pJqC8JDXjIIgo5iyJDqoQ9ZTdJVdcLVpF32KtHbY77X2W623OgE4CIlqclbKUhN5tbnfdnuHNtjASTdqbRs56/AX+plavvt3PX7C6V1WMNIqiaTGtiPl1STmpxMBanJPI58mqSFSC3GXUhSkyNIUpMCXgD2bJLVEXnW3AySRGUzqcma9GN5zPdc279o5m/22ZL2An6mtATsTZIc2hHZZCcaPM8gCIL+oEO7nZvtj35q1UxCXrLDUR9JTUq6jTRB7eXmltHtHMw7dJ5cY3M6z9+eyUtufcUdld83d31y286sqhsQClcdTl9ITUpaFvhJ1xVvEARB39GhLd+2EJXvAKC3pSZtv0Ba5xsEQdAxdLi8ZI8YxLfWczQI5TKDIAgGCp0sL9lTouXbnJDLDIIg6CeGdOZSIwDy7ObymPRU0sqVs2y/OXeq2UTLtxtoYMhlTivE751nj9dayr/K48iTlSQuz5X0cM0mCIKgE+jwlu9kYBrw6xxeBZ4nLdv8dVeJo+XbnIEsl9mMpYCdgd2Bq4Ftsv9jJI20Pa5J2iAIgj6hw7uTN7a9feH8akm32d5eUt1lrUWi5duc6bZHFsIlhWtFucwR+Xh74PeQ5DKBueQybU/LabdrUOZ9tp+1PRMYl/Nei9lymeNI0pGVN22uw9VOa8wmAs9noY+ZpHXOI8rGkg5S2nTi/rPPvqR8OQiCoFfoxI0VCiwraXjtJB/X5vi83VXiaPl2n46Vy6zjR1khq1bGzFJ5M6nzNxHykkEQ9AcdPOQL8N/AHZKeJL2jVwG+llUNL+gqcbR820unyGUCPC9pHUlDcnlBEAQDiiFy5dDX2L6WtHHOETmsZfsvuYfzZ12lj5ZvcwakXGbmKNJ+k8+QNJsXbWIbBEHQcQzr7JYvwKakobphpB3tsP3bKglDXjJokeh2DuYdOk+usTmd52/P5CU/cePtld83V39ouz6tqiX9DliNNDenpj5o24dXSR8t3yAIgqAj6fAx382Add3NFmxUvkFLLDT8uFnH058+Ya4v7eL1MjW7YpqafS2vRulqacrlN8qj7Fuj9MX4eunLtvX8quJPveuN4srXanT1fBpRxfdiXL1n1yxNzabZs+6unz1N21W6KrR6H/1B0afe8K3Z/xuN7NL5RT0qt8MnJT0ILA/8qzuJo/INgiAIOpIOb/kuAzwk6T4Kq0Zs714lca9VvpKWJ00K2pzk2BTSjLC3gYdJG9IvSJrx+0vbF+R0S5DWyg7P/p1q+7xS3vcCCwDvARYC/pkv7Wl7Sjf9HQFcY3v9LHbxTdsfr5h2IeA64HBmTzEfTpIamwq8CHy/lTz7GkkXA/9j+/H+9iUIggBg6JCOnmJyfE8S90rlmyURrwQusL1vjhsJLEeaffuk7Y1z/KrAFZKG5Er2EOAh25/IW909KulC27MWLdveMqc9ANjM9qG9cR8t8AXgCtvjgZGQZBxJlfnl+XzHnhTQm1sKZn4FfBv4ci+WEQRBUJlO7na2fWtP0vfWve0EvGP7zFqE7XG2by8b2p4MHElqNUISh1gsV+CLAi8B73bHCUmbS7pL0nhJ90laTNJQSadIGiNpgqSvdJHHDoVdjcZKqrcu9zPAnyu41Ei3+YM574lZZ3mBHD9F0rGS7gD2kXS4pIey3xc38k+JU7KO9MQsR1nbNWl0PR9I6453kRRDEUEQdASduM43v4+R9JqkVwvhNUmvVs2nt16065NkF6vyALB2Pj4duIqkbbwYMCpLH7aEpPmBS3L6MZIWB6YDXwSm2t48V3J3SrqB+spUkHSbD7F9p6RFgTl2qsjlrFqxu7uebvP9wPnAB20/Jum3wFdJXfYAb9reNpf1HLCK7bckLdnEv0+SWuAbkcYlxki6rZEPJD3qmZKeyGnm+O0kHUTWqR621GYMW3T1CrcaBEHQMzpxzLf2PrZdRSCpIZ3Sqi8+4g+T1k2tSKpATs8VZ6usBfzL9hgA26/afhfYFfhcFs+4F1iapFLSiDuBn0g6HFgy51FkGeCVij410m1+yvZj2eYCkkZ0jaKY8gTgQkn/xezegHr+bQtcZHuG7edJgh+bN/Ghxn9Iz30ObJ9tezPbm0XFGwRBXzGkhdDXSFqt0Eu5Y+6ZXLJq+t7yeRJJ+aMqG5MmYUHa/ecKJ54AnmJ2q7gVRGOd5cMKmyWsYvuGRpnYPpm0489CwD3K2wcWmM7c2smNaKTb3IzXC8cfA35JerZ/lzSsgX/N8qznQ40FSfcTBEHQ73T4xgp/BGZIWh34DUnb+Q9VE/dW5XszsICkWZN38vjrDmXDPMv4VNKWeZAkGT+Yry1HahlO7oYPjwArSto857VYHs+8HviqpPly/JpKest1kbRa3vXnR6RNkueofG2/DAyVVLUCrufniPwDAnyW1FIt+zEEeJ/tW0gTo5YkjSHX8+82YFQe316W1JK+r5xnHdakuWRlEARBnzFsiCuHfmBm7mncC/iZ7W8Ac+2z3oheGfO1bUl7AT+TdBRpHHIKaakRpA3kxzJ7qdFpheVE3wPOlzSR1IL7ju0Xu+HD23mi0Wl5KdB0YBfgHFJX6wN5stELwJ5NsjpC0k6kVuJDJA3mMjeQunr/1g0/35R0IHBZ/jgYA5xZx3Qo8HulpVgCfmr7FUnfq+Pf28BWwHhS6//btv9dp9U+i/yhM912txaMB0EQtJtOGRdtwDuS9gM+D3wix81XNXGvzWy1/Rzw6QaXF+oi3a4VyzifNFmp0fUxwAfqXPpuDkWmkiaKYXs0MDofH1bBldNJM7ZnVb62Dyj5MivPfH5o4fgmUtd72f8RheN3SBV82aaRf9/KoZIPwP7AWQ3yCoIg6HP6Y7eiFjgQOBg4yfZTklYh7+dehdhYoU1I+gJpXXNvrsXtNXLr+3d1JpSViI0VgnmHztuooDmd52/PNlb42l23VH7fnLH1Tv02N1rSUqRhwQlV08SazjZh+9z+9qEnlFXEgiAI+ptOXGpUQ9JoYHdSPToOeEHSrbaPrJK+w7vUew9Jy0u6WNKTWbjiWklrNrEfKemjFfLdUdLWPfDr2lamq7eY92hJm+XjKZKW6Y1ygiAI2sEwuXKogqTdJD0q6Yk8H6mR3eaSZijto96IJWy/StJVOM/2pqR5RZWYJyvfPNHqSmC07dVsr0saA16uSbKRQJeVL7Aj0O3K1/ZHbb/S3fRBEASDhXYuNZI0lLRU8yPAusB+ktZtYPcj0sqYZgyTtAJpbtM1Ld7avFn50kT+UtLvJO1Ri88SjLsDJ5KW74yTNErSeyT9SUnq8R5JG+ZlUwcD38h220l6v6Sbst1NkoZLWiJ/fa2Vy7iotiyr2CKV9LmcbrzSxs1zIGlRSecpSUhOkPSpHL+rpLslPSDpMiXlq7pIWkTSX3IZD+YZ4kEQBP1Om0U2tgCesD057xVwMbBHHbvDSGt4/9NFfieSKugnsoriqkDljWnm1THfZvKX5wDfAP6cl/VsTZpKfiyFTRwknQaMtb2npJ2B39oeKelMYJrtU7Pd1fnaBXlS1i9ymkNJS6p+Dixl+9dFJyStBxwNbGP7RUnvqePr/5CkMjfIaZbKFfcxwC62X5f0HdJM7BMb3O9uwHO2P5bzWKKLZxcEQdAntDLmq4IMbuZs22cXzlcibexT41lgy1IeK5HW7e7MbFXAuti+DLiscD4Z+FRVf+fVyrchtm+V9EtJ7yX15f/R9rvSXH8F25IftO2bJS3doOLaKucD8DvgxznNjZL2IXWDbFQn3c7A5bU1zrZfqmOzC7BvwfeXJX2c1KVyZ/Z5fuDuJrc8EThV0o9IuzDNtflF8Y/6rLNO5KCDonEcBEHvoxaWGuWK9uwmJvWq8nIBPyNpS8yo887PPunbtn+cG2BzOWj78DrJ5mJerXwnAc0G0n9H2qloX9J2gfWo8kPWwzBLsWodkvjHe0hfYeX8u8qvno2AG23vV8EX8mYOm5LGs38o6QbbJ5ZsCn/UsdQoCIK+oc2znZ8F3lc4X5m0wUyRzYCLc8W7DPBRSe/a/lPBpiaFfH9PnJlXx3y7kr88n6zGZbsmt/gaaZelGreRKujaXr0v5plvZbu7mN06/QxwRz7+BulH3A84V1nussBNwKclLZ3LqNftfAMwSygjrzW7h7Rb0uo5buEuZnGvCLxh+/ckmc9NGtkGQRD0JW2e7TwGWEPSKkq70e1L2kFvFlnrf0QWOLoc+Fqp4sX21fnfC+qFyvdW1XAw0ZX8pe3nJT0M/KmQ7BbgKKXdkH4IHA+cJ2kC8AZpXBjgauDyPGnrMNI+xedK+hZJyvLAXBl+CdjC9mtK2/0dAxxX8HGSpJOAWyXNAMYCB5Ru5fvALyU9SJKXPMH2FZIOAC5S3nEj5/0Y9dkAOEXSTOAd0naGQRAE/U47W755+PBQ0iSpocC5+T17cL5eT9Z3LiRd1ey67d0r5RMKV3MjaWHSWOgmtqf2tz+dRXQ7B/MOnacY1ZzO87dnClcnjP1b5ffNcRvv0ieSHJJeIE3cuoi0Le0c5dqea2OcesyTLd9mSNoFOBf4SVS8QRAE/cfQ/nagPssDHyINGe4P/IW0f3pLO8JF5VvC9t+A4f3tRxAEwbxOJ26skPX7rwOuy0N7+wGjJZ1o+7TmqWczr064aoqSrNi4Qjgqx28naVKOW0jSKfn8lG6UUd5VqVeRNCKPDdckMFtWZAmCIOhL2qlw1U4kLSDpk6RdjA4BfgFc0Uoe0fKtz3TbI+vEfwY4tbYJgaSvAMvafqsbZXwX+EFVY6W577I9sxtlBUEQDDjm68DmoaQLSEJNfyVNcn2wO/l04K11JpK+RNLwPFZJcvIqYBHgXiW5yWUl/VHSmBy2yenmkoCUdDKwUG5BX5jtjszyjg9KOiLHjZD0sKQzgAeYc41abXnUXUrSkPdJWkzS0NwiH5PL+0oX97VDoYU/VtJizeyDIAj6ig5t+X4WWBP4OnCXpFdzeE3Sq1UziZZvfRbKS4pq/ND2OZK2JalAXQ4gaVqthSzpD8BPbd8haThpOvs61JGAtP1HSYcW0m5K2ph5S9LMuXsl3Qq8DKwFHGj7a0UH8zq1S4BRWVd0cZJgxxdzeZvn8Yg7Jd1AY8GObwKH2L5TSQP6zW4/tSAIgjbSoWO+bWm0RuVbn0bdzs3YBVhXsyXJFs+tyLkkIOuk3Ra40vbrAJKuALYjLQD/h+176qRZC/iX7TE531dz2l2BDTV7K6wlgDVovM73TuAnuQV+he2y0lbISwZB0C908n6+PSUq3/YxBNjK9vRiZB6rrSIT2YjXm6Spl6+Aw2zPsR2W0o5Lc2H7ZEl/IclL3iNpF9uPlGxCXjIIgj6nQ5catYUY820fZanHkQ3il8qH72i2pORtwJ5ZCnIR0q4ac21wUOIRYEVJm+d8F5M0jNTd/dVa3pLWzHnWRdJqtifa/hFJq3TtSncbBEHQywwb4sphoBGVb31qk6Fq4eQKaQ4HNsuTnB4i7esLSQJyqTyRajxpL2FILckJki60/QBJT/o+kmLKObbHNiss70c5Cjgt53sjsCBpS8SHgAfy0qKzaN7DcUTBt+mkGXxBEAT9zlBVDwONkJcMWiS6nYN5h86Ta2xO5/nbM3nJ8x67vvL75sA1PzygquAY8w2CIAg6kphwFQRBEAR9TFS+QRAEQdDHDO3Adb7tokcTriQtL+liSU9KekjStXl27QhJ07Ni0sNZfenzhXRLSLo6KzNNknRgnbzvzZOdnpb0QmHy04ge+NttfWMlLedbJfXJ7Pdm9y9piqRl2ljWBpLOb1d+QRAE7WCYqoeBRrdbvnn96pXABbb3zXEjgeVIex0+aXvjHL8qcIWkIVkX+RDgIdufkLQs8Gie9ft2LX/bW+a0BwCb2T6U/uULJBGKGX1RWLP7Lwh5tKusiZJWljTc9tNtzTwIgqCbDOZu5560fHcC3rF9Zi3C9jjbc61PtT0ZOJK0HAeSOMRiuQJfFHgJeLc7Tqjv9I0/A/w520vS6bm1/5fc4t87X5vVKpW0maTR+XiL7OfY/O9aOf4ASVdIuk7S45J+3MK9L5LLH5+XC43K8ZvmVvrfJV0vaYVC/HhJd+fnUxQEv5qCElcQBEF/M1SuHAYaPal81wf+3oL9A8wWcDidpHv8HDAR+Hp3duvRbH3jr9veiCTlOIe+MbA58GVJqzTJqqZvPJIk61hWqZofWNX2lBy1F0necQPgy8DWFdx9BNg+9wYcy5w7Go0krdndABgl6X1zJ6/LbsBztjeyvT5pf8n5gNOAvW1vCpwLnJTtzwMOt71VnbzuJ917EARBR9ChGyu0hb4U2Sg+ng8D44AVSRXP6UobA7TKXPrGtt8FdgU+p7Q5wr3A0iR940bU9I0PB5bMeRRZBnilcL49cJHtGbafA26u4OsSwGW5tflTYL3CtZtsT7X9Jkkg4/0V8oP04bKLpB9J2s72VNIzWR+4Md//McDKkpbI93ZrTvu7Ul7/If0ecyHpIEn3S7r/7LMvqehaEARBzxjMlW9PZjtPAvbu0mo2GwMP5+MDgZOdFD6ekPQUqVV8X4s+9JW+8XSSetQcyRr49C6zP2qKab4H3GJ7r+zH6MK14n7AM6j4u9h+TGlHpI8CP1TavehKYFK5dStpySY+13ydXu9CaDsHQdAfDMRKtSo9afneDCwg6cu1iDz+ukPZMFc2p5K6QwGeBj6Yry1Haq1N7oYPfaJvnHciGiqpVpneBuybx5ZXYLZkJMAUYNN8/KlC/BLAP/PxAS3faX2/VwTesP170vPdBHgUWFbSVtlmPknr2X4FmKq0LSKkMewiawLd2hQ6CIKgN5hviCuHgUa3W762LWkv4GeSjiLtAzsFOCKbrCZpLKlF9RpwWp7pDKkVeL6kiaRW6ndsv9gNH97Ok4xOk7QQqeW2C0nfeARJ31jAC8CeTbI6QtJOpFbnQ9TXN76BtPXf30ity51J3b6PAbcW7E4AfiPpu6Qu7xo/Bi6QdCTVuqmrsAFwiqSZwDvAV/Mz2Rv4Re5qHgb8jNRTcSBwrqQ3SB8oRXYC/tImv4IgCHrMYN58ILSdKyJpY+BI25+tc+184Brbl/e5Y90k90ZcY3t9SQuQPiC2rTPeXSK6nYN5h87TSm5O5/nbM23nm5+7tvL7ZucVPzqgOqlD4aoitsdKukXS0L5a69uHDAeO6rriTf9z15j+9Alz/c9evF6mZldMU7Ov5dUoXS1NufxGeZR9a5S+GF8vfdm2nl9V/Kl3vVFc+VqNrp5PI6r4Xoyr9+yapanZNHvW3fWzp2m7SleFVu+jPyj61Bu+Nft/o5FdOr+oR+UOxN2KqhKVbwvYPrdB/AF97EqPycum1s/HjwOP96tDQRAEJYYMwPW7VYnKt5eRtDRwUz5dnjSu/EI+X9P2wsUu4H5wMQiCoCMZNogHfaPy7WVs/x9pLTOSjgem2T41n0/rL78Gafd5EASDiEFc9w7qextUSNonS0iOl3Rbjhsq6VRJE5VkNA/L8R/MMpYTJZ2bJ1TVpC+PlXQHsI+kXbPU5AOSLpO0aD/eYhAEwRxI1cNAIyrfgcOxwIezjObuOe4gYBVgY9sbAhfmtcjnA6Nsb0Dq3fhqIZ83bdeWTB0D7GJ7E9L65iP75E6CIAgqoBbCQCMq34HDnaS10V8Gatsa7gKcWZulbPslkmDJU7YfyzYXkOQwa9T0IT8ArAvcmWUoP08DWcuivOS7055o4y0FQRA0ZjC3fGPMd4Bg+2BJWwIfA8Ypbd9YT16zqz/D1wt2N9rer0LZs+QlFxq+3+CdfhgEQUcxmFuHg/neBhVZAvNe28cCLwLvI6luHZwlNZH0HpLk5ghJq+ekn2VOBa4a9wDb1OwkLSxpzd6+jyAIgqoMkSuHgUZUvp3DWpKeLYR9StdPyROoHiRpS48nyWg+DUyQNB7YP++MdCBpB6WJwEzgzFJe2H6BpDF9kaQJpMp47bJdEARBfxHdzkFbsH186XzR/O8UYL4u0n6yTvS7pElSR5ZsbyLtIlXOY0Tp/GbSfsdBEAQdxwCsUysTLd8gCIKgI2n3fr6SdpP0qKQn8oZA5eufycs2J0i6S9JG7b6nWWXFxgpBa8TGCkHQqQy2jRUmvXxN5ffNekt9vGlZkoaSdqH7EPAsMAbYz/ZDBZutgYdtvyzpI8DxtrfslvNdEN3OQRAEQUfS5rHcLYAnbE9OeetiYA/SNrIA2L6rYH8PsHJbPSgwT3c796e8YxAEQdCcIS2Eoh5BDgeVslsJeKZw/myOa8QXqb+3e1uIlu88Smg7B0HQ6VQdy4U59QgaUC+3ut3aknYiVb7bVvegNebplm8NSTtKulXSpZIek3RyHni/Ly/vWS3bnS/pV3lf38mSdsjayQ9LOr+Q37TC8d61azn9L/JA/mRJexfsviVpTB7on2vQJus4n5/1nSdK+kaOX13S37Lm8wOSVlPilILtqMJ93iLpD8DEnOcphXK/0kuPOAiCoGXaLC/5LEkfocbKwHNzlSltSFrGuUfeGKdXiJbvbDYC1gFeAiYD59jeQtLXgcOAI7LdUsDOJH3lq4FtgC8BYySNtD2ui3JWIH1NrQ1cBVwuaVdgDdKYhICrJG1v+7ZCupHASrVtByUtmeMvBE62fWXWdR4CfDLbbwQsk32r5bUFsL7tp3K3zFTbm+fNF+6UdIPtpyo+syAIgl5D7RXPGAOsIWkV4J/AvsD+c5an4cAVwGcLEr29QrR8ZzPG9r9svwU8SVKPApgIjCjYXe00RXwi8LztibZnApNKdo34k+2ZeYbdcjlu1xzGAg+QKuY1SukmA6tKOk3SbsCrkhYjVchXAth+0/YbpMr9ItszbD9PUriqree9r1C57gp8Lms73wssXafcOcZSzj77kvLlIAiCXqGdLd+sgX8ocD3wMHCp7UmSDpZ0cDY7lvQePEPSOEn3t/N+ikTLdzZvFY5nFs5nMudzequOTdmu+Lm2YJNyVPj3h7bPauRcnvq+EfBh4BDg08xujZdp9rf4euFYwGG2r29iXxpLiaVGQRD0DUPbrLJh+1rg2lLcmYXjL5F6MnudaPn2Ds9LWkfSEGCvCvbXA19Q3k9X0kqS3ls0kLQMMMT2H4H/ATax/SrwrKQ9s80CkhYmyU+OymO6y5J2NbqvQblflTRfTr+mpEW6c8NBEATtJuQlg1Y5CriGNK39QaDpJvW2b5C0DnC30l/RNOC/gP8UzFYCzssVOsD/y/9+FjhL0onAO8A+wJXAViT9ZwPftv1vSWXt5nNIXeUPKBX8ArBnqzcbBEHQGwzAOrUyoXAVtEh0OwdBpzLYFK6eef3qyu+b9y3yiQFVV0e3czAgWGj4cfNUuUF9uvo9+vr3ald57cqnsyrentPmpUYdRXQ7B0EQBB1JKyIbA42mLV9JS0r6Wl8501tIWlHS5fl4pKSPdiOPPSUdW4obL+miUtz5kp7K09QfkXRc4doXsujFhCyAsUeOl6RjJD2eRT5ukbReId2UPOGqWM600vkBkk6XdHQue5ykGYXjwyUdL+mfhbhx+TfeQAWRkCAIgk5giFw5DDS6avkuCXwNOKP3XUmozbKHkobZfg6oqUmNBDajNN28At8mCWvU8l2H9PGyvaRFbBeX8HzL9uVZ9OIhSb8lTYY6mjRLeWqe2bxstj8E2BrYyPYbWXTjKknr2X6zFSdtnwSclH2cZntkwefjgZ/aPrWU7BVJK0sabvvpVsoLgiDoLQZxw7fLMd+TgdVyC+mULE94Te1ibmkdkI+nSPqBpLuzIMMmkq6X9GRtAXNu4XUpe1h2QtI0Sf+rJJ94U14+g6TRkjbLx8tImpKPD5B0maSrgRskjchlzg+cSFqGM07SqNzarOU3RGmfx3Irc03gLdsvFqL3B35HEuPYnfrU1vi+DrwXeI00kxnb0wpiF98hrbd9I1+7AbgL+EzDX6b9XE1SfAmCIOgIBvNSo64q36OAJ22PtP2tCvk9Y3sr4HbgfFJr8wOkCg/mlD3cBThF0gr52hbA0bbXrZPvIsADtjchqTVVmZ2wFfB52zvXImy/TVIwuSTf0yXA75ldye0CjC9VspAkJB8oxY0CLgEuAvYrXTtFSTXqWeBi2/8hLft5HnhK0nmSPgEgaXFgEdtPlvK4H1iP9vONQpfzLaXytuuF8oIgCLrFYJ5w1e7ZzlflfycC99p+zfYLwJtKWsRVZQ/LzCRVdJAqyyo7Tdxo+6UKducCn8vHXwDOq2OzAmkNLACSNgdesP0P4CZgE0lLFey/lbt7lwc+KGnr3JW+G+mD5DHgp7kbuBGiwY4bTahi/9P84THS9k6F+P8AK9Z1JOQlgyDoB1rZUnCg0eps53eZ8z4bSSc2kl6sKnvYFbVKpuhP2ZdK+dl+RtLzknYGtqR+V+90YInC+X7A2rVubmBx4FMk0Ypi3tMkjSZ9LNyVNaHvA+6TdCNwnu3jJb0uadXaJs+ZWiu/EdMlzZ9b8wDvAcot9lZYkHSfcxHykkEQ9AcaiP3JFenqg+E1YLHC+T+AdZVkDJcAPthieVVlD+v5WZswtT9wRz6eAmyaj/emGuV7glRp/p4ktF1vstfDwOqQxoVJKlIb2h5hewSwB3N3PSNpGKlCf1JpxvUmhcsjSc8T4BTgF5IWyul2IVXYf2hyH7eSVLDI6T4N3NLEvivWJKlxBUEQdARq4b+BRtPKN+9leGeerHSK7WeAS4EJpK3sxrZY3pU57XjgZrLsYYV0rwPrSfo7aTu/2hjyqSRt4rtIW+dV4RbSB8S42oQvUnf5otTvcob00bCx0mfY9sA/bf+zdH3dwvh1bcx3AqkL/gpgPuBUpeVH40hjxl/P9qeRtruaKOlRknbzHraLLdEJkp7N4Sc57SdzXvcAl5W2IGxEccx3nKQROX4n4C8V0gdBEPQJ0pDKYaAxIOQl85KZpvrIPcx/M9JYaMMJR5J+TtpO8G+95Ud/obSX763AtnnbrSb0T7dzf8nmdZ5c37xNV79HX/9e7Spv8P6d9Uxe8pW3/1r5fbPk/B8ZUM3fgfe50GYkHQX8kdkbFTTiB8DCve9RvzAcOKrrijcIgqDvGMzdzgOi5Rt0EjHhKgg6lc5rQfes5Tv17esrv2+WmP/DA6oGbqnlq5CbLOYxS25SjWUbF5Z0YRYUeVDSHZLeX7D5dynd/MqykVkYxJK+VyhzGUnvSDq95Mt4ZZlLSQcW8ns7lz1O0slZfOSFkp/rSlpW0nU9eaZBEATtRhpaOQw0Wl1qtCQhN1ljDrlJ6sg2Svp/wPO2N8jnawH/rkk+5nW+04rpSlPrJwMfJ03AgjTLelKpjLLM5XnkiWN5KdRONdEQJTWyS2wfWr4ZSf+StI3tO6s/giAIgt5jIHYnV6XVMd+Qm6Sh3GQ9VgBmzYq2/ajtt5rYl5kOPFy7J9IM6UtLNlVkLqvwJ/pWzjIIgqApg3nMt9XKN+QmE/XkJotLeGrrbc8FvpM/QL4vaY0Kfpa5GNhX0srADOC50vVmMpf1GKU5u50XyvEhLxkEQYcxeDWuetvjeUJuMjOXbKPtccCqJBGN9wBjcjdxK1wHfIhUsc6h7aiuZS7rcUnBz5GFtcQN5SWDIAj6A0mVw0Cjp5XvoJSbBIpyk3+tYza9Tv6N8ptm+wrbXyN9KLQ0uSu3zv8O/DdpSVSRoszlk8yWuewODeUlFdrOQRD0C4N3a4VWK9+Qm0zMkptshqRtai3RPL68LrMlJVvhf4HvZMWxWt6VZS4r0lBe0vbZtjezvdlBB42qZxIEQdB2xNDKYaDRUuUbcpOzKMpN1iiO+dZkG1cDbpU0kfRs7mfu1muX2J5k+4JSdBWZy3qUx3y3zvEhLxkEQUcxmCdcDUiRDYXcZNuRdBtJT/rl5pYhshEEncpgE9l4c8bdld83Cw7dakDVwANvilgvo3lQbjJ3+f+k64o3CIKgL4kx346iN1u9tk+2/X7bd3Rh97ztq5rZDBRsv2D7T/3tRxB0GgsNr7KKMegtxJDKYaDRqsJVEARBEPQJA3GrwKp05J0pNKSLeVTRkN5R0lRJYyU9LOm4bF9WIPtIXjL0sNK+wqfm+EMlHdieuw6CIGgX0e3c1yxJ0pDuM9RmZW5lDWnbRQ3plitfkoZ0UUu7KOYx0vYrOf522xuTdKr/S9KmxUwkrQ+cDvyX7XWA9Una0ZCERQ7vhm9BEAS9xmDudu5Uj0NDmpY0pGdh+3WSKMdqpUvfBk6y/Ui2e9f2Gfn4DWCKpC2qlhMEQdD7RMu3rwkN6URVDelZSFqadO+TSpfWJ1XKjQht5yAIOorBvM63UyvfVpmnNaQz20kaS9rd6GTb5cq3KxpqO4e8ZBAE/YEGsbbzQJntPCg1pCUVNaTrbec3HViiom+32/54k+uTSNKb4xtcb6jtbPts4Ox0FiIbQRD0DQNRNrIqndryDQ3pRCUN6YqcAnw3jyPXxpmPLFxvqO0cBEHQP8SYb58SGtKzqKoh3SW2JwBHABdJephU0RY1oLcBBoVUZhAEg4N2dztL2k3So3mC61F1rkvSL/L1CZI2aftN1coaiNrOfYXmEQ1pSRsDR9r+bNfW0e0czDt0nlZyczrP355pO5tHK79vxFpNy8rLSR8j7Y/+LDAG2M/2QwWbjwKHkZaFbgn83PaW3XC9Szqy5TsvoM7SkF4G+J9eLiMIgqAl2jzbeQvgCduT8wqUi0nbsBbZA/itE/cAS6r5LnHdx3aECJUDcFC7bSPPyHMw5DnY7qe38uytABxEWjJZCweVru8NnFM4/yxwesnmGmDbwvlNwGa94W+0fINWOagXbCPPyHMw5DnY7qe38uwVbJ9te7NCOLtkUq95XO7WrmLTFqLyDYIgCOYFngXeVzhfGXiuGzZtISrfIAiCYF5gDLCGpFWy5O++zBZoqnEV8Lk86/kDwFTb/+oNZwaKyEbQOZS7ctphG3lGnoMhz8F2P72VZ79g+11JhwLXA0OBc21Pqu0BYPtM4FrSTOcngDeAXtvtLZYaBUEQBEEfE93OQRAEQdDHROUbBEEQBH1MVL5BEARB0MfEhKugKZLWJqm+rERa7/YccJXth/vVsSAIggFMtHyDhkj6DkmCTaRdoMbk44vqiZJXzHMJSSdLekTS/+XwcI5bsmC3WynNb7LQ+R8kLdcpeQbVycs3tpT0SUl75eOu9HgXlbRJo2ceeXZ+nkED+lsSLELnBpII+Xx14ucHHi/F7VY4XgL4DWknqT8AyxWuXQ98B1i+ELd8jruxEPdA4fgc4PvA+4FvAH8qld1veTZ4bosCmwBL1rkmkmD7J4G98rEq/BYDOk9gV9Lyjb/mZ38OcF2O27Vgd0bheFvgadKOYM8AH408B1aeEZr8v9LfDkTo3AA8Ary/Tvz7gUdLcZUqtnK6Uh6PNshvXMmufN5veebzSi+jqi+3QZrnw8CIOs9yFeDhBr/RLcAm+XhV4P5S2sizw/OM0DjEmG/QjCOAmyQ9TnqZAgwHVgcObZJuM9sj8/FPJX2+cO0fkr4NXGD7eYDc5XtAoQyA90o6ktQCW1ySnP8PZ+7hkv7ME+ADhePvAXvafkDSqqR9qK/N134O7GJ7SjGxpFWyzTqDOM9hJOm+Mv8E5qsTD7C47QcAbE9W2hKuSOTZ+XkGDYjKN2iI7eskrUnaimslUqX1LDDG9oySedWKbRRwFHBrrswMPE+Sdft0we7XwGL5+ALStocvSFoeGFcquzfzHK3Z48H18izT7GXUnZfbYMnzXGCMpIuZ/fHyPpLE328KdmtLmkD6OxohaSnbL0saUqfsenkOJ/127cyzN/wcKHn29HkGDQiFq6AtSDquFHWG7VrF9mPbnyvYbgHY9hhJ6wG7kbq1rqUJkn5bzKcQvyXwiO2pkhYmVZqbAJOAH9iemu0OB660XW65NipvddJ45/uAd0lj4BfV8ivYvUHqkhUwAhheeBlNsL1+tvt/pIq73gvzUts/HKx5Ztt1mD1zvvYhd5Xn3Mz8/aWf4Tnb70haBtje9hWlZ9+dPP9l++0+znNdYPc233tv5Nn23yioT1S+Qa8j6UDb5+Xj44CPkFpXN5Ja1bcCuwDX2z4p25UFzwF2Bm4GsL17If9JwEZO2q1nA68DfwQ+mOM/me2m5mtPAhcBl9l+oYHPhwMfB24jab2OA14mVcZfsz26YFv5RVzlhdmNPLt8YXaCn32BpPfa/k9fljnQaOUZSVra9v/1tk/zJO0aPI4QoVEAni4cTySJmi8MvErqqgRYiNRSqtmNBX4P7AjskP/9Vz7eoZR/3ckg+XxcKc8hpAlFvwFeIE0k+jywWCndRGBoPl4YGJ2PhwNj+/uZtvG3WboDfPhr4Xhx4IfA74D9S3ZnlM7fUydMAZYC3lOwK87EX5IGM/Frfz/AMcCqXfi8GWmy0e9JvQI3AlNJy/E2LtkuAZxMmsD4fzk8nOOWbPUZ5fPlgV8BvwSWBo7P93QpsEKTZ7R0vWeUbU8GlsnHmwKTgceBfxT/nys8o9X6+29nIIdY5xu0BaX1svXCRKC4hvZd2zNsvwE8aftVANvTgZkFu02BvwNHk7b1Gg1Mt32r7VtLxT8oqbb7yHhJm2Wf1gTeKdjZ9kzbN9j+IrAicAap23tynduqzYlYgDxWbPtpSuNaqr7OeHFJP5T0O0n7lfI4o3S+vKRfSfqlpKUlHS9poqRLJa1QsCuvXT5Hjdcun5xbpEjaTNJk4F5J/5C0Q8FuM0m3SPq9pPdJulHSK5LGSNq4lOeikk6UNEnSVEkvSLpH0gElu00ahE2BkQXT80gt7T8C+0r6o6QF8rXi5C6AF0l/I8WwEqlyuL9g94PC8amkj7hPkCrKs0p5LkWqoEdLuk/SNyStyNycAfwY+AtwF3CW7SVIQx5nlGwvJfWa7Gh7adtLAzsBrwCXdeMZAZwPPEQaFrgFmE7qqbkdOLPJM7q/wTMC+JjtFwvPaZTtNYAPAf9b5xnd0sUzCprR37V/hMERSJORRpKWFxXDCNKYUM3uXmDhfDykEL8EpVZrjl+Z9II6nUILumSzBOll9GTO/x1SZXorqdu5Zje2if8Llc6/TmpJnE1qsRyY45cFbivZVl1n/EdS62JP0sStPwIL5GvlFvt1wGGkl/mEnNfwHPfngl0ra5cnFo5vATbPx2tSWCJCElT5CLAf6eW+d47/IHB3Kc8/k2aArwwcCfwPsAZpQtsPCnYzSEMGt9QJ0wt240r5Hw3cSWqxlZ/RN/Nz2qAQ91Sd37aVJWZF2+1IFem/s58H1ftbKv9dlv/OqL4UrtIzqlD+uMJxpWeU4x8BhuXje5r87VR6RhGah353IMLgCKSuvG0bXPtD4XiBBjbLFF8Qda5/rPgyb2CzGLARqdW8XJ3ra7Z4T+sBewNrd2FX9eU6rnStWcUytnDc7OXaSsVS9eXarOyxpfPxpfMx+d8hpElwtfgHgTUaPKNnCscPU/goy3GfJ02e+0edtLWPs5/k339yHZtnSR8G/036KFPh2oSSbb0PwKGk3pHzCnF3k4Yv9iF1y+6Z43dg7vWzNwDfZk6xmeVIH1R/a/UZlZ878P3StfI9dfmMst1h2dedSd3YPwO2B04AftfqM4rQPMRSo6AtOHXjNrq2f+H4rQY2L5K6yBrl8RdSF18zH14Dxje5/liz9HXsJ5Fe+l3xD1VbE7yApCG2Z+b8T5L0LGlS16KlPItDQr8tXSsu4Wll7fIvgWslnQxcJ+lnwBWkFu24gt2bknYl9ShY0p62/5S7pstLzF6XtK3tOyR9Angp39tMaQ5ZwuPr+FPjsMLx1aSX/99qEbYvkPQ8cFo5oe1ngX1y2TeSxufLtLLEbK6/EadlddflUONgUrfzTODDwFclnU9ajvXlUhZVl8IdT7VnBPBnSYvanmb7mFqk0gz9Oe6h4jPC9ml5mOirpN6QYfnfP5F6VGpUfUZBM/q79o8QYaAH0hjYj0gty5dyeDjHLVWw+zFJvKKcfjfmlus8EVi0ju3qwOWF8+NKYdkcvzzw2zrpdwQuIU0+m0gSwTiI3CLONhuRutL/CqxNEt14hfQhsnUpvw1J3dSvAHeQexdI3fOHl2zXJlX0i5bvv6LdR+rczyxb0qS99VvMc7dmeXbh5zot5LkFs7v51yO1wueSYizZrUtqsdeVbKxqW7LbgDRZqkqebfEzQv3Q7w5EiDCYA3msuF12AzVP4HDgUVIragqwR+Fasev8sCp2vZhnJdtc9iMV8zwOuIc0wemHwE3AsaQej6Ob2N1cz64V26pl95afEZr8/9HfDkSIMJgDDSaJddduoOZJamUvmo9H5Bf31/P52FbtBmCeVZbXVbIbSHlGaBxizDcIeoiS1F7dSxSWWVW1G4x5ktZMTwOwPUXSjsDlWaRD3bAbSHm+6zQm+oakOZbXSZrZDbuBlGfQgKh8g6DnLEeadPNyKV6kNaCt2g3GPP8taaTtcQC2p0n6OElPeINu2A2kPN+WtLDT2vZNZz0gaQnmXNte1W4g5Rk0or+b3hEiDPRA9WVWlewGaZ4rU1gHXbLbplW7AZZnpeV1Ve0GUp4RGofQdg6CIAiCPibkJYMgCIKgj4nKNwiCIAj6mKh8gyAIgqCPico3CIIgCPqY/w/ERFppM7WbXAAAAABJRU5ErkJggg==", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "sns.heatmap(ensembled_tasks.isna().transpose(),\n", - " cmap=\"YlGnBu\",\n", - " cbar_kws={'label': 'Missing Data'})" - ] - }, - { - "cell_type": "code", - "execution_count": 66, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Unnamed: 0 0.000000\n", - "slide_submitter_id 0.000000\n", - "sample_submitter_id 0.000000\n", - "Stromal score 0.007833\n", - "CAFs (MCP counter) 0.013055\n", - "CAFs (EPIC) 0.013055\n", - "CAFs (Bagaev) 0.013055\n", - "Endothelial cells (xCell) 0.013055\n", - "Endothelial cells (EPIC) 0.013055\n", - "Endothelium 0.013055\n", - "CD8 T cells (Thorsson) 0.796345\n", - "Cytotoxic cells 0.010444\n", - "Effector cells 0.013055\n", - "CD8 T cells (quanTIseq) 0.013055\n", - "TIL score 0.796345\n", - "Immune score 0.007833\n", - "tumor purity (ABSOLUTE) 0.023499\n", - "tumor purity (ESTIMATE) 0.007833\n", - "tumor purity (EPIC) 0.013055\n", - "dtype: float64" - ] - }, - "execution_count": 66, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "(ensembled_tasks.isna().transpose().sum(axis=1)) / len(ensembled_tasks) #" - ] - }, - { - "cell_type": "code", - "execution_count": 67, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "383" - ] - }, - "execution_count": 67, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "len(ensembled_tasks)" - ] - }, - { - "cell_type": "code", - "execution_count": 68, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "382\n" - ] - }, - { - "data": { - "text/plain": [ - "tumor purity (ABSOLUTE) 0.020942\n", - "tumor purity (ESTIMATE) 0.005236\n", - "tumor purity (EPIC) 0.010471\n", - "dtype: float64" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/plain": [ - "tumor purity (ABSOLUTE) 8\n", - "tumor purity (ESTIMATE) 2\n", - "tumor purity (EPIC) 4\n", - "dtype: int64" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "370\n" - ] - } - ], - "source": [ - "tmp = ensembled_tasks[tumor_purity].dropna(how=\"all\")\n", - "print(len(tmp))\n", - "display((tmp.isna().transpose().sum(axis=1)) / len(tmp))\n", - "display((tmp.isna().transpose().sum(axis=1)))\n", - "print(len(tmp.dropna(how=\"any\")))" - ] - }, - { - "cell_type": "code", - "execution_count": 69, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "381\n" - ] - }, - { - "data": { - "text/plain": [ - "CD8 T cells (Thorsson) 0.795276\n", - "Cytotoxic cells 0.005249\n", - "Effector cells 0.007874\n", - "CD8 T cells (quanTIseq) 0.007874\n", - "TIL score 0.795276\n", - "Immune score 0.002625\n", - "dtype: float64" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/plain": [ - "CD8 T cells (Thorsson) 303\n", - "Cytotoxic cells 2\n", - "Effector cells 3\n", - "CD8 T cells (quanTIseq) 3\n", - "TIL score 303\n", - "Immune score 1\n", - "dtype: int64" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "77\n" - ] - } - ], - "source": [ - "tmp = ensembled_tasks[T_cells].dropna(how=\"all\")\n", - "print(len(tmp))\n", - "display((tmp.isna().transpose().sum(axis=1)) / len(tmp))\n", - "display((tmp.isna().transpose().sum(axis=1)))\n", - "print(len(tmp.dropna(how=\"any\")))" - ] - }, - { - "cell_type": "code", - "execution_count": 70, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "380\n" - ] - }, - { - "data": { - "text/plain": [ - "Stromal score 0.000000\n", - "CAFs (MCP counter) 0.005263\n", - "CAFs (EPIC) 0.005263\n", - "CAFs (Bagaev) 0.005263\n", - "dtype: float64" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/plain": [ - "Stromal score 0\n", - "CAFs (MCP counter) 2\n", - "CAFs (EPIC) 2\n", - "CAFs (Bagaev) 2\n", - "dtype: int64" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "378\n" - ] - } - ], - "source": [ - "tmp = ensembled_tasks[CAFs].dropna(how=\"all\")\n", - "print(len(tmp))\n", - "display((tmp.isna().transpose().sum(axis=1)) / len(tmp))\n", - "display((tmp.isna().transpose().sum(axis=1)))\n", - "print(len(tmp.dropna(how=\"any\")))" - ] - }, - { - "cell_type": "code", - "execution_count": 71, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "378\n" - ] - }, - { - "data": { - "text/plain": [ - "Endothelial cells (xCell) 0.0\n", - "Endothelial cells (EPIC) 0.0\n", - "Endothelium 0.0\n", - "dtype: float64" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/plain": [ - "Endothelial cells (xCell) 0\n", - "Endothelial cells (EPIC) 0\n", - "Endothelium 0\n", - "dtype: int64" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "378\n" - ] - } - ], - "source": [ - "tmp = ensembled_tasks[endothelial_cells].dropna(how=\"all\")\n", - "print(len(tmp))\n", - "display((tmp.isna().transpose().sum(axis=1)) / len(tmp))\n", - "display((tmp.isna().transpose().sum(axis=1)))\n", - "print(len(tmp.dropna(how=\"any\")))" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3.9.10 ('base')", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.10 | packaged by conda-forge | (main, Feb 1 2022, 21:27:43) \n[Clang 11.1.0 ]" - }, - "orig_nbformat": 4, - "vscode": { - "interpreter": { - "hash": "aa98a7cfca054c33f70b1b98f933bb29fd610054927dca4b481403e32f3903ef" - } - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/Python/2_train_multitask_models/processing_transcriptomics.py b/Python/2_train_multitask_models/processing_transcriptomics.py deleted file mode 100644 index 70e6c22..0000000 --- a/Python/2_train_multitask_models/processing_transcriptomics.py +++ /dev/null @@ -1,243 +0,0 @@ -# Module imports -import os -import sys -import argparse -import joblib - -import numpy as np -import pandas as pd -import git -REPO_DIR= git.Repo('.', search_parent_directories=True).working_tree_dir -sys.path.append(f"{REPO_DIR}/Python/libs") -import model.preprocessing as preprocessing -from model.constants import TUMOR_PURITY, T_CELLS, ENDOTHELIAL_CELLS, CAFS, IDS, TILE_VARS - -# cancer_type="SKCM" -# slide_type="FFPE" -# # clinical_file = pd.read_csv("../../data/SKCM/slide.tsv", sep="\t") -# clinical_file = pd.read_csv("../../data/FFPE_generated_clinical_file.txt", sep="\t") - -# # Set paths: 1) folder with all published data, 2) folder with all computed data, 3) folder for storing the ensembled tasks -# path_published_data = "../../data/published" -# path_computed_features= "../../data/" -# output_dir = "../../data/" - -def processing_transcriptomics(cancer_type, slide_type, clinical_file_path, tpm_path, output_dir, path_data=None,): - """ Compute and combine cell type abundances from different quantification methods necessary for TF learning - Args: - cancer_type (str): abbreviation of cancer_type - slide_type (str): type of slide either 'FFPE' or 'FF' for naming and necessary for merging data - clinical_file_path (str): clinical_file_path: path to clinical_file - tpm_path (str): path pointing to the tpm file - output_dir (str): path pointing to a folder where the dataframe containing all features should be stored, stored as .txt file - - - Returns: - ./task_selection_names.pkl: pickle file containing variable names. - {output_dir}/TCGA_{cancer_type}_ensembled_selected_tasks.csv" containing the following cell type quantification methods: - tumor_purity = [ - 'tumor purity (ABSOLUTE)', - 'tumor purity (estimate)', - 'tumor purity (EPIC)' - ] - T_cells = [ - 'CD8 T cells (Thorsson)', - 'Cytotoxic cells', - 'Effector cells', - 'CD8 T cells (quanTIseq)', - 'TIL score', - 'Immune score', - ] - endothelial_cells = [ - 'Endothelial cells (xCell)', - 'Endothelial cells (EPIC)', - 'Endothelium', ] - CAFs = [ - 'Stromal score', - 'CAFs (MCP counter)', - 'CAFs (EPIC)', - 'CAFs (Bagaev)', - ] - """ - full_output_dir = f"{output_dir}/2_TF_training" - if not os.path.exists(full_output_dir): - os.makedirs(full_output_dir) - - var_dict = { - "CAFs": CAFS, - "T_cells": T_CELLS, - "tumor_purity": TUMOR_PURITY, - "endothelial_cells": ENDOTHELIAL_CELLS, - "IDs":IDS, - "tile_IDs": TILE_VARS - } - joblib.dump(var_dict, "./task_selection_names.pkl") - clinical_file = pd.read_csv(clinical_file_path, sep="\t") - - var_IDs = ['sample_submitter_id','slide_submitter_id'] - all_slide_features = clinical_file.loc[:,var_IDs] - - # Published Data - Thorsson = pd.read_csv(f"{REPO_DIR}/data/published/Thorsson_Scores_160_Signatures.tsv", sep="\t") - estimate = pd.read_csv(f"{REPO_DIR}/data/published/Yoshihara_ESTIMATE_{cancer_type}_RNAseqV2.txt", sep="\t") - tcga_absolute = pd.read_csv(f"{REPO_DIR}/data/published/TCGA_ABSOLUTE_tumor_purity.txt", sep="\t") - gibbons = pd.read_excel(f"{REPO_DIR}/data/published/Gibbons_supp1.xlsx", skiprows=2, sheet_name="DataFileS1 - immune features") - - # Computed Data: Immunedeconv - mcp_counter = pd.read_csv(f"{output_dir}/immunedeconv/mcp_counter.csv", index_col=0, sep=",") - quantiseq = pd.read_csv(f"{output_dir}/immunedeconv/quantiseq.csv", index_col=0, sep=",") - xCell = pd.read_csv(f"{output_dir}/immunedeconv/xcell.csv", index_col=0, sep=",", header=[0]) - EPIC = pd.read_csv(f"{output_dir}/immunedeconv/epic.csv", index_col=0, sep=",") - - # Re(compute) Fges scores with TPM - Fges_computed = preprocessing.compute_gene_signature_scores(tpm_path) - Fges_computed = Fges_computed.loc[:, ["Effector_cells", "Endothelium", "CAF"]] - Fges_computed.columns = ["Effector cells", "Endothelium", "CAFs (Bagaev)"] - - Fges_computed = Fges_computed.reset_index() - Fges_computed = Fges_computed.rename(columns={"index": "TCGA_sample"}) - - # From immunedeconv - quantiseq = preprocessing.process_immunedeconv(quantiseq, "quanTIseq") - EPIC = preprocessing.process_immunedeconv(EPIC, "EPIC") - mcp_counter = preprocessing.process_immunedeconv(mcp_counter, "MCP") - xCell = preprocessing.process_immunedeconv(xCell, "xCell") - - # Merge cell fractions - cellfrac = pd.merge(xCell, quantiseq, on=["TCGA_sample"]) - cellfrac = pd.merge(cellfrac, mcp_counter, on=["TCGA_sample"]) - cellfrac = pd.merge(cellfrac, EPIC, on=["TCGA_sample"]) - - # Merge cell fractions - cellfrac = pd.merge(xCell, quantiseq, on=["TCGA_sample"]) - cellfrac = pd.merge(cellfrac, mcp_counter, on=["TCGA_sample"]) - cellfrac = pd.merge(cellfrac, EPIC, on=["TCGA_sample"]) - - # estimate data - estimate = estimate.rename(columns={"ID": "TCGA_sample"}) - estimate = estimate.set_index("TCGA_sample") - estimate.columns = ["Stromal score", "Immune score", "ESTIMATE score"] - - # According the tumor purity formula provided in the paper - estimate["tumor purity (ESTIMATE)"] = np.cos( - 0.6049872018 + .0001467884 * estimate["ESTIMATE score"]) - estimate = estimate.drop(columns=["ESTIMATE score"]) - - # Thorsson data - Thorsson = Thorsson.drop(columns="Source") - Thorsson = Thorsson.set_index("SetName").T - Thorsson = Thorsson.rename_axis(None, axis=1) - Thorsson.index.name="TCGA_aliquot" - Thorsson = Thorsson.loc[:, ["LIexpression_score", "CD8_PCA_16704732"]] - Thorsson.columns = ["TIL score", "CD8 T cells (Thorsson)"] - - # TCGA PanCanAtlas - tcga_absolute = tcga_absolute.rename(columns = {"purity": "tumor purity (ABSOLUTE)", "sample": "TCGA_aliquot"}) - tcga_absolute = tcga_absolute.set_index("TCGA_aliquot") - tcga_absolute = pd.DataFrame(tcga_absolute.loc[:, "tumor purity (ABSOLUTE)"]) - - gibbons = gibbons.rename(columns={'Unnamed: 1': "id"}) - gibbons["slide_submitter_id"] = gibbons["id"].str[0:23] - gibbons["Cytotoxic cells"] = gibbons["Cytotoxic cells"].astype(float) - gibbons = gibbons.set_index("slide_submitter_id") - - all_slide_features["TCGA_sample"] = clinical_file["slide_submitter_id"].str[0:15] - - # add IDs - Thorsson["TCGA_sample"] = Thorsson.index.str[0:15] - tcga_absolute["TCGA_sample"] = tcga_absolute.index.str[0:15] - gibbons["TCGA_sample"] = gibbons.index.str[0:15] - - tcga_absolute_merged = pd.merge(all_slide_features, tcga_absolute, on=["TCGA_sample", ], how="left") - Thorsson_merged = pd.merge(all_slide_features, Thorsson, on=["TCGA_sample",], how="left") - gibbons_merged = pd.merge(all_slide_features, gibbons, on=["TCGA_sample"], how="left") - - cellfrac_merged = pd.merge(all_slide_features, cellfrac, on=["TCGA_sample"], how="left") - estimate_merged = pd.merge(all_slide_features, estimate, on=["TCGA_sample" ], how="left") - Fges_computed_merged = pd.merge(all_slide_features, Fges_computed, on=["TCGA_sample"], how="left") - - # Combine in one dataframe - all_merged = pd.merge(all_slide_features, tcga_absolute_merged, how="left") - all_merged = pd.merge(all_merged, Thorsson_merged ,how="left") - all_merged = pd.merge(all_merged, gibbons_merged,how="left") - all_merged = pd.merge(all_merged, estimate_merged, how="left") - all_merged = pd.merge(all_merged, cellfrac_merged, how="left") - all_merged = pd.merge(all_merged, Fges_computed_merged, how="left") - - # ---- Transform features to get a normal distribution (immunedeconv) ---- # - featuresnames_transform = ["CAFs (MCP counter)", - 'CAFs (EPIC)',] - feature_data = all_merged.loc[:, CAFS].astype(float) - data_log2_transformed = feature_data.copy() - data_log2_transformed[featuresnames_transform] = np.log2(feature_data[featuresnames_transform] * 100 + 0.001) - CAFs_transformed = data_log2_transformed - - featuresnames_transform = ["Endothelial cells (xCell)", - "Endothelial cells (EPIC)",] - feature_data = all_merged.loc[:, ENDOTHELIAL_CELLS].astype(float) - data_log2_transformed = feature_data.copy() - data_log2_transformed[featuresnames_transform] = np.log2(feature_data[featuresnames_transform] * 100 + 0.001) - endothelial_cells_transformed = data_log2_transformed - - feature_data = all_merged.loc[:, T_CELLS].astype(float) - featuresnames_transform = ['CD8 T cells (quanTIseq)'] - data_log2_transformed = feature_data.copy() - data_log2_transformed[featuresnames_transform] = np.log2(feature_data[featuresnames_transform] * 100 + 0.001) - T_cells_transformed = data_log2_transformed - - feature_data = all_merged.loc[:, TUMOR_PURITY].astype(float) - featuresnames_transform = ["tumor purity (EPIC)"] - data_log2_transformed = feature_data.copy() - data_log2_transformed[featuresnames_transform] = np.log2(feature_data[featuresnames_transform] * 100 + 0.001) - tumor_cells_transformed = data_log2_transformed - - # Store processed data - IDs = ['slide_submitter_id', 'sample_submitter_id', "TCGA_sample"] - metadata = all_merged[IDs] - merged = pd.concat([ - metadata, - CAFs_transformed, endothelial_cells_transformed, T_cells_transformed, tumor_cells_transformed], axis=1) - merged = merged.fillna(np.nan) - - # Remove slides if there are no values at all - merged = merged.dropna(axis=0, subset=T_CELLS + CAFS + ENDOTHELIAL_CELLS + TUMOR_PURITY, how="all") - merged.to_csv(f"{full_output_dir}/ensembled_selected_tasks.csv", sep="\t") - -if __name__ == "__main__": - parser = argparse.ArgumentParser(description='Process transcriptomics data for use in TF learning') - parser.add_argument( - "--cancer_type", - help="Abbreviation of cancer type for naming of generated files", - ) - parser.add_argument( - "--clinical_file_path", - help="Full path to clinical file", default=None - ) - parser.add_argument( - "--slide_type", - help="Type of pathology slides either 'FF' (fresh frozen) or 'FFPE' (Formalin-Fixed Paraffin-Embedded) by default 'FF'", - type=str, required=True - ) - parser.add_argument( - "--tpm_path", help="Path to tpm file", type=str, required=True - ) - - parser.add_argument( - "--output_dir", help="Path to folder for generated file") - args = parser.parse_args() - - # old_stdout = sys.stdout - # log_file = open(f"{REPO_DIR}/logs/processing_transcriptomics.log", "w") - # sys.stdout = log_file - - processing_transcriptomics( - cancer_type=args.cancer_type, - slide_type=args.slide_type, - tpm_path=args.tpm_path, - path_data=args.path_data, - clinical_file_path=args.clinical_file_path, - output_dir=args.output_dir, - ) - - # sys.stdout = old_stdout - # log_file.close() diff --git a/Python/2_train_multitask_models/run_TF_pipeline.py b/Python/2_train_multitask_models/run_TF_pipeline.py deleted file mode 100644 index 8cc891e..0000000 --- a/Python/2_train_multitask_models/run_TF_pipeline.py +++ /dev/null @@ -1,220 +0,0 @@ -# Module imports -import argparse -import os -import sys -import dask.dataframe as dd -import joblib -import git -import numpy as np -import pandas as pd -from sklearn import linear_model, metrics -from sklearn.model_selection import GridSearchCV, GroupKFold -from sklearn.preprocessing import StandardScaler - -# Custom imports -import model.evaluate as meval -import model.preprocessing as preprocessing -import model.utils as utils - -def nested_cv_multitask( - output_dir, - category, - alpha_min, - alpha_max, - n_steps=40, - n_outerfolds=5, - n_innerfolds=10, - n_tiles=50, - split_level="sample_submitter_id", - slide_type="FF" -): - """ - Transfer Learning to quantify the cell types on a tile-level - Use a nested cross-validation strategy to train a multi-task lasso algorithm. Tuning and evaluation based on spearman correlation. - - Args: - output_dir (str): Path pointing to folder where models will be stored - category (str): cell type - alpha_min (int): Min. value of hyperparameter alpha - alpha_max (int): Max. value of hyperparameter alpha - n_steps (int): Stepsize for grid [alpha_min, alpha_max] - slide_type (str): slide format (FF or FFPE) - n_outerfolds (int): Number of outer loops - n_innerfolds (int): Number of inner loops - n_tiles (int): Number of tiles to select per slide - split_level (str): Split level of slides for creating splits - - Returns: - {output_dir}/: Pickle files containing the created splits, selected tiles, learned models, scalers, and evaluation of the slides and tiles using the spearman correlation for both train and test sets - """ - # Hyperparameter grid for tuning - alphas = np.logspace(int(alpha_min), int(alpha_max), int(n_steps)) - scoring = meval.custom_spearmanr - N_JOBS = -1 - OUTPUT_PATH = f"{output_dir}/2_TF_training/models/{category}" - - print(slide_type) - - # Load data - var_names = joblib.load("./task_selection_names.pkl") - var_names['T_cells'] = ['Cytotoxic cells', 'Effector cells', 'CD8 T cells (quanTIseq)', 'Immune score'] - target_features = pd.read_csv(f"{output_dir}/2_TF_training/ensembled_selected_tasks.csv", sep="\t", index_col=0) - if slide_type == "FF": - bottleneck_features = pd.read_csv(f"{output_dir}/1_histopathological_features/features.txt", sep="\t", index_col=0) - elif slide_type == "FFPE": - bottleneck_features = dd.read_parquet(f"{output_dir}/1_histopathological_features/features.parquet") - - target_vars = var_names[category] - metadata_colnames = var_names["tile_IDs"] + var_names["IDs"] + var_names[category] - - if os.path.exists(OUTPUT_PATH): - print("Folder exists") - else: - os.makedirs(OUTPUT_PATH) - - # Preprocessing - IDs = ["sample_submitter_id", "slide_submitter_id"] - merged_data = preprocessing.clean_data( - bottleneck_features, target_features.loc[:, target_vars + IDs], slide_type - ) - total_tile_selection = utils.selecting_tiles(merged_data, n_tiles, slide_type) - X, Y = utils.split_in_XY( - total_tile_selection, metadata_colnames, var_names[category] - ) - - # TF learning - ## Create variables for storing - model_learned = dict.fromkeys(range(n_outerfolds)) - x_train_scaler = dict.fromkeys(range(n_outerfolds)) - y_train_scaler = dict.fromkeys(range(n_outerfolds)) - - multi_task_lasso = linear_model.MultiTaskLasso() - - ## Setup nested cv - sample_id = pd.factorize(total_tile_selection[split_level])[0] - cv_outer = GroupKFold(n_splits=n_outerfolds) - cv_inner = GroupKFold(n_splits=n_innerfolds) - cv_outer_splits = list(cv_outer.split(X, Y, groups=sample_id)) - - ## Storing scores - tiles_spearman_train = {} - tiles_spearman_test = {} - slides_spearman_train = {} - slides_spearman_test = {} - - print("Feature matrix dimensions [tiles, features]:", X.shape) - print("Response matrix dimensions:", Y.shape) - - ## Run nested cross-validation - for outerfold in range(n_outerfolds): - print(f"Outerfold {outerfold}") - train_index, test_index = cv_outer_splits[outerfold] - x_train, x_test = X.iloc[train_index], X.iloc[test_index] - y_train, y_test = Y.iloc[train_index], Y.iloc[test_index] - - ### Standardizing predictors - scaler_x = StandardScaler() - scaler_x.fit(x_train) - x_train_z = scaler_x.transform(x_train) - x_test_z = scaler_x.transform(x_test) - - ### Standardizing targets - scaler_y = StandardScaler() - scaler_y.fit(y_train) - y_train_z = scaler_y.transform(y_train) - y_test_z = scaler_y.transform(y_test) - grid = GridSearchCV( - estimator=multi_task_lasso, - param_grid=[{"alpha": alphas}], - cv=cv_inner, - scoring=metrics.make_scorer(scoring), - return_train_score=True, - n_jobs=N_JOBS, - ) - grid.fit(x_train_z, y_train_z, groups=sample_id[train_index]) - - ### Evaluate on tile level (spearmanr) - y_train_z_pred = grid.predict(x_train_z) - y_test_z_pred = grid.predict(x_test_z) - - tiles_spearman_train[outerfold] = meval.custom_spearmanr( - y_train_z, y_train_z_pred, in_gridsearch=False - ) - tiles_spearman_test[outerfold] = meval.custom_spearmanr( - y_test_z, y_test_z_pred, in_gridsearch=False - ) - - ### Evaluate on Slide level - #### For aggregation for evaluation in gridsearch for hyper parameter choosing - slide_IDs_train = total_tile_selection["slide_submitter_id"].iloc[train_index] - slide_IDs_test = total_tile_selection["slide_submitter_id"].iloc[test_index] - - ##### 1. Aggregate on slide level (averaging) - Y_train_true_agg, Y_train_pred_agg = meval.compute_aggregated_scores( - y_train_z, y_train_z_pred, target_vars, slide_IDs_train - ) - Y_test_true_agg, Y_test_pred_agg = meval.compute_aggregated_scores( - y_test_z, y_test_z_pred, target_vars, slide_IDs_test - ) - - ###### 2. Compute spearman correlation between ground truth and predictions - slides_spearman_train[outerfold] = meval.custom_spearmanr( - Y_train_true_agg, Y_train_pred_agg, in_gridsearch=False - ) - slides_spearman_test[outerfold] = meval.custom_spearmanr( - Y_test_true_agg, Y_test_pred_agg, in_gridsearch=False - ) - - ###### Store scalers for future predictions/later use - x_train_scaler[outerfold] = scaler_x - y_train_scaler[outerfold] = scaler_y - model_learned[outerfold] = grid - - ## Store for reproduction of outer scores - joblib.dump(cv_outer_splits, f"{OUTPUT_PATH}/cv_outer_splits.pkl") - joblib.dump(total_tile_selection, f"{OUTPUT_PATH}/total_tile_selection.pkl") - joblib.dump(model_learned, f"{OUTPUT_PATH}/outer_models.pkl") - joblib.dump(x_train_scaler, f"{OUTPUT_PATH}/x_train_scaler.pkl") - joblib.dump(y_train_scaler, f"{OUTPUT_PATH}/y_train_scaler.pkl") - joblib.dump(slides_spearman_train, f"{OUTPUT_PATH}/outer_scores_slides_train.pkl") - joblib.dump(slides_spearman_test, f"{OUTPUT_PATH}/outer_scores_slides_test.pkl") - joblib.dump(tiles_spearman_train, f"{OUTPUT_PATH}/outer_scores_tiles_train.pkl") - joblib.dump(tiles_spearman_test, f"{OUTPUT_PATH}/outer_scores_tiles_test.pkl") - - -if __name__ == "__main__": - parser = argparse.ArgumentParser() - parser.add_argument( - "--path_target_features", help="Path pointing to file containing the cell type abundances", - ) - parser.add_argument( - "--output_dir", help="Path pointing to folder where models will be stored" - ) - parser.add_argument( - "--path_bottleneck_features", help="Path pointing to file containing the histopathological features", - ) - parser.add_argument("--category", help="Cell type") - parser.add_argument("--alpha_min", help="Min. value of hyperparameter alpha") - parser.add_argument("--alpha_max", help="Max. value of hyperparameter alpha") - parser.add_argument("--n_steps", help="Stepsize for grid [alpha_min, alpha_max]", default=40) - parser.add_argument("--n_innerfolds", help="Number of inner loops", default=10) - parser.add_argument("--n_outerfolds", help="Number of outer loops", default=5) - parser.add_argument("--n_tiles", help="Number of tiles to select per slide", default=50) - parser.add_argument("--split_level", help="Split level of slides for creating splits", - default="sample_submitter_id", - ) - parser.add_argument("--slide_type", help="Type of tissue slide (FF or FFPE)]") - - args = parser.parse_args() - nested_cv_multitask( - output_dir=args.output_dir, - category=args.category, - alpha_min=args.alpha_min, - alpha_max=args.alpha_max, - n_steps=args.n_steps, - n_outerfolds=args.n_outerfolds, - n_innerfolds=args.n_innerfolds, - n_tiles=args.n_tiles, - split_level=args.split_level, - slide_type=args.slide_type - ) diff --git a/Python/3_spatial_characterization/compute_clustering_features.py b/Python/3_spatial_characterization/compute_clustering_features.py deleted file mode 100644 index c1635a5..0000000 --- a/Python/3_spatial_characterization/compute_clustering_features.py +++ /dev/null @@ -1,262 +0,0 @@ -import multiprocessing -import sys -import joblib -import pandas as pd -from joblib import Parallel, delayed -import argparse - -import features.clustering as clustering # trunk-ignore(flake8/E402) -import features.features as features # trunk-ignore(flake8/E402) -import features.graphs as graphs # trunk-ignore(flake8/E402) - -NUM_CORES = multiprocessing.cpu_count() - - -def compute_clustering_features(tile_quantification_path, output_dir, slide_type="FF", cell_types=None, graphs_path=None): - - if cell_types is None: - cell_types = ["CAFs", "T_cells", "endothelial_cells", "tumor_purity"] - - predictions = pd.read_csv(tile_quantification_path, sep="\t", index_col=0)] - - ##################################### - # ---- Constructing the graphs ---- # - ##################################### - - if graphs_path is None: - results= Parallel(n_jobs=NUM_CORES)( - delayed(graphs.construct_graph)(predictions=predictions, - slide_submitter_id=slide_submitter_id) - for _, slide_submitter_id in slides.to_numpy() - ) - # Extract/format graphs - all_graphs = { - list(slide_graph.keys())[0]: list(slide_graph.values())[0] - for slide_graph in results - } - joblib.dump( - all_graphs, f"{output_dir}/{slide_type}_graphs.pkl") - else: - all_graphs = joblib.load(graphs_path) - - ###################################################################### - # ---- Fraction of cell type clusters (simultaneous clustering) ---- # - ###################################################################### - - # Spatially Hierarchical Constrained Clustering with all quantification of all cell types - slide_clusters= Parallel(n_jobs=NUM_CORES)(delayed(clustering.schc_all)(predictions, all_graphs[slide_submitter_id], slide_submitter_id) for subtype, slide_submitter_id in slides.to_numpy()) - # Combine the tiles labeled with their cluster id for all slides - tiles_all_schc = pd.concat(slide_clusters, axis=0) - - # Assign a cell type label based on the mean of all cluster means across all slides - all_slide_clusters_characterized = clustering.characterize_clusters(tiles_all_schc) - - # Count the number of clusters per cell type for each slide - num_clust_by_slide = features.n_clusters_per_cell_type(all_slide_clusters_characterized) - - ###################################################################################### - # ---- Fraction of highly abundant cell types (individual cell type clustering) ---- # - ###################################################################################### - - # Spatially Hierarchical Constrained Clustering with all quantification of all cell types for each individual cell type - slide_indiv_clusters= Parallel(n_jobs=NUM_CORES)(delayed(clustering.schc_individual)(predictions, all_graphs[slide_submitter_id], slide_submitter_id) for subtype, slide_submitter_id in slides.to_numpy()) - all_slide_indiv_clusters = pd.concat(slide_indiv_clusters, axis=0) - - # Add metadata - all_slide_indiv_clusters = pd.merge(predictions, all_slide_indiv_clusters, on="tile_ID") - - # Add abundance label 'high' or 'low' based on cluster means - slide_indiv_clusters_labeled = clustering.label_cell_type_map_clusters(all_slide_indiv_clusters) - - # Count the fraction of 'high' clusters - frac_high = features.n_high_clusters(slide_indiv_clusters_labeled) - - ################################################################## - # ---- Compute proximity features (simultaneous clustering) ---- # - ################################################################## - - ## Computing proximity for clusters derived with all cell types simultaneously - clusters_all_schc_long = all_slide_clusters_characterized.melt(id_vars=["MFP", "slide_submitter_id", "cluster_label"], value_name="is_assigned", var_name="cell_type") - # remove all cell types that are not assigned to the cluster - clusters_all_schc_long = clusters_all_schc_long[clusters_all_schc_long["is_assigned"]] - clusters_all_schc_long = clusters_all_schc_long.drop(columns="is_assigned") - - results_schc_all= Parallel(n_jobs=NUM_CORES)(delayed(features.compute_proximity_clusters_pairs)(tiles_all_schc, slide_submitter_id, method="all") for _, slide_submitter_id in slides.to_numpy()) - prox_all_schc = pd.concat(results_schc_all) - - # Label clusters (a number) with the assigned cell types - prox_all_schc = pd.merge(prox_all_schc, clusters_all_schc_long, left_on=["slide_submitter_id", "cluster1"], right_on=["slide_submitter_id", "cluster_label"]) - prox_all_schc = prox_all_schc.rename(columns={"cell_type": "cluster1_label"}) - prox_all_schc = prox_all_schc.drop(columns=["cluster_label", "MFP"]) - - prox_all_schc = pd.merge(prox_all_schc, clusters_all_schc_long, left_on=["slide_submitter_id", "cluster2"], right_on=["slide_submitter_id", "cluster_label"]) - prox_all_schc = prox_all_schc.rename(columns={"cell_type": "cluster2_label"}) - # prox_all_schc = prox_all_schc.drop(columns=["cluster_label", "cluster1", "cluster2" ]) - - # Order doesn't matter: x <-> - prox_all_schc["pair"] = [f"{sorted([i, j])[0]}-{sorted([i, j])[1]}" for i, j in prox_all_schc[["cluster1_label", "cluster2_label"]].to_numpy()] - prox_all_schc = prox_all_schc[((prox_all_schc.cluster1 == prox_all_schc.cluster2) & (prox_all_schc.cluster2_label != prox_all_schc.cluster1_label)) | (prox_all_schc.cluster1 != prox_all_schc.cluster2)] - # prox_all_schc.to_csv(f"{output_dir}/{slide_type}_features_clust_all_schc_prox.txt", sep="\t") - - slides = prox_all_schc[["MFP", "slide_submitter_id"]].drop_duplicates().to_numpy() - - # Post Processing - results_schc_all= Parallel(n_jobs=NUM_CORES)(delayed(features.post_processing_proximity)(prox_df=prox_all_schc, slide_submitter_id=slide_submitter_id,subtype=subtype, method="all") for subtype, slide_submitter_id in slides) - all_prox_df = pd.concat(results_schc_all) - # Remove rows with a proximity of NaN - all_prox_df = all_prox_df.dropna(axis=0) - - ########################################################################## - # ---- Compute proximity features (individual cell type clustering) ---- # - ########################################################################## - - ## Computing proximity for clusters derived for each cell type individually - # Between clusters - slides = ( - predictions[["MFP", "slide_submitter_id"]].drop_duplicates().reset_index(drop=True)) - results_schc_indiv= Parallel(n_jobs=NUM_CORES)(delayed(features.compute_proximity_clusters_pairs)(all_slide_indiv_clusters, slide_submitter_id, method="individual_between") for _, slide_submitter_id in slides.to_numpy()) - prox_indiv_schc = pd.concat(results_schc_indiv) - - # Formatting - prox_indiv_schc = pd.merge(prox_indiv_schc,slide_indiv_clusters_labeled, left_on=["slide_submitter_id", "cluster1_label", "cluster1"], right_on=["slide_submitter_id", "cell_type_map", "cluster_label"]) - prox_indiv_schc = prox_indiv_schc.drop(columns=["cell_type_map", "MFP", "cluster_label"]) - prox_indiv_schc = prox_indiv_schc.rename(columns={"is_high": "cluster1_is_high"}) - prox_indiv_schc = pd.merge(prox_indiv_schc,slide_indiv_clusters_labeled, left_on=["slide_submitter_id", "cluster2_label", "cluster2"], right_on=["slide_submitter_id", "cell_type_map", "cluster_label"]) - prox_indiv_schc = prox_indiv_schc.rename(columns={"is_high": "cluster2_is_high"}) - prox_indiv_schc = prox_indiv_schc.drop(columns=["cell_type_map", "cluster_label"]) - - # Order matters - prox_indiv_schc["ordered_pair"] = [f"{i}-{j}" for i, j in prox_indiv_schc[["cluster1_label", "cluster2_label"]].to_numpy()] - prox_indiv_schc["comparison"] = [f"cluster1={i}-cluster2={j}" for i, j in prox_indiv_schc[["cluster1_is_high", "cluster2_is_high"]].to_numpy()] - - # Post-processing - slides = prox_indiv_schc[["MFP", "slide_submitter_id"]].drop_duplicates().to_numpy() - results_schc_indiv= pd.concat(Parallel(n_jobs=NUM_CORES)(delayed(features.post_processing_proximity)(prox_df=prox_indiv_schc, slide_submitter_id=slide_submitter_id,subtype=subtype, method="individual_between") for subtype, slide_submitter_id in slides)) - - # Within clusters - slides = ( - predictions[["MFP", "slide_submitter_id"]].drop_duplicates().reset_index(drop=True)) - results_schc_indiv_within= Parallel(n_jobs=NUM_CORES)(delayed(features.compute_proximity_clusters_pairs)(all_slide_indiv_clusters, slide_submitter_id, method="individual_within") for _, slide_submitter_id in slides.to_numpy()) - prox_indiv_schc_within = pd.concat(results_schc_indiv_within) - - prox_indiv_schc_within = pd.merge(prox_indiv_schc_within,slide_indiv_clusters_labeled, left_on=["slide_submitter_id", "cell_type", "cluster1"], right_on=["slide_submitter_id", "cell_type_map", "cluster_label"]) - prox_indiv_schc_within = prox_indiv_schc_within.drop(columns=["MFP", "cluster_label"]) - prox_indiv_schc_within = prox_indiv_schc_within.rename(columns={"is_high": "cluster1_is_high", "cell_type_map":"cell_type_map1"}) - prox_indiv_schc_within = pd.merge(prox_indiv_schc_within,slide_indiv_clusters_labeled, left_on=["slide_submitter_id", "cell_type", "cluster2"], right_on=["slide_submitter_id", "cell_type_map", "cluster_label"]) - prox_indiv_schc_within = prox_indiv_schc_within.rename(columns={"is_high": "cluster2_is_high", "cell_type_map": "cell_type_map2"}) - prox_indiv_schc_within = prox_indiv_schc_within.drop(columns=["cluster_label"]) - - # Order doesn't matter (only same cell type combinations) - prox_indiv_schc_within["pair"] = [f"{i}-{j}" for i, j in prox_indiv_schc_within[["cell_type_map1", "cell_type_map2"]].to_numpy()] - prox_indiv_schc_within["comparison"] = [f"cluster1={sorted([i,j])[0]}-cluster2={sorted([i,j])[1]}" for i, j in prox_indiv_schc_within[["cluster1_is_high", "cluster2_is_high"]].to_numpy()] - - # prox_indiv_schc_within.to_csv(f"{output_dir}/{slide_type}_features_clust_indiv_schc_prox_within.txt", sep="\t") - slides = prox_indiv_schc_within[["slide_submitter_id", "MFP"]].drop_duplicates().to_numpy() - results_schc_indiv_within= pd.concat(Parallel(n_jobs=NUM_CORES)(delayed(features.post_processing_proximity)(prox_df=prox_indiv_schc_within, slide_submitter_id=slide_submitter_id,subtype=subtype, method="individual_within") for slide_submitter_id, subtype in slides)) - - # Concatenate within and between computed proximity values - prox_indiv_schc_combined = pd.concat([results_schc_indiv_within, results_schc_indiv]) - - # Remove rows with a proximity of NaN - prox_indiv_schc_combined = prox_indiv_schc_combined.dropna(axis=0) - - #################################### - # ---- Compute shape features ---- # - #################################### - - # Compute shape features based on clustering with all cell types simultaneously - slides = ( - predictions[["MFP", "slide_submitter_id"]].drop_duplicates().reset_index(drop=True)) - - all_slide_clusters_characterized = all_slide_clusters_characterized.rename(columns=dict(zip(cell_types, [f"is_{cell_type}_cluster" for cell_type in cell_types]))) - tiles_all_schc = pd.merge(tiles_all_schc, all_slide_clusters_characterized, on=["slide_submitter_id", "MFP", "cluster_label"]) - res = pd.concat(Parallel(n_jobs=NUM_CORES)(delayed(features.compute_shape_features)(tiles=tiles_all_schc, slide_submitter_id=slide_submitter_id,subtype=subtype) for subtype, slide_submitter_id in slides.to_numpy())) - res = res.drop(axis=1, labels=["cluster_label"]) - shape_feature_means = res.groupby(["slide_submitter_id", "cell_type"]).mean().reset_index() - - ############################################## - # ---- Formatting all computed features ---- # - ############################################## - - frac_high_sub = frac_high[frac_high["is_high"]].copy() - frac_high_sub = frac_high_sub.drop(columns=["is_high", "n_clusters", "n_total_clusters"]) - - frac_high_wide = frac_high_sub.pivot(index=["MFP", "slide_submitter_id"], columns=["cell_type_map"])["fraction"] - new_cols=[('fraction {0} clusters labeled high'.format(col)) for col in frac_high_wide.columns] - frac_high_wide.columns = new_cols - frac_high_wide = frac_high_wide.sort_index(axis="columns").reset_index() - - num_clust_by_slide_sub = num_clust_by_slide.copy() - num_clust_by_slide_sub = num_clust_by_slide_sub.drop(columns=["is_assigned", "n_clusters"]) - - num_clust_slide_wide = num_clust_by_slide_sub.pivot(index=["MFP", "slide_submitter_id"], columns=["cell_type"])["fraction"] - new_cols=[('fraction {0} clusters'.format(col)) for col in num_clust_slide_wide.columns] - num_clust_slide_wide.columns = new_cols - num_clust_slide_wide = num_clust_slide_wide.sort_index(axis="columns").reset_index() - - all_prox_df_wide = all_prox_df.pivot(index=["MFP", "slide_submitter_id"], columns=["pair"])["proximity"] - new_cols = [f'prox CC {col.replace("_", " ")} clusters' for col in all_prox_df_wide.columns] - all_prox_df_wide.columns = new_cols - all_prox_df_wide = all_prox_df_wide.reset_index() - - prox_indiv_schc_combined.comparison = prox_indiv_schc_combined.comparison.replace(dict(zip(['cluster1=True-cluster2=True', 'cluster1=True-cluster2=False', - 'cluster1=False-cluster2=True', 'cluster1=False-cluster2=False'], ["high-high", "high-low", "low-high", "low-low"]))) - prox_indiv_schc_combined["pair (comparison)"] = [f"{pair} ({comp})" for pair, comp in prox_indiv_schc_combined[["pair", "comparison"]].to_numpy()] - prox_indiv_schc_combined = prox_indiv_schc_combined.drop(axis=1, labels=["pair", "comparison"]) - prox_indiv_schc_combined_wide = prox_indiv_schc_combined.pivot(index=["MFP", "slide_submitter_id"], columns=["pair (comparison)"])["proximity"] - new_cols = [f'prox CC {col.replace("_", " ")}' for col in prox_indiv_schc_combined_wide.columns] - prox_indiv_schc_combined_wide.columns = new_cols - prox_indiv_schc_combined_wide = prox_indiv_schc_combined_wide.reset_index() - - shape_feature_means_wide = shape_feature_means.pivot(index=["slide_submitter_id"], columns="cell_type")[["solidity", "roundness"]] - new_cols = [f'prox CC {col.replace("_", " ")}' for col in prox_indiv_schc_combined_wide.columns] - shape_feature_means_wide.columns = [f"{i.capitalize()} {j}" for i, j in shape_feature_means_wide.columns] - shape_feature_means_wide = shape_feature_means_wide.reset_index() - - # Store features - all_features = pd.merge(frac_high_wide, num_clust_slide_wide, on=["MFP", "slide_submitter_id"]) - all_features = pd.merge(all_features, all_prox_df_wide) - all_features = pd.merge(all_features, prox_indiv_schc_combined_wide) - all_features = pd.merge(all_features, shape_feature_means_wide) - - tiles_all_schc = tiles_all_schc.drop(axis=1, columns=cell_types) # drop the predicted probabilities - all_slide_indiv_clusters = all_slide_indiv_clusters.drop(axis=1, columns=cell_types)# drop the predicted probabilities - - ################################ - # ---- Store all features ---- # - ################################ - - # tiles_all_schc (DataFrame): dataframe containing the metadata columns and the cluster_label (int) - # all_slide_clusters_characterized (DataFrame): dataframe containing the slide_submitter_id, and the the columns for the cell types filled with booleans (True if the cluster is assigned with that cell type) - # all_slide_indiv_clusters (DataFrame): dataframe containing the metadata columns and columns with to which cell type cluster the tile belongs to - # slide_indiv_clusters_labeled (DataFrame): dataframe containing the slide_submitter_id, cell_type_map, cluster_label (int), and is_high (abundance) - # all_prox_df (DataFrame): dataframe containing slide_submitter_id, pair, proximity - # prox_indiv_schc_combined (DataFrame): dataframe containing slide_submitter_id, comparison (high/low abundance label), pair (cell type pair) and proximity - # shape_features_mean (DataFrame): dataframe containing slide_submitter_id, cell_type, slide_submitter_id, solidity, roundness - frac_high.to_csv(f"{output_dir}/{slide_type}_num_clusters_per_cell_type_indiv_clustering.csv", sep="\t", index=False) - num_clust_by_slide.to_csv(f"{output_dir}/{slide_type}_num_clusters_per_cell_type_all_clustering.csv", sep="\t", index=False) - tiles_all_schc.to_csv(f"{output_dir}/{slide_type}_all_schc_tiles.csv", sep="\t", index=False) - all_slide_clusters_characterized.to_csv(f"{output_dir}/{slide_type}_all_schc_clusters_labeled.csv", sep="\t", index=False) - all_slide_indiv_clusters.to_csv(f"{output_dir}/{slide_type}_indiv_schc_tiles.csv", sep="\t", index=False) - slide_indiv_clusters_labeled.to_csv(f"{output_dir}/{slide_type}_indiv_schc_clusters_labeled.csv", sep="\t", index=False) - all_prox_df.to_csv(f"{output_dir}/{slide_type}_features_clust_all_schc_prox.csv", sep="\t", index=False) - prox_indiv_schc_combined.to_csv(f"{output_dir}/{slide_type}_features_clust_indiv_schc_prox.csv", sep="\t", index=False) - shape_feature_means.to_csv(f"{output_dir}/{slide_type}_features_clust_shapes.csv", sep="\t", index=False) - all_features.to_csv(f"{output_dir}/{slide_type}_clustering_features.csv", sep="\t", index=False) - -if __name__ == "__main__": - parser = argparse.ArgumentParser(description="Compute clustering features") - parser.add_argument("--tile_quantification_path", help="Path to csv file with tile-level quantification (predictions)") - parser.add_argument("--output_dir", help="Path to output folder to store generated files") - parser.add_argument("--slide_type", help="Type of slides 'FFPE' or 'FF' used for naming generated files (by default='FF')", default="FF") - parser.add_argument("--cell_types", help="List of cell types", default=None) # TODO: adapt to external file with the cell types, easier for parsing - parser.add_argument("--graphs_path", help="Path to pkl with generated graphs in case this was done before (OPTIONAL) if not specified, graphs will be generated", default=None) - - args=parser.parse_args() - - compute_clustering_features( - tile_quantification_path=args.tile_quantification_path, - output_dir=args.output_dir, - slide_type=args.slide_type, - cell_types=args.cell_types, - graphs_path=args.graphs_path) diff --git a/Python/3_spatial_characterization/compute_network_features.py b/Python/3_spatial_characterization/compute_network_features.py deleted file mode 100644 index 9888206..0000000 --- a/Python/3_spatial_characterization/compute_network_features.py +++ /dev/null @@ -1,175 +0,0 @@ -import multiprocessing -import sys -import argparse -import joblib -from joblib import Parallel, delayed -import pandas as pd - - -import features.features as features # trunk-ignore(flake8/E402) -import features.graphs as graphs # trunk-ignore(flake8/E402) -import features.utils as utils # trunk-ignore(flake8/E402) -def compute_network_features(tile_quantification_path, output_dir, slide_type="FF", cell_types=None, graphs_path=None): - NUM_CORES = multiprocessing.cpu_count() - - if cell_types is None: - cell_types = ["CAFs", "T_cells", "endothelial_cells", "tumor_purity"] - - predictions = pd.read_csv(tile_quantification_path, sep="\t", index_col=0) - - ##################################### - # ---- Constructing the graphs ---- # - ##################################### - - if graphs_path is None: - results = Parallel(n_jobs=NUM_CORES)( - delayed(graphs.construct_graph)(predictions=predictions, slide_submitter_id=slide_submitter_id) - for _, slide_submitter_id in slides.to_numpy() - ) - # Extract/format graphs - all_graphs = { - list(slide_graph.keys())[0]: list(slide_graph.values())[0] - for slide_graph in results - } - joblib.dump( - all_graphs, f"{output_dir}/{slide_type}_graphs.pkl") - else: - all_graphs = joblib.load(graphs_path) - - ####################################################### - # ---- Compute connectedness and co-localization ---- # - ####################################################### - - all_largest_cc_sizes = [] - all_dual_nodes_frac = [] - for _, slide_submitter_id in slides.to_numpy(): - slide_data = utils.get_slide_data(predictions, slide_submitter_id) - node_cell_types = utils.assign_cell_types(slide_data) - lcc = features.determine_lcc( - graph=all_graphs[slide_submitter_id], cell_type_assignments=node_cell_types - ) - lcc["slide_submitter_id"] = slide_submitter_id - all_largest_cc_sizes.append(lcc) - - dual_nodes_frac = features.compute_dual_node_fractions(node_cell_types) - dual_nodes_frac["slide_submitter_id"] = slide_submitter_id - all_dual_nodes_frac.append(dual_nodes_frac) - - all_largest_cc_sizes = pd.concat(all_largest_cc_sizes, axis=0) - all_dual_nodes_frac = pd.concat(all_dual_nodes_frac, axis=0) - - ####################################################### - # ---- Compute N shortest paths with max. length ---- # - ####################################################### - - results = Parallel(n_jobs=NUM_CORES)( - delayed(features.compute_n_shortest_paths_max_length)( - predictions=predictions, slide_submitter_id=slide_submitter_id, graph=all_graphs[slide_submitter_id] - ) - for _, slide_submitter_id in slides.to_numpy() - ) - # Formatting and count the number of shortest paths with max length - all_shortest_paths_thresholded = pd.concat(results, axis=0) - all_shortest_paths_thresholded["n_paths"] = 1 - proximity_graphs = ( - all_shortest_paths_thresholded.groupby( - ["slide_submitter_id", "source", "target"] - ) - .sum() - .reset_index() - ) - # Post-processing - proximity_graphs["pair"] = [f"{source}-{target}" for source, target in proximity_graphs[["source", "target"]].to_numpy()] - proximity_graphs = proximity_graphs.drop(columns=["path_length"]) - - ############################################### - # ---- Compute ES based on ND difference ---- # - ############################################### - - nd_results= Parallel(n_jobs=NUM_CORES)(delayed(features.node_degree_wrapper)(all_graphs[slide_submitter_id], predictions, slide_submitter_id) for _, slide_submitter_id in slides.to_numpy()) - - # Format results - all_sims_nd = [] - all_mean_nd_df = [] - example_simulations = {} - for sim_assignments, sim, mean_nd_df in nd_results: - all_mean_nd_df.append(mean_nd_df) - all_sims_nd.append(sim) - example_simulations.update(sim_assignments) - - all_sims_nd = pd.concat(all_sims_nd, axis=0).reset_index() - all_mean_nd_df =pd.concat(all_mean_nd_df).reset_index(drop=True) - - # Testing normality - # shapiro_tests = Parallel(n_jobs=NUM_CORES)(delayed(utils.test_normality)(sims_nd=all_sims_nd, slide_submitter_id=slide_submitter_id) for slide_submitter_id in all_sims_nd.slide_submitter_id.unique()) - # all_shapiro_tests = pd.concat(shapiro_tests, axis=0) - # print(f"Number of samples from normal distribution { len(all_shapiro_tests) - all_shapiro_tests.is_not_normal.sum()}/{len(all_shapiro_tests)}") - - # Computing Cohen's d effect size and perform t-test - effect_sizes = Parallel(n_jobs=NUM_CORES)(delayed(features.compute_effect_size)(all_mean_nd_df, all_sims_nd, slide_submitter_id) for slide_submitter_id in all_sims_nd.slide_submitter_id.unique()) - all_effect_sizes = pd.concat(effect_sizes, axis=0) - all_effect_sizes["pair"] = [f"{c}-{n}" for c, n in all_effect_sizes[["center", "neighbor"]].to_numpy()] - - ######################## - # ---- Formatting ---- # - ######################## - all_largest_cc_sizes = all_largest_cc_sizes.reset_index(drop=True) - all_largest_cc_sizes_wide = all_largest_cc_sizes.pivot(index=["slide_submitter_id"], columns="cell_type")["type_spec_frac"] - new_cols = [f'LCC {col.replace("_", " ")} clusters' for col in all_largest_cc_sizes_wide.columns] - all_largest_cc_sizes_wide.columns = new_cols - all_largest_cc_sizes_wide = all_largest_cc_sizes_wide.reset_index() - - shortest_paths_wide = proximity_graphs.pivot(index=["slide_submitter_id"], columns="pair")["n_paths"] - new_cols = [f'Prox graph {col.replace("_", " ")} clusters' for col in shortest_paths_wide.columns] - shortest_paths_wide.columns = new_cols - shortest_paths_wide = shortest_paths_wide.reset_index() - - colocalization_wide = all_dual_nodes_frac.pivot(index=["slide_submitter_id"], columns="pair")["frac"] - new_cols = [f'Co-loc {col.replace("_", " ")} clusters' for col in colocalization_wide.columns] - colocalization_wide.columns = new_cols - colocalization_wide = colocalization_wide.reset_index() - - all_features = pd.merge(all_largest_cc_sizes_wide, shortest_paths_wide) - all_features = pd.merge(all_features, colocalization_wide) - - ################################ - # ---- Store all features ---- # - ################################ - - # all_effect_sizes (DataFrame): dataframe containing the slide_submitter_id, center, neighbor, effect_size (Cohen's d), Tstat, pval, and the pair (string of center and neighbor) - # all_sims_nd (DataFrame): dataframe containing slide_submitter_id, center, neighbor, simulation_nr and degree (node degree) - # all_mean_nd_df (DataFrame): dataframe containing slide_submitter_id, center, neighbor, mean_sim (mean node degree across the N simulations), mean_obs - # all_largest_cc_sizes (DataFrame): dataframe containing slide_submitter_id, cell type and type_spec_frac (fraction of LCC w.r.t. all tiles for cell type) - # shortest_paths_slide (DataFrame): dataframe containing slide_submitter_id, source, target, pair and n_paths (number of shortest paths for a pair) - # all_dual_nodes_frac (DataFrame): dataframe containing slide_submitter_id, pair, counts (absolute) and frac - - all_effect_sizes.to_csv( - f"{output_dir}/{slide_type}_features_ND_ES.csv", sep="\t", index=False) - all_sims_nd.to_csv( - f"{output_dir}/{slide_type}_features_ND_sims.csv", sep="\t", index=False) - all_mean_nd_df.to_csv( - f"{output_dir}/{slide_type}_features_ND.csv", sep="\t", index=False) - joblib.dump(example_simulations, - f"{output_dir}/{slide_type}_features_ND_sim_assignments.pkl") - - all_largest_cc_sizes_wide.to_csv(f"{output_dir}/{slide_type}_features_lcc_fraction.csv", sep="\t", index=False) - proximity_graphs.to_csv(f"{output_dir}/{slide_type}_features_shortest_paths_thresholded.csv", sep="\t", index=False) - all_dual_nodes_frac.to_csv(f"{output_dir}/{slide_type}_features_coloc_fraction.csv", sep="\t", index=False) - - all_features.to_csv(f"{output_dir}/{slide_type}_all_graph_features.csv", sep="\t", index=False) - -if __name__ == "__main__": - parser = argparse.ArgumentParser(description="Compute network features") - parser.add_argument("--tile_quantification_path", help="Path to csv file with tile-level quantification (predictions)") - parser.add_argument("--output_dir", help="Path to output folder to store generated files") - parser.add_argument("--slide_type", help="Type of slides 'FFPE' or 'FF' used for naming generated files (by default='FF')", default="FF") - parser.add_argument("--cell_types", help="List of cell types", default=None) - parser.add_argument("--graphs_path", help="Path to pkl with generated graphs in case this was done before (OPTIONAL) if not specified, graphs will be generated", default=None) - - args=parser.parse_args() - compute_network_features( - tile_quantification_path=args.tile_quantification_path, - output_dir=args.output_dir, - slide_type=args.slide_type, - cell_types=args.cell_types, - graphs_path=args.graphs_path) diff --git a/Python/3_spatial_characterization/computing_features.py b/Python/3_spatial_characterization/computing_features.py deleted file mode 100755 index 2318296..0000000 --- a/Python/3_spatial_characterization/computing_features.py +++ /dev/null @@ -1,681 +0,0 @@ -import os -import sys -import joblib -import pandas as pd -from joblib import Parallel, delayed -import argparse -from os import path - -# Own modules -import features.clustering as clustering -import features.features as features -import features.graphs as graphs -import features.utils as utils -from model.constants import DEFAULT_SLIDE_TYPE, DEFAULT_CELL_TYPES, NUM_CORES, METADATA_COLS - - -def compute_network_features(tile_quantification_path, output_dir, slide_type=DEFAULT_SLIDE_TYPE, cell_types=None, graphs_path=None, - abundance_threshold=0.5, shapiro_alpha=0.05, cutoff_path_length=2): - """ - Compute network features - 1. effect sizes based on difference in node degree between simulated slides and actual slide - 2. fraction largest connected component - 3. number of shortest paths with a max length. - - Args: - tile_quantification_path (str) - output_dir (str) - slide_type (str): type of slide either 'FF' or 'FFPE' - cell_types (list): list of cell types - graphs_path (str): path to pkl file with generated graphs [optional] - abundance_threshold (float): threshold for assigning cell types to tiles based on the predicted probability (default=0.5) - shapiro_alpha (float): significance level for shapiro tests for normality (default=0.05) - cutoff_path_length (int): max. length of shortest paths (default=2) - - Returns: - all_effect_sizes (DataFrame): dataframe containing the slide_submitter_id, center, neighbor, effect_size (Cohen's d), Tstat, pval, and the pair (string of center and neighbor) - all_sims_nd (DataFrame): dataframe containing slide_submitter_id, center, neighbor, simulation_nr and degree (node degree) - all_mean_nd_df (DataFrame): dataframe containing slide_submitter_id, center, neighbor, mean_sim (mean node degree across the N simulations), mean_obs - all_largest_cc_sizes (DataFrame): dataframe containing slide_submitter_id, cell type and type_spec_frac (fraction of LCC w.r.t. all tiles for cell type) - shortest_paths_slide (DataFrame): dataframe containing slide_submitter_id, source, target, pair and n_paths (number of shortest paths for a pair) - all_dual_nodes_frac (DataFrame): dataframe containing slide_submitter_id, pair, counts (absolute) and frac - - """ - if cell_types is None: - cell_types = DEFAULT_CELL_TYPES - - predictions = pd.read_csv(tile_quantification_path, sep="\t") - slide_submitter_ids = list(set(predictions.slide_submitter_id)) - - ##################################### - # ---- Constructing the graphs ---- # - ##################################### - - if graphs_path is None: - results = Parallel(n_jobs=NUM_CORES)( - delayed(graphs.construct_graph)( - predictions=predictions, slide_submitter_id=id) - for id in slide_submitter_ids - ) - # Extract/format graphs - all_graphs = { - list(slide_graph.keys())[0]: list(slide_graph.values())[0] - for slide_graph in results - } - joblib.dump(all_graphs, f"{output_dir}/{slide_type}_graphs.pkl") - else: - all_graphs = joblib.load(graphs_path) - - ####################################################### - # ---- Compute connectedness and co-localization ---- # - ####################################################### - - all_largest_cc_sizes = [] - all_dual_nodes_frac = [] - for id in slide_submitter_ids: - slide_data = utils.get_slide_data(predictions, id) - node_cell_types = utils.assign_cell_types( - slide_data=slide_data, cell_types=cell_types, threshold=abundance_threshold) - lcc = features.determine_lcc( - graph=all_graphs[id], cell_type_assignments=node_cell_types, cell_types=cell_types - ) - lcc["slide_submitter_id"] = id - all_largest_cc_sizes.append(lcc) - - dual_nodes_frac = features.compute_dual_node_fractions( - node_cell_types, cell_types) - dual_nodes_frac["slide_submitter_id"] = id - all_dual_nodes_frac.append(dual_nodes_frac) - - all_largest_cc_sizes = pd.concat(all_largest_cc_sizes, axis=0) - all_dual_nodes_frac = pd.concat(all_dual_nodes_frac, axis=0) - - ####################################################### - # ---- Compute N shortest paths with max. length ---- # - ####################################################### - - results = Parallel(n_jobs=NUM_CORES)( - delayed(features.compute_n_shortest_paths_max_length)( - predictions=predictions, slide_submitter_id=id, graph=all_graphs[ - id], cutoff=cutoff_path_length - ) - for id in slide_submitter_ids - ) - # Formatting and count the number of shortest paths with max length - all_shortest_paths_thresholded = pd.concat(results, axis=0) - all_shortest_paths_thresholded["n_paths"] = 1 - proximity_graphs = ( - all_shortest_paths_thresholded.groupby( - ["slide_submitter_id", "source", "target"] - ) - .sum(numeric_only=True) - .reset_index() - ) - # Post-processing - proximity_graphs["pair"] = [f"{source}-{target}" for source, - target in proximity_graphs[["source", "target"]].to_numpy()] - proximity_graphs = proximity_graphs.drop(columns=["path_length"]) - - # ---- Formatting ---- # - all_largest_cc_sizes = all_largest_cc_sizes.reset_index(drop=True) - all_largest_cc_sizes_wide = all_largest_cc_sizes.pivot( - index=["slide_submitter_id"], columns="cell_type")["type_spec_frac"] - new_cols = [ - f'LCC {col.replace("_", " ")} clusters' for col in all_largest_cc_sizes_wide.columns] - all_largest_cc_sizes_wide.columns = new_cols - all_largest_cc_sizes_wide = all_largest_cc_sizes_wide.reset_index() - - shortest_paths_wide = proximity_graphs.pivot( - index=["slide_submitter_id"], columns="pair")["n_paths"] - new_cols = [ - f'Prox graph {col.replace("_", " ")} clusters' for col in shortest_paths_wide.columns] - shortest_paths_wide.columns = new_cols - shortest_paths_wide = shortest_paths_wide.reset_index() - - colocalization_wide = all_dual_nodes_frac.pivot( - index=["slide_submitter_id"], columns="pair")["frac"] - new_cols = [ - f'Co-loc {col.replace("_", " ")} clusters' for col in colocalization_wide.columns] - colocalization_wide.columns = new_cols - colocalization_wide = colocalization_wide.reset_index() - - all_features = pd.merge(all_largest_cc_sizes_wide, shortest_paths_wide) - all_features = pd.merge(all_features, colocalization_wide) - - # ---- Save to file ---- # - all_largest_cc_sizes_wide.to_csv( - f"{output_dir}/{slide_type}_features_lcc_fraction.csv", sep="\t", index=False) - proximity_graphs.to_csv( - f"{output_dir}/{slide_type}_features_shortest_paths_thresholded.csv", sep="\t", index=False) - all_dual_nodes_frac.to_csv( - f"{output_dir}/{slide_type}_features_coloc_fraction.csv", sep="\t", index=False) - all_features.to_csv( - f"{output_dir}/{slide_type}_all_graph_features.csv", sep="\t", index=False) - - ############################################### - # ---- Compute ES based on ND difference ---- # - ############################################### - # Remove one slide for which node degree could not be resolved (no node with 8 neighbours) - # problematic_slide = 'TCGA-D3-A2JE-06A-01-TS1' # just 63 tiles - # filtered_slides = list(filter(lambda id: id != problematic_slide, slide_submitter_ids)) - nd_results = Parallel(n_jobs=NUM_CORES)(delayed(features.node_degree_wrapper)( - all_graphs[id], predictions, id) for id in slide_submitter_ids) - nd_results = list(filter(lambda id: id != None, nd_results)) - - # Format results - all_sims_nd = [] - all_mean_nd_df = [] - example_simulations = {} - - for sim_assignments, sim, mean_nd_df in nd_results: - all_mean_nd_df.append(mean_nd_df) - all_sims_nd.append(sim) - example_simulations.update(sim_assignments) - - all_sims_nd = pd.concat(all_sims_nd, axis=0).reset_index() - all_mean_nd_df = pd.concat(all_mean_nd_df).reset_index(drop=True) - - # Testing normality - shapiro_tests = Parallel(n_jobs=NUM_CORES)(delayed(utils.test_normality)(sims_nd=all_sims_nd, slide_submitter_id=id, - alpha=shapiro_alpha, cell_types=cell_types) for id in all_sims_nd.slide_submitter_id.unique()) - all_shapiro_tests = pd.concat(shapiro_tests, axis=0) - - # Computing Cohen's d effect size and perform t-test - effect_sizes = Parallel(n_jobs=NUM_CORES)(delayed(features.compute_effect_size)( - all_mean_nd_df, all_sims_nd, slide_submitter_id) for slide_submitter_id in all_sims_nd.slide_submitter_id.unique()) - all_effect_sizes = pd.concat(effect_sizes, axis=0) - all_effect_sizes["pair"] = [ - f"{c}-{n}" for c, n in all_effect_sizes[["center", "neighbor"]].to_numpy()] - - # ---- Save to file ---- # - all_effect_sizes.to_csv( - f"{output_dir}/{slide_type}_features_ND_ES.csv", sep="\t", index=False) - all_sims_nd.to_csv( - f"{output_dir}/{slide_type}_features_ND_sims.csv", sep="\t", index=False) - all_mean_nd_df.to_csv( - f"{output_dir}/{slide_type}_features_ND.csv", sep="\t", index=False) - joblib.dump(example_simulations, - f"{output_dir}/{slide_type}_features_ND_sim_assignments.pkl") - all_shapiro_tests.to_csv( - f"{output_dir}/{slide_type}_shapiro_tests.csv", index=False, sep="\t") - - -def compute_clustering_features( - tile_quantification_path, output_dir, slide_type=DEFAULT_SLIDE_TYPE, cell_types=None, graphs_path=None, n_clusters=8, max_dist=None, max_n_tiles_threshold=2, tile_size=512, overlap=50): - - if cell_types is None: - cell_types = DEFAULT_CELL_TYPES - - predictions = pd.read_csv(tile_quantification_path, sep="\t") - slide_submitter_ids = list(set(predictions.slide_submitter_id)) - - ##################################### - # ---- Constructing the graphs ---- # - ##################################### - - if graphs_path is None: - results = Parallel(n_jobs=NUM_CORES)( - delayed(graphs.construct_graph)( - predictions=predictions, slide_submitter_id=id) - for id in slide_submitter_ids - ) - # Extract/format graphs - all_graphs = { - list(slide_graph.keys())[0]: list(slide_graph.values())[0] - for slide_graph in results - } - joblib.dump( - all_graphs, f"{output_dir}/{slide_type}_graphs.pkl") - else: - all_graphs = joblib.load(graphs_path) - - ###################################################################### - # ---- Fraction of cell type clusters (simultaneous clustering) ---- # - ###################################################################### - - # Spatially Hierarchical Constrained Clustering with all quantification of all cell types - slide_clusters = Parallel(n_jobs=NUM_CORES)(delayed(clustering.schc_all)( - predictions, all_graphs[id], id) for id in slide_submitter_ids) - # Combine the tiles labeled with their cluster id for all slides - tiles_all_schc = pd.concat(slide_clusters, axis=0) - - # Assign a cell type label based on the mean of all cluster means across all slides - all_slide_clusters_characterized = clustering.characterize_clusters( - tiles_all_schc) - - # Count the number of clusters per cell type for each slide - num_clust_by_slide = features.n_clusters_per_cell_type( - all_slide_clusters_characterized, cell_types=cell_types) - - ###################################################################################### - # ---- Fraction of highly abundant cell types (individual cell type clustering) ---- # - ###################################################################################### - - # Spatially Hierarchical Constrained Clustering with all quantification of all cell types for each individual cell type - slide_indiv_clusters = Parallel(n_jobs=NUM_CORES)(delayed(clustering.schc_individual)( - predictions, all_graphs[id], id) for id in slide_submitter_ids) - all_slide_indiv_clusters = pd.concat(slide_indiv_clusters, axis=0) - - # Add metadata - all_slide_indiv_clusters = pd.merge( - predictions, all_slide_indiv_clusters, on="tile_ID") - - # Add abundance label 'high' or 'low' based on cluster means - slide_indiv_clusters_labeled = clustering.label_cell_type_map_clusters( - all_slide_indiv_clusters) - - # Count the fraction of 'high' clusters - frac_high = features.n_high_clusters(slide_indiv_clusters_labeled) - - ################################################################## - # ---- Compute proximity features (simultaneous clustering) ---- # - ################################################################## - - # Computing proximity for clusters derived with all cell types simultaneously - clusters_all_schc_long = all_slide_clusters_characterized.melt( - id_vars=["slide_submitter_id", "cluster_label"], value_name="is_assigned", var_name="cell_type") - # remove all cell types that are not assigned to the cluster - clusters_all_schc_long = clusters_all_schc_long[clusters_all_schc_long["is_assigned"]] - clusters_all_schc_long = clusters_all_schc_long.drop(columns="is_assigned") - - results_schc_all = Parallel(n_jobs=NUM_CORES)(delayed(features.compute_proximity_clusters_pairs)( - tiles=tiles_all_schc, slide_submitter_id=id, n_clusters=n_clusters, cell_types=cell_types, max_dist=max_dist, max_n_tiles_threshold=max_n_tiles_threshold, tile_size=tile_size, overlap=overlap, method="all") for id in slide_submitter_ids) - prox_all_schc = pd.concat(results_schc_all) - - # Label clusters (a number) with the assigned cell types - prox_all_schc = pd.merge(prox_all_schc, clusters_all_schc_long, left_on=[ - "slide_submitter_id", "cluster1"], right_on=["slide_submitter_id", "cluster_label"]) - prox_all_schc = prox_all_schc.rename( - columns={"cell_type": "cluster1_label"}) - prox_all_schc = prox_all_schc.drop(columns=["cluster_label"]) - - prox_all_schc = pd.merge(prox_all_schc, clusters_all_schc_long, left_on=[ - "slide_submitter_id", "cluster2"], right_on=["slide_submitter_id", "cluster_label"]) - prox_all_schc = prox_all_schc.rename( - columns={"cell_type": "cluster2_label"}) - - # Order doesn't matter: x <-> - prox_all_schc["pair"] = [f"{sorted([i, j])[0]}-{sorted([i, j])[1]}" for i, - j in prox_all_schc[["cluster1_label", "cluster2_label"]].to_numpy()] - prox_all_schc = prox_all_schc[((prox_all_schc.cluster1 == prox_all_schc.cluster2) & ( - prox_all_schc.cluster2_label != prox_all_schc.cluster1_label)) | (prox_all_schc.cluster1 != prox_all_schc.cluster2)] - - # slides = prox_all_schc[["MFP", "slide_submitter_id"]].drop_duplicates().to_numpy() - slide_submitter_ids = list(set(prox_all_schc.slide_submitter_id)) - - # Post Processing - results_schc_all = Parallel(n_jobs=NUM_CORES)(delayed(features.post_processing_proximity)( - prox_df=prox_all_schc, slide_submitter_id=id, method="all") for id in slide_submitter_ids) - all_prox_df = pd.concat(results_schc_all) - # Remove rows with a proximity of NaN - all_prox_df = all_prox_df.dropna(axis=0) - - ########################################################################## - # ---- Compute proximity features (individual cell type clustering) ---- # - ########################################################################## - - # Computing proximity for clusters derived for each cell type individually - # Between clusters - slide_submitter_ids = list(set(predictions.slide_submitter_id)) - results_schc_indiv = Parallel(n_jobs=NUM_CORES)(delayed(features.compute_proximity_clusters_pairs)(all_slide_indiv_clusters, slide_submitter_id=id, method="individual_between", - n_clusters=n_clusters, cell_types=cell_types, max_dist=max_dist, max_n_tiles_threshold=max_n_tiles_threshold, tile_size=tile_size, overlap=overlap) for id in slide_submitter_ids) - prox_indiv_schc = pd.concat(results_schc_indiv) - - # Formatting - prox_indiv_schc = pd.merge(prox_indiv_schc, slide_indiv_clusters_labeled, left_on=[ - "slide_submitter_id", "cluster1_label", "cluster1"], right_on=["slide_submitter_id", "cell_type_map", "cluster_label"]) - prox_indiv_schc = prox_indiv_schc.drop( - columns=["cell_type_map", "cluster_label"]) - prox_indiv_schc = prox_indiv_schc.rename( - columns={"is_high": "cluster1_is_high"}) - prox_indiv_schc = pd.merge(prox_indiv_schc, slide_indiv_clusters_labeled, left_on=[ - "slide_submitter_id", "cluster2_label", "cluster2"], right_on=["slide_submitter_id", "cell_type_map", "cluster_label"]) - prox_indiv_schc = prox_indiv_schc.rename( - columns={"is_high": "cluster2_is_high"}) - prox_indiv_schc = prox_indiv_schc.drop( - columns=["cell_type_map", "cluster_label"]) - - # Order matters - prox_indiv_schc["ordered_pair"] = [ - f"{i}-{j}" for i, j in prox_indiv_schc[["cluster1_label", "cluster2_label"]].to_numpy()] - prox_indiv_schc["comparison"] = [ - f"cluster1={i}-cluster2={j}" for i, j in prox_indiv_schc[["cluster1_is_high", "cluster2_is_high"]].to_numpy()] - - # Post-processing - slide_submitter_ids = list(set(predictions.slide_submitter_id)) - results_schc_indiv = pd.concat(Parallel(n_jobs=NUM_CORES)(delayed(features.post_processing_proximity)( - prox_df=prox_indiv_schc, slide_submitter_id=id, method="individual_between") for id in slide_submitter_ids)) - - # Within clusters - slide_submitter_ids = list(set(predictions.slide_submitter_id)) - results_schc_indiv_within = Parallel(n_jobs=NUM_CORES)(delayed(features.compute_proximity_clusters_pairs)(all_slide_indiv_clusters, slide_submitter_id=id, method="individual_within", - n_clusters=n_clusters, cell_types=cell_types, max_dist=max_dist, max_n_tiles_threshold=max_n_tiles_threshold, tile_size=tile_size, overlap=overlap,) for id in slide_submitter_ids) - prox_indiv_schc_within = pd.concat(results_schc_indiv_within) - - prox_indiv_schc_within = pd.merge(prox_indiv_schc_within, slide_indiv_clusters_labeled, left_on=[ - "slide_submitter_id", "cell_type", "cluster1"], right_on=["slide_submitter_id", "cell_type_map", "cluster_label"]) - prox_indiv_schc_within = prox_indiv_schc_within.drop( - columns=["cluster_label"]) - prox_indiv_schc_within = prox_indiv_schc_within.rename( - columns={"is_high": "cluster1_is_high", "cell_type_map": "cell_type_map1"}) - prox_indiv_schc_within = pd.merge(prox_indiv_schc_within, slide_indiv_clusters_labeled, left_on=[ - "slide_submitter_id", "cell_type", "cluster2"], right_on=["slide_submitter_id", "cell_type_map", "cluster_label"]) - prox_indiv_schc_within = prox_indiv_schc_within.rename( - columns={"is_high": "cluster2_is_high", "cell_type_map": "cell_type_map2"}) - prox_indiv_schc_within = prox_indiv_schc_within.drop( - columns=["cluster_label"]) - - # Order doesn't matter (only same cell type combinations) - prox_indiv_schc_within["pair"] = [ - f"{i}-{j}" for i, j in prox_indiv_schc_within[["cell_type_map1", "cell_type_map2"]].to_numpy()] - prox_indiv_schc_within["comparison"] = [ - f"cluster1={sorted([i,j])[0]}-cluster2={sorted([i,j])[1]}" for i, j in prox_indiv_schc_within[["cluster1_is_high", "cluster2_is_high"]].to_numpy()] - - # Post-processing - slide_submitter_ids = list(set(prox_indiv_schc_within.slide_submitter_id)) - results_schc_indiv_within = pd.concat(Parallel(n_jobs=NUM_CORES)(delayed(features.post_processing_proximity)( - prox_df=prox_indiv_schc_within, slide_submitter_id=id, method="individual_within") for id in slide_submitter_ids)) - - # Concatenate within and between computed proximity values - prox_indiv_schc_combined = pd.concat( - [results_schc_indiv_within, results_schc_indiv]) - - # Remove rows with a proximity of NaN - prox_indiv_schc_combined = prox_indiv_schc_combined.dropna(axis=0) - - #################################### - # ---- Compute shape features ---- # - #################################### - - # Compute shape features based on clustering with all cell types simultaneously - # slide_submitter_ids = list(set(predictions.slide_submitter_id)) - # all_slide_clusters_characterized = all_slide_clusters_characterized.rename(columns=dict(zip(cell_types, [f"is_{cell_type}_cluster" for cell_type in cell_types]))) - # tiles_all_schc = pd.merge(tiles_all_schc, all_slide_clusters_characterized, on=["slide_submitter_id", "cluster_label"]) - - # res = pd.concat(Parallel(n_jobs=NUM_CORES)(delayed(features.compute_shape_features)(tiles=tiles_all_schc, slide_submitter_id=id, tile_size=tile_size, overlap=overlap, cell_types=cell_types) for id in -# slide_submitter_ids)) - # res = res.drop(axis=1, labels=["cluster_label"]) - # shape_feature_means = res.groupby(["slide_submitter_id", "cell_type"]).mean().reset_index() - - ############################################## - # ---- Formatting all computed features ---- # - ############################################## - - frac_high_sub = frac_high[frac_high["is_high"]].copy() - frac_high_sub = frac_high_sub.drop( - columns=["is_high", "n_clusters", "n_total_clusters"]) - - frac_high_wide = frac_high_sub.pivot(index=["slide_submitter_id"], columns=[ - "cell_type_map"])["fraction"] - new_cols = [('fraction {0} clusters labeled high'.format(col)) - for col in frac_high_wide.columns] - frac_high_wide.columns = new_cols - frac_high_wide = frac_high_wide.sort_index(axis="columns").reset_index() - - num_clust_by_slide_sub = num_clust_by_slide.copy() - num_clust_by_slide_sub = num_clust_by_slide_sub.drop( - columns=["is_assigned", "n_clusters"]) - - num_clust_slide_wide = num_clust_by_slide_sub.pivot( - index=["slide_submitter_id"], columns=["cell_type"])["fraction"] - new_cols = [('fraction {0} clusters'.format(col)) - for col in num_clust_slide_wide.columns] - num_clust_slide_wide.columns = new_cols - num_clust_slide_wide = num_clust_slide_wide.sort_index( - axis="columns").reset_index() - - all_prox_df_wide = all_prox_df.pivot( - index=["slide_submitter_id"], columns=["pair"])["proximity"] - new_cols = [ - f'prox CC {col.replace("_", " ")} clusters' for col in all_prox_df_wide.columns] - all_prox_df_wide.columns = new_cols - all_prox_df_wide = all_prox_df_wide.reset_index() - - prox_indiv_schc_combined.comparison = prox_indiv_schc_combined.comparison.replace(dict(zip(['cluster1=True-cluster2=True', 'cluster1=True-cluster2=False', - 'cluster1=False-cluster2=True', 'cluster1=False-cluster2=False'], ["high-high", "high-low", "low-high", "low-low"]))) - prox_indiv_schc_combined["pair (comparison)"] = [ - f"{pair} ({comp})" for pair, comp in prox_indiv_schc_combined[["pair", "comparison"]].to_numpy()] - prox_indiv_schc_combined = prox_indiv_schc_combined.drop( - axis=1, labels=["pair", "comparison"]) - prox_indiv_schc_combined_wide = prox_indiv_schc_combined.pivot( - index=["slide_submitter_id"], columns=["pair (comparison)"])["proximity"] - new_cols = [ - f'prox CC {col.replace("_", " ")}' for col in prox_indiv_schc_combined_wide.columns] - prox_indiv_schc_combined_wide.columns = new_cols - prox_indiv_schc_combined_wide = prox_indiv_schc_combined_wide.reset_index() - - # shape_feature_means_wide = shape_feature_means.pivot(index=["slide_submitter_id"], columns="cell_type")[["solidity", "roundness"]] - # new_cols = [f'prox CC {col.replace("_", " ")}' for col in prox_indiv_schc_combined_wide.columns] - # shape_feature_means_wide.columns = [f"{i.capitalize()} {j}" for i, j in shape_feature_means_wide.columns] - # shape_feature_means_wide = shape_feature_means_wide.reset_index() - - # Store features - all_features = pd.merge(frac_high_wide, num_clust_slide_wide, on=[ - "slide_submitter_id"]) - all_features = pd.merge(all_features, all_prox_df_wide) - all_features = pd.merge(all_features, prox_indiv_schc_combined_wide) - # all_features = pd.merge(all_features, shape_feature_means_wide) - - # drop the predicted probabilities - tiles_all_schc = tiles_all_schc.drop(axis=1, columns=cell_types) - all_slide_indiv_clusters = all_slide_indiv_clusters.drop( - axis=1, columns=cell_types) # drop the predicted probabilities - - ################################ - # ---- Store all features ---- # - ################################ - - # tiles_all_schc (DataFrame): dataframe containing the metadata columns and the cluster_label (int) - # all_slide_clusters_characterized (DataFrame): dataframe containing the slide_submitter_id, and the the columns for the cell types filled with booleans (True if the cluster is assigned with that cell type) - # all_slide_indiv_clusters (DataFrame): dataframe containing the metadata columns and columns with to which cell type cluster the tile belongs to - # slide_indiv_clusters_labeled (DataFrame): dataframe containing the slide_submitter_id, cell_type_map, cluster_label (int), and is_high (abundance) - # all_prox_df (DataFrame): dataframe containing slide_submitter_id, pair, proximity - # prox_indiv_schc_combined (DataFrame): dataframe containing slide_submitter_id, comparison (high/low abundance label), pair (cell type pair) and proximity - # shape_features_mean (DataFrame): dataframe containing slide_submitter_id, cell_type, slide_submitter_id, solidity, roundness - tiles_all_schc.to_csv( - f"{output_dir}/{slide_type}_all_schc_tiles.csv", sep="\t", index=False) - all_slide_clusters_characterized.to_csv( - f"{output_dir}/{slide_type}_all_schc_clusters_labeled.csv", sep="\t", index=False) - all_slide_indiv_clusters.to_csv( - f"{output_dir}/{slide_type}_indiv_schc_tiles.csv", sep="\t", index=False) - slide_indiv_clusters_labeled.to_csv( - f"{output_dir}/{slide_type}_indiv_schc_clusters_labeled.csv", sep="\t", index=False) - all_prox_df.to_csv( - f"{output_dir}/{slide_type}_features_clust_all_schc_prox.csv", sep="\t", index=False) - prox_indiv_schc_combined.to_csv( - f"{output_dir}/{slide_type}_features_clust_indiv_schc_prox.csv", sep="\t", index=False) - # shape_feature_means.to_csv(f"{output_dir}/{slide_type}_features_clust_shapes.csv", sep="\t", index=False) - all_features.to_csv( - f"{output_dir}/{slide_type}_clustering_features.csv", sep="\t", index=False) - - -def post_processing(output_dir, slide_type="FF", metadata_path="", is_TCGA=False, merge_var="slide_submitter_id", sheet_name=None): - """ - Combine network and clustering features into a single file. If metadata_path is not None, add the metadata as well, based on variable slide_submitter_id - - Args: - output_dir (str): directory containing the graph and clustering features - slide_type (str): slide type to identify correct files for merging, either "FF" or "FFPE" (default="FF") - metadata_path (str): path to file containing metadata - is_TCGA (bool): whether data is from TCGA - merge_var (str): variable on which to merge (default: slide_submitter_id) - - """ - all_features_graph = pd.read_csv( - f"{output_dir}/{slide_type}_all_graph_features.csv", sep="\t") - all_features_clustering = pd.read_csv( - f"{output_dir}/{slide_type}_clustering_features.csv", sep="\t") - - all_features_combined = pd.merge( - all_features_graph, all_features_clustering) - - # Add additional identifiers for TCGA - if is_TCGA: - all_features_combined["TCGA_patient_ID"] = all_features_combined.slide_submitter_id.str[0:12] - all_features_combined["TCGA_sample_ID"] = all_features_combined.slide_submitter_id.str[0:15] - all_features_combined["sample_submitter_id"] = all_features_combined.slide_submitter_id.str[0:16] - - if path.isfile(metadata_path): - file_extension = metadata_path.split(".")[-1] - if (file_extension.startswith("xls")): - if sheet_name is None: - metadata = pd.read_excel(metadata_path) - elif (file_extension == "txt") or (file_extension == "csv"): - metadata = pd.read_csv(metadata_path, sep="\t") - all_features_combined = pd.merge( - all_features_combined, metadata, on=merge_var, how="left") - all_features_combined.to_csv( - f"{output_dir}/all_features_combined.csv", sep="\t", index=False) - - -if __name__ == "__main__": - parser = argparse.ArgumentParser(description="Derive spatial features") - parser.add_argument("--workflow_mode", type=int, - help="Choose which steps to execute: all = 1, graph-based = 2, clustering-based = 3, combining features = 4 (default: 1)", default=1) - parser.add_argument("--tile_quantification_path", type=str, - help="Path to csv file with tile-level quantification (predictions)", required=True) - parser.add_argument("--output_dir", type=str, - help="Path to output folder to store generated files", required=True) - parser.add_argument("--slide_type", type=str, - help="Type of slides 'FFPE' or 'FF' used for naming generated files (default: 'FF')", default="FF") - parser.add_argument("--cell_types_path", type=str, - help="Path to file with list of cell types (default: CAFs, endothelial_cells, T_cells, tumor_purity)", default="") - parser.add_argument("--graphs_path", type=str, - help="Path to pkl with generated graphs in case this was done before (OPTIONAL) if not specified, graphs will be generated", default=None) - - parser.add_argument("--cutoff_path_length", type=int, - help="Max path length for proximity based on graphs", default=2, required=False) - parser.add_argument("--shapiro_alpha", type=float, - help="Choose significance level alpha (default: 0.05)", default=0.05, required=False) - parser.add_argument("--abundance_threshold", type=float, - help="Threshold for assigning cell types based on predicted probability (default: 0.5)", default=0.5, required=False) - - parser.add_argument("--n_clusters", type=int, - help="Number of clusters for SCHC (default: 8)", required=False, default=8) - parser.add_argument("--max_dist", type=int, - help="Maximum distance between clusters", required=False, default=None) - parser.add_argument("--max_n_tiles_threshold", type=int, - help="Number of tiles for computing max. distance between two points in two different clusters", default=2, required=False) - parser.add_argument("--tile_size", type=int, - help="Size of tile (default: 512)", default=512, required=False) - parser.add_argument("--overlap", type=int, - help="Overlap of tiles (default: 50)", default=50, required=False) - - parser.add_argument("--metadata_path", type=str, - help="Path to tab-separated file with metadata", default="") - parser.add_argument("--is_TCGA", type=bool, - help="dataset is from TCGA (default: True)", default=True, required=False) - parser.add_argument("--merge_var", type=str, - help="Variable to merge metadata and computed features on", default=None) - parser.add_argument("--sheet_name", type=str, - help="Name of sheet for merging in case a path to xls(x) file is given for metadata_path", default=None) - - args = parser.parse_args() - - # Common variables - workflow_mode = args.workflow_mode - tile_quantification_path = args.tile_quantification_path - slide_type = args.slide_type - cell_types_path = args.cell_types_path - graphs_path = args.graphs_path - output_dir = args.output_dir - - # Variables for network/graph based features - cutoff_path_length = args.cutoff_path_length - shapiro_alpha = args.shapiro_alpha - abundance_threshold = args.abundance_threshold - - # Variables clustering - n_clusters = args.n_clusters - max_dist = args.max_dist - max_n_tiles_threshold = args.max_n_tiles_threshold - tile_size = args.tile_size - overlap = args.overlap - - # Variables post-processing - metadata_path = args.metadata_path - merge_var = args.merge_var - sheet_name = args.sheet_name - - # FIX create output directory for saving - full_output_dir = f"{output_dir}" - print(full_output_dir) - if not os.path.isdir(full_output_dir): - os.makedirs(full_output_dir) - - if path.isfile(cell_types_path): - cell_types = pd.read_csv( - cell_types_path, header=None).to_numpy().flatten() - else: - cell_types = DEFAULT_CELL_TYPES - - if (workflow_mode in [1, 2, 3]) & (graphs_path is None): - predictions = pd.read_csv(tile_quantification_path, sep="\t") - slide_submitter_ids = list(set(predictions.slide_submitter_id)) - results = Parallel(n_jobs=NUM_CORES)( - delayed(graphs.construct_graph)( - predictions=predictions, slide_submitter_id=id) - for id in slide_submitter_ids - ) - # Extract/format graphs - all_graphs = { - list(slide_graph.keys())[0]: list(slide_graph.values())[0] - for slide_graph in results - } - joblib.dump( - all_graphs, f"{output_dir}/{slide_type}_graphs.pkl") - - graphs_path = f"{output_dir}/{slide_type}_graphs.pkl" - - if workflow_mode == 1: - print("Workflow mode: all steps") - - print("Compute network features...") - compute_network_features( - tile_quantification_path=tile_quantification_path, - output_dir=output_dir, - slide_type=slide_type, - cell_types=cell_types, - graphs_path=graphs_path, cutoff_path_length=cutoff_path_length, shapiro_alpha=shapiro_alpha, abundance_threshold=abundance_threshold) - - print("Compute clustering features...") - compute_clustering_features( - tile_quantification_path=tile_quantification_path, - output_dir=output_dir, - slide_type=slide_type, - cell_types=cell_types, - graphs_path=graphs_path, n_clusters=n_clusters, max_dist=max_dist, max_n_tiles_threshold=max_n_tiles_threshold, tile_size=tile_size, overlap=overlap) - - print("Post-processing: combining all features") - post_processing(output_dir=output_dir, slide_type=slide_type, metadata_path=metadata_path, - is_TCGA=False, merge_var=merge_var, sheet_name=sheet_name) - - print("Finished with all steps.") - elif workflow_mode == 2: - print("Compute network features...") - compute_network_features( - tile_quantification_path=tile_quantification_path, - output_dir=output_dir, - slide_type=slide_type, - cell_types=cell_types, - graphs_path=graphs_path, cutoff_path_length=cutoff_path_length, shapiro_alpha=shapiro_alpha, abundance_threshold=abundance_threshold) - print("Finished.") - elif workflow_mode == 3: - print("Compute clustering features...") - compute_clustering_features( - tile_quantification_path=tile_quantification_path, - output_dir=output_dir, - slide_type=slide_type, - cell_types=cell_types, - graphs_path=graphs_path, n_clusters=n_clusters, max_dist=max_dist, max_n_tiles_threshold=max_n_tiles_threshold, tile_size=tile_size, overlap=overlap) - print("Finished.") - - elif workflow_mode == 4: - print("Post-processing: combining all features") - - post_processing(output_dir=output_dir, slide_type=slide_type, metadata_path=metadata_path, - is_TCGA=False, merge_var=merge_var, sheet_name=sheet_name) - print("Finished.") - else: - raise Exception( - "Invalid workflow mode, please choose one of the following (int): all = 1, graph-based = 2, clustering-based = 3, combining features = 4 (default: 1)") diff --git a/README.md b/README.md index 27c6f20..690f618 100644 --- a/README.md +++ b/README.md @@ -16,52 +16,33 @@ See also the figures below. ## Run SPoTLIghT -1. Build the docker image as follows: +1. Pull the Docker container: ```bash -docker build -t run_spotlight:vfinal . --platform linux/amd64 +docker pull joank23/spotlight:latest ``` Alternatively you can use Singularity/Apptainer (HPCs): ```bash # 1. save docker as tar or tar.gz (compressed) -docker save run_spotlight:vfinal -o {output_dir}/spotlight_docker.tar.gz +docker save joank23/spotlight -o spotlight.tar.gz # 2. build apptainer (.sif) from docker (.tar) -apptainer build {output_dir}/spotlight_apptainer.sif docker-archive:spotlight_docker.tar.gz +apptainer build spotlight.sif docker-archive:spotlight.tar.gz ``` -2. Add your FF histopathology slides to a subdirectory in the `spotlight_docker` directory, e.g. `data_example/images` +> Please rename your images file names, so they only include "-", to follow the same sample coding used by the TCGA. -3. Download retrained models to extract the histopathological features, available from Fu et al., Nat Cancer, 2020 ([Retrained_Inception_v4](https://www.ebi.ac.uk/biostudies/bioimages/studies/S-BSST292)). +1. Download retrained models to extract the histopathological features, available from Fu et al., Nat Cancer, 2020 ([Retrained_Inception_v4](https://www.ebi.ac.uk/biostudies/bioimages/studies/S-BSST292)). Once you unzip the folder, extract the files to the `data/checkpoint/Retrained_Inception_v4/` folder. - -4. If a TCGA dataset is used, please download metadata (i.e. "biospecimen -> TSV", unzip and keep slide.tsv), then rename `slide.tsv` to `clinical_file_TCGA_{cancer_type_abbrev}` such as `clinical_file_TCGA_SKCM.tsv` and copy to `/data`. Example dataset TCGA-SKCM can be downloaded [here](https://portal.gdc.cancer.gov/projects/TCGA-SKCM) - -5. Setup your paths and variables in `run_pipeline.sh` +2. If a TCGA dataset is used, please download metadata (i.e. "biospecimen -> TSV", unzip and keep slide.tsv), then rename `slide.tsv` to `clinical_file_TCGA_{cancer_type_abbrev}` such as `clinical_file_TCGA_SKCM.tsv` and copy to `/data`. Example dataset TCGA-SKCM can be downloaded [here](https://portal.gdc.cancer.gov/projects/TCGA-SKCM). For non-TCGA datasets, please omit this step. +3. Setup your paths and variables in `run_pipeline.sh` +4. Set a config ensuring compatibility with available resources, you can use `custom.config` as a template. (see `nextflow.config` for all parameters, if a parameter is 'assets/NO_FILE' or 'dummy', they are optional parameters, if not used please leave as is) +5. Run the Nextflow Pipeline as follows: ```bash -# Directory 'spotlight_docker' - -work_dir="/path/to/spotlight_docker" -spotlight_sif="path/to/spotlight_sif" - -# Define directories/files in container (mounted) - -folder_images="/path/to/images_dir" -output_dir="/path/to/output_dir" - -# Relative to docker, i.e. start with /data - -checkpoint="/data/checkpoint/Retrained_Inception_v4/model.ckpt-100000" -clinical_files_dir="/data/path/to/clinical/TCGA/file.tsv" - -# Remaining parameters (this configuration has been tested) -slide_type="FF" -tumor_purity_threshold=80 -class_names="SKCM_T" -model_name="inception_v4" +nextflow run . -profile apptainer -c "${your_config_file}" ```` @@ -70,42 +51,63 @@ model_name="inception_v4" SPoTLIghT generates the following output directory structure: ```bash -{output_dir} -├── 1_histopathological_features -│ ├── bot_train.txt -│ ├── features.txt +{outdir} +├── 1_extract_histopatho_features +│ ├── avail_slides_for_img.csv +│ ├── features-0.parquet │ ├── file_info_train.txt -│ ├── final_clinical_file.txt │ ├── generated_clinical_file.txt -│ ├── pred_train.txt -│ ├── predictions.txt +│ ├── ok.txt +│ ├── predictions-0.parquet │ ├── process_train │ │ ├── images_train_00001-of-00320.tfrecord -│ │ └── images_train_00002-of-00320.tfrecord +│ │ ├── images_train_00002-of-00320.tfrecord +│ │ ├── images_train_00004-of-00320.tfrecord │ └── tiles -│ ├── TCGA-EB-A3XC-01Z-00-DX1_2773_15709.jpg -│ └── TCGA-EE-A3JE-01Z-00-DX1_25873_12013.jpg +│ ├── xenium-skin-panel_10165_10165.jpg +│ ├── xenium-skin-panel_10165_10627.jpg +│ ├── xenium-skin-panel_10165_11089.jpg ├── 2_tile_level_quantification │ ├── test_tile_predictions_proba.csv │ └── test_tile_predictions_zscores.csv ├── 3_spatial_features -│ ├── FF_all_graph_features.csv -│ ├── FF_all_schc_clusters_labeled.csv -│ ├── FF_all_schc_tiles.csv -│ ├── FF_clustering_features.csv -│ ├── FF_features_ND.csv -│ ├── FF_features_ND_ES.csv -│ ├── FF_features_ND_sim_assignments.pkl -│ ├── FF_features_ND_sims.csv -│ ├── FF_features_clust_all_schc_prox.csv -│ ├── FF_features_clust_indiv_schc_prox.csv -│ ├── FF_features_coloc_fraction.csv -│ ├── FF_features_lcc_fraction.csv -│ ├── FF_features_shortest_paths_thresholded.csv -│ ├── FF_graphs.pkl -│ ├── FF_indiv_schc_clusters_labeled.csv -│ ├── FF_indiv_schc_tiles.csv -│ ├── FF_shapiro_tests.csv -│ └── all_features_combined.csv -└── list_images.txt +│ ├── clustering_features +│ │ ├── FFPE_all_schc_clusters_labeled.csv +│ │ ├── FFPE_all_schc_tiles.csv +│ │ ├── FFPE_all_schc_tiles_raw.csv +│ │ ├── FFPE_features_clust_all_schc_prox_wide.csv +│ │ ├── FFPE_features_clust_indiv_schc_prox_between.csv +│ │ ├── FFPE_features_clust_indiv_schc_prox.csv +│ │ ├── FFPE_features_clust_indiv_schc_prox_within.csv +│ │ ├── FFPE_frac_high_wide.csv +│ │ ├── FFPE_graphs.pkl +│ │ ├── FFPE_indiv_schc_clusters_labeled.csv +│ │ ├── FFPE_indiv_schc_tiles.csv +│ │ ├── FFPE_indiv_schc_tiles_raw.csv +│ │ └── FFPE_nclusters_wide.csv +│ ├── FFPE_all_features_combined.csv +│ ├── FFPE_all_graph_features.csv +│ ├── FFPE_clustering_features.csv +│ ├── FFPE_graphs.pkl +│ └── network_features +│ ├── FFPE_features_coloc_fraction.csv +│ ├── FFPE_features_coloc_fraction_wide.csv +│ ├── FFPE_features_lcc_fraction_wide.csv +│ ├── FFPE_features_ND.csv +│ ├── FFPE_features_ND_ES.csv +│ ├── FFPE_features_ND_sim_assignments.pkl +│ ├── FFPE_features_ND_sims.csv +│ ├── FFPE_features_shortest_paths_thresholded.csv +│ ├── FFPE_features_shortest_paths_thresholded_wide.csv +│ ├── FFPE_graphs.pkl +│ └── FFPE_shapiro_tests.csv +├── bottleneck +│ ├── bot_train.txt +│ ├── ok.txt +│ └── pred_train.txt +└── pipeline_info + ├── execution_report_2024-09-23_21-07-41.html + ├── execution_timeline_2024-09-23_21-07-41.html + ├── execution_trace_2024-09-23_21-07-41.txt + └── pipeline_dag_2024-09-23_21-07-41.html ``` diff --git a/Python/libs/DL/__init__.py b/assets/NO_FILE old mode 100644 new mode 100755 similarity index 100% rename from Python/libs/DL/__init__.py rename to assets/NO_FILE diff --git a/data/TF_models/SKCM_FF/CAFs/outer_models.pkl b/assets/TF_models/SKCM_FF/CAFs/outer_models.pkl old mode 100644 new mode 100755 similarity index 100% rename from data/TF_models/SKCM_FF/CAFs/outer_models.pkl rename to assets/TF_models/SKCM_FF/CAFs/outer_models.pkl diff --git a/data/TF_models/SKCM_FF/CAFs/x_train_scaler.pkl b/assets/TF_models/SKCM_FF/CAFs/x_train_scaler.pkl old mode 100644 new mode 100755 similarity index 100% rename from data/TF_models/SKCM_FF/CAFs/x_train_scaler.pkl rename to assets/TF_models/SKCM_FF/CAFs/x_train_scaler.pkl diff --git a/data/TF_models/SKCM_FF/T_cells/outer_models.pkl b/assets/TF_models/SKCM_FF/T_cells/outer_models.pkl old mode 100644 new mode 100755 similarity index 100% rename from data/TF_models/SKCM_FF/T_cells/outer_models.pkl rename to assets/TF_models/SKCM_FF/T_cells/outer_models.pkl diff --git a/data/TF_models/SKCM_FF/T_cells/x_train_scaler.pkl b/assets/TF_models/SKCM_FF/T_cells/x_train_scaler.pkl old mode 100644 new mode 100755 similarity index 100% rename from data/TF_models/SKCM_FF/T_cells/x_train_scaler.pkl rename to assets/TF_models/SKCM_FF/T_cells/x_train_scaler.pkl diff --git a/data/TF_models/SKCM_FF/endothelial_cells/outer_models.pkl b/assets/TF_models/SKCM_FF/endothelial_cells/outer_models.pkl old mode 100644 new mode 100755 similarity index 100% rename from data/TF_models/SKCM_FF/endothelial_cells/outer_models.pkl rename to assets/TF_models/SKCM_FF/endothelial_cells/outer_models.pkl diff --git a/data/TF_models/SKCM_FF/endothelial_cells/x_train_scaler.pkl b/assets/TF_models/SKCM_FF/endothelial_cells/x_train_scaler.pkl old mode 100644 new mode 100755 similarity index 100% rename from data/TF_models/SKCM_FF/endothelial_cells/x_train_scaler.pkl rename to assets/TF_models/SKCM_FF/endothelial_cells/x_train_scaler.pkl diff --git a/data/TF_models/SKCM_FF/tumor_purity/outer_models.pkl b/assets/TF_models/SKCM_FF/tumor_purity/outer_models.pkl old mode 100644 new mode 100755 similarity index 100% rename from data/TF_models/SKCM_FF/tumor_purity/outer_models.pkl rename to assets/TF_models/SKCM_FF/tumor_purity/outer_models.pkl diff --git a/data/TF_models/SKCM_FF/tumor_purity/x_train_scaler.pkl b/assets/TF_models/SKCM_FF/tumor_purity/x_train_scaler.pkl old mode 100644 new mode 100755 similarity index 100% rename from data/TF_models/SKCM_FF/tumor_purity/x_train_scaler.pkl rename to assets/TF_models/SKCM_FF/tumor_purity/x_train_scaler.pkl diff --git a/assets/adaptivecard.json b/assets/adaptivecard.json new file mode 100755 index 0000000..e8b1a9f --- /dev/null +++ b/assets/adaptivecard.json @@ -0,0 +1,67 @@ +{ + "type": "message", + "attachments": [ + { + "contentType": "application/vnd.microsoft.card.adaptive", + "contentUrl": null, + "content": { + "\$schema": "http://adaptivecards.io/schemas/adaptive-card.json", + "msteams": { + "width": "Full" + }, + "type": "AdaptiveCard", + "version": "1.2", + "body": [ + { + "type": "TextBlock", + "size": "Large", + "weight": "Bolder", + "color": "<% if (success) { %>Good<% } else { %>Attention<%} %>", + "text": "nf-core/spotlight v${version} - ${runName}", + "wrap": true + }, + { + "type": "TextBlock", + "spacing": "None", + "text": "Completed at ${dateComplete} (duration: ${duration})", + "isSubtle": true, + "wrap": true + }, + { + "type": "TextBlock", + "text": "<% if (success) { %>Pipeline completed successfully!<% } else { %>Pipeline completed with errors. The full error message was: ${errorReport}.<% } %>", + "wrap": true + }, + { + "type": "TextBlock", + "text": "The command used to launch the workflow was as follows:", + "wrap": true + }, + { + "type": "TextBlock", + "text": "${commandLine}", + "isSubtle": true, + "wrap": true + } + ], + "actions": [ + { + "type": "Action.ShowCard", + "title": "Pipeline Configuration", + "card": { + "type": "AdaptiveCard", + "\$schema": "http://adaptivecards.io/schemas/adaptive-card.json", + "body": [ + { + "type": "FactSet", + "facts": [<% out << summary.collect{ k,v -> "{\"title\": \"$k\", \"value\" : \"$v\"}"}.join(",\n") %> + ] + } + ] + } + } + ] + } + } + ] +} diff --git a/Python/2_train_multitask_models/cell_types.txt b/assets/cell_types.txt old mode 100644 new mode 100755 similarity index 100% rename from Python/2_train_multitask_models/cell_types.txt rename to assets/cell_types.txt diff --git a/assets/email_template.html b/assets/email_template.html new file mode 100755 index 0000000..a1c2f76 --- /dev/null +++ b/assets/email_template.html @@ -0,0 +1,53 @@ + + + + + + + + nf-core/spotlight Pipeline Report + + +
+ + + +

nf-core/spotlight ${version}

+

Run Name: $runName

+ +<% if (!success){ + out << """ +
+

nf-core/spotlight execution completed unsuccessfully!

+

The exit status of the task that caused the workflow execution to fail was: $exitStatus.

+

The full error message was:

+
${errorReport}
+
+ """ +} else { + out << """ +
+ nf-core/spotlight execution completed successfully! +
+ """ +} +%> + +

The workflow was completed at $dateComplete (duration: $duration)

+

The command used to launch the workflow was as follows:

+
$commandLine
+ +

Pipeline Configuration:

+ + + <% out << summary.collect{ k,v -> "" }.join("\n") %> + +
$k
$v
+ +

nf-core/spotlight

+

https://github.com/nf-core/spotlight

+ +
+ + + diff --git a/assets/email_template.txt b/assets/email_template.txt new file mode 100755 index 0000000..dfac538 --- /dev/null +++ b/assets/email_template.txt @@ -0,0 +1,39 @@ +---------------------------------------------------- + ,--./,-. + ___ __ __ __ ___ /,-._.--~\\ + |\\ | |__ __ / ` / \\ |__) |__ } { + | \\| | \\__, \\__/ | \\ |___ \\`-._,-`-, + `._,._,' + nf-core/spotlight ${version} +---------------------------------------------------- +Run Name: $runName + +<% if (success){ + out << "## nf-core/spotlight execution completed successfully! ##" +} else { + out << """#################################################### +## nf-core/spotlight execution completed unsuccessfully! ## +#################################################### +The exit status of the task that caused the workflow execution to fail was: $exitStatus. +The full error message was: + +${errorReport} +""" +} %> + + +The workflow was completed at $dateComplete (duration: $duration) + +The command used to launch the workflow was as follows: + + $commandLine + + + +Pipeline Configuration: +----------------------- +<% out << summary.collect{ k,v -> " - $k: $v" }.join("\n") %> + +-- +nf-core/spotlight +https://github.com/nf-core/spotlight diff --git a/assets/methods_description_template.yml b/assets/methods_description_template.yml new file mode 100755 index 0000000..3bc853e --- /dev/null +++ b/assets/methods_description_template.yml @@ -0,0 +1,29 @@ +id: "nf-core-spotlight-methods-description" +description: "Suggested text and references to use when describing pipeline usage within the methods section of a publication." +section_name: "nf-core/spotlight Methods Description" +section_href: "https://github.com/nf-core/spotlight" +plot_type: "html" +## TODO nf-core: Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline +## You inject any metadata in the Nextflow '${workflow}' object +data: | +

Methods

+

Data was processed using nf-core/spotlight v${workflow.manifest.version} ${doi_text} of the nf-core collection of workflows (Ewels et al., 2020), utilising reproducible software environments from the Bioconda (Grüning et al., 2018) and Biocontainers (da Veiga Leprevost et al., 2017) projects.

+

The pipeline was executed with Nextflow v${workflow.nextflow.version} (Di Tommaso et al., 2017) with the following command:

+
${workflow.commandLine}
+

${tool_citations}

+

References

+
    +
  • Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316-319. doi: 10.1038/nbt.3820
  • +
  • Ewels, P. A., Peltzer, A., Fillinger, S., Patel, H., Alneberg, J., Wilm, A., Garcia, M. U., Di Tommaso, P., & Nahnsen, S. (2020). The nf-core framework for community-curated bioinformatics pipelines. Nature Biotechnology, 38(3), 276-278. doi: 10.1038/s41587-020-0439-x
  • +
  • Grüning, B., Dale, R., Sjödin, A., Chapman, B. A., Rowe, J., Tomkins-Tinch, C. H., Valieris, R., Köster, J., & Bioconda Team. (2018). Bioconda: sustainable and comprehensive software distribution for the life sciences. Nature Methods, 15(7), 475–476. doi: 10.1038/s41592-018-0046-7
  • +
  • da Veiga Leprevost, F., Grüning, B. A., Alves Aflitos, S., Röst, H. L., Uszkoreit, J., Barsnes, H., Vaudel, M., Moreno, P., Gatto, L., Weber, J., Bai, M., Jimenez, R. C., Sachsenberg, T., Pfeuffer, J., Vera Alvarez, R., Griss, J., Nesvizhskii, A. I., & Perez-Riverol, Y. (2017). BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics (Oxford, England), 33(16), 2580–2582. doi: 10.1093/bioinformatics/btx192
  • + ${tool_bibliography} +
+
+
Notes:
+
    + ${nodoi_text} +
  • The command above does not include parameters contained in any configs or profiles that may have been used. Ensure the config file is also uploaded with your publication!
  • +
  • You should also cite all software used within this run. Check the "Software Versions" of this report to get version information.
  • +
+
diff --git a/assets/multiqc_config.yml b/assets/multiqc_config.yml new file mode 100755 index 0000000..81edd73 --- /dev/null +++ b/assets/multiqc_config.yml @@ -0,0 +1,15 @@ +report_comment: > + This report has been generated by the nf-core/spotlight + analysis pipeline. For information about how to interpret these results, please see the + documentation. +report_section_order: + "nf-core-spotlight-methods-description": + order: -1000 + software_versions: + order: -1001 + "nf-core-spotlight-summary": + order: -1002 + +export_plots: true + +disable_version_detection: true diff --git a/assets/samplesheet.csv b/assets/samplesheet.csv new file mode 100755 index 0000000..5f653ab --- /dev/null +++ b/assets/samplesheet.csv @@ -0,0 +1,3 @@ +sample,fastq_1,fastq_2 +SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz +SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz, diff --git a/assets/schema_input.json b/assets/schema_input.json new file mode 100755 index 0000000..772ae5d --- /dev/null +++ b/assets/schema_input.json @@ -0,0 +1,33 @@ +{ + "$schema": "http://json-schema.org/draft-07/schema", + "$id": "https://raw.githubusercontent.com/nf-core/spotlight/master/assets/schema_input.json", + "title": "nf-core/spotlight pipeline - params.input schema", + "description": "Schema for the file provided with params.input", + "type": "array", + "items": { + "type": "object", + "properties": { + "sample": { + "type": "string", + "pattern": "^\\S+$", + "errorMessage": "Sample name must be provided and cannot contain spaces", + "meta": ["id"] + }, + "fastq_1": { + "type": "string", + "format": "file-path", + "exists": true, + "pattern": "^\\S+\\.f(ast)?q\\.gz$", + "errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'" + }, + "fastq_2": { + "type": "string", + "format": "file-path", + "exists": true, + "pattern": "^\\S+\\.f(ast)?q\\.gz$", + "errorMessage": "FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'" + } + }, + "required": ["sample", "fastq_1"] + } +} diff --git a/assets/sendmail_template.txt b/assets/sendmail_template.txt new file mode 100755 index 0000000..a7c115c --- /dev/null +++ b/assets/sendmail_template.txt @@ -0,0 +1,53 @@ +To: $email +Subject: $subject +Mime-Version: 1.0 +Content-Type: multipart/related;boundary="nfcoremimeboundary" + +--nfcoremimeboundary +Content-Type: text/html; charset=utf-8 + +$email_html + +--nfcoremimeboundary +Content-Type: image/png;name="nf-core-spotlight_logo.png" +Content-Transfer-Encoding: base64 +Content-ID: +Content-Disposition: inline; filename="nf-core-spotlight_logo_light.png" + +<% out << new File("$projectDir/assets/nf-core-spotlight_logo_light.png"). + bytes. + encodeBase64(). + toString(). + tokenize( '\n' )*. + toList()*. + collate( 76 )*. + collect { it.join() }. + flatten(). + join( '\n' ) %> + +<% +if (mqcFile){ +def mqcFileObj = new File("$mqcFile") +if (mqcFileObj.length() < mqcMaxSize){ +out << """ +--nfcoremimeboundary +Content-Type: text/html; name=\"multiqc_report\" +Content-Transfer-Encoding: base64 +Content-ID: +Content-Disposition: attachment; filename=\"${mqcFileObj.getName()}\" + +${mqcFileObj. + bytes. + encodeBase64(). + toString(). + tokenize( '\n' )*. + toList()*. + collate( 76 )*. + collect { it.join() }. + flatten(). + join( '\n' )} +""" +}} +%> + +--nfcoremimeboundary-- diff --git a/assets/slackreport.json b/assets/slackreport.json new file mode 100755 index 0000000..9d76280 --- /dev/null +++ b/assets/slackreport.json @@ -0,0 +1,34 @@ +{ + "attachments": [ + { + "fallback": "Plain-text summary of the attachment.", + "color": "<% if (success) { %>good<% } else { %>danger<%} %>", + "author_name": "nf-core/spotlight ${version} - ${runName}", + "author_icon": "https://www.nextflow.io/docs/latest/_static/favicon.ico", + "text": "<% if (success) { %>Pipeline completed successfully!<% } else { %>Pipeline completed with errors<% } %>", + "fields": [ + { + "title": "Command used to launch the workflow", + "value": "```${commandLine}```", + "short": false + } + <% + if (!success) { %> + , + { + "title": "Full error message", + "value": "```${errorReport}```", + "short": false + }, + { + "title": "Pipeline configuration", + "value": "<% out << summary.collect{ k,v -> k == "hook_url" ? "_${k}_: (_hidden_)" : ( ( v.class.toString().contains('Path') || ( v.class.toString().contains('String') && v.contains('/') ) ) ? "_${k}_: `${v}`" : (v.class.toString().contains('DateTime') ? ("_${k}_: " + v.format(java.time.format.DateTimeFormatter.ofLocalizedDateTime(java.time.format.FormatStyle.MEDIUM))) : "_${k}_: ${v}") ) }.join(",\n") %>", + "short": false + } + <% } + %> + ], + "footer": "Completed at <% out << dateComplete.format(java.time.format.DateTimeFormatter.ofLocalizedDateTime(java.time.format.FormatStyle.MEDIUM)) %> (duration: ${duration})" + } + ] +} diff --git a/Python/2_train_multitask_models/task_selection_names.pkl b/assets/task_selection_names.pkl old mode 100644 new mode 100755 similarity index 100% rename from Python/2_train_multitask_models/task_selection_names.pkl rename to assets/task_selection_names.pkl diff --git a/Python/1_extract_histopathological_features/tissue_classes.csv b/assets/tissue_classes.csv old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/tissue_classes.csv rename to assets/tissue_classes.csv diff --git a/assets/tmp_clinical_file.txt b/assets/tmp_clinical_file.txt new file mode 100644 index 0000000..76aa7ff --- /dev/null +++ b/assets/tmp_clinical_file.txt @@ -0,0 +1 @@ +slide_submitter_id sample_submitter_id image_file_name percent_tumor_cells class_name class_id diff --git a/Python/1_extract_histopathological_features/myslim/bottleneck_predict.py b/bin/bottleneck_predict.py old mode 100644 new mode 100755 similarity index 85% rename from Python/1_extract_histopathological_features/myslim/bottleneck_predict.py rename to bin/bottleneck_predict.py index ae46b59..35e847c --- a/Python/1_extract_histopathological_features/myslim/bottleneck_predict.py +++ b/bin/bottleneck_predict.py @@ -6,19 +6,15 @@ sys.path.append(os.getcwd()) -# trunk-ignore(flake8/E402) import tf_slim as slim - -# trunk-ignore(flake8/E402) from nets import nets_factory - -# trunk-ignore(flake8/E402) from preprocessing import preprocessing_factory tf.compat.v1.disable_eager_execution() tf.app.flags.DEFINE_integer("num_classes", 42, "The number of classes.") -tf.app.flags.DEFINE_string("bot_out", None, "Output file for bottleneck features.") +tf.app.flags.DEFINE_string( + "bot_out", None, "Output file for bottleneck features.") tf.app.flags.DEFINE_string("pred_out", None, "Output file for predictions.") tf.app.flags.DEFINE_string( "model_name", "inception_v4", "The name of the architecture to evaluate.") @@ -26,8 +22,10 @@ "checkpoint_path", None, "The directory where the model was written to.") tf.app.flags.DEFINE_integer("eval_image_size", 299, "Eval image size.") tf.app.flags.DEFINE_string("file_dir", "../Output/process_train/", "") + FLAGS = tf.app.flags.FLAGS + def main(_): model_name_to_variables = { "inception_v3": "InceptionV3", @@ -37,13 +35,16 @@ def main(_): "inception_v4": "InceptionV4/Logits/AvgPool_1a/AvgPool:0", "inception_v3": "InceptionV3/Logits/AvgPool_1a_8x8/AvgPool:0", } - bottleneck_tensor_name = model_name_to_bottleneck_tensor_name.get(FLAGS.model_name) + bottleneck_tensor_name = model_name_to_bottleneck_tensor_name.get( + FLAGS.model_name) preprocessing_name = FLAGS.model_name eval_image_size = FLAGS.eval_image_size model_variables = model_name_to_variables.get(FLAGS.model_name) if model_variables is None: - tf.logging.error("Unknown model_name provided `%s`." % FLAGS.model_name) + tf.logging.error("Unknown model_name provided `%s`." % + FLAGS.model_name) sys.exit(-1) + # Either specify a checkpoint_path directly or find the path if tf.gfile.IsDirectory(FLAGS.checkpoint_path): checkpoint_path = tf.train.latest_checkpoint(FLAGS.checkpoint_path) print(checkpoint_path) @@ -62,7 +63,8 @@ def main(_): network_fn = nets_factory.get_network_fn( FLAGS.model_name, FLAGS.num_classes, is_training=False ) - processed_image = image_preprocessing_fn(image, eval_image_size, eval_image_size) + processed_image = image_preprocessing_fn( + image, eval_image_size, eval_image_size) processed_images = tf.expand_dims(processed_image, 0) logits, _ = network_fn(processed_images) @@ -72,14 +74,15 @@ def main(_): ) print(FLAGS.bot_out) - + sess = tf.Session() init_fn(sess) fto_bot = open(FLAGS.bot_out, "w") fto_pred = open(FLAGS.pred_out, "w") - filelist = os.listdir(FLAGS.file_dir) + filelist = [file_path for file_path in os.listdir( + FLAGS.file_dir) if (file_path.startswith("images_train") & file_path.endswith(".tfrecord"))] for i in range(len(filelist)): file = filelist[i] fls = tf.python_io.tf_record_iterator(FLAGS.file_dir + "/" + file) @@ -97,7 +100,8 @@ def main(_): example.features.feature["image/class/label"].int64_list.value[0] ) preds = sess.run(probabilities, feed_dict={image_string: x}) - bottleneck_values = sess.run(bottleneck_tensor_name, {image_string: x}) + bottleneck_values = sess.run( + bottleneck_tensor_name, {image_string: x}) fto_pred.write(filenames + "\t" + label) fto_bot.write(filenames + "\t" + label) for p in range(len(preds[0])): diff --git a/bin/clustering_schc_individual.py b/bin/clustering_schc_individual.py new file mode 100755 index 0000000..a7f1ba5 --- /dev/null +++ b/bin/clustering_schc_individual.py @@ -0,0 +1,143 @@ +#!/usr/bin/env python3 +import os +import joblib +import pandas as pd +from joblib import Parallel, delayed +import argparse + +# Own modules +import features.clustering as clustering +import features.graphs as graphs +from model.constants import DEFAULT_CELL_TYPES + +import multiprocessing + +from argparse import ArgumentParser as AP +from os.path import abspath +import time +from pathlib import Path + + +def get_args(): + # Script description + description = """Spatial Clustering with SCHC (individual)""" + + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + parser.add_argument("--tile_quantification_path", type=str, + help="Path to csv file with tile-level quantification (predictions)", required=True) + parser.add_argument("--output_dir", type=str, + help="Path to output folder to store generated files", required=False, default = "") + parser.add_argument("--slide_type", type=str, + help="Type of slides 'FFPE' or 'FF' used for naming generated files (default: 'FF')", default="FF") + parser.add_argument("--cell_types_path", type=str, + help="Path to file with list of cell types (default: CAFs, endothelial_cells, T_cells, tumor_purity)", default=None) + parser.add_argument("--graphs_path", type=str, + help="Path to pkl with generated graphs in case this was done before (OPTIONAL) if not specified, graphs will be generated", default=None) + parser.add_argument("--prefix", type=str, + help="Prefix for output file", default="") + parser.add_argument("--cutoff_path_length", type=int, + help="Max path length for proximity based on graphs", default=2, required=False) + parser.add_argument("--shapiro_alpha", type=float, + help="Choose significance level alpha (default: 0.05)", default=0.05, required=False) + parser.add_argument("--abundance_threshold", type=float, + help="Threshold for assigning cell types based on predicted probability (default: 0.5)", default=0.5, required=False) + + parser.add_argument("--n_clusters", type=int, + help="Number of clusters for SCHC (default: 8)", required=False, default=8) + parser.add_argument("--n_cores", type = int, help = "Number of cores to use (parallelization)") + + parser.add_argument("--version", action="version", version="0.1.0") + arg = parser.parse_args() + arg.output_dir = abspath(arg.output_dir) + + if ((arg.output_dir != "") & (not os.path.isdir(arg.output_dir))): + # Create an empty folder for TF records if folder doesn't exist + arg.output_dir = Path(arg.output_dir,"process_train") + os.mkdir(arg.output_dir) + + if arg.n_cores is None: + arg.n_cores = multiprocessing.cpu_count() + return arg + + +def clustering_schc_individual( + tile_quantification_path, + cell_types=None, graphs_path=None, n_cores = multiprocessing.cpu_count): + + if cell_types is None: + cell_types = DEFAULT_CELL_TYPES + + predictions = pd.read_csv(tile_quantification_path, sep="\t") + slide_submitter_ids = list(set(predictions.slide_submitter_id)) + + ##################################### + # ---- Constructing the graphs ---- # + ##################################### + + if graphs_path is None: + results = Parallel(n_jobs=n_cores)( + delayed(graphs.construct_graph)( + predictions=predictions, slide_submitter_id=id) + for id in slide_submitter_ids + ) + # Extract/format graphs + all_graphs = { + list(slide_graph.keys())[0]: list(slide_graph.values())[0] + for slide_graph in results + } + + else: + all_graphs = joblib.load(graphs_path) + + ###################################################################################### + # ---- Fraction of highly abundant cell types (individual cell type clustering) ---- # + ###################################################################################### + + # Spatially Hierarchical Constrained Clustering with all quantification of all cell types for each individual cell type + slide_indiv_clusters = Parallel(n_jobs=n_cores)(delayed(clustering.schc_individual)( + predictions, all_graphs[id], id) for id in slide_submitter_ids) + all_slide_indiv_clusters = pd.concat(slide_indiv_clusters, axis=0) + + # Add metadata + all_slide_indiv_clusters = pd.merge( + predictions, all_slide_indiv_clusters, on="tile_ID") + + # Add abundance label 'high' or 'low' based on cluster means + slide_indiv_clusters_labeled = clustering.label_cell_type_map_clusters( + all_slide_indiv_clusters) + + all_slide_indiv_clusters_final = all_slide_indiv_clusters.drop( + axis=1, columns=cell_types) # drop the predicted probabilities + + + return all_slide_indiv_clusters, slide_indiv_clusters_labeled, all_slide_indiv_clusters_final, all_graphs + + +def main(args): + all_slide_indiv_clusters, slide_indiv_clusters_labeled, all_slide_indiv_clusters_final,all_graphs = clustering_schc_individual( + tile_quantification_path = args.tile_quantification_path, + cell_types=args.cell_types_path, + graphs_path=args.graphs_path, + n_cores = args.n_cores) + + all_slide_indiv_clusters.to_csv( + Path(args.output_dir, f"{args.prefix}_indiv_schc_tiles_raw.csv", index = False)) + + all_slide_indiv_clusters_final.to_csv(Path(args.output_dir, f"{args.prefix}_indiv_schc_tiles.csv"), index = False) + + slide_indiv_clusters_labeled.to_csv( + Path(args.output_dir,f"{args.prefix}_indiv_schc_clusters_labeled.csv", index= False )) + + if (args.graphs_path is None): + joblib.dump(all_graphs, + Path(args.output_dir, + f"{args.prefix}_graphs.pkl")) + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/bin/clustering_schc_simultaneous.py b/bin/clustering_schc_simultaneous.py new file mode 100755 index 0000000..0f9f3a1 --- /dev/null +++ b/bin/clustering_schc_simultaneous.py @@ -0,0 +1,138 @@ +#!/usr/bin/env python3 +import os +import joblib +import pandas as pd +from joblib import Parallel, delayed +import argparse + +# Own modules +import features.clustering as clustering +import features.graphs as graphs +from model.constants import DEFAULT_CELL_TYPES +import multiprocessing + +from argparse import ArgumentParser as AP +from os.path import abspath +import time +from pathlib import Path + +def get_args(): + # Script description + description = """Compute Spatial Network Features: Compute Connectedness""" + + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + parser.add_argument("--tile_quantification_path", type=str, + help="Path to csv file with tile-level quantification (predictions)", required=True) + parser.add_argument("--output_dir", type=str, + help="Path to output folder to store generated files", required=False, default = "") + parser.add_argument("--slide_type", type=str, + help="Type of slides 'FFPE' or 'FF' used for naming generated files (default: 'FF')", default="FF") + parser.add_argument("--cell_types_path", type=str, + help="Path to file with list of cell types (default: CAFs, endothelial_cells, T_cells, tumor_purity)", default=None) + parser.add_argument("--graphs_path", type=str, + help="Path to pkl with generated graphs in case this was done before (OPTIONAL) if not specified, graphs will be generated", default=None) + parser.add_argument("--prefix", type=str, + help="Prefix for output file", default="") + parser.add_argument("--cutoff_path_length", type=int, + help="Max path length for proximity based on graphs", default=2, required=False) + parser.add_argument("--shapiro_alpha", type=float, + help="Choose significance level alpha (default: 0.05)", default=0.05, required=False) + parser.add_argument("--abundance_threshold", type=float, + help="Threshold for assigning cell types based on predicted probability (default: 0.5)", default=0.5, required=False) + + parser.add_argument("--n_clusters", type=int, + help="Number of clusters for SCHC (default: 8)", required=False, default=8) + parser.add_argument("--n_cores", type = int, help = "Number of cores to use (parallelization)") + + parser.add_argument("--version", action="version", version="0.1.0") + arg = parser.parse_args() + arg.output_dir = abspath(arg.output_dir) + + if ((arg.output_dir != "") & (not os.path.isdir(arg.output_dir))): + # Create an empty folder for TF records if folder doesn't exist + arg.output_dir = Path(arg.output_dir,"process_train") + os.mkdir(arg.output_dir) + + if arg.n_cores is None: + arg.n_cores = multiprocessing.cpu_count() + return arg + + +def clustering_schc_simultaneous( + tile_quantification_path, cell_types=None, graphs_path=None, n_cores = multiprocessing.cpu_count()): + + if cell_types is None: + cell_types = DEFAULT_CELL_TYPES + print(cell_types) + + predictions = pd.read_csv(tile_quantification_path, sep="\t") + slide_submitter_ids = list(set(predictions.slide_submitter_id)) + + ##################################### + # ---- Constructing the graphs ---- # + ##################################### + + if graphs_path is None: + results = Parallel(n_jobs=n_cores)( + delayed(graphs.construct_graph)( + predictions=predictions, slide_submitter_id=id) + for id in slide_submitter_ids + ) + # Extract/format graphs + all_graphs = { + list(slide_graph.keys())[0]: list(slide_graph.values())[0] + for slide_graph in results + } + else: + all_graphs = joblib.load(graphs_path) + + ###################################################################### + # ---- Fraction of cell type clusters (simultaneous clustering) ---- # + ###################################################################### + + # Spatially Hierarchical Constrained Clustering with all quantification of all cell types + slide_clusters = Parallel(n_jobs=n_cores)(delayed(clustering.schc_all)( + predictions, all_graphs[id], id) for id in slide_submitter_ids) + # Combine the tiles labeled with their cluster id for all slides + tiles_all_schc = pd.concat(slide_clusters, axis=0) + + # Assign a cell type label based on the mean of all cluster means across all slides + all_slide_clusters_characterized = clustering.characterize_clusters( + tiles_all_schc) + + formatted_tiles_all_schc = tiles_all_schc.drop(axis=1, columns=cell_types) + # drop the predicted probabilities + return tiles_all_schc, all_slide_clusters_characterized, formatted_tiles_all_schc, all_graphs + + +def main(args): + tiles_all_schc, all_slide_clusters_characterized,formatted_tiles_all_schc, all_graphs = clustering_schc_simultaneous( + tile_quantification_path = args.tile_quantification_path, + cell_types=args.cell_types_path, + graphs_path=args.graphs_path, + n_cores = args.n_cores) + + tiles_all_schc.to_csv( + Path(args.output_dir, f"{args.prefix}_all_schc_tiles_raw.csv", index = False)) + + formatted_tiles_all_schc.to_csv( + Path(args.output_dir, f"{args.prefix}_all_schc_tiles.csv", index = False) + ) + + all_slide_clusters_characterized.to_csv( + Path(args.output_dir,f"{args.prefix}_all_schc_clusters_labeled.csv", index= False )) + + + if (args.graphs_path is None): + joblib.dump(all_graphs, + Path(args.output_dir, + f"{args.prefix}_graphs.pkl")) + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/bin/combine_all_spatial_features.py b/bin/combine_all_spatial_features.py new file mode 100755 index 0000000..c25806c --- /dev/null +++ b/bin/combine_all_spatial_features.py @@ -0,0 +1,95 @@ +#!/usr/bin/env python3 +import os +import pandas as pd +import argparse +from os import path +from argparse import ArgumentParser as AP +from os.path import abspath +import time +from pathlib import Path + + +def get_args(): + # Script description + description = """Combining all computed spatial features""" + + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + parser.add_argument("--output_dir", type=str, + help="Path to output folder to store generated files", required=False, default = "") + parser.add_argument("--prefix", type=str, + help="Prefix for output file", default="") + parser.add_argument("--graph_features", type=str, + help="Path to tab-separated file with the graph features", default="") + + parser.add_argument("--clustering_features", type=str, + help="Path to tab-separated file with the graph features", default="") + parser.add_argument("--metadata_path", type=str, + help="Path to tab-separated file with metadata", default="") + parser.add_argument("--is_tcga", type=int, + help="dataset is from TCGA (default: True)", default=True, required=False) + parser.add_argument("--merge_var", type=str, + help="Variable to merge metadata and computed features on", default="slide_submitter_id") + parser.add_argument("--sheet_name", type=str, + help="Name of sheet for merging in case a path to xls(x) file is given for metadata_path", default=None) + + parser.add_argument("--version", action="version", version="0.1.0") + arg = parser.parse_args() + arg.output_dir = abspath(arg.output_dir) + + if ((arg.output_dir != "") & (not os.path.isdir(arg.output_dir))): + # Create an empty folder for TF records if folder doesn't exist + arg.output_dir = Path(arg.output_dir,"process_train") + os.mkdir(arg.output_dir) + return arg + + +def combine_all_spatial_features(graph_features, clustering_features, metadata_path="", is_TCGA=False, merge_var="slide_submitter_id", sheet_name=None): + """ + Combine network and clustering features into a single file. If metadata_path is not None, add the metadata as well, based on variable slide_submitter_id + + Args: + output_dir (str): directory containing the graph and clustering features + slide_type (str): slide type to identify correct files for merging, either "FF" or "FFPE" (default="FF") + metadata_path (str): path to file containing metadata + is_TCGA (bool): whether data is from TCGA + merge_var (str): variable on which to merge (default: slide_submitter_id) + + """ + all_features_graph = pd.read_csv(graph_features, sep=",", index_col=False) + all_features_clustering = pd.read_csv(clustering_features, sep=",", index_col=False) + + all_features_combined = pd.merge( + all_features_graph, all_features_clustering) + + # Add additional identifiers for TCGA + if is_TCGA: + all_features_combined["TCGA_patient_ID"] = all_features_combined.slide_submitter_id.str[0:12] + all_features_combined["TCGA_sample_ID"] = all_features_combined.slide_submitter_id.str[0:15] + all_features_combined["sample_submitter_id"] = all_features_combined.slide_submitter_id.str[0:16] + + # Add metadata if available + if path.isfile(metadata_path): + file_extension = metadata_path.split(".")[-1] + if (file_extension.startswith("xls")): + if sheet_name is None: + metadata = pd.read_excel(metadata_path) + elif (file_extension == "txt") or (file_extension == "csv"): + metadata = pd.read_csv(metadata_path, sep="\t") + all_features_combined = pd.merge( + all_features_combined, metadata, on=merge_var, how="left") + return all_features_combined + +def main(args): + print("Post-processing: combining all features") + combine_all_spatial_features(args.graph_features, args.clustering_features, args.metadata_path, args.is_tcga, args.merge_var, args.sheet_name).to_csv( + Path(args.output_dir, f"{args.prefix}_all_features_combined.csv"), sep="\t", index=False) + + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/bin/combine_clustering_features.py b/bin/combine_clustering_features.py new file mode 100755 index 0000000..163fd71 --- /dev/null +++ b/bin/combine_clustering_features.py @@ -0,0 +1,76 @@ +#!/usr/bin/env python3 +import os +import pandas as pd +import argparse +from argparse import ArgumentParser as AP +from os.path import abspath +import time +from pathlib import Path + +def get_args(): + # Script description + description = """Combine all clustering features""" + + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + + + parser.add_argument("--output_dir", type=str, + help="Path to output folder to store generated files", required=False, default = "") + parser.add_argument("--prefix", type=str, + help="Prefix for output file", default="") + + parser.add_argument("--frac_high_wide", type=str, + help="Path to csv", default="") + + parser.add_argument("--num_clust_slide_wide", type=str, + help="Path to csv", default="") + + parser.add_argument("--all_prox_df_wide", type=str, + help="Path to csv", default="") + parser.add_argument("--prox_indiv_schc_combined_wide", type=str, + help="Path to csv", default="") + parser.add_argument("--version", action="version", version="0.1.0") + arg = parser.parse_args() + arg.output_dir = abspath(arg.output_dir) + + if ((arg.output_dir != "") & (not os.path.isdir(arg.output_dir))): + # Create an empty folder for TF records if folder doesn't exist + arg.output_dir = Path(arg.output_dir,"process_train") + os.mkdir(arg.output_dir) + return arg + +def combine_clustering_features(frac_high_wide, + num_clust_slide_wide, + all_prox_df_wide, + prox_indiv_schc_combined_wide): + + frac_high_wide = pd.read_csv(frac_high_wide, index_col = False, header = 0) + num_clust_slide_wide = pd.read_csv(num_clust_slide_wide, index_col = False, header = 0) + + all_prox_df_wide = pd.read_csv(all_prox_df_wide, index_col = False, header =0) + prox_indiv_schc_combined_wide = pd.read_csv(prox_indiv_schc_combined_wide, index_col = False, header = 0, sep = "\t") + + + # Store features + all_features = pd.merge(frac_high_wide, num_clust_slide_wide, on=[ + "slide_submitter_id"]) + all_features = pd.merge(all_features, all_prox_df_wide) + all_features = pd.merge(all_features, prox_indiv_schc_combined_wide) + # all_features = pd.merge(all_features, shape_feature_means_wide) + + return all_features + +def main(args): + + combine_clustering_features(frac_high_wide = args.frac_high_wide, + num_clust_slide_wide = args.num_clust_slide_wide, all_prox_df_wide = args.all_prox_df_wide, prox_indiv_schc_combined_wide = args.prox_indiv_schc_combined_wide).to_csv( + Path(args.output_dir, f"{args.prefix}_clustering_features.csv"), index=False) + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/bin/combine_network_features.py b/bin/combine_network_features.py new file mode 100755 index 0000000..40d2a74 --- /dev/null +++ b/bin/combine_network_features.py @@ -0,0 +1,86 @@ +#!/usr/bin/env python3 +import argparse +from argparse import ArgumentParser as AP +import os +import pandas as pd +from os.path import abspath +import time +from pathlib import Path + +def get_args(): + # Script description + description = """Combine all network Features""" + + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + parser.add_argument("--all_largest_cc_sizes_wide", type=str, + help="Path to csv file ", required=True) + parser.add_argument("--shortest_paths_wide", type=str, + help="Path to csv file ", required=True) + parser.add_argument("--colocalization_wide", type=str, + help="Path to csv file ", required=True) + + parser.add_argument("--output_dir", type=str, + help="Path to output folder to store generated files", required=False, default = "") + + parser.add_argument("--prefix", type=str, + help="Prefix for output file", default="") + + parser.add_argument("--version", action="version", version="0.1.0") + arg = parser.parse_args() + arg.output_dir = abspath(arg.output_dir) + + if ((arg.output_dir != "") & (not os.path.isdir(arg.output_dir))): + # Create an empty folder for TF records if folder doesn't exist + arg.output_dir = Path(arg.output_dir,"process_train") + os.mkdir(arg.output_dir) + + return arg + +def combine_network_features(all_largest_cc_sizes_wide, shortest_paths_wide, colocalization_wide): + """ + Compute network features + 1. effect sizes based on difference in node degree between simulated slides and actual slide + 2. fraction largest connected component + 3. number of shortest paths with a max length. + + Args: + tile_quantification_path (str) + output_dir (str) + slide_type (str): type of slide either 'FF' or 'FFPE' + cell_types (list): list of cell types + graphs_path (str): path to pkl file with generated graphs [optional] + abundance_threshold (float): threshold for assigning cell types to tiles based on the predicted probability (default=0.5) + shapiro_alpha (float): significance level for shapiro tests for normality (default=0.05) + cutoff_path_length (int): max. length of shortest paths (default=2) + + Returns: + all_effect_sizes (DataFrame): dataframe containing the slide_submitter_id, center, neighbor, effect_size (Cohen's d), Tstat, pval, and the pair (string of center and neighbor) + all_sims_nd (DataFrame): dataframe containing slide_submitter_id, center, neighbor, simulation_nr and degree (node degree) + all_mean_nd_df (DataFrame): dataframe containing slide_submitter_id, center, neighbor, mean_sim (mean node degree across the N simulations), mean_obs + all_largest_cc_sizes (DataFrame): dataframe containing slide_submitter_id, cell type and type_spec_frac (fraction of LCC w.r.t. all tiles for cell type) + shortest_paths_slide (DataFrame): dataframe containing slide_submitter_id, source, target, pair and n_paths (number of shortest paths for a pair) + all_dual_nodes_frac (DataFrame): dataframe containing slide_submitter_id, pair, counts (absolute) and frac + + """ + all_largest_cc_sizes_wide = pd.read_csv(all_largest_cc_sizes_wide, index_col = False, sep = "\t") + shortest_paths_wide = pd.read_csv(shortest_paths_wide, index_col = False, sep = "\t") + colocalization_wide = pd.read_csv(colocalization_wide, index_col = False, sep = "\t") + + + all_features = pd.merge(all_largest_cc_sizes_wide, shortest_paths_wide) + all_features = pd.merge(all_features, colocalization_wide) + return (all_features) + +def main(args): + combine_network_features(args.all_largest_cc_sizes_wide, args.shortest_paths_wide, args.colocalization_wide + ).to_csv(Path(args.output_dir, f"{args.prefix}_all_graph_features.csv"), index = False) + + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/bin/compute_colocalization.py b/bin/compute_colocalization.py new file mode 100755 index 0000000..3348742 --- /dev/null +++ b/bin/compute_colocalization.py @@ -0,0 +1,155 @@ +#!/usr/bin/env python3 +import argparse +from argparse import ArgumentParser as AP +import os +import joblib +import pandas as pd +from joblib import Parallel, delayed +import argparse +import multiprocessing +# Own modules +import features.features as features +import features.graphs as graphs +import features.utils as utils +from model.constants import DEFAULT_CELL_TYPES + +from os.path import abspath +import time +from pathlib import Path + + +def get_args(): + # Script description + description = """Compute Spatial Network Features: Compute co-localization""" + + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + parser.add_argument("--tile_quantification_path", type=str, + help="Path to csv file with tile-level quantification (predictions)", required=True) + parser.add_argument("--output_dir", type=str, + help="Path to output folder to store generated files", required=False, default="") + parser.add_argument("--cell_types_path", type=str, + help="Path to file with list of cell types (default: CAFs, endothelial_cells, T_cells, tumor_purity)", default=None) + parser.add_argument("--graphs_path", type=str, + help="Path to pkl with generated graphs in case this was done before (OPTIONAL) if not specified, graphs will be generated", default=None) + parser.add_argument("--prefix", type=str, + help="Prefix for output file", default="") + parser.add_argument("--abundance_threshold", type=float, + help="Threshold for assigning cell types based on predicted probability (default: 0.5)", default=0.5, required=False) + + parser.add_argument("--n_cores", type=int, + help="Number of cores to use (parallelization)") + + parser.add_argument("--version", action="version", version="0.1.0") + arg = parser.parse_args() + arg.output_dir = abspath(arg.output_dir) + + if ((arg.output_dir != "") & (not os.path.isdir(arg.output_dir))): + # Create an empty folder for TF records if folder doesn't exist + arg.output_dir = Path(arg.output_dir, "process_train") + os.mkdir(arg.output_dir) + + if arg.n_cores is None: + arg.n_cores = multiprocessing.cpu_count() + return arg + + +def compute_colocalization(tile_quantification_path, + cell_types=None, + graphs_path=None, + abundance_threshold=0.5, n_cores=multiprocessing.cpu_count()): + """ + Compute network features: co-localization + + Args: + tile_quantification_path (str) + output_dir (str) + cell_types (list): list of cell types + graphs_path (str): path to pkl file with generated graphs [optional] + abundance_threshold (float): threshold for assigning cell types to tiles based on the predicted probability (default=0.5) + + Returns: + all_dual_nodes_frac (DataFrame): dataframe containing slide_submitter_id, pair, counts (absolute) and frac + + """ + if cell_types is None: + cell_types = DEFAULT_CELL_TYPES + + predictions = pd.read_csv(tile_quantification_path, sep="\t") + slide_submitter_ids = list(set(predictions.slide_submitter_id)) + + ##################################### + # ---- Constructing the graphs ---- # + ##################################### + + # TODO use 'generate_graphs.py' for this to replace + if graphs_path is None: + results = Parallel(n_jobs=n_cores)( + delayed(graphs.construct_graph)( + predictions=predictions, slide_submitter_id=id) + for id in slide_submitter_ids + ) + # Extract/format graphs + all_graphs = { + list(slide_graph.keys())[0]: list(slide_graph.values())[0] + for slide_graph in results + } + else: + all_graphs = joblib.load(graphs_path) + + ####################################################### + # ---- Compute connectedness and co-localization ---- # + ####################################################### + + all_dual_nodes_frac = [] + for id in slide_submitter_ids: + slide_data = utils.get_slide_data(predictions, id) + node_cell_types = utils.assign_cell_types( + slide_data=slide_data, cell_types=cell_types, threshold=abundance_threshold) + + dual_nodes_frac = features.compute_dual_node_fractions( + node_cell_types, cell_types) + dual_nodes_frac["slide_submitter_id"] = id + all_dual_nodes_frac.append(dual_nodes_frac) + + all_dual_nodes_frac = pd.concat(all_dual_nodes_frac, axis=0) + + colocalization_wide = all_dual_nodes_frac.pivot( + index=["slide_submitter_id"], columns="pair")["frac"] + new_cols = [ + f'Co-loc {col.replace("_", " ")} clusters' for col in colocalization_wide.columns] + colocalization_wide.columns = new_cols + colocalization_wide = colocalization_wide.reset_index() + + return (all_dual_nodes_frac, colocalization_wide, all_graphs) + + +def main(args): + all_dual_nodes_frac, colocalization_wide, all_graphs = compute_colocalization( + tile_quantification_path=args.tile_quantification_path, + cell_types=args.cell_types_path, + graphs_path=args.graphs_path, + abundance_threshold=args.abundance_threshold, + n_cores=args.n_cores) + + all_dual_nodes_frac.to_csv( + Path(args.output_dir, + f"{args.prefix}_features_coloc_fraction.csv"), sep="\t", index=False) + + colocalization_wide.to_csv( + Path(args.output_dir, + f"{args.prefix}_features_coloc_fraction_wide.csv"), sep="\t", index=False) + + if (args.graphs_path is None): + joblib.dump(all_graphs, + Path(args.output_dir, + f"{args.prefix}_graphs.pkl")) + + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/bin/compute_connectedness.py b/bin/compute_connectedness.py new file mode 100755 index 0000000..5caf7cd --- /dev/null +++ b/bin/compute_connectedness.py @@ -0,0 +1,153 @@ +#!/usr/bin/env python3 +import argparse +from argparse import ArgumentParser as AP +import os +import sys +import joblib +import pandas as pd +from joblib import Parallel, delayed +import argparse +from os import path +import multiprocessing +# Own modules +import features.clustering as clustering +import features.features as features +import features.graphs as graphs +import features.utils as utils +from model.constants import DEFAULT_SLIDE_TYPE, DEFAULT_CELL_TYPES, NUM_CORES, METADATA_COLS + +from os.path import abspath +import time +from pathlib import Path + +def get_args(): + # Script description + description = """Compute Spatial Network Features: Compute LCC""" + + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + parser.add_argument("--tile_quantification_path", type=str, + help="Path to csv file with tile-level quantification (predictions)", required=True) + parser.add_argument("--output_dir", type=str, + help="Path to output folder to store generated files", required=False, default = "") + parser.add_argument("--cell_types_path", type=str, + help="Path to file with list of cell types (default: CAFs, endothelial_cells, T_cells, tumor_purity)", default=None) + parser.add_argument("--graphs_path", type=str, + help="Path to pkl with generated graphs in case this was done before (OPTIONAL) if not specified, graphs will be generated", default=None) + parser.add_argument("--prefix", type=str, + help="Prefix for output file", default="") + parser.add_argument("--abundance_threshold", type=float, + help="Threshold for assigning cell types based on predicted probability (default: 0.5)", default=0.5, required=False) + parser.add_argument("--n_cores", type = int, help = "Number of cores to use (parallelization)") + parser.add_argument("--version", action="version", version="0.1.0") + arg = parser.parse_args() + arg.output_dir = abspath(arg.output_dir) + + if ((arg.output_dir != "") & (not os.path.isdir(arg.output_dir))): + # Create an empty folder for TF records if folder doesn't exist + arg.output_dir = Path(arg.output_dir,"process_train") + os.mkdir(arg.output_dir) + + if arg.n_cores is None: + arg.n_cores = multiprocessing.cpu_count() + return arg + + + +def compute_connectedness(tile_quantification_path, + cell_types=None, + graphs_path=None, + abundance_threshold=0.5, + n_cores = multiprocessing.cpu_count()): + """ + Compute network features: LCC + Normalized size of the largest connected component (LCC) for cell type A. This is defined as the largest set of nodes of cell type A + connected with at least one path between every pair of nodes, + divided by the total number of nodes of cell type A. Nodes are assigned a cell type label as described above. + + Args: + tile_quantification_path (str) + cell_types (list): list of cell types + graphs_path (str): path to pkl file with generated graphs [optional] + abundance_threshold (float): threshold for assigning cell types to tiles based on the predicted probability (default=0.5) + + Returns: + all_largest_cc_sizes (DataFrame): dataframe containing slide_submitter_id, cell type and type_spec_frac (fraction of LCC w.r.t. all tiles for cell type) + all_graphs (dict): dictionary with graphs [only if not existing] + + """ + if cell_types is None: + cell_types = DEFAULT_CELL_TYPES + + predictions = pd.read_csv(tile_quantification_path, sep="\t") + slide_submitter_ids = list(set(predictions.slide_submitter_id)) + + ##################################### + # ---- Constructing the graphs ---- # + ##################################### + + if graphs_path is None: + results = Parallel(n_jobs=n_cores)( + delayed(graphs.construct_graph)( + predictions=predictions, slide_submitter_id=id) + for id in slide_submitter_ids + ) + # Extract/format graphs + all_graphs = { + list(slide_graph.keys())[0]: list(slide_graph.values())[0] + for slide_graph in results + } + else: + all_graphs = joblib.load(graphs_path) + + ####################################################### + # ---- Compute connectedness and co-localization ---- # + ####################################################### + + all_largest_cc_sizes = [] + for id in slide_submitter_ids: + slide_data = utils.get_slide_data(predictions, id) + node_cell_types = utils.assign_cell_types( + slide_data=slide_data, cell_types=cell_types, threshold=abundance_threshold) + lcc = features.determine_lcc( + graph=all_graphs[id], cell_type_assignments=node_cell_types, cell_types=cell_types + ) + lcc["slide_submitter_id"] = id + all_largest_cc_sizes.append(lcc) + + + all_largest_cc_sizes = pd.concat(all_largest_cc_sizes, axis=0) + all_largest_cc_sizes = all_largest_cc_sizes.reset_index(drop=True) + all_largest_cc_sizes_wide = all_largest_cc_sizes.pivot( + index=["slide_submitter_id"], columns="cell_type")["type_spec_frac"] + new_cols = [ + f'LCC {col.replace("_", " ")} clusters' for col in all_largest_cc_sizes_wide.columns] + all_largest_cc_sizes_wide.columns = new_cols + all_largest_cc_sizes_wide = all_largest_cc_sizes_wide.reset_index() + + return(all_largest_cc_sizes_wide, all_graphs) + +def main(args): + all_largest_cc_sizes_wide, all_graphs = compute_connectedness( + tile_quantification_path=args.tile_quantification_path, + cell_types=args.cell_types_path, + graphs_path=args.graphs_path, + abundance_threshold=args.abundance_threshold, n_cores=args.n_cores) + all_largest_cc_sizes_wide.to_csv( + Path(args.output_dir, + f"{args.prefix}_features_lcc_fraction_wide.csv"), + sep="\t", + index=False) + + if (args.graphs_path is None): + joblib.dump(all_graphs, + Path(args.output_dir, + f"{args.prefix}_graphs.pkl")) + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/bin/compute_frac_high.py b/bin/compute_frac_high.py new file mode 100755 index 0000000..110f658 --- /dev/null +++ b/bin/compute_frac_high.py @@ -0,0 +1,78 @@ +#!/usr/bin/env python3 +import os +import pandas as pd +import argparse + +# Own modules +import features.features as features + + +from argparse import ArgumentParser as AP +from os.path import abspath +import time +from pathlib import Path + +def get_args(): + # Script description + description = """Compute Clustering Features""" + + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + parser.add_argument("--slide_indiv_clusters_labeled", type=str, + help="Path to csv file", required=True) + parser.add_argument("--output_dir", type=str, + help="Path to output folder to store generated files", required=False, default = "") + parser.add_argument("--prefix", type=str, + help="Prefix for output file", default="") + parser.add_argument("--version", action="version", version="0.1.0") + arg = parser.parse_args() + arg.output_dir = abspath(arg.output_dir) + + if ((arg.output_dir != "") & (not os.path.isdir(arg.output_dir))): + # Create an empty folder for TF records if folder doesn't exist + arg.output_dir = Path(arg.output_dir,"process_train") + os.mkdir(arg.output_dir) + + return arg + +def compute_frac_high(slide_indiv_clusters_labeled): + """ + Compute clustering features: + + Args: + slide_indiv_clusters_labeled: dataframe with the labeled clusters based on individual cell type SCHC + + Returns: + frac_high_wide (DataFrame) + + """ + + slide_indiv_clusters_labeled = pd.read_csv(slide_indiv_clusters_labeled) + + # Count the fraction of 'high' clusters + frac_high = features.n_high_clusters(slide_indiv_clusters_labeled) + + frac_high_sub = frac_high[frac_high["is_high"]].copy() + frac_high_sub = frac_high_sub.drop( + columns=["is_high", "n_clusters", "n_total_clusters"]) + + frac_high_wide = frac_high_sub.pivot(index=["slide_submitter_id"], columns=[ + "cell_type_map"])["fraction"] + new_cols = [('fraction {0} clusters labeled high'.format(col)) + for col in frac_high_wide.columns] + frac_high_wide.columns = new_cols + frac_high_wide = frac_high_wide.sort_index(axis="columns").reset_index() + return frac_high_wide + +def main(args): + compute_frac_high( + slide_indiv_clusters_labeled = args.slide_indiv_clusters_labeled).to_csv( + Path(args.output_dir, f"{args.prefix}_frac_high_wide.csv", index = False)) + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/bin/compute_nclusters.py b/bin/compute_nclusters.py new file mode 100755 index 0000000..25c33cf --- /dev/null +++ b/bin/compute_nclusters.py @@ -0,0 +1,115 @@ +#!/usr/bin/env python3 +import os +import sys +import joblib +import pandas as pd +from joblib import Parallel, delayed +import argparse +from os import path + +# Own modules +import features.clustering as clustering +import features.features as features +import features.graphs as graphs +import features.utils as utils +from model.constants import DEFAULT_SLIDE_TYPE, DEFAULT_CELL_TYPES, METADATA_COLS + + +import multiprocessing + +from argparse import ArgumentParser as AP +from os.path import abspath +import time +from pathlib import Path + + + +def get_args(): + # Script description + description = """Compute Spatial Network Features: Compute Connectedness""" + + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + parser.add_argument("--all_slide_clusters_characterized", type=str, + help="Path to csv file", required=True) + parser.add_argument("--output_dir", type=str, + help="Path to output folder to store generated files", required=False, default = "") + parser.add_argument("--slide_type", type=str, + help="Type of slides 'FFPE' or 'FF' used for naming generated files (default: 'FF')", default="FF") + parser.add_argument("--cell_types_path", type=str, + help="Path to file with list of cell types (default: CAFs, endothelial_cells, T_cells, tumor_purity)", default=None) + parser.add_argument("--graphs_path", type=str, + help="Path to pkl with generated graphs in case this was done before (OPTIONAL) if not specified, graphs will be generated", default=None) + parser.add_argument("--prefix", type=str, + help="Prefix for output file", default="") + parser.add_argument("--cutoff_path_length", type=int, + help="Max path length for proximity based on graphs", default=2, required=False) + parser.add_argument("--shapiro_alpha", type=float, + help="Choose significance level alpha (default: 0.05)", default=0.05, required=False) + parser.add_argument("--abundance_threshold", type=float, + help="Threshold for assigning cell types based on predicted probability (default: 0.5)", default=0.5, required=False) + + parser.add_argument("--n_clusters", type=int, + help="Number of clusters for SCHC (default: 8)", required=False, default=8) + parser.add_argument("--n_cores", type = int, help = "Number of cores to use (parallelization)") + + parser.add_argument("--version", action="version", version="0.1.0") + arg = parser.parse_args() + arg.output_dir = abspath(arg.output_dir) + + if ((arg.output_dir != "") & (not os.path.isdir(arg.output_dir))): + # Create an empty folder for TF records if folder doesn't exist + arg.output_dir = Path(arg.output_dir,"process_train") + os.mkdir(arg.output_dir) + + if arg.n_cores is None: + arg.n_cores = multiprocessing.cpu_count() + return arg + + + +def compute_nclusters(all_slide_clusters_characterized, cell_types = DEFAULT_CELL_TYPES): + + if cell_types is None: + cell_types = DEFAULT_CELL_TYPES + + # # Assign a cell type label based on the mean of all cluster means across all slides + # all_slide_clusters_characterized = clustering.characterize_clusters( + # tiles_all_schc) + all_slide_clusters_characterized = pd.read_csv(all_slide_clusters_characterized, + sep = ",", header = 0) + + + # Count the number of clusters per cell type for each slide + num_clust_by_slide = features.n_clusters_per_cell_type( + all_slide_clusters_characterized, cell_types=cell_types) + + num_clust_by_slide_sub = num_clust_by_slide.copy() + num_clust_by_slide_sub = num_clust_by_slide_sub.drop( + columns=["is_assigned", "n_clusters"]) + + num_clust_slide_wide = num_clust_by_slide_sub.pivot( + index=["slide_submitter_id"], columns=["cell_type"])["fraction"] + new_cols = [('fraction {0} clusters'.format(col)) + for col in num_clust_slide_wide.columns] + num_clust_slide_wide.columns = new_cols + num_clust_slide_wide = num_clust_slide_wide.sort_index( + axis="columns").reset_index() + return num_clust_slide_wide + + +def main(args): + num_clust_slide_wide = compute_nclusters( + all_slide_clusters_characterized = args.all_slide_clusters_characterized, + cell_types=args.cell_types_path) + + num_clust_slide_wide.to_csv( + Path(args.output_dir, f"{args.prefix}_nclusters_wide.csv", sep = "\t", index = False)) + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/bin/compute_node_degree_with_es.py b/bin/compute_node_degree_with_es.py new file mode 100755 index 0000000..7c9ff0a --- /dev/null +++ b/bin/compute_node_degree_with_es.py @@ -0,0 +1,197 @@ +#!/usr/bin/env python3 +import argparse +from argparse import ArgumentParser as AP +import os +import joblib +import pandas as pd +from joblib import Parallel, delayed +import multiprocessing +# Own modules +import features.features as features +import features.graphs as graphs +import features.utils as utils +from model.constants import DEFAULT_CELL_TYPES + +from os.path import abspath +import time +from pathlib import Path + + +def get_args(): + # Script description + description = """Spatial Network (graph) features: Compute 'mean_ND' and 'ND_effsize' features""" + + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + parser.add_argument("--tile_quantification_path", type=str, + help="Path to csv file with tile-level quantification (predictions)", required=True) + parser.add_argument("--output_dir", type=str, + help="Path to output folder to store generated files", required=False, default="") + parser.add_argument("--slide_type", type=str, + help="Type of slides 'FFPE' or 'FF' used for naming generated files (default: 'FF')", default="FF") + parser.add_argument("--cell_types_path", type=str, + help="Path to file with list of cell types (default: CAFs, endothelial_cells, T_cells, tumor_purity)", default=None) + parser.add_argument("--graphs_path", type=str, + help="Path to pkl with generated graphs in case this was done before (OPTIONAL) if not specified, graphs will be generated", default=None) + parser.add_argument("--prefix", type=str, + help="Prefix for output file", default="") + parser.add_argument("--cutoff_path_length", type=int, + help="Max path length for proximity based on graphs", default=2, required=False) + parser.add_argument("--shapiro_alpha", type=float, + help="Choose significance level alpha (default: 0.05)", default=0.05, required=False) + parser.add_argument("--abundance_threshold", type=float, + help="Threshold for assigning cell types based on predicted probability (default: 0.5)", default=0.5, required=False) + + parser.add_argument("--n_clusters", type=int, + help="Number of clusters for SCHC (default: 8)", required=False, default=8) + parser.add_argument("--max_dist", type=int, + help="Maximum distance between clusters", required=False, default=None) + parser.add_argument("--max_n_tiles_threshold", type=int, + help="Number of tiles for computing max. distance between two points in two different clusters", default=2, required=False) + parser.add_argument("--tile_size", type=int, + help="Size of tile (default: 512)", default=512, required=False) + parser.add_argument("--overlap", type=int, + help="Overlap of tiles (default: 50)", default=50, required=False) + + parser.add_argument("--metadata_path", type=str, + help="Path to tab-separated file with metadata", default="") + parser.add_argument("--is_TCGA", type=bool, + help="dataset is from TCGA (default: True)", default=True, required=False) + parser.add_argument("--merge_var", type=str, + help="Variable to merge metadata and computed features on", default=None) + parser.add_argument("--sheet_name", type=str, + help="Name of sheet for merging in case a path to xls(x) file is given for metadata_path", default=None) + parser.add_argument("--n_cores", type=int, + help="Number of cores to use (parallelization)") + + parser.add_argument("--version", action="version", version="0.1.0") + arg = parser.parse_args() + arg.output_dir = abspath(arg.output_dir) + + if ((arg.output_dir != "") & (not os.path.isdir(arg.output_dir))): + # Create an empty folder for TF records if folder doesn't exist + arg.output_dir = Path(arg.output_dir, "process_train") + os.mkdir(arg.output_dir) + + if arg.n_cores is None: + arg.n_cores = multiprocessing.cpu_count() + return arg + + +def compute_node_degree_with_es(tile_quantification_path, + cell_types=None, + graphs_path=None, + shapiro_alpha=0.05, + n_cores=multiprocessing.cpu_count()): + """ + Compute network features + 'mean_ND': Average number of neighbor nodes of cell type B surrounding nodes of cell type A. Nodes are assigned a cell type label if the predicted probability of that node (tile) for the given cell type is higher than 0.5. + 'ND_effsize': Cohen's d measure of effect size computed comparing the mean_ND(A,B) with the null distribution obtained by recomputing the mean_ND randomly assigning the A or B cell type label to each node preserving the total number of cell type A and B nodes in the network. For a negative effect size, the true average mean_ND(A,B) is larger than the simulated average mean_ND(A,B) meaning that the two cell types in the actual slide are closer together compared to a random distribution of these two cell types. Vice versa for a positive effect size. Nodes are assigned a cell type label as described above. + + + Args: + tile_quantification_path (str): path to tile quantification path (csv) + cell_types (list): list of cell types (or path to csv file with cell types) + graphs_path (str): path to pkl file with generated graphs [optional] + shapiro_alpha (float): significance level for shapiro tests for normality (default=0.05) + + Returns: + all_effect_sizes (DataFrame): dataframe containing the slide_submitter_id, center, neighbor, effect_size (Cohen's d), Tstat, pval, and the pair (string of center and neighbor) + all_sims_nd (DataFrame): dataframe containing slide_submitter_id, center, neighbor, simulation_nr and degree (node degree) + all_mean_nd_df (DataFrame): dataframe containing slide_submitter_id, center, neighbor, mean_sim (mean node degree across the N simulations), mean_obs + + """ + if cell_types is None: + cell_types = DEFAULT_CELL_TYPES + + predictions = pd.read_csv(tile_quantification_path, sep="\t") + slide_submitter_ids = list(set(predictions.slide_submitter_id)) + + ##################################### + # ---- Constructing the graphs ---- # + ##################################### + # TODO use 'generate_graphs.py' for this to replace + if graphs_path is None: + results = Parallel(n_jobs=n_cores)( + delayed(graphs.construct_graph)( + predictions=predictions, slide_submitter_id=id) + for id in slide_submitter_ids + ) + # Extract/format graphs + all_graphs = { + list(slide_graph.keys())[0]: list(slide_graph.values())[0] + for slide_graph in results + } + else: + all_graphs = joblib.load(graphs_path) + + ############################################### + # ---- Compute ES based on ND difference ---- # + ############################################### + nd_results = Parallel(n_jobs=n_cores)(delayed(features.node_degree_wrapper)( + all_graphs[id], predictions, id) for id in slide_submitter_ids) + nd_results = list(filter(lambda id: id is not None, nd_results)) + + # Format results + all_sims_nd = [] + all_mean_nd_df = [] + example_simulations = {} + + for sim_assignments, sim, mean_nd_df in nd_results: + all_mean_nd_df.append(mean_nd_df) + all_sims_nd.append(sim) + example_simulations.update(sim_assignments) + + all_sims_nd = pd.concat(all_sims_nd, axis=0).reset_index() + all_mean_nd_df = pd.concat(all_mean_nd_df).reset_index(drop=True) + + # Testing normality + shapiro_tests = Parallel(n_jobs=n_cores)(delayed(utils.test_normality)( + sims_nd=all_sims_nd, + slide_submitter_id=id, + alpha=shapiro_alpha, + cell_types=cell_types) for id in all_sims_nd.slide_submitter_id.unique()) + all_shapiro_tests = pd.concat(shapiro_tests, axis=0) + + # Computing Cohen's d effect size and perform t-test + effect_sizes = Parallel(n_jobs=n_cores)(delayed(features.compute_effect_size)( + all_mean_nd_df, all_sims_nd, slide_submitter_id) for slide_submitter_id in all_sims_nd.slide_submitter_id.unique()) + all_effect_sizes = pd.concat(effect_sizes, axis=0) + all_effect_sizes["pair"] = [ + f"{c}-{n}" for c, n in all_effect_sizes[["center", "neighbor"]].to_numpy()] + + return all_effect_sizes, all_sims_nd, all_mean_nd_df, example_simulations, all_shapiro_tests, all_graphs + + +def main(args): + all_effect_sizes, all_sims_nd, all_mean_nd_df, example_simulations, all_shapiro_tests, all_graphs = compute_node_degree_with_es( + tile_quantification_path=args.tile_quantification_path, + cell_types=args.cell_types_path, + graphs_path=args.graphs_path, + shapiro_alpha=args.shapiro_alpha, + n_cores=args.n_cores + ) + + all_effect_sizes.to_csv( + Path(args.output_dir, f"{args.prefix}_features_ND_ES.csv"), sep="\t", index=False) + all_sims_nd.to_csv( + Path(args.output_dir, f"{args.prefix}_features_ND_sims.csv"), sep="\t", index=False) + all_mean_nd_df.to_csv( + Path(args.output_dir, f"{args.prefix}_features_ND.csv"), sep="\t", index=False) + joblib.dump(example_simulations, + Path(args.output_dir, f"{args.prefix}_features_ND_sim_assignments.pkl")) + all_shapiro_tests.to_csv( + Path(args.output_dir, f"{args.prefix}_shapiro_tests.csv"), index=False, sep="\t") + + if (args.graphs_path is None): + joblib.dump(all_graphs, + Path(args.output_dir, + f"{args.prefix}_graphs.pkl")) + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/bin/compute_nshortest_with_max_length.py b/bin/compute_nshortest_with_max_length.py new file mode 100755 index 0000000..78ef8ef --- /dev/null +++ b/bin/compute_nshortest_with_max_length.py @@ -0,0 +1,191 @@ +#!/usr/bin/env python3 +import argparse +from argparse import ArgumentParser as AP +import os +import sys +import joblib +import pandas as pd +from joblib import Parallel, delayed +import argparse +from os import path +import multiprocessing +# Own modules +import features.clustering as clustering +import features.features as features +import features.graphs as graphs +import features.utils as utils +from model.constants import DEFAULT_SLIDE_TYPE, DEFAULT_CELL_TYPES, NUM_CORES, METADATA_COLS + +from os.path import abspath +import time +from pathlib import Path + + +def get_args(): + # Script description + description = """Compute Spatial Network Features: Compute Connectedness""" + + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + parser.add_argument("--tile_quantification_path", type=str, + help="Path to csv file with tile-level quantification (predictions)", required=True) + parser.add_argument("--output_dir", type=str, + help="Path to output folder to store generated files", required=False, default="") + parser.add_argument("--slide_type", type=str, + help="Type of slides 'FFPE' or 'FF' used for naming generated files (default: 'FF')", default="FF") + parser.add_argument("--cell_types_path", type=str, + help="Path to file with list of cell types (default: CAFs, endothelial_cells, T_cells, tumor_purity)", default=None) + parser.add_argument("--graphs_path", type=str, + help="Path to pkl with generated graphs in case this was done before (OPTIONAL) if not specified, graphs will be generated", default=None) + parser.add_argument("--prefix", type=str, + help="Prefix for output file", default="") + parser.add_argument("--cutoff_path_length", type=int, + help="Max path length for proximity based on graphs", default=2, required=False) + parser.add_argument("--shapiro_alpha", type=float, + help="Choose significance level alpha (default: 0.05)", default=0.05, required=False) + parser.add_argument("--abundance_threshold", type=float, + help="Threshold for assigning cell types based on predicted probability (default: 0.5)", default=0.5, required=False) + + parser.add_argument("--n_clusters", type=int, + help="Number of clusters for SCHC (default: 8)", required=False, default=8) + parser.add_argument("--n_cores", type=int, + help="Number of cores to use (parallelization)") + + parser.add_argument("--version", action="version", version="0.1.0") + arg = parser.parse_args() + arg.output_dir = abspath(arg.output_dir) + + if ((arg.output_dir != "") & (not os.path.isdir(arg.output_dir))): + # Create an empty folder for TF records if folder doesn't exist + arg.output_dir = Path(arg.output_dir, "process_train") + os.mkdir(arg.output_dir) + + if arg.n_cores is None: + arg.n_cores = multiprocessing.cpu_count() + return arg + + +def compute_n_shortest_paths_with_max_length(tile_quantification_path, + cell_types=None, + graphs_path=None, + cutoff_path_length=2, + n_cores=multiprocessing.cpu_count()): + """ + Compute network features + 1. effect sizes based on difference in node degree between simulated slides and actual slide + 2. fraction largest connected component + 3. number of shortest paths with a max length. + + Args: + tile_quantification_path (str) + output_dir (str) + slide_type (str): type of slide either 'FF' or 'FFPE' + cell_types (list): list of cell types + graphs_path (str): path to pkl file with generated graphs [optional] + abundance_threshold (float): threshold for assigning cell types to tiles based on the predicted probability (default=0.5) + shapiro_alpha (float): significance level for shapiro tests for normality (default=0.05) + cutoff_path_length (int): max. length of shortest paths (default=2) + + Returns: + all_effect_sizes (DataFrame): dataframe containing the slide_submitter_id, center, neighbor, effect_size (Cohen's d), Tstat, pval, and the pair (string of center and neighbor) + all_sims_nd (DataFrame): dataframe containing slide_submitter_id, center, neighbor, simulation_nr and degree (node degree) + all_mean_nd_df (DataFrame): dataframe containing slide_submitter_id, center, neighbor, mean_sim (mean node degree across the N simulations), mean_obs + all_largest_cc_sizes (DataFrame): dataframe containing slide_submitter_id, cell type and type_spec_frac (fraction of LCC w.r.t. all tiles for cell type) + shortest_paths_slide (DataFrame): dataframe containing slide_submitter_id, source, target, pair and n_paths (number of shortest paths for a pair) + all_dual_nodes_frac (DataFrame): dataframe containing slide_submitter_id, pair, counts (absolute) and frac + + """ + if cell_types is None: + cell_types = DEFAULT_CELL_TYPES + + predictions = pd.read_csv(tile_quantification_path, sep="\t") + slide_submitter_ids = list(set(predictions.slide_submitter_id)) + + ##################################### + # ---- Constructing the graphs ---- # + ##################################### + + # TODO use 'generate_graphs.py' for this to replace + if graphs_path is None: + results = Parallel(n_jobs=n_cores)( + delayed(graphs.construct_graph)( + predictions=predictions, slide_submitter_id=id) + for id in slide_submitter_ids + ) + # Extract/format graphs + all_graphs = { + list(slide_graph.keys())[0]: list(slide_graph.values())[0] + for slide_graph in results + } + else: + all_graphs = joblib.load(graphs_path) + + ####################################################### + # ---- Compute N shortest paths with max. length ---- # + ####################################################### + + results = Parallel(n_jobs=n_cores)( + delayed(features.compute_n_shortest_paths_max_length)( + predictions=predictions, slide_submitter_id=id, graph=all_graphs[ + id], cutoff=cutoff_path_length + ) + for id in slide_submitter_ids + ) + # Formatting and count the number of shortest paths with max length + all_shortest_paths_thresholded = pd.concat(results, axis=0) + all_shortest_paths_thresholded["n_paths"] = 1 + proximity_graphs = ( + all_shortest_paths_thresholded.groupby( + ["slide_submitter_id", "source", "target"] + ) + .sum(numeric_only=True) + .reset_index() + ) + # Post-processing + proximity_graphs["pair"] = [f"{source}-{target}" for source, + target in proximity_graphs[["source", "target"]].to_numpy()] + proximity_graphs = proximity_graphs.drop(columns=["path_length"]) + + # Additional formatting + shortest_paths_wide = proximity_graphs.pivot( + index=["slide_submitter_id"], columns="pair")["n_paths"] + new_cols = [ + f'Prox graph {col.replace("_", " ")} clusters' for col in shortest_paths_wide.columns] + shortest_paths_wide.columns = new_cols + shortest_paths_wide = shortest_paths_wide.reset_index() + + return (proximity_graphs, shortest_paths_wide, all_graphs) + + +def main(args): + proximity_graphs, shortest_paths_wide, all_graphs = compute_n_shortest_paths_with_max_length( + tile_quantification_path=args.tile_quantification_path, + cell_types=args.cell_types_path, + graphs_path=args.graphs_path, cutoff_path_length=args.cutoff_path_length, + n_cores=args.n_cores) + + proximity_graphs.to_csv( + Path(args.output_dir, + f"{args.prefix}_features_shortest_paths_thresholded.csv"), + sep="\t", + index=False) + + shortest_paths_wide.to_csv( + Path(args.output_dir, + f"{args.prefix}_features_shortest_paths_thresholded_wide.csv"), + sep="\t", + index=False) + + if (args.graphs_path is None): + joblib.dump(all_graphs, + Path(args.output_dir, + f"{args.prefix}_graphs.pkl")) + + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/bin/compute_proximity_from_indiv_schc_between.py b/bin/compute_proximity_from_indiv_schc_between.py new file mode 100755 index 0000000..74216dc --- /dev/null +++ b/bin/compute_proximity_from_indiv_schc_between.py @@ -0,0 +1,148 @@ +#!/usr/bin/env python3 +import os +import pandas as pd +from joblib import Parallel, delayed +import argparse +import features.features as features +from model.constants import DEFAULT_CELL_TYPES + + +import multiprocessing +from argparse import ArgumentParser as AP +from os.path import abspath +import time +from pathlib import Path + +def get_args(): + # Script description + description = """Compute Clustering features Features: Compute proximity (between clusters)""" + + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + parser.add_argument("--slide_clusters", type=str, + help="Path to csv file", required=True) + parser.add_argument("--tiles_schc", type=str, + help="Path to csv file", required=True) + parser.add_argument("--output_dir", type=str, + help="Path to output folder to store generated files", required=False, default = "") + parser.add_argument("--slide_type", type=str, + help="Type of slides 'FFPE' or 'FF' used for naming generated files (default: 'FF')", default="FF") + parser.add_argument("--cell_types_path", type=str, + help="Path to file with list of cell types (default: CAFs, endothelial_cells, T_cells, tumor_purity)", default=None) + parser.add_argument("--graphs_path", type=str, + help="Path to pkl with generated graphs in case this was done before (OPTIONAL) if not specified, graphs will be generated", default=None) + parser.add_argument("--prefix", type=str, + help="Prefix for output file", default="") + parser.add_argument("--cutoff_path_length", type=int, + help="Max path length for proximity based on graphs", default=2, required=False) + parser.add_argument("--shapiro_alpha", type=float, + help="Choose significance level alpha (default: 0.05)", default=0.05, required=False) + parser.add_argument("--max_dist", type=int, + help="Max dist", required=False, default= None) + + parser.add_argument("--max_n_tiles_threshold", type=int, + help="Max dist", required=False, default= 2) + parser.add_argument("--tile_size", type=int, + help="Max dist", required=False, default= 512) + parser.add_argument("--overlap", type=int, + help="Max dist", required=False, default= 50) + + parser.add_argument("--n_clusters", type=int, + help="Number of clusters for SCHC (default: 8)", required=False, default=8) + parser.add_argument("--n_cores", type = int, help = "Number of cores to use (parallelization)") + + parser.add_argument("--version", action="version", version="0.1.0") + arg = parser.parse_args() + arg.output_dir = abspath(arg.output_dir) + + if ((arg.output_dir != "") & (not os.path.isdir(arg.output_dir))): + # Create an empty folder for TF records if folder doesn't exist + arg.output_dir = Path(arg.output_dir,"process_train") + os.mkdir(arg.output_dir) + + if arg.n_cores is None: + arg.n_cores = multiprocessing.cpu_count() + return arg + + +def compute_proximity_from_indiv_schc_between( + slide_clusters, tiles_schc, + cell_types=None, + n_clusters=8, max_dist=None, + max_n_tiles_threshold=2, + tile_size=512, + overlap=50, + n_cores = multiprocessing.cpu_count()): + + if cell_types is None: + cell_types = DEFAULT_CELL_TYPES + + all_slide_indiv_clusters = pd.read_csv(slide_clusters, sep = ",", header = 0) + slide_submitter_ids = list(set(all_slide_indiv_clusters.slide_submitter_id)) + + slide_indiv_clusters_labeled = pd.read_csv(tiles_schc, sep = ",", header = 0) + + ########################################################################## + # ---- Compute proximity features (individual cell type clustering) ---- # + ########################################################################## + + # Computing proximity for clusters derived for each cell type individually + # Between clusters + slide_submitter_ids = list(set(slide_indiv_clusters_labeled.slide_submitter_id)) + results_schc_indiv = Parallel(n_jobs=n_cores)(delayed(features.compute_proximity_clusters_pairs)(all_slide_indiv_clusters, + slide_submitter_id=id, method="individual_between", + n_clusters=n_clusters, + cell_types=cell_types, + max_dist=max_dist, + max_n_tiles_threshold=max_n_tiles_threshold, + tile_size=tile_size, + overlap=overlap) for id in slide_submitter_ids) + prox_indiv_schc = pd.concat(results_schc_indiv) + + # Formatting + prox_indiv_schc = pd.merge(prox_indiv_schc, slide_indiv_clusters_labeled, left_on=[ + "slide_submitter_id", "cluster1_label", "cluster1"], right_on=["slide_submitter_id", "cell_type_map", "cluster_label"]) + prox_indiv_schc = prox_indiv_schc.drop( + columns=["cell_type_map", "cluster_label"]) + prox_indiv_schc = prox_indiv_schc.rename( + columns={"is_high": "cluster1_is_high"}) + prox_indiv_schc = pd.merge(prox_indiv_schc, slide_indiv_clusters_labeled, left_on=[ + "slide_submitter_id", "cluster2_label", "cluster2"], right_on=["slide_submitter_id", "cell_type_map", "cluster_label"]) + prox_indiv_schc = prox_indiv_schc.rename( + columns={"is_high": "cluster2_is_high"}) + prox_indiv_schc = prox_indiv_schc.drop( + columns=["cell_type_map", "cluster_label"]) + + # Order matters + prox_indiv_schc["ordered_pair"] = [ + f"{i}-{j}" for i, j in prox_indiv_schc[["cluster1_label", "cluster2_label"]].to_numpy()] + prox_indiv_schc["comparison"] = [ + f"cluster1={i}-cluster2={j}" for i, j in prox_indiv_schc[["cluster1_is_high", "cluster2_is_high"]].to_numpy()] + + # Post-processing + results_schc_indiv = pd.concat(Parallel(n_jobs=n_cores)(delayed(features.post_processing_proximity)( + prox_df=prox_indiv_schc, slide_submitter_id=id, method="individual_between") for id in slide_submitter_ids)) + + return results_schc_indiv + +def main(args): + compute_proximity_from_indiv_schc_between( + slide_clusters = args.slide_clusters, + tiles_schc = args.tiles_schc, + cell_types=args.cell_types_path, + n_cores = args.n_cores, + n_clusters=args.n_clusters, + max_dist=args.max_dist, + max_n_tiles_threshold=args.max_n_tiles_threshold, + tile_size=args.tile_size, + overlap=args.overlap).to_csv( + Path(args.output_dir, f"{args.prefix}_features_clust_indiv_schc_prox_between.csv"),index=False) + + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/bin/compute_proximity_from_indiv_schc_combine.py b/bin/compute_proximity_from_indiv_schc_combine.py new file mode 100755 index 0000000..73d3953 --- /dev/null +++ b/bin/compute_proximity_from_indiv_schc_combine.py @@ -0,0 +1,78 @@ +#!/usr/bin/env python3 +import os +import pandas as pd +import argparse +import multiprocessing +from argparse import ArgumentParser as AP +from os.path import abspath +import time +from pathlib import Path + +def get_args(): + # Script description + description = """Compute Clustering Features: Proximity""" + + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + parser.add_argument("--prox_within", type=str, + help="Path to csv file", required=True) + parser.add_argument("--prox_between", type=str, + help="Path to csv file", required=True) + parser.add_argument("--prefix", type=str, + help="Prefix for output file", default="") + parser.add_argument("--output_dir", type=str, + help="Path to output folder to store generated files", + required=False, default = "") + parser.add_argument("--version", action="version", version="0.1.0") + arg = parser.parse_args() + arg.output_dir = abspath(arg.output_dir) + + if ((arg.output_dir != "") & (not os.path.isdir(arg.output_dir))): + # Create an empty folder for TF records if folder doesn't exist + arg.output_dir = Path(arg.output_dir,"process_train") + os.mkdir(arg.output_dir) + + return arg + + +def compute_proximity_from_indiv_schc_combine(prox_within, prox_between): + + results_schc_indiv_within = pd.read_csv(prox_within, header = 0, index_col=False) + results_schc_indiv_between = pd.read_csv(prox_between, header = 0, index_col = False) + + # Concatenate within and between computed proximity values + prox_indiv_schc_combined = pd.concat( + [results_schc_indiv_within, results_schc_indiv_between]) + + # Remove rows with a proximity of NaN + prox_indiv_schc_combined = prox_indiv_schc_combined.dropna(axis=0) + + prox_indiv_schc_combined.comparison = prox_indiv_schc_combined.comparison.replace(dict(zip(['cluster1=True-cluster2=True', 'cluster1=True-cluster2=False', + 'cluster1=False-cluster2=True', 'cluster1=False-cluster2=False'], ["high-high", "high-low", "low-high", "low-low"]))) + prox_indiv_schc_combined["pair (comparison)"] = [ + f"{pair} ({comp})" for pair, comp in prox_indiv_schc_combined[["pair", "comparison"]].to_numpy()] + prox_indiv_schc_combined = prox_indiv_schc_combined.drop( + axis=1, labels=["pair", "comparison"]) + + + prox_indiv_schc_combined_wide = prox_indiv_schc_combined.pivot( + index=["slide_submitter_id"], columns=["pair (comparison)"])["proximity"] + new_cols = [ + f'prox CC {col.replace("_", " ")}' for col in prox_indiv_schc_combined_wide.columns] + prox_indiv_schc_combined_wide.columns = new_cols + prox_indiv_schc_combined_wide = prox_indiv_schc_combined_wide.reset_index() + return prox_indiv_schc_combined_wide + +def main(args): + compute_proximity_from_indiv_schc_combine(prox_within=args.prox_within, prox_between= args.prox_between).to_csv( + Path(args.output_dir, + f"{args.prefix}_features_clust_indiv_schc_prox.csv"), sep="\t", + index=False) + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/bin/compute_proximity_from_indiv_schc_within.py b/bin/compute_proximity_from_indiv_schc_within.py new file mode 100755 index 0000000..f2481ed --- /dev/null +++ b/bin/compute_proximity_from_indiv_schc_within.py @@ -0,0 +1,159 @@ +#!/usr/bin/env python3 +import os +import sys +import pandas as pd +from joblib import Parallel, delayed +import argparse + +# Own modules +import features.features as features +from model.constants import DEFAULT_CELL_TYPES + + +import multiprocessing + +from argparse import ArgumentParser as AP +from os.path import abspath +import time +from pathlib import Path + +def get_args(): + # Script description + description = """Compute Clustering Features: Proximity (within clusters)""" + + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + parser.add_argument("--slide_clusters", type=str, + help="Path to csv file", required=True) + parser.add_argument("--tiles_schc", type=str, + help="Path to csv file", required=True) + parser.add_argument("--output_dir", type=str, + help="Path to output folder to store generated files", required=False, default = "") + parser.add_argument("--slide_type", type=str, + help="Type of slides 'FFPE' or 'FF' used for naming generated files (default: 'FF')", default="FF") + parser.add_argument("--cell_types_path", type=str, + help="Path to file with list of cell types (default: CAFs, endothelial_cells, T_cells, tumor_purity)", default=None) + parser.add_argument("--graphs_path", type=str, + help="Path to pkl with generated graphs in case this was done before (OPTIONAL) if not specified, graphs will be generated", default=None) + parser.add_argument("--prefix", type=str, + help="Prefix for output file", default="") + parser.add_argument("--cutoff_path_length", type=int, + help="Max path length for proximity based on graphs", default=2, required=False) + parser.add_argument("--shapiro_alpha", type=float, + help="Choose significance level alpha (default: 0.05)", default=0.05, required=False) + parser.add_argument("--abundance_threshold", type=float, + help="Threshold for assigning cell types based on predicted probability (default: 0.5)", default=0.5, required=False) + parser.add_argument("--max_dist", type=int, + help="Max dist", required=False, default= None) + + parser.add_argument("--max_n_tiles_threshold", type=int, + help="Max dist", required=False, default= 2) + parser.add_argument("--tile_size", type=int, + help="Max dist", required=False, default= 512) + parser.add_argument("--overlap", type=int, + help="Max dist", required=False, default= 50) + + parser.add_argument("--n_clusters", type=int, + help="Number of clusters for SCHC (default: 8)", required=False, default=8) + parser.add_argument("--n_cores", type = int, help = "Number of cores to use (parallelization)") + + parser.add_argument("--version", action="version", version="0.1.0") + arg = parser.parse_args() + arg.output_dir = abspath(arg.output_dir) + + if ((arg.output_dir != "") & (not os.path.isdir(arg.output_dir))): + # Create an empty folder for TF records if folder doesn't exist + arg.output_dir = Path(arg.output_dir,"process_train") + os.mkdir(arg.output_dir) + + if arg.n_cores is None: + arg.n_cores = multiprocessing.cpu_count() + return arg + + +def compute_proximity_from_indiv_schc_within( + slide_clusters, tiles_schc, + cell_types=None, + n_clusters=8, max_dist=None, + max_n_tiles_threshold=2, + tile_size=512, + overlap=50, + n_cores = multiprocessing.cpu_count()): + + if cell_types is None: + cell_types = DEFAULT_CELL_TYPES + + all_slide_indiv_clusters = pd.read_csv(slide_clusters, sep = ",", header = 0, index_col = False) + slide_submitter_ids = list(set(all_slide_indiv_clusters.slide_submitter_id)) + + slide_indiv_clusters_labeled = pd.read_csv(tiles_schc, sep = ",", header = 0, index_col = False) + + ########################################################################## + # ---- Compute proximity features (individual cell type clustering) ---- # + ########################################################################## + + # Computing proximity for clusters derived for each cell type individually + # Between clusters + slide_submitter_ids = list(set(slide_indiv_clusters_labeled.slide_submitter_id)) + + # # Within clusters + + print(cell_types) + + print(all_slide_indiv_clusters.head()) + + results_schc_indiv_within = Parallel(n_jobs=n_cores)(delayed(features.compute_proximity_clusters_pairs)(all_slide_indiv_clusters, + slide_submitter_id=id, + method="individual_within", + n_clusters=n_clusters, + cell_types=cell_types, + max_dist=max_dist, + max_n_tiles_threshold=max_n_tiles_threshold, + tile_size=tile_size, overlap=overlap,) for id in slide_submitter_ids) + prox_indiv_schc_within = pd.concat(results_schc_indiv_within) + + prox_indiv_schc_within = pd.merge(prox_indiv_schc_within, slide_indiv_clusters_labeled, left_on=[ + "slide_submitter_id", "cell_type", "cluster1"], right_on=["slide_submitter_id", "cell_type_map", "cluster_label"]) + prox_indiv_schc_within = prox_indiv_schc_within.drop( + columns=["cluster_label"]) + prox_indiv_schc_within = prox_indiv_schc_within.rename( + columns={"is_high": "cluster1_is_high", "cell_type_map": "cell_type_map1"}) + prox_indiv_schc_within = pd.merge(prox_indiv_schc_within, slide_indiv_clusters_labeled, left_on=[ + "slide_submitter_id", "cell_type", "cluster2"], right_on=["slide_submitter_id", "cell_type_map", "cluster_label"]) + prox_indiv_schc_within = prox_indiv_schc_within.rename( + columns={"is_high": "cluster2_is_high", "cell_type_map": "cell_type_map2"}) + prox_indiv_schc_within = prox_indiv_schc_within.drop( + columns=["cluster_label"]) + + # Order doesn't matter (only same cell type combinations) + prox_indiv_schc_within["pair"] = [ + f"{i}-{j}" for i, j in prox_indiv_schc_within[["cell_type_map1", "cell_type_map2"]].to_numpy()] + prox_indiv_schc_within["comparison"] = [ + f"cluster1={sorted([i,j])[0]}-cluster2={sorted([i,j])[1]}" for i, j in prox_indiv_schc_within[["cluster1_is_high", "cluster2_is_high"]].to_numpy()] + + # Post-processing + slide_submitter_ids = list(set(prox_indiv_schc_within.slide_submitter_id)) + results_schc_indiv_within = pd.concat(Parallel(n_jobs=n_cores)(delayed(features.post_processing_proximity)( + prox_df=prox_indiv_schc_within, slide_submitter_id=id, method="individual_within") for id in slide_submitter_ids)) + return results_schc_indiv_within + +def main(args): + compute_proximity_from_indiv_schc_within( + slide_clusters = args.slide_clusters, + tiles_schc = args.tiles_schc, + cell_types=args.cell_types_path, + n_cores = args.n_cores, + n_clusters=args.n_clusters, + max_dist=args.max_dist, max_n_tiles_threshold=args.max_n_tiles_threshold, + tile_size=args.tile_size, + overlap=args.overlap).to_csv( + Path(args.output_dir, f"{args.prefix}_features_clust_indiv_schc_prox_within.csv"),index=False) + + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/bin/compute_proximity_from_simultaneous_schc.py b/bin/compute_proximity_from_simultaneous_schc.py new file mode 100755 index 0000000..a4311c3 --- /dev/null +++ b/bin/compute_proximity_from_simultaneous_schc.py @@ -0,0 +1,150 @@ +#!/usr/bin/env python3 +import os +import pandas as pd +from joblib import Parallel, delayed +import argparse +import features.features as features +from model.constants import DEFAULT_CELL_TYPES + +import multiprocessing + +from argparse import ArgumentParser as AP +from os.path import abspath +import time +from pathlib import Path + +def get_args(): + # Script description + description = """Compute Spatial Network Features: Compute Connectedness""" + + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + parser.add_argument("--slide_clusters_characterized", type=str, + help="Path to csv file", required=True) + parser.add_argument("--tiles_schc", type=str, + help="Path to csv file", required=True) + parser.add_argument("--output_dir", type=str, + help="Path to output folder to store generated files", required=False, default = "") + parser.add_argument("--slide_type", type=str, + help="Type of slides 'FFPE' or 'FF' used for naming generated files (default: 'FF')", default="FF") + parser.add_argument("--cell_types_path", type=str, + help="Path to file with list of cell types (default: CAFs, endothelial_cells, T_cells, tumor_purity)", default=None) + parser.add_argument("--graphs_path", type=str, + help="Path to pkl with generated graphs in case this was done before (OPTIONAL) if not specified, graphs will be generated", default=None) + parser.add_argument("--prefix", type=str, + help="Prefix for output file", default="") + parser.add_argument("--cutoff_path_length", type=int, + help="Max path length for proximity based on graphs", default=2, required=False) + parser.add_argument("--shapiro_alpha", type=float, + help="Choose significance level alpha (default: 0.05)", default=0.05, required=False) + parser.add_argument("--abundance_threshold", type=float, + help="Threshold for assigning cell types based on predicted probability (default: 0.5)", default=0.5, required=False) + parser.add_argument("--max_dist", type=int, + help="Max dist", required=False, default= None) + parser.add_argument("--max_n_tiles_threshold", type=int, + help="Max dist", required=False, default= 2) + parser.add_argument("--tile_size", type=int, + help="Max dist", required=False, default= 512) + parser.add_argument("--overlap", type=int, + help="Max dist", required=False, default= 50) + parser.add_argument("--n_clusters", type=int, + help="Number of clusters for SCHC (default: 8)", required=False, default=8) + parser.add_argument("--n_cores", type = int, help = "Number of cores to use (parallelization)") + + parser.add_argument("--version", action="version", version="0.1.0") + arg = parser.parse_args() + arg.output_dir = abspath(arg.output_dir) + + if ((arg.output_dir != "") & (not os.path.isdir(arg.output_dir))): + # Create an empty folder for TF records if folder doesn't exist + arg.output_dir = Path(arg.output_dir,"process_train") + os.mkdir(arg.output_dir) + + if arg.n_cores is None: + arg.n_cores = multiprocessing.cpu_count() + return arg + +def compute_proximity_from_simultaneous_schc(slide_clusters_characterized, tiles_schc, + cell_types = DEFAULT_CELL_TYPES, + n_clusters=8, + max_dist=None, + max_n_tiles_threshold=2, + tile_size=512, + overlap=50, + n_cores = multiprocessing.cpu_count()): + + all_slide_clusters_characterized = pd.read_csv(slide_clusters_characterized, + sep = ",", header = 0, index_col = 0) + + slide_submitter_ids = list(set(all_slide_clusters_characterized.slide_submitter_id)) + + tiles_all_schc = pd.read_csv(tiles_schc, sep = ",", header = 0, index_col = 0) + + + # Computing proximity for clusters derived with all cell types simultaneously + clusters_all_schc_long = all_slide_clusters_characterized.melt( + id_vars=["slide_submitter_id", "cluster_label"], value_name="is_assigned", var_name="cell_type") + # remove all cell types that are not assigned to the cluster + clusters_all_schc_long = clusters_all_schc_long[clusters_all_schc_long["is_assigned"]] + clusters_all_schc_long = clusters_all_schc_long.drop(columns="is_assigned") + + results_schc_all = Parallel(n_jobs=n_cores)(delayed(features.compute_proximity_clusters_pairs)( + tiles=tiles_all_schc, slide_submitter_id=id, n_clusters=n_clusters, cell_types=cell_types, max_dist=max_dist, max_n_tiles_threshold=max_n_tiles_threshold, tile_size=tile_size, overlap=overlap, method="all") for id in slide_submitter_ids) + prox_all_schc = pd.concat(results_schc_all) + + # Label clusters (a number) with the assigned cell types + prox_all_schc = pd.merge(prox_all_schc, clusters_all_schc_long, left_on=[ + "slide_submitter_id", "cluster1"], right_on=["slide_submitter_id", "cluster_label"]) + prox_all_schc = prox_all_schc.rename( + columns={"cell_type": "cluster1_label"}) + prox_all_schc = prox_all_schc.drop(columns=["cluster_label"]) + + prox_all_schc = pd.merge(prox_all_schc, clusters_all_schc_long, left_on=[ + "slide_submitter_id", "cluster2"], right_on=["slide_submitter_id", "cluster_label"]) + prox_all_schc = prox_all_schc.rename( + columns={"cell_type": "cluster2_label"}) + + # Order doesn't matter: x <-> + prox_all_schc["pair"] = [f"{sorted([i, j])[0]}-{sorted([i, j])[1]}" for i, + j in prox_all_schc[["cluster1_label", "cluster2_label"]].to_numpy()] + prox_all_schc = prox_all_schc[((prox_all_schc.cluster1 == prox_all_schc.cluster2) & ( + prox_all_schc.cluster2_label != prox_all_schc.cluster1_label)) | (prox_all_schc.cluster1 != prox_all_schc.cluster2)] + + # slides = prox_all_schc[["MFP", "slide_submitter_id"]].drop_duplicates().to_numpy() + slide_submitter_ids = list(set(prox_all_schc.slide_submitter_id)) + + # Post Processing + results_schc_all = Parallel(n_jobs=n_cores)(delayed(features.post_processing_proximity)( + prox_df=prox_all_schc, slide_submitter_id=id, method="all") for id in slide_submitter_ids) + all_prox_df = pd.concat(results_schc_all) + # Remove rows with a proximity of NaN + all_prox_df = all_prox_df.dropna(axis=0) + + + all_prox_df_wide = all_prox_df.pivot( + index=["slide_submitter_id"], columns=["pair"])["proximity"] + new_cols = [ + f'prox CC {col.replace("_", " ")} clusters' for col in all_prox_df_wide.columns] + all_prox_df_wide.columns = new_cols + all_prox_df_wide = all_prox_df_wide.reset_index() + + return all_prox_df_wide + +def main(args): + compute_proximity_from_simultaneous_schc( + slide_clusters_characterized = args.slide_clusters_characterized, + tiles_schc = args.tiles_schc, + cell_types=args.cell_types_path, + n_cores = args.n_cores, + n_clusters=args.n_clusters, + max_dist=args.max_dist, max_n_tiles_threshold=args.max_n_tiles_threshold, + tile_size=args.tile_size, + overlap=args.overlap).to_csv(Path(args.output_dir, f"{args.prefix}_features_clust_all_schc_prox_wide.csv"), index = False) + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/bin/create_clinical_file.py b/bin/create_clinical_file.py new file mode 100755 index 0000000..8052786 --- /dev/null +++ b/bin/create_clinical_file.py @@ -0,0 +1,217 @@ +#!/usr/bin/env python3 +import argparse +from argparse import ArgumentParser as AP +import os +import os.path +import numpy as np +import pandas as pd + +from os.path import abspath +import time +from pathlib import Path + + +def get_args(): + # Script description + description = """Creating a clinical file for TCGA dataset(s)""" + + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + + # Sections + parser.add_argument( + "--clinical_files_input", + help="Path to either a folder for multiple cancer types or single txt file.", required=False, + default=None + ) + parser.add_argument("--out_file", type=str, required=False, default="generated_clinical_file.txt", + help="Output filename with .txt extension (default='generated_clinical_file.txt')") + parser.add_argument( + "--path_codebook", + help="Path to codebook", + default=None, required=False, type=str + ) + parser.add_argument( + "--output_dir", help="Path to folder for saving all created files", default="", required=False, type=str + ) + parser.add_argument( + "--class_name", + help="Single classname or (b) Path to file with classnames according to codebook.txt (e.g. LUAD_T)", default=None, + type=str + ) + parser.add_argument("--class_names_path", + type=str, + help="Path to file with classnames according to codebook.txt", + default=None) + parser.add_argument( + "--tumor_purity_threshold", + help="Integer for filtering tumor purity assessed by pathologists", + default=80, required=False, type=int + ) + + parser.add_argument("--version", action="version", version="0.1.0") + arg = parser.parse_args() + arg.output_dir = abspath(arg.output_dir) + + if ((arg.output_dir != "") & (not os.path.isdir(arg.output_dir))): + os.mkdir(arg.output_dir) + return arg + +def handle_single_class(input, class_name, codebook): + input = pd.read_csv(input, sep="\t") + # only keep tissue (remove _T or _N) to check in filename + input["class_name"] = class_name + input["class_id"] = int( + codebook.loc[codebook["class_name"] + == class_name].values[0][1] + ) + return (input) + +def handle_multi_class(input, class_names, codebook): + clinical_file_list = [] + # Combine all clinical raw files based on input + for class_name in class_names: + clinical_file_temp = pd.read_csv( + f"{input}/clinical_file_TCGA_{class_name[:-2]}.tsv", + sep="\t", + ) + # only keep tissue (remove _T or _N) to check in filename + clinical_file_temp["class_name"] = class_name + clinical_file_temp["class_id"] = int( + codebook.loc[codebook["class_name"] + == class_name].values[0][1] + ) + clinical_file_list.append(clinical_file_temp) + clinical_file = pd.concat( + clinical_file_list, axis=0).reset_index(drop=True) + return (clinical_file) + + +def filter_tumor_purity(df, threshold = 80): + # ---- 2) Filter: Availability of tumor purity (percent_tumor_cells) ---- # + # Remove rows with missing tumor purity + df["percent_tumor_cells"] = ( + df["percent_tumor_cells"] + .replace("'--", np.nan, regex=True) + .astype(float) + ) + + # Convert strings to numeric type + df["percent_tumor_cells"] = pd.to_numeric( + df["percent_tumor_cells"] + ) + df = df.dropna(subset=["percent_tumor_cells"]) + df = df.where( + df["percent_tumor_cells"] >= float( + threshold) + ) + return(df) + +def is_valid_class_name_input (input, codebook): + res = None + if (input is not None): + if (input in codebook["class_name"].values): + res = "single" + elif(Path(input)): + res = "multi" + return(res) + + +def create_TCGA_clinical_file( + class_name, + class_names_path, + clinical_files_input, + tumor_purity_threshold=80, + path_codebook=None +): + """ + Create a clinical file based on the slide metadata downloaded from the GDC data portal + 1. Read the files and add classname and id based on codebook_df.txt + 2. Filter tumor purity + 3. Save file + + Args: + class_names (str): single class name e.g. LUAD_T or path to file with class names + clinical_files_input (str): String with path to folder with subfolders pointing to the raw clinical files (slide.tsv) + tumor_purity_threshold (int): default=80 + multi_class_path (str): path to file with class names to be merged into one clinical file + + Returns: + {output_dir}/generated_clinical_file.txt" containing the slide_submitter_id, sample_submitter_id, image_file_name, percent_tumor_cells, class_name, class_id in columns and records (slides) in rows. + + """ + codebook_df = pd.read_csv( + path_codebook, + delim_whitespace=True, + header=None, names=["class_name", "value"] + ) + init_check_single_class =is_valid_class_name_input(input = class_name, codebook= codebook_df) + init_check_multi_class = is_valid_class_name_input(input = class_names_path, codebook = codebook_df) + is_single_class = init_check_single_class == "single" + is_multi_class = init_check_multi_class == "multi" + + passes_input_check = (is_single_class| is_multi_class) & (clinical_files_input is not None) + + if passes_input_check: + if (is_multi_class): # multi class names + class_names = pd.read_csv( + class_names_path, header=None).to_numpy().flatten() + if os.path.isdir(clinical_files_input) & (len(class_names) > 1): + clinical_file = handle_multi_class(input = clinical_files_input, class_names = class_names, codebook=codebook_df) + elif (is_single_class): # single class names + # a) Single class + if os.path.isfile(clinical_files_input): + clinical_file = handle_single_class(input = clinical_files_input, class_name=class_name, codebook = codebook_df) + + clinical_file = filter_tumor_purity(df = clinical_file, threshold= tumor_purity_threshold) + + # ---- 3) Formatting ---- # + clinical_file["image_file_name"] = [ + f"{slide_submitter_id}.{str(slide_id).upper()}.svs" + for slide_submitter_id, slide_id in clinical_file[ + ["slide_submitter_id", "slide_id"] + ].to_numpy() + ] + + clinical_file = clinical_file.dropna(how="all") + clinical_file = clinical_file.drop_duplicates() + clinical_file = clinical_file.drop_duplicates( + subset="slide_submitter_id") + clinical_file = clinical_file[ + [ + "slide_submitter_id", + "sample_submitter_id", + "image_file_name", + "percent_tumor_cells", + "class_name", + "class_id", + ] + ] + clinical_file = clinical_file.dropna(how="any", axis=0) + return clinical_file + + +def main(args): + # Generate clinical file + clinical_file = create_TCGA_clinical_file( + class_name=args.class_name, + class_names_path = args.class_names_path, + tumor_purity_threshold=args.tumor_purity_threshold, + clinical_files_input=args.clinical_files_input, + path_codebook=args.path_codebook, + ) + # Save file + clinical_file.to_csv( + Path(args.output_dir, args.out_file), + index=False, + sep="\t", + ) + + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/bin/create_list_avail_img_for_tiling.py b/bin/create_list_avail_img_for_tiling.py new file mode 100755 index 0000000..eafbc6b --- /dev/null +++ b/bin/create_list_avail_img_for_tiling.py @@ -0,0 +1,80 @@ +#!/usr/bin/env python3 +import argparse +import os +import pandas as pd +from argparse import ArgumentParser as AP +from os.path import abspath +import time +from pathlib import Path + +def get_args(): + # Script description + description = """Creating list of available slide images that have to be tiled""" + + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + + parser = argparse.ArgumentParser() + parser.add_argument("--slides_folder", help="Set slides folder", default = None) + parser.add_argument("--output_dir", help="Set output folder", default = "") + parser.add_argument("--clinical_file_path", help="Set clinical file path") + parser.add_argument("--version", action="version", version="0.1.0") + arg = parser.parse_args() + arg.output_dir = abspath(arg.output_dir) + arg.slides_folder = abspath(arg.slides_folder) + + if ((arg.output_dir != "") & (not os.path.isdir(arg.output_dir))): + os.mkdir(arg.output_dir) + return arg + + +def create_list_avail_img_for_tiling(slides_folder, clinical_file_path): + """ + Create tiles from slides + Dividing the whole slide images into tiles with a size of 512 x 512 pixels, with an overlap of 50 pixels at a magnification of 20x. In addition, remove blurred and non-informative tiles by using the weighted gradient magnitude. + + Source: + Fu, Y., Jung, A. W., Torne, R. V., Gonzalez, S., Vöhringer, H., Shmatko, A., Yates, L. R., Jimenez-Linan, M., Moore, L., & Gerstung, M. (2020). Pan-cancer computational histopathology reveals mutations, tumor composition and prognosis. Nature Cancer, 1(8), 800–810. https://doi.org/10.1038/s43018-020-0085-8 + + Args: + slides_folder (str): path pointing to folder with all whole slide images (.svs files) + clinical_file_path (str): path pointing to file with clinical file + + Returns: + txt containing list of slides available for tiling + """ + + # Subset images of interest (present in generated clinical file) + clinical_file = pd.read_csv(clinical_file_path, sep="\t", index_col=False) + print(clinical_file) + clinical_file.dropna(how="all", inplace=True) + clinical_file.drop_duplicates(inplace=True) + clinical_file.drop_duplicates(subset="slide_submitter_id", inplace=True) + subset_images = clinical_file.image_file_name.tolist() + print(subset_images) + + # Check if slides are among our data + available_images = os.listdir(slides_folder) + print(available_images) + images_for_tiling = list(set(subset_images) & set(available_images)) + + return(pd.DataFrame([[name.split(".")[0], name] for name in images_for_tiling], columns=["slide_id", "slide_filename"])) + + +def main(args): + list_avail_img = create_list_avail_img_for_tiling(slides_folder=args.slides_folder, + clinical_file_path=args.clinical_file_path) + + list_avail_img.to_csv(Path(args.output_dir, "avail_slides_for_img.csv"), index=False) + print("Generated list of available images for tiling...") + + + + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/bin/create_tiles_from_slides.py b/bin/create_tiles_from_slides.py new file mode 100755 index 0000000..0c29369 --- /dev/null +++ b/bin/create_tiles_from_slides.py @@ -0,0 +1,118 @@ +#!/usr/bin/env python3 +import tiffslide as openslide +import os +import argparse +import glob +import numpy as np +from PIL import Image + +from argparse import ArgumentParser as AP +from os.path import abspath +import time +from pathlib import Path + +import DL.image as im + +def get_args(): + # Script description + description = """Creating tiles from a slide""" + + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + + # Sections + parser = argparse.ArgumentParser() + parser.add_argument("--filename_slide", help="Name of slide", default = "") + parser.add_argument("--slides_folder", help="Set slides folder", default = None) + parser.add_argument("--slide_path", help="Path to individual slide", default = None) + parser.add_argument("--output_dir", help="Set output folder", default = "") + parser.add_argument("--clin_path", help="Set clinical file path", default = None) + parser.add_argument("--gradient_mag_filter", help = "Threshold for filtering", default = 20) + parser.add_argument("--version", action="version", version="0.1.0") + arg = parser.parse_args() + arg.output_dir = abspath(arg.output_dir) + + if ((arg.output_dir != "") & (not os.path.isdir(arg.output_dir))): + arg.output_dir = Path(arg.output_dir, "tiles") + os.mkdir(arg.output_dir) + return arg + +def create_tiles_from_slide(slide_filename, slides_folder, gradient_mag_filter = 20, slide_path = None): + """ + Create tiles from a single slide + Dividing the whole slide images into tiles with a size of 512 x 512 pixels, with an overlap of 50 pixels at a magnification of 20x. In addition, remove blurred and non-informative tiles by using the weighted gradient magnitude. + + Source: + Fu, Y., Jung, A. W., Torne, R. V., Gonzalez, S., Vöhringer, H., Shmatko, A., Yates, L. R., Jimenez-Linan, M., Moore, L., & Gerstung, M. (2020). Pan-cancer computational histopathology reveals mutations, tumor composition and prognosis. Nature Cancer, 1(8), 800–810. https://doi.org/10.1038/s43018-020-0085-8 + + Args: + slide_filename (str): name of slide to use for creating tiles + slides_folder (str): path pointing to folder with all whole slide images (.svs files) + tiles (str): path pointing to folder for storing all created files by script (i.e. .jpg files for the created tiles) + grad_mag_filter (int): remove tiles that are blurred or non-informative based on weighted gradient magnitude (default=20) + + Returns: + jpg files for the created tiles in the specified folder {output_dir}/tiles + + """ + # Accept different file types + if slide_filename.endswith(('.svs', '.ndpi', '.tif')): + if (slide_path is not None): + slide = openslide.OpenSlide(slide_path) + else: + slide = openslide.OpenSlide( + "{}/{}".format(slides_folder, slide_filename)) + slide_name = slide_filename.split(".")[0] + if ( + str(slide.properties["tiff.ImageDescription"]).find( + "AppMag = 40" + ) + != -1 + ): + region_size = 1024 + tile_size = 924 + else: + region_size = 512 + tile_size = 462 + [width, height] = slide.dimensions + + list_of_tiles = [] + for x_coord in range(1, width, tile_size): + for y_coord in range(1, height, tile_size): + slide_region = slide.read_region( + location=(x_coord, y_coord), + level=0, + size=(region_size, region_size), + ) + slide_region_converted = slide_region.convert("RGB") + tile = slide_region_converted.resize( + (512, 512), Image.ANTIALIAS) + grad = im.getGradientMagnitude(np.array(tile)) + unique, counts = np.unique(grad, return_counts=True) + if counts[np.argwhere(unique <= int(gradient_mag_filter))].sum() < 512 * 512 * 0.6: + list_of_tiles.append((tile, slide_name, x_coord, y_coord)) + return(list_of_tiles) + +def main(args): + list_of_tiles = create_tiles_from_slide(slides_folder=args.slides_folder, gradient_mag_filter=args.gradient_mag_filter, slide_filename = args.filename_slide, slide_path= args.slide_path) + n_tiles = len(list_of_tiles) + for tile in list_of_tiles: + tile[0].save( + "{}/{}_{}_{}.jpg".format( + args.output_dir, tile[1], tile[2], tile[3] + ), + "JPEG", + optimize=True, + quality=94, + ) + # Check if all tiles were saved + assert len(glob.glob1(Path(args.output_dir), "*.jpg")) == n_tiles + + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/bin/format_tile_data_structure.py b/bin/format_tile_data_structure.py new file mode 100755 index 0000000..a69267e --- /dev/null +++ b/bin/format_tile_data_structure.py @@ -0,0 +1,130 @@ +#!/usr/bin/env python3 + +import tiffslide as openslide +import DL.utils as utils +import glob +import argparse +import os +import os.path +import pandas as pd + +from os.path import abspath +import time +from pathlib import Path +from argparse import ArgumentParser as AP + +def get_args(): + # Script description + description = """Format tile data structure""" + + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + + # Sections + parser.add_argument("--slides_folder", help="Set slides folder", default = "") + parser.add_argument("--tiles_folder", help = "Directory with the tiles", default = "") + parser.add_argument("--output_dir", help="Set output folder", default = "") + parser.add_argument("--clin_path", help="Set clinical file path", default = "") + parser.add_argument("--is_tcga", help="Set output folder", type=int) + + parser.add_argument("--version", action="version", version="0.1.0") + arg = parser.parse_args() + arg.output_dir = abspath(arg.output_dir) + + if ((arg.output_dir != "") & (not os.path.isdir(arg.output_dir))): + os.mkdir(arg.output_dir) + return arg + +def assess_tcga_slide_quality(slide_name, slides_folder): + print("{}/{}".format(slides_folder, slide_name)) + + img = openslide.OpenSlide( + "{}/{}".format(slides_folder, slide_name)) + image_description = str(img.properties["tiff.ImageDescription"]).split("|")[0] + image_description_split = image_description.split(" ") + jpeg_quality = image_description_split[-1] + return([slide_name, "RGB" + jpeg_quality]) + + +def format_tile_data_structure(slides_folder, tiles_folder, output_dir, clinical_file_path, is_tcga=True): + """ + Specifying the tile data structure required to store tiles as TFRecord files (used in convert.py) + + Args: + slides_folder (str): path pointing to folder with all whole slide images (.svs files) + output_dir (str): path pointing to folder for storing all created files by script + clinical_file_path (str): path pointing to formatted clinical file (either generated or manually formatted) + is_tcga (bool): default = TRUE + + Returns: + {output_dir}/file_info_train.txt containing the path to the individual tiles, class name, class id, percent of tumor cells and JPEG quality + + """ + if (tiles_folder == ""): + tiles_folder = output_dir + + clinical_file = pd.read_csv(clinical_file_path, sep="\t") + clinical_file.dropna(how="all", inplace=True) + clinical_file.drop_duplicates(inplace=True) + clinical_file.drop_duplicates(subset="slide_submitter_id", inplace=True) + + # 2) Determine the paths paths of jpg tiles + jpg_tile_names = glob.glob1(Path(args.tiles_folder), "*.jpg") + jpg_tile_paths = [Path(tiles_folder, tile_name) for tile_name in jpg_tile_names] + + # 3) Get corresponding data from the clinical file based on the tile names + jpg_tile_names_stripped = [ + utils.get_slide_submitter_id(jpg_tile_name) for jpg_tile_name in jpg_tile_names + ] + jpg_tile_names_df = pd.DataFrame( + jpg_tile_names_stripped, columns=["slide_submitter_id"] + ) + jpg_tiles_df = pd.merge( + jpg_tile_names_df, clinical_file, on=["slide_submitter_id"], how="left" + ) + # 4) Determine jpeg_quality of slides + slide_quality = [] + if is_tcga: + for slide_name in jpg_tiles_df.image_file_name.unique(): + slide_quality.append(assess_tcga_slide_quality(slide_name = slide_name, slides_folder = slides_folder)) + else: + jpeg_quality = 100 # assuming no loss + slide_quality = [[slide_name, f"RGB{jpeg_quality}"] for slide_name in jpg_tiles_df.image_file_name.unique()] + + slide_quality_df = pd.DataFrame( + slide_quality, columns=["image_file_name", "jpeg_quality"] + ) + jpg_tiles_df = pd.merge( + jpg_tiles_df, slide_quality_df, on=["image_file_name"], how="left" + ) + jpg_tiles_df["tile_path"] = jpg_tile_paths + + # Create output dataframe + output = jpg_tiles_df[ + ["tile_path", "class_name", "class_id", + "jpeg_quality", "percent_tumor_cells"] + ] + return (output) + + +def main(args): + output = format_tile_data_structure( + slides_folder=args.slides_folder, + tiles_folder= args.tiles_folder, + output_dir=args.output_dir, + clinical_file_path=args.clin_path, + is_tcga=args.is_tcga + ) + output.to_csv(Path(args.output_dir, "file_info_train.txt"), + index=False, sep="\t") + + print("Finished creating the necessary file for computing the features in the next step") + + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/bin/generate_graphs.py b/bin/generate_graphs.py new file mode 100755 index 0000000..fbc33ef --- /dev/null +++ b/bin/generate_graphs.py @@ -0,0 +1,98 @@ +#!/usr/bin/env python3 +import argparse +import multiprocessing +from argparse import ArgumentParser as AP +import os +import joblib +import pandas as pd +from joblib import Parallel, delayed +import argparse + +# Own modules +import features.graphs as graphs +from model.constants import DEFAULT_CELL_TYPES + +from os.path import abspath +import time +from pathlib import Path + +def get_args(): + # Script description + description = """Generating graphs for computing spatial network features""" + + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + parser.add_argument("--tile_quantification_path", type=str, + help="Path to csv file with tile-level quantification (predictions)", required=True) + parser.add_argument("--output_dir", type=str, + help="Path to output folder to store generated files", default = "") + parser.add_argument("--slide_type", type=str, + help="Type of slides 'FFPE' or 'FF' used for naming generated files (default: 'FF')", default="FF") + parser.add_argument("--cell_types_path", type=str, + help="Path to file with list of cell types (default: CAFs, endothelial_cells, T_cells, tumor_purity)", default=None) + parser.add_argument("--prefix", type=str, + help="Prefix for output file", default="") + parser.add_argument("--n_cores", type = int, help = "Number of cores to use (parallelization)") + parser.add_argument("--version", action="version", version="0.1.0") + arg = parser.parse_args() + arg.output_dir = abspath(arg.output_dir) + + if ((arg.output_dir != "") & (not os.path.isdir(arg.output_dir))): + os.mkdir(arg.output_dir) + + if arg.n_cores is None: + arg.n_cores = multiprocessing.cpu_count() + return arg + +def generate_graphs(tile_quantification_path, cell_types=None, n_cores = multiprocessing.cpu_count()): + """ + Generating graphs + + Args: + tile_quantification_path (str) + cell_types (list): list of cell types + n_cores (int): Number of cores to use (parallelization) + + Returns: + Graphs for all slides (dict) + + """ + if cell_types is None: + cell_types = DEFAULT_CELL_TYPES + + predictions = pd.read_csv(tile_quantification_path, sep="\t") + slide_submitter_ids = list(set(predictions.slide_submitter_id)) + + ##################################### + # ---- Constructing the graphs ---- # + ##################################### + + results = Parallel(n_jobs=n_cores)( + delayed(graphs.construct_graph)( + predictions=predictions, slide_submitter_id=id) + for id in slide_submitter_ids + ) + # Extract/format graphs + all_graphs = { + list(slide_graph.keys())[0]: list(slide_graph.values())[0] + for slide_graph in results + } + return all_graphs + +def main(args): + all_graphs = generate_graphs( + tile_quantification_path = args.tile_quantification_path, + n_cores=args.n_cores) + out_filepath = Path(args.output_dir, + f"{args.prefix}_graphs.pkl") + + joblib.dump(all_graphs, out_filepath) + print(f"Generated all graphs and stored in: {out_filepath}") + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/bin/post_process_features.py b/bin/post_process_features.py new file mode 100755 index 0000000..cc3bf54 --- /dev/null +++ b/bin/post_process_features.py @@ -0,0 +1,132 @@ +#!/usr/bin/env python3 +#  Module imports +import argparse +from argparse import ArgumentParser as AP +import os +import dask.dataframe as dd +import pandas as pd + +#  Custom imports +import DL.utils as utils +from os.path import abspath +import time +from pathlib import Path + + +def get_args(): + # Script description + description = """Post processing features""" + + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + + # Sections + parser.add_argument("--output_dir", help="Set output folder (default='.')", default = ".") + parser.add_argument("--create_parquet_subdir", help = "Whether to create a subdirectory called 'features_format_parquet' if slide_type == 'FFPE', default=False", default = False) + parser.add_argument( + "--slide_type", help="Type of tissue slide (FF or FFPE)") + parser.add_argument( + "--is_tcga", help="Is TCGA dataset, default=False", type = int, default = 0) + parser.add_argument("--bot_train_file", type = str, default = None, help = "Txt file") + parser.add_argument("--version", action="version", version="0.1.0") + arg = parser.parse_args() + + if arg.bot_train_file is None: + arg.bot_train_file = Path(arg.output_dir, "bot_train.txt") + + if (arg.create_parquet_subdir): + arg.output_dir = abspath(Path(arg.output_dir, "features_format_parquet")) + + if not os.path.isdir(arg.output_dir): + os.mkdir(arg.output_dir) + + return arg + + +def handle_ff_slides(bot_train_file, is_tcga): + features_raw = pd.read_csv(bot_train_file, sep="\t", header=None) + # Extract the DL features (discard: col1 = tile paths, col2 = true class id) + features = features_raw.iloc[:, 2:] + features.columns = list(range(1536)) + # Add new column variables that define each tile + features["tile_ID"] = [utils.get_tile_name( + tile_path) for tile_path in features_raw.iloc[:, 0]] + features["Coord_X"] = [i[-2] + for i in features["tile_ID"].str.split("_")] + features["Coord_Y"] = [i[-1] + for i in features["tile_ID"].str.split("_")] + # FIX add sample_submitter_id and slide_submitter_id depending on is_tcga + if is_tcga: + features["sample_submitter_id"] = features["tile_ID"].str[0:16] + features["slide_submitter_id"] = features["tile_ID"].str[0:23] + features["Section"] = features["tile_ID"].str[20:23] + else: + features["sample_submitter_id"] = features['tile_ID'].str.split( + '_').str[0] + return(features) + +def handle_ffpe_slides(bot_train_file, is_tcga): + features_raw = dd.read_csv(bot_train_file, sep="\t", header=None) + features_raw['tile_ID'] = features_raw.iloc[:, 0] + features_raw.tile_ID = features_raw.tile_ID.map( + lambda x: x.split("/")[-1]) + features_raw['tile_ID'] = features_raw['tile_ID'].str.replace( + ".jpg'", "") + features = features_raw.map_partitions( + lambda df: df.drop(columns=[0, 1])) + new_names = list(map(lambda x: str(x), list(range(1536)))) + new_names.append('tile_ID') + features.columns = new_names + # FIX add sample_submitter_id and slide_submitter_id depending on is_tcga + if is_tcga: + features["sample_submitter_id"] = features["tile_ID"].str[0:16] + features["slide_submitter_id"] = features["tile_ID"].str[0:23] + features["Section"] = features["tile_ID"].str[20:23] + else: + features["sample_submitter_id"] = features['tile_ID'].str.split( + '_').str[0] + features['Coord_X'] = features['tile_ID'].str.split('_').str[1] + features['Coord_Y'] = features['tile_ID'].str.split('_').str[-1] + return(features) + +def post_process_features(bot_train_file, slide_type = "FF", is_tcga="TCGA"): + """ + Format extracted histopathological features from bot.train.txt file generated by myslim/bottleneck_predict.py and extract the 1,536 features, tile names. Extract several variables from tile ID. + + Args: + bot_train_file (txt) + slide_type (str) + is_tcga (bool) + + Returns: + features (dataframe) contains the 1,536 features, followed by the sample_submitter_id, tile_ID, slide_submitter_id, Section, Coord_X and Coord_Y and in the rows the tiles + """ + # Read histopathological computed features + if slide_type == "FF": + return(handle_ff_slides(bot_train_file=bot_train_file, is_tcga=is_tcga)) + elif slide_type == "FFPE": + return(handle_ffpe_slides(bot_train_file=bot_train_file, is_tcga=is_tcga)) + else: + raise Exception("Invalid `slide_type`, please choose 'FF' or 'FFPE' ") + +def main(args): + features = post_process_features( + bot_train_file=args.bot_train_file, + slide_type=args.slide_type, + is_tcga=args.is_tcga) + if (args.slide_type == "FF"): + #  Save features to .csv file + features.to_csv(Path(args.output_dir, "features.txt"), sep="\t", header=True) + elif (args.slide_type == "FFPE"): + features.to_parquet(path= args.output_dir, compression='gzip', + name_function=utils.name_function) + print("Finished post-processing of features...") + + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/bin/post_process_predictions.py b/bin/post_process_predictions.py new file mode 100755 index 0000000..7973c92 --- /dev/null +++ b/bin/post_process_predictions.py @@ -0,0 +1,242 @@ +#!/usr/bin/env python + +# Module imports +import argparse +from argparse import ArgumentParser as AP +import os +import dask.dataframe as dd +import pandas as pd + +#  Custom imports +import DL.utils as utils +import numpy as np +from os.path import abspath +import time +from pathlib import Path + +def get_args(): + # Script description + description = """Post-processing predictions""" + + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + parser.add_argument("--output_dir", help="Set output folder", default = ".") + parser.add_argument("--create_parquet_subdir", help = "Whether to create a subdirectory called 'predictions_format_parquet' if slide_type == 'FFPE', default=False", default = False) + parser.add_argument( + "--slide_type", help="Type of tissue slide (FF or FFPE) (default='FF')", type = str, default = "FF") + parser.add_argument( + "--path_codebook", help="codebook.txt file", required=True, type=str) + parser.add_argument( + "--path_tissue_classes", help="Tissue_classes.csv file", required=True, type=str) + parser.add_argument("--cancer_type", help = "Cancer type", required = True, type =str) + parser.add_argument("--pred_train_file", help = "", type = str, default = None) + arg = parser.parse_args() + + if arg.pred_train_file is None: + arg.pred_train_file = Path(arg.output_dir, "pred_train.txt") + + if (arg.create_parquet_subdir): + arg.output_dir = abspath(Path(arg.output_dir, "predictions_format_parquet")) + + if not os.path.isdir(arg.output_dir): + os.mkdir(arg.output_dir) + + return arg + + + +def handle_ff_slides(pred_train_file, codebook, tissue_classes, cancer_type): + predictions_raw = pd.read_csv(pred_train_file, sep="\t", header=None) + # Extract tile name incl. coordinates from path + tile_names = [utils.get_tile_name(tile_path) + for tile_path in predictions_raw[0]] + # Create output dataframe for post-processed data + predictions = pd.DataFrame(tile_names, columns=["tile_ID"]) + # Get predicted probabilities for all 42 classes + rename columns + pred_probabilities = predictions_raw.iloc[:, 2:] + pred_probabilities.columns = codebook["class_id"] + # Get predicted and true class ids + predictions["pred_class_id"] = pred_probabilities.idxmax( + axis="columns") + predictions["true_class_id"] = 41 + # Get corresponding max probabilities to the predicted class + predictions["pred_probability"] = pred_probabilities.max(axis=1) + # Replace class id with class name + predictions["true_class_name"] = predictions["true_class_id"].copy() + predictions["pred_class_name"] = predictions["pred_class_id"].copy() + found_class_ids = set(predictions["true_class_id"]).union( + set(predictions["pred_class_id"])) + for class_id in found_class_ids: + predictions["true_class_name"].replace( + class_id, codebook["class_name"][class_id], inplace=True + ) + predictions["pred_class_name"].replace( + class_id, codebook["class_name"][class_id], inplace=True + ) + + # Define whether prediction was right + predictions["is_correct_pred"] = ( + predictions["true_class_id"] == predictions["pred_class_id"]) + predictions["is_correct_pred"] = predictions["is_correct_pred"].replace( + False, "F") + predictions.is_correct_pred = predictions.is_correct_pred.astype(str) + # Get tumor and tissue ID + temp = pd.DataFrame( + {"tumor_type": predictions["true_class_name"].str[:-2]}) + temp = pd.merge(temp, tissue_classes, on="tumor_type", how="left") + # Set of IDs for normal and tumor (because of using multiple classes) + IDs_tumor = list(set(temp["ID_tumor"])) + if list(set(temp.tumor_type.tolist()))[0] == cancer_type: + # Probability for predicting tumor and normal label (regardless of tumor [tissue] type) + predictions["tumor_label_prob"] = np.nan + predictions["normal_label_prob"] = np.nan + for ID_tumor in IDs_tumor: + vals = pred_probabilities.loc[temp["ID_tumor"] + == ID_tumor, ID_tumor] + predictions.loc[temp["ID_tumor"] == + ID_tumor, "tumor_label_prob"] = vals + + predictions["is_correct_pred_label"] = np.nan + else: + IDs_normal = list(set(temp["ID_normal"])) + # Probability for predicting tumor and normal label (regardless of tumor [tissue] type) + predictions["tumor_label_prob"] = np.nan + predictions["normal_label_prob"] = np.nan + for ID_tumor in IDs_tumor: + vals = pred_probabilities.loc[temp["ID_tumor"] + == ID_tumor, ID_tumor] + predictions.loc[temp["ID_tumor"] == + ID_tumor, "tumor_label_prob"] = vals + + for ID_normal in IDs_normal: + vals = pred_probabilities.loc[temp["ID_normal"] + == ID_normal, ID_normal] + predictions.loc[temp["ID_normal"] == + ID_normal, "normal_label_prob"] = vals + + # Check if the correct label (tumor/normal) is predicted + temp_probs = predictions[["tumor_label_prob", "normal_label_prob"]] + is_normal_label_prob = ( + temp_probs["normal_label_prob"] > temp_probs["tumor_label_prob"] + ) + is_tumor_label_prob = ( + temp_probs["normal_label_prob"] < temp_probs["tumor_label_prob"] + ) + is_normal_label = predictions["true_class_name"].str.find( + "_N") != -1 + is_tumor_label = predictions["true_class_name"].str.find( + "_T") != -1 + + is_normal = is_normal_label & is_normal_label_prob + is_tumor = is_tumor_label & is_tumor_label_prob + + predictions["is_correct_pred_label"] = is_normal | is_tumor + predictions["is_correct_pred_label"].replace( + True, "T", inplace=True) + predictions["is_correct_pred_label"].replace( + False, "F", inplace=True) + return(predictions) + +def handle_ffpe_slides(pred_train_file, codebook, tissue_classes, cancer_type): + predictions_raw = dd.read_csv(pred_train_file, sep="\t", header=None) + predictions_raw['tile_ID'] = predictions_raw.iloc[:, 0] + predictions_raw.tile_ID = predictions_raw.tile_ID.map( + lambda x: x.split("/")[-1]) + predictions_raw['tile_ID'] = predictions_raw['tile_ID'].str.replace( + ".jpg'", "") + predictions = predictions_raw.map_partitions( + lambda df: df.drop(columns=[0, 1])) + new_names = list(map(lambda x: str(x), codebook["class_id"])) + new_names.append('tile_ID') + predictions.columns = new_names + predictions = predictions.map_partitions(lambda x: x.assign( + pred_class_id=x.iloc[:, 0:41].idxmax(axis="columns"))) + predictions["true_class_id"] = 41 + predictions = predictions.map_partitions(lambda x: x.assign( + pred_probability=x.iloc[:, 0:41].max(axis="columns"))) + predictions["true_class_name"] = predictions["true_class_id"].copy() + predictions["pred_class_name"] = predictions["pred_class_id"].copy() + predictions.pred_class_id = predictions.pred_class_id.astype(int) + res = dict(zip(codebook.class_id, codebook.class_name)) + predictions = predictions.map_partitions(lambda x: x.assign( + pred_class_name=x.loc[:, 'pred_class_id'].replace(res))) + predictions = predictions.map_partitions(lambda x: x.assign( + true_class_name=x.loc[:, 'true_class_id'].replace(res))) + predictions["is_correct_pred"] = ( + predictions["true_class_id"] == predictions["pred_class_id"]) + predictions["is_correct_pred"] = predictions["is_correct_pred"].replace( + False, "F") + predictions.is_correct_pred = predictions.is_correct_pred.astype(str) + temp = predictions.map_partitions(lambda x: x.assign( + tumor_type=x["true_class_name"].str[:-2])) + temp = temp.map_partitions(lambda x: pd.merge( + x, tissue_classes, on="tumor_type", how="left")) + if (temp['tumor_type'].compute() == cancer_type).any(): + # Probability for predicting tumor and normal label (regardless of tumor [tissue] type) + predictions["tumor_label_prob"] = np.nan + predictions["normal_label_prob"] = np.nan + predictions = predictions.map_partitions( + lambda x: x.assign(tumor_label_prob=x.loc[:, '41'])) + predictions["is_correct_pred_label"] = np.nan + else: + # TO DO + predictions["tumor_label_prob"] = np.nan + predictions["normal_label_prob"] = np.nan + # predictions = predictions.map_partitions(lambda x: x.assign(tumor_label_prob=x.loc[:, '41'])) + # predictions = predictions.map_partitions(lambda x: x.assign(tumor_label_prob=x.loc[:, '41'])) + return predictions + +def post_process_predictions(pred_train_file, slide_type, path_codebook, path_tissue_classes, cancer_type): + """ + Format predicted tissue classes and derive tumor purity from pred.train.txt file generated by myslim/bottleneck_predict.py and + The pred.train.txt file contains the tile ID, the true class id and the 42 predicted probabilities for the 42 tissue classes. + + Args: + output_dir (str): path pointing to folder for storing all created files by script + + Returns: + {output_dir}/predictions.txt containing the following columns + - tile_ID, + - pred_class_id and true_class_id: class ids defined in codebook.txt) + - pred_class_name and true_class_name: class names e.g. LUAD_T, defined in codebook.txt) + - pred_probability: corresponding probability + - is_correct_pred (boolean): correctly predicted tissue class label + - tumor_label_prob and normal_label_prob: probability for predicting tumor and normal label (regardless of tumor or tissue type) + - is_correct_pred_label (boolean): correctly predicted 'tumor' or 'normal' tissue regardless of tumor or tissue type + In the rows the tiles. + """ + + # Initialize + codebook = pd.read_csv(path_codebook, delim_whitespace=True, header=None) + codebook.columns = ["class_name", "class_id"] + tissue_classes = pd.read_csv(path_tissue_classes, sep="\t") + + # Read predictions + if slide_type == "FF": + return(handle_ff_slides(pred_train_file=pred_train_file, codebook=codebook, tissue_classes=tissue_classes, cancer_type = cancer_type)) + #  Save features to .csv file + elif slide_type == "FFPE": + return(handle_ffpe_slides(pred_train_file=pred_train_file, codebook=codebook, tissue_classes= tissue_classes, cancer_type=cancer_type)) + else: + raise Exception("Invalid `slide_type`, please choose 'FF' or 'FFPE' ") + +def main(args): + predictions = post_process_predictions(pred_train_file = args.pred_train_file, slide_type=args.slide_type, path_codebook=args.path_codebook, + path_tissue_classes=args.path_tissue_classes, cancer_type=args.cancer_type) + if (args.slide_type == "FF"): + predictions.to_csv(Path(args.output_dir, "predictions.txt"), sep="\t") + elif (args.slide_type == "FFPE"): + # Save features using parquet + def name_function(x): return f"predictions-{x}.parquet" + predictions.to_parquet( + path=args.output_dir, compression='gzip', name_function=name_function) + print("Finished post-processing of predictions...") + + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/bin/pre_processing.py b/bin/pre_processing.py new file mode 100755 index 0000000..cc71baa --- /dev/null +++ b/bin/pre_processing.py @@ -0,0 +1,98 @@ +#!/usr/bin/env python3 + +import argparse +from argparse import ArgumentParser as AP +import os +import pandas as pd +import sys + +import glob +from myslim.datasets.convert import _convert_dataset + +from os.path import abspath +import time +from pathlib import Path + +def get_args(): + # Script description + description = """Convert tiles to TFrecords""" + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + + # Sections + parser.add_argument("--slides_folder", help="Set slides folder", default = "") + parser.add_argument("--output_dir", help="Set output folder", default = "") + parser.add_argument("--file_info_train", + help="Set to path to 'file_info_train.txt' generated by create_file_info_train.py") + parser.add_argument( + "--N_shards", help="Number of shards", default=320, type=int) + parser.add_argument("--version", action="version", version="0.1.0") + arg = parser.parse_args() + arg.output_dir = abspath(arg.output_dir) + + if ((arg.output_dir != "") & (not os.path.isdir(arg.output_dir))): + # Create an empty folder for TF records if folder doesn't exist + arg.output_dir = Path(arg.output_dir,"process_train") + os.mkdir(arg.output_dir) + return arg + +def execute_preprocessing(file_info_train, output_dir, N_shards=320): + """ + Execute several pre-processing steps necessary for extracting the histopathological features + 1. Create tiles from slides + 2. Construct file necessary for the deep learning architecture + 3. Convert images of tiles to TF records + + Args: + slides_folder (str): path pointing to folder with all whole slide images (.svs files) + output_dir (str): path pointing to folder for storing all created files by script + clinical_file_path (str): path pointing to formatted clinical file (either generated or manually formatted) + N_shards (int): default: 320 + + Returns: + {output_dir}/tiles/{tile files} + {output_dir}/file_info_train.txt file specifying data structure of the tiles required for inception architecture (to read the TF records) + {output_dir}/process_train/{TFrecord file} files that store the data as a series of binary sequencies + + """ + # Convert tiles from jpg to TF record1 + file_info = pd.read_csv(file_info_train, sep="\t") + training_filenames = list(file_info["tile_path"].values) + training_classids = [int(id) for id in list(file_info["class_id"].values)] + tps = [int(id) for id in list(file_info["percent_tumor_cells"].values)] + Qs = list(file_info["jpeg_quality"].values) + + _convert_dataset( + split_name="train", + filenames=training_filenames, + tps=tps, + Qs=Qs, + classids=training_classids, + output_dir=output_dir, + NUM_SHARDS=N_shards, + ) + +def main(args): + execute_preprocessing( + output_dir=args.output_dir, + file_info_train=args.file_info_train, + N_shards=args.N_shards + ) + + out_files = glob.glob1(Path(args.output_dir), "*.tfrecord") + print(len(out_files)) + + assert len(out_files) == args.N_shards + + print("Finished converting dataset") + print( + f"The converted data is stored in the directory: {args.output_dir}") + + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/Python/2_train_multitask_models/tile_level_cell_type_quantification.py b/bin/tile_level_cell_type_quantification.py old mode 100644 new mode 100755 similarity index 70% rename from Python/2_train_multitask_models/tile_level_cell_type_quantification.py rename to bin/tile_level_cell_type_quantification.py index 4134cc0..6394c2e --- a/Python/2_train_multitask_models/tile_level_cell_type_quantification.py +++ b/bin/tile_level_cell_type_quantification.py @@ -1,16 +1,74 @@ +#!/usr/bin/env python3 + import os -import sys import pandas as pd import dask.dataframe as dd import argparse import joblib import scipy.stats as stats - +from pathlib import Path from model.constants import DEFAULT_CELL_TYPES from model.evaluate import compute_tile_predictions +import time +from argparse import ArgumentParser as AP +import glob -def tile_level_quantification(models_dir, output_dir, var_names_path, histopatho_features_dir, prediction_mode="all", n_outerfolds=5, cell_types_path="", slide_type="FF"): +def get_args(): + # Script description + description = """Tile-level cell type quantification""" + + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + parser.add_argument("--models_dir", type=str, + help="Path to models directory", required=True) + parser.add_argument("--output_dir", type=str, + help="Path to output directory", required=False, default="") + parser.add_argument("--histopatho_features_dir", type=str, + help="Path to histopathological features file", required=False, default="") + parser.add_argument("--var_names_path", type=str, + help="Path to variable names pkl file", required=True) + parser.add_argument("--features_input", type=str, default=None) + parser.add_argument("--prediction_mode", type=str, + help="Choose prediction mode 'performance' or 'all' (default='all')", default="all", required=False) + parser.add_argument("--n_outerfolds", type=int, default=5, + help="Number of outer folds (default=5)", required=False) + parser.add_argument("--cell_types", type=str, default=None, + help="List of cell types by default=['T_cells','CAFs', 'tumor_purity','endothelial_cells']", required=False) + parser.add_argument( + "--slide_type", help="Type of tissue slide (FF or FFPE)", type=str, required=True) + + arg = parser.parse_args() + + if (arg.features_input is None): + if arg.slide_type == "FF": + arg.features_input = Path( + arg.histopatho_features_dir, "features.txt") + + elif arg.slide_type == "FFPE": + parquet_files = glob.glob1("", "*.parquet") + if (len(parquet_files) > 0): + if not (os.path.isdir("features_format_parquet")): + os.mkdir("features_format_parquet") + for parquet_file in parquet_files: + os.replace(parquet_file, Path( + "features_format_parquet", parquet_file)) + + arg.features_input = Path( + arg.histopatho_features_dir, "features_format_parquet") + + if (not Path(arg.features_input).exists()): + raise Exception( + "Invalid argument, please check `features_input` or `histopatho_features_dir`") + + if ((arg.output_dir != "") & (not os.path.isdir(arg.output_dir))): + # Create an empty folder for TF records if folder doesn't exist + os.mkdir(arg.output_dir) + return arg + + +def tile_level_quantification(features_input, models_dir, var_names_path, prediction_mode="all", n_outerfolds=5, cell_types="", slide_type="FF"): """ Quantify the cell type abundances for the different tiles. Creates three files: (1) z-scores and @@ -29,34 +87,23 @@ def tile_level_quantification(models_dir, output_dir, var_names_path, histopatho """ # Read data - if os.path.isfile(cell_types_path): - cell_types = pd.read_csv( - cell_types_path, header=None).to_numpy().flatten() - else: + if cell_types is None: cell_types = DEFAULT_CELL_TYPES - print(cell_types) - - full_output_dir = f"{output_dir}" - print(full_output_dir) - if not os.path.isdir(full_output_dir): - os.makedirs(full_output_dir) - var_names = joblib.load(var_names_path) print(var_names) if slide_type == "FF": - FEATURES_PATH = f"{histopatho_features_dir}/features.txt" - histopatho_features = pd.read_csv(FEATURES_PATH, sep="\t", index_col=0) + histopatho_features = pd.read_csv( + features_input, sep="\t", index_col=0) elif slide_type == "FFPE": - FEATURES_PATH = f"{histopatho_features_dir}/features_format_parquet" - histopatho_features = dd.read_parquet(FEATURES_PATH) + histopatho_features = dd.read_parquet(features_input) print(histopatho_features.head()) # Compute predictions based on bottleneck features tile_predictions = pd.DataFrame() - bottleneck_features = histopatho_features.loc[:, [ + bottleneck_features = histopatho_features.loc[:, [ str(i) for i in range(1536)]] bottleneck_features.index = histopatho_features.tile_ID var_names['IDs'] = 'sample_submitter_id' @@ -67,7 +114,6 @@ def tile_level_quantification(models_dir, output_dir, var_names_path, histopatho metadata = metadata.compute() print("Computing tile predictions for each cell type...") - ############################################################################## # If predicting on all FFPE slides, we do this by chunks: # if any([prediction_mode == item for item in ['tcga_train_validation', 'test']]): @@ -132,42 +178,29 @@ def tile_level_quantification(models_dir, output_dir, var_names_path, histopatho columns={'sample_submitter_id': 'slide_submitter_id'}) pred_proba = pred_proba.rename( columns={'sample_submitter_id': 'slide_submitter_id'}) + return (tile_predictions, pred_proba) - tile_predictions.to_csv( - f"{full_output_dir}/{prediction_mode}_tile_predictions_zscores.csv", sep="\t", index=False) - pred_proba.to_csv( - f"{full_output_dir}/{prediction_mode}_tile_predictions_proba.csv", sep="\t", index=False) - - -if __name__ == "__main__": - - parser = argparse.ArgumentParser( - description="Predict cell type abundances for the tiles") - parser.add_argument("--models_dir", type=str, - help="Path to models directory", required=True) - parser.add_argument("--output_dir", type=str, - help="Path to output directory", required=True) - parser.add_argument("--histopatho_features_dir", type=str, - help="Path to histopathological features file", required=True) - parser.add_argument("--var_names_path", type=str, - help="Path to variable names pkl file", required=True) - - parser.add_argument("--prediction_mode", type=str, - help="Choose prediction mode 'performance' or 'all' (default='all')", default="all", required=False) - parser.add_argument("--n_outerfolds", type=int, default=5, - help="Number of outer folds (default=5)", required=False) - parser.add_argument("--cell_types_path", type=str, default="", - help="List of cell types by default=['T_cells','CAFs', 'tumor_purity','endothelial_cells']", required=False) - parser.add_argument( - "--slide_type", help="Type of tissue slide (FF or FFPE)", type=str, required=True) - args = parser.parse_args() - tile_level_quantification( +def main(args): + tile_predictions, pred_proba = tile_level_quantification( + features_input=args.features_input, models_dir=args.models_dir, - output_dir=args.output_dir, - histopatho_features_dir=args.histopatho_features_dir, prediction_mode=args.prediction_mode, n_outerfolds=args.n_outerfolds, - cell_types_path=args.cell_types_path, + cell_types=args.cell_types, var_names_path=args.var_names_path, slide_type=args.slide_type) + + tile_predictions.to_csv( + Path(args.output_dir, f"{args.prediction_mode}_tile_predictions_zscores.csv"), sep="\t", index=False) + pred_proba.to_csv( + Path(args.output_dir, f"{args.prediction_mode}_tile_predictions_proba.csv"), sep="\t", index=False) + print("Finished tile predictions...") + + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/conf/base.config b/conf/base.config new file mode 100755 index 0000000..4ce8036 --- /dev/null +++ b/conf/base.config @@ -0,0 +1,139 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + nf-core/spotlight Nextflow base config file +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + A 'blank slate' config file, appropriate for general use on most high performance + compute environments. Assumes that all software is installed and available on + the PATH. Runs in `local` mode - all jobs will be run on the logged in environment. +---------------------------------------------------------------------------------------- +*/ + +process { + + // TODO nf-core: Check the defaults for all processes + cpus = { check_max( 1 * task.attempt, 'cpus' ) } + memory = { check_max( 6.GB * task.attempt, 'memory' ) } + time = { check_max( 4.h * task.attempt, 'time' ) } + + errorStrategy = { task.exitStatus in ((130..145) + 104) ? 'retry' : 'finish' } + maxRetries = 1 + maxErrors = '-1' + + // Process-specific resource requirements + // NOTE - Please try and re-use the labels below as much as possible. + // These labels are used and recognised by default in DSL2 files hosted on nf-core/modules. + // If possible, it would be nice to keep the same label naming convention when + // adding in your local modules too. + // TODO nf-core: Customise requirements for specific processes. + // See https://www.nextflow.io/docs/latest/config.html#config-process-selectors + withLabel:process_single { + cpus = { check_max( 1 , 'cpus' ) } + memory = { check_max( 6.GB * task.attempt, 'memory' ) } + time = { check_max( 4.h * task.attempt, 'time' ) } + } + withLabel:process_low { + cpus = { check_max( 2 * task.attempt, 'cpus' ) } + memory = { check_max( 12.GB * task.attempt, 'memory' ) } + time = { check_max( 4.h * task.attempt, 'time' ) } + } + withLabel:process_medium { + cpus = { check_max( 6 * task.attempt, 'cpus' ) } + memory = { check_max( 36.GB * task.attempt, 'memory' ) } + time = { check_max( 8.h * task.attempt, 'time' ) } + } + withLabel:process_high { + cpus = { check_max( 12 * task.attempt, 'cpus' ) } + memory = { check_max( 72.GB * task.attempt, 'memory' ) } + time = { check_max( 16.h * task.attempt, 'time' ) } + } + withLabel:process_long { + time = { check_max( 20.h * task.attempt, 'time' ) } + } + withLabel:process_high_memory { + memory = { check_max( 200.GB * task.attempt, 'memory' ) } + } + withLabel:error_ignore { + errorStrategy = 'ignore' + } + withLabel:error_retry { + errorStrategy = 'retry' + maxRetries = 2 + } + + // ---- NEW LABELS for SPOTLIGHT ---- // + // Memory labels + withLabel:mem_4G { + memory = { check_max( 4.GB * task.attempt, 'memory' ) } + // queue = { assign_queue ( 4.GB * task.attempt )} + } + + withLabel:mem_8G { + memory = { check_max( 8.GB * task.attempt, 'memory' ) } + // queue = { assign_queue ( 8.GB * task.attempt )} + + } + + withLabel:mem_16G { + memory = { check_max( 16.GB * task.attempt, 'memory' ) } + // queue = { assign_queue ( 16.GB * task.attempt )} + + } + + withLabel:mem_32G { + memory = { check_max( 32.GB * task.attempt, 'memory' ) } + // queue = { assign_queue ( 32.GB * task.attempt )} + + } + + withLabel:mem_64G { + memory = { check_max( 64.GB * task.attempt, 'memory' ) } + // queue = { assign_queue ( 64.GB * task.attempt )} + + } + + withLabel:mem_128G { + memory = { check_max( 128.GB * task.attempt, 'memory' ) } + // queue = { assign_queue ( 128.GB * task.attempt )} + + } + + // Time label + withLabel:time_10m { + time = { check_max( 10.m * task.attempt, 'time' ) } + } + + withLabel:time_30m { + time = { check_max( 30.m * task.attempt, 'time' ) } + } + + withLabel:time_1h { + time = { check_max( 1.h * task.attempt, 'time' ) } + } + + withLabel:time_2h { + time = { check_max( 2.h * task.attempt, 'time' ) } + } + + withLabel:time_4h { + time = { check_max( 4.h * task.attempt, 'time' ) } + } + + withLabel:time_8h { + time = { check_max( 8.h * task.attempt, 'time' ) } + } + + withLabel:time_12h { + time = { check_max( 12.h * task.attempt, 'time' ) } + } + + withLabel:time_24h { + time = { check_max( 1.d * task.attempt, 'time' ) } + } + + withLabel:time_24h { + time = { check_max( 2.d * task.attempt, 'time' ) } + } + +} + + diff --git a/conf/igenomes.config b/conf/igenomes.config new file mode 100755 index 0000000..3f11437 --- /dev/null +++ b/conf/igenomes.config @@ -0,0 +1,440 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Nextflow config file for iGenomes paths +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Defines reference genomes using iGenome paths. + Can be used by any config that customises the base path using: + $params.igenomes_base / --igenomes_base +---------------------------------------------------------------------------------------- +*/ + +params { + // illumina iGenomes reference file paths + genomes { + 'GRCh37' { + fasta = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.bed" + readme = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Annotation/README.txt" + mito_name = "MT" + macs_gsize = "2.7e9" + blacklist = "${projectDir}/assets/blacklists/GRCh37-blacklist.bed" + } + 'GRCh38' { + fasta = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.bed" + mito_name = "chrM" + macs_gsize = "2.7e9" + blacklist = "${projectDir}/assets/blacklists/hg38-blacklist.bed" + } + 'CHM13' { + fasta = "${params.igenomes_base}/Homo_sapiens/UCSC/CHM13/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Homo_sapiens/UCSC/CHM13/Sequence/BWAIndex/" + bwamem2 = "${params.igenomes_base}/Homo_sapiens/UCSC/CHM13/Sequence/BWAmem2Index/" + gtf = "${params.igenomes_base}/Homo_sapiens/NCBI/CHM13/Annotation/Genes/genes.gtf" + gff = "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz" + mito_name = "chrM" + } + 'GRCm38' { + fasta = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Annotation/Genes/genes.bed" + readme = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Annotation/README.txt" + mito_name = "MT" + macs_gsize = "1.87e9" + blacklist = "${projectDir}/assets/blacklists/GRCm38-blacklist.bed" + } + 'TAIR10' { + fasta = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Annotation/Genes/genes.bed" + readme = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Annotation/README.txt" + mito_name = "Mt" + } + 'EB2' { + fasta = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Annotation/Genes/genes.bed" + readme = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Annotation/README.txt" + } + 'UMD3.1' { + fasta = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Annotation/Genes/genes.bed" + readme = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Annotation/README.txt" + mito_name = "MT" + } + 'WBcel235' { + fasta = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Annotation/Genes/genes.bed" + mito_name = "MtDNA" + macs_gsize = "9e7" + } + 'CanFam3.1' { + fasta = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Annotation/Genes/genes.bed" + readme = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Annotation/README.txt" + mito_name = "MT" + } + 'GRCz10' { + fasta = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Annotation/Genes/genes.bed" + mito_name = "MT" + } + 'BDGP6' { + fasta = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Annotation/Genes/genes.bed" + mito_name = "M" + macs_gsize = "1.2e8" + } + 'EquCab2' { + fasta = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Annotation/Genes/genes.bed" + readme = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Annotation/README.txt" + mito_name = "MT" + } + 'EB1' { + fasta = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Annotation/Genes/genes.bed" + readme = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Annotation/README.txt" + } + 'Galgal4' { + fasta = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Annotation/Genes/genes.bed" + mito_name = "MT" + } + 'Gm01' { + fasta = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Annotation/Genes/genes.bed" + readme = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Annotation/README.txt" + } + 'Mmul_1' { + fasta = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Annotation/Genes/genes.bed" + readme = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Annotation/README.txt" + mito_name = "MT" + } + 'IRGSP-1.0' { + fasta = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Annotation/Genes/genes.bed" + mito_name = "Mt" + } + 'CHIMP2.1.4' { + fasta = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Annotation/Genes/genes.bed" + readme = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Annotation/README.txt" + mito_name = "MT" + } + 'Rnor_5.0' { + fasta = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_5.0/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_5.0/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_5.0/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_5.0/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_5.0/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_5.0/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_5.0/Annotation/Genes/genes.bed" + mito_name = "MT" + } + 'Rnor_6.0' { + fasta = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Annotation/Genes/genes.bed" + mito_name = "MT" + } + 'R64-1-1' { + fasta = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Annotation/Genes/genes.bed" + mito_name = "MT" + macs_gsize = "1.2e7" + } + 'EF2' { + fasta = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Annotation/Genes/genes.bed" + readme = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Annotation/README.txt" + mito_name = "MT" + macs_gsize = "1.21e7" + } + 'Sbi1' { + fasta = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Annotation/Genes/genes.bed" + readme = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Annotation/README.txt" + } + 'Sscrofa10.2' { + fasta = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Annotation/Genes/genes.bed" + readme = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Annotation/README.txt" + mito_name = "MT" + } + 'AGPv3' { + fasta = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Annotation/Genes/genes.bed" + mito_name = "Mt" + } + 'hg38' { + fasta = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Annotation/Genes/genes.bed" + mito_name = "chrM" + macs_gsize = "2.7e9" + blacklist = "${projectDir}/assets/blacklists/hg38-blacklist.bed" + } + 'hg19' { + fasta = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.bed" + readme = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Annotation/README.txt" + mito_name = "chrM" + macs_gsize = "2.7e9" + blacklist = "${projectDir}/assets/blacklists/hg19-blacklist.bed" + } + 'mm10' { + fasta = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.bed" + readme = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Annotation/README.txt" + mito_name = "chrM" + macs_gsize = "1.87e9" + blacklist = "${projectDir}/assets/blacklists/mm10-blacklist.bed" + } + 'bosTau8' { + fasta = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Annotation/Genes/genes.bed" + mito_name = "chrM" + } + 'ce10' { + fasta = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Annotation/Genes/genes.bed" + readme = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Annotation/README.txt" + mito_name = "chrM" + macs_gsize = "9e7" + } + 'canFam3' { + fasta = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Annotation/Genes/genes.bed" + readme = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Annotation/README.txt" + mito_name = "chrM" + } + 'danRer10' { + fasta = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Annotation/Genes/genes.bed" + mito_name = "chrM" + macs_gsize = "1.37e9" + } + 'dm6' { + fasta = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Annotation/Genes/genes.bed" + mito_name = "chrM" + macs_gsize = "1.2e8" + } + 'equCab2' { + fasta = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Annotation/Genes/genes.bed" + readme = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Annotation/README.txt" + mito_name = "chrM" + } + 'galGal4' { + fasta = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Annotation/Genes/genes.bed" + readme = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Annotation/README.txt" + mito_name = "chrM" + } + 'panTro4' { + fasta = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Annotation/Genes/genes.bed" + readme = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Annotation/README.txt" + mito_name = "chrM" + } + 'rn6' { + fasta = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Annotation/Genes/genes.bed" + mito_name = "chrM" + } + 'sacCer3' { + fasta = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/BismarkIndex/" + readme = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Annotation/README.txt" + mito_name = "chrM" + macs_gsize = "1.2e7" + } + 'susScr3' { + fasta = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/WholeGenomeFasta/genome.fa" + bwa = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/BWAIndex/version0.6.0/" + bowtie2 = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/Bowtie2Index/" + star = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/STARIndex/" + bismark = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/BismarkIndex/" + gtf = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Annotation/Genes/genes.gtf" + bed12 = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Annotation/Genes/genes.bed" + readme = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Annotation/README.txt" + mito_name = "chrM" + } + } +} diff --git a/conf/modules.config b/conf/modules.config new file mode 100755 index 0000000..61d4b67 --- /dev/null +++ b/conf/modules.config @@ -0,0 +1,119 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Config file for defining DSL2 per module options and publishing paths +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Available keys to override module options: + ext.args = Additional arguments appended to command in module. + ext.args2 = Second set of arguments appended to command in module (multi-tool modules). + ext.args3 = Third set of arguments appended to command in module (multi-tool modules). + ext.prefix = File name prefix for output files. +---------------------------------------------------------------------------------------- +*/ + +process { + // Setting defaults + publishDir = [ + path: { "${params.outdir}/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" }, + mode: params.publish_dir_mode, + saveAs: { filename -> + if (filename.equals('versions.yml') | it == "ok.txt") { null } + else { filename } + } + ] + + // The three main subworkflows + withName: 'BOTTLENECK_PREDICT' { + publishDir = [ + path: { "${params.outdir}/1_extract_histopatho_features" }, + // mode: params.publish_dir_mode, + // saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + } + withLabel: 'tf_learning_celltyp_quant' { + publishDir = [ + path: { "${params.outdir}/2_tile_level_quantification" }, + // mode: params.publish_dir_mode, + // saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + } + + // Type of spatial features + withLabel: 'spatial_clustering_features' { + publishDir = [ + path: { "${params.outdir}/3_spatial_features/clustering_features" }, + // mode: params.publish_dir_mode, + // saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + + } + + withLabel: 'spatial_features' { + publishDir = [ + path: { "${params.outdir}/3_spatial_features" }, + // mode: params.publish_dir_mode, + // saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + + } + + withLabel: 'spatial_network_features' { + publishDir = [ + path: { "${params.outdir}/3_spatial_features/network_features" }, + // mode: params.publish_dir_mode, + // saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + + + } + + // Individual modules (processes) + withName: 'CREATE_CLINICAL_FILE' { + ext.prefix = {"generated_clinical_file"} + } + + withName: 'TILING_SINGLE_SLIDE' { + publishDir = [ + path: { "${params.outdir}/1_extract_histopatho_features/tiles" }, + mode: "symlink", + // saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + } + + withName: 'PREPROCESSING_SLIDES' { + publishDir = [ + path: { "${params.outdir}/1_extract_histopatho_features/process_train" }, + mode: "symlink", + // saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + } + + withLabel: 'compute_spatial_features' { + cpus = check_max( 16, "cpu") + time = { check_max ( 8.h * task.attempt, 'time')} + memory = {check_max ( 32.GB * task.attempt, 'memory')} + + } + + withName: 'COMPUTE_NETWORK_FEATURES' { + publishDir = [ + path: { "${params.outdir}/3_spatial_features/network_features" }, + mode: "copy", + // saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + + } + + withName: FASTQC { + ext.args = '--quiet' + } + + withName: 'MULTIQC' { + ext.args = { params.multiqc_title ? "--title \"$params.multiqc_title\"" : '' } + publishDir = [ + path: { "${params.outdir}/multiqc" }, + // mode: params.publish_dir_mode, + // saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + } + +} diff --git a/conf/test.config b/conf/test.config new file mode 100755 index 0000000..db52cbd --- /dev/null +++ b/conf/test.config @@ -0,0 +1,29 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Nextflow config file for running minimal tests +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Defines input files and everything required to run a fast and simple pipeline test. + + Use as follows: + nextflow run nf-core/spotlight -profile test, --outdir + +---------------------------------------------------------------------------------------- +*/ + +params { + config_profile_name = 'Test profile' + config_profile_description = 'Minimal test dataset to check pipeline function' + + // Limit resources so that this can run on GitHub Actions + max_cpus = 2 + max_memory = '6.GB' + max_time = '6.h' + + // Input data + // TODO nf-core: Specify the paths to your test data on nf-core/test-datasets + // TODO nf-core: Give any required params for the test so that command line flags are not needed + input = params.pipelines_testdata_base_path + 'viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv' + + // Genome references + genome = 'R64-1-1' +} diff --git a/conf/test_full.config b/conf/test_full.config new file mode 100755 index 0000000..227b4ca --- /dev/null +++ b/conf/test_full.config @@ -0,0 +1,24 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Nextflow config file for running full-size tests +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Defines input files and everything required to run a full size pipeline test. + + Use as follows: + nextflow run nf-core/spotlight -profile test_full, --outdir + +---------------------------------------------------------------------------------------- +*/ + +params { + config_profile_name = 'Full test profile' + config_profile_description = 'Full test dataset to check pipeline function' + + // Input data for full size test + // TODO nf-core: Specify the paths to your full test data ( on nf-core/test-datasets or directly in repositories, e.g. SRA) + // TODO nf-core: Give any required params for the test so that command line flags are not needed + input = params.pipelines_testdata_base_path + 'viralrecon/samplesheet/samplesheet_full_illumina_amplicon.csv' + + // Genome references + genome = 'R64-1-1' +} diff --git a/custom.config b/custom.config new file mode 100755 index 0000000..90a3cf8 --- /dev/null +++ b/custom.config @@ -0,0 +1,114 @@ +params { + config_profile_description = 'GaitiLab cluster profile' + + max_cpus = 24 + max_memory = 184.GB + // 7.d for himem, but not sure yet how to handle htis + max_time = 5.d + maxRetries = 10 + + clinical_files_input = "${projectDir}/assets/codebook.txt" + path_codebook= 'assets/NO_FILE' + class_name='SKCM' + clinical_file_out_file = 'generated_clinical_file' + tumor_purity_threshold=80 + is_tcga = false + image_dir = "${projectDir}/data_example/tiny_xenium_set" + gradient_mag_filter=10 + n_shards=320 + bot_out = 'bot_train' + pred_out = 'pred_train' + model_name='inception_v4' + + + checkpoint_path = "${projectDir}/assets/checkpoint/Retrained_Inception_v4/model.ckpt-100000" + slide_type = 'FFPE' + path_tissue_classes= "${projectDir}/assets/tissue_classes.csv" + + celltype_models = "${projectDir}/assets/TF_models/SKCM_FF" + var_names_path = "${projectDir}/assets/task_selection_names.pkl" + prediction_mode='test' + + cell_types_path = 'assets/NO_FILE' + n_outerfolds = 5 + + // Prefix for spatial features output filenames, else 'slide_type' is used + out_prefix = 'dummy' + + + // Spatial features parameters + graphs_path = 'assets/NO_FILE' + n_outerfolds = 5 + abundance_threshold = 0.5 + shapiro_alpha = 0.05 + cutoff_path_length = 2 + + n_clusters = 8 + max_dist = 'dummy' + max_n_tiles_threshold = 2 + tile_size = 512 + overlap = 50 + + metadata_path = 'assets/NO_FILE' + merge_var = "slide_submitter_id" + sheet_name = 'dummy' + + outdir = "output" + + +} +nextflow.enable.moduleBinaries = true + +process { + executor = "local" +} + +// Preform work directory cleanup after a successful run +cleanup = true +// env.PYTHONPATH = "${projectDir}/lib:${projectDir}/lib/myslim" + +// Profile to deactivate automatic cleanup of work directory after a successful run. Overwrites cleanup option. +profiles { + debug { + cleanup = false + } + slurm { + process { + executor = "slurm" + jobName = { "$task.hash" } + // Select right queue + queue = { assign_queue( task.memory * task.attempt ) } + } + } + h4h { + // When on cluster ensure apptainer and java/18 are loaded + process { + beforeScript = """module load apptainer""".stripIndent() + } + } + apptainer { + process.container = "${projectDir}/spotlight.sif" + + } +} + + +def assign_queue (mem){ + def queue = "" + switch ( mem ) { + case { it > 185.GB }: + queue = 'superhimem' + break + case { it > 61.4.GB }: + queue = 'veryhimem' + break + case { it > 30.72.GB }: + queue = 'himem' + break + default: + queue = 'all' + break + } + return queue +} + diff --git a/docs/README.md b/docs/README.md new file mode 100755 index 0000000..0a970e6 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,10 @@ +# nf-core/spotlight: Documentation + +The nf-core/spotlight documentation is split into the following pages: + +- [Usage](usage.md) + - An overview of how the pipeline works, how to run it and a description of all of the different command-line flags. +- [Output](output.md) + - An overview of the different results produced by the pipeline and how to interpret them. + +You can find a lot more documentation about installing, configuring and running nf-core pipelines on the website: [https://nf-co.re](https://nf-co.re) diff --git a/docs/images/mqc_fastqc_adapter.png b/docs/images/mqc_fastqc_adapter.png new file mode 100755 index 0000000..361d0e4 Binary files /dev/null and b/docs/images/mqc_fastqc_adapter.png differ diff --git a/docs/images/mqc_fastqc_counts.png b/docs/images/mqc_fastqc_counts.png new file mode 100755 index 0000000..cb39ebb Binary files /dev/null and b/docs/images/mqc_fastqc_counts.png differ diff --git a/docs/images/mqc_fastqc_quality.png b/docs/images/mqc_fastqc_quality.png new file mode 100755 index 0000000..a4b89bf Binary files /dev/null and b/docs/images/mqc_fastqc_quality.png differ diff --git a/docs/output.md b/docs/output.md new file mode 100755 index 0000000..c9b0705 --- /dev/null +++ b/docs/output.md @@ -0,0 +1,71 @@ +# nf-core/spotlight: Output + +## Introduction + +This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline. + +The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory. + + + +## Pipeline overview + +The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps: + +- [FastQC](#fastqc) - Raw read QC +- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline +- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution + +### FastQC + +
+Output files + +- `fastqc/` + - `*_fastqc.html`: FastQC report containing quality metrics. + - `*_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images. + +
+ +[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/). + +![MultiQC - FastQC sequence counts plot](images/mqc_fastqc_counts.png) + +![MultiQC - FastQC mean quality scores plot](images/mqc_fastqc_quality.png) + +![MultiQC - FastQC adapter content plot](images/mqc_fastqc_adapter.png) + +:::note +The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They may contain adapter sequence and potentially regions with low quality. +::: + +### MultiQC + +
+Output files + +- `multiqc/` + - `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser. + - `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline. + - `multiqc_plots/`: directory containing static images from the report in various formats. + +
+ +[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory. + +Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see . + +### Pipeline information + +
+Output files + +- `pipeline_info/` + - Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`. + - Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.yml`. The `pipeline_report*` files will only be present if the `--email` / `--email_on_fail` parameter's are used when running the pipeline. + - Reformatted samplesheet files used as input to the pipeline: `samplesheet.valid.csv`. + - Parameters used by the pipeline run: `params.json`. + +
+ +[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage. diff --git a/docs/usage.md b/docs/usage.md new file mode 100755 index 0000000..83a672f --- /dev/null +++ b/docs/usage.md @@ -0,0 +1,226 @@ +# nf-core/spotlight: Usage + +## :warning: Please read this documentation on the nf-core website: [https://nf-co.re/spotlight/usage](https://nf-co.re/spotlight/usage) + +> _Documentation of pipeline parameters is generated automatically from the pipeline schema and can no longer be found in markdown files._ + +## Introduction + + + +## Samplesheet input + +You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row as shown in the examples below. + +```bash +--input '[path to samplesheet file]' +``` + +### Multiple runs of the same sample + +The `sample` identifiers have to be the same when you have re-sequenced the same sample more than once e.g. to increase sequencing depth. The pipeline will concatenate the raw reads before performing any downstream analysis. Below is an example for the same sample sequenced across 3 lanes: + +```csv title="samplesheet.csv" +sample,fastq_1,fastq_2 +CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz +CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz +CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz +``` + +### Full samplesheet + +The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet. The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first 3 columns to match those defined in the table below. + +A final samplesheet file consisting of both single- and paired-end data may look something like the one below. This is for 6 samples, where `TREATMENT_REP3` has been sequenced twice. + +```csv title="samplesheet.csv" +sample,fastq_1,fastq_2 +CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz +CONTROL_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz +CONTROL_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz +TREATMENT_REP1,AEG588A4_S4_L003_R1_001.fastq.gz, +TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz, +TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz, +TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz, +``` + +| Column | Description | +| --------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `sample` | Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (`_`). | +| `fastq_1` | Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". | +| `fastq_2` | Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". | + +An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline. + +## Running the pipeline + +The typical command for running the pipeline is as follows: + +```bash +nextflow run nf-core/spotlight --input ./samplesheet.csv --outdir ./results --genome GRCh37 -profile docker +``` + +This will launch the pipeline with the `docker` configuration profile. See below for more information about profiles. + +Note that the pipeline will create the following files in your working directory: + +```bash +work # Directory containing the nextflow working files + # Finished results in specified location (defined with --outdir) +.nextflow_log # Log file from Nextflow +# Other nextflow hidden files, eg. history of pipeline runs and old logs. +``` + +If you wish to repeatedly use the same parameters for multiple runs, rather than specifying each flag in the command, you can specify these in a params file. + +Pipeline settings can be provided in a `yaml` or `json` file via `-params-file `. + +:::warning +Do not use `-c ` to specify parameters as this will result in errors. Custom config files specified with `-c` must only be used for [tuning process resource specifications](https://nf-co.re/docs/usage/configuration#tuning-workflow-resources), other infrastructural tweaks (such as output directories), or module arguments (args). +::: + +The above pipeline run specified with a params file in yaml format: + +```bash +nextflow run nf-core/spotlight -profile docker -params-file params.yaml +``` + +with `params.yaml` containing: + +```yaml +input: './samplesheet.csv' +outdir: './results/' +genome: 'GRCh37' +<...> +``` + +You can also generate such `YAML`/`JSON` files via [nf-core/launch](https://nf-co.re/launch). + +### Updating the pipeline + +When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline: + +```bash +nextflow pull nf-core/spotlight +``` + +### Reproducibility + +It is a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since. + +First, go to the [nf-core/spotlight releases page](https://github.com/nf-core/spotlight/releases) and find the latest pipeline version - numeric only (eg. `1.3.1`). Then specify this when running the pipeline with `-r` (one hyphen) - eg. `-r 1.3.1`. Of course, you can switch to another version by changing the number after the `-r` flag. + +This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future. For example, at the bottom of the MultiQC reports. + +To further assist in reproducbility, you can use share and re-use [parameter files](#running-the-pipeline) to repeat pipeline runs with the same settings without having to write out a command with every single parameter. + +:::tip +If you wish to share such profile (such as upload as supplementary material for academic publications), make sure to NOT include cluster specific paths to files, nor institutional specific profiles. +::: + +## Core Nextflow arguments + +:::note +These options are part of Nextflow and use a _single_ hyphen (pipeline parameters use a double-hyphen). +::: + +### `-profile` + +Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. + +Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud, Apptainer, Conda) - see below. + +:::info +We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported. +::: + +The pipeline also dynamically loads configurations from [https://github.com/nf-core/configs](https://github.com/nf-core/configs) when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the [nf-core/configs documentation](https://github.com/nf-core/configs#documentation). + +Note that multiple profiles can be loaded, for example: `-profile test,docker` - the order of arguments is important! +They are loaded in sequence, so later profiles can overwrite earlier profiles. + +If `-profile` is not specified, the pipeline will run locally and expect all software to be installed and available on the `PATH`. This is _not_ recommended, since it can lead to different results on different machines dependent on the computer enviroment. + +- `test` + - A profile with a complete configuration for automated testing + - Includes links to test data so needs no other parameters +- `docker` + - A generic configuration profile to be used with [Docker](https://docker.com/) +- `singularity` + - A generic configuration profile to be used with [Singularity](https://sylabs.io/docs/) +- `podman` + - A generic configuration profile to be used with [Podman](https://podman.io/) +- `shifter` + - A generic configuration profile to be used with [Shifter](https://nersc.gitlab.io/development/shifter/how-to-use/) +- `charliecloud` + - A generic configuration profile to be used with [Charliecloud](https://hpc.github.io/charliecloud/) +- `apptainer` + - A generic configuration profile to be used with [Apptainer](https://apptainer.org/) +- `wave` + - A generic configuration profile to enable [Wave](https://seqera.io/wave/) containers. Use together with one of the above (requires Nextflow ` 24.03.0-edge` or later). +- `conda` + - A generic configuration profile to be used with [Conda](https://conda.io/docs/). Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter, Charliecloud, or Apptainer. + +### `-resume` + +Specify this when restarting a pipeline. Nextflow will use cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. For input to be considered the same, not only the names must be identical but the files' contents as well. For more info about this parameter, see [this blog post](https://www.nextflow.io/blog/2019/demystifying-nextflow-resume.html). + +You can also supply a run name to resume a specific run: `-resume [run-name]`. Use the `nextflow log` command to show previous run names. + +### `-c` + +Specify the path to a specific config file (this is a core Nextflow command). See the [nf-core website documentation](https://nf-co.re/usage/configuration) for more information. + +## Custom configuration + +### Resource requests + +Whilst the default requirements set within the pipeline will hopefully work for most people and with most input data, you may find that you want to customise the compute resources that the pipeline requests. Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with any of the error codes specified [here](https://github.com/nf-core/rnaseq/blob/4c27ef5610c87db00c3c5a3eed10b1d161abf575/conf/base.config#L18) it will automatically be resubmitted with higher requests (2 x original, then 3 x original). If it still fails after the third attempt then the pipeline execution is stopped. + +To change the resource requests, please see the [max resources](https://nf-co.re/docs/usage/configuration#max-resources) and [tuning workflow resources](https://nf-co.re/docs/usage/configuration#tuning-workflow-resources) section of the nf-core website. + +### Custom Containers + +In some cases you may wish to change which container or conda environment a step of the pipeline uses for a particular tool. By default nf-core pipelines use containers and software from the [biocontainers](https://biocontainers.pro/) or [bioconda](https://bioconda.github.io/) projects. However in some cases the pipeline specified version maybe out of date. + +To use a different container from the default container or conda environment specified in a pipeline, please see the [updating tool versions](https://nf-co.re/docs/usage/configuration#updating-tool-versions) section of the nf-core website. + +### Custom Tool Arguments + +A pipeline might not always support every possible argument or option of a particular tool used in pipeline. Fortunately, nf-core pipelines provide some freedom to users to insert additional parameters that the pipeline does not include by default. + +To learn how to provide additional arguments to a particular tool of the pipeline, please see the [customising tool arguments](https://nf-co.re/docs/usage/configuration#customising-tool-arguments) section of the nf-core website. + +### nf-core/configs + +In most cases, you will only need to create a custom config as a one-off but if you and others within your organisation are likely to be running nf-core pipelines regularly and need to use the same settings regularly it may be a good idea to request that your custom config file is uploaded to the `nf-core/configs` git repository. Before you do this please can you test that the config file works with your pipeline of choice using the `-c` parameter. You can then create a pull request to the `nf-core/configs` repository with the addition of your config file, associated documentation file (see examples in [`nf-core/configs/docs`](https://github.com/nf-core/configs/tree/master/docs)), and amending [`nfcore_custom.config`](https://github.com/nf-core/configs/blob/master/nfcore_custom.config) to include your custom profile. + +See the main [Nextflow documentation](https://www.nextflow.io/docs/latest/config.html) for more information about creating your own configuration files. + +If you have any questions or issues please send us a message on [Slack](https://nf-co.re/join/slack) on the [`#configs` channel](https://nfcore.slack.com/channels/configs). + +## Azure Resource Requests + +To be used with the `azurebatch` profile by specifying the `-profile azurebatch`. +We recommend providing a compute `params.vm_type` of `Standard_D16_v3` VMs by default but these options can be changed if required. + +Note that the choice of VM size depends on your quota and the overall workload during the analysis. +For a thorough list, please refer the [Azure Sizes for virtual machines in Azure](https://docs.microsoft.com/en-us/azure/virtual-machines/sizes). + +## Running in the background + +Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished. + +The Nextflow `-bg` flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file. + +Alternatively, you can use `screen` / `tmux` or similar tool to create a detached session which you can log back into at a later time. +Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs). + +## Nextflow memory requirements + +In some cases, the Nextflow Java virtual machines can start to request a large amount of memory. +We recommend adding the following line to your environment to limit this (typically in `~/.bashrc` or `~./bash_profile`): + +```bash +NXF_OPTS='-Xms1g -Xmx4g' +``` diff --git a/env_requirements.txt b/env_requirements.txt index 1bc1cb1..0221518 100644 --- a/env_requirements.txt +++ b/env_requirements.txt @@ -8,7 +8,7 @@ six==1.16.0 tensorflow==2.11.0 tf_slim==1.1.0 opencv-python==4.6.0.66 -openslide-python==1.2.0 +tiffslide==2.4.0 tornado==6.2 scikit-learn==1.2.0 dask==2022.12.1 diff --git a/Python/libs/MFP/__init__.py b/lib/DL/__init__.py old mode 100644 new mode 100755 similarity index 100% rename from Python/libs/MFP/__init__.py rename to lib/DL/__init__.py diff --git a/Python/libs/DL/image.py b/lib/DL/image.py old mode 100644 new mode 100755 similarity index 100% rename from Python/libs/DL/image.py rename to lib/DL/image.py diff --git a/Python/libs/DL/utils.py b/lib/DL/utils.py old mode 100644 new mode 100755 similarity index 100% rename from Python/libs/DL/utils.py rename to lib/DL/utils.py diff --git a/Python/libs/MFP/portraits/__init__.py b/lib/MFP/__init__.py old mode 100644 new mode 100755 similarity index 100% rename from Python/libs/MFP/portraits/__init__.py rename to lib/MFP/__init__.py diff --git a/Python/libs/MFP/license.md b/lib/MFP/license.md old mode 100644 new mode 100755 similarity index 100% rename from Python/libs/MFP/license.md rename to lib/MFP/license.md diff --git a/Python/libs/features/__init__.py b/lib/MFP/portraits/__init__.py old mode 100644 new mode 100755 similarity index 100% rename from Python/libs/features/__init__.py rename to lib/MFP/portraits/__init__.py diff --git a/Python/libs/MFP/portraits/utils.py b/lib/MFP/portraits/utils.py old mode 100644 new mode 100755 similarity index 100% rename from Python/libs/MFP/portraits/utils.py rename to lib/MFP/portraits/utils.py diff --git a/Python/libs/MFP/signatures/gene_signatures.gmt b/lib/MFP/signatures/gene_signatures.gmt old mode 100644 new mode 100755 similarity index 100% rename from Python/libs/MFP/signatures/gene_signatures.gmt rename to lib/MFP/signatures/gene_signatures.gmt diff --git a/Python/libs/MFP/signatures/gene_signatures_order.tsv b/lib/MFP/signatures/gene_signatures_order.tsv old mode 100644 new mode 100755 similarity index 100% rename from Python/libs/MFP/signatures/gene_signatures_order.tsv rename to lib/MFP/signatures/gene_signatures_order.tsv diff --git a/Python/libs/model/__init__.py b/lib/features/__init__.py old mode 100644 new mode 100755 similarity index 100% rename from Python/libs/model/__init__.py rename to lib/features/__init__.py diff --git a/Python/libs/features/clustering.py b/lib/features/clustering.py old mode 100644 new mode 100755 similarity index 100% rename from Python/libs/features/clustering.py rename to lib/features/clustering.py diff --git a/Python/libs/features/features.py b/lib/features/features.py old mode 100644 new mode 100755 similarity index 99% rename from Python/libs/features/features.py rename to lib/features/features.py index 5436730..9cf38ba --- a/Python/libs/features/features.py +++ b/lib/features/features.py @@ -16,9 +16,6 @@ from scipy.spatial import ConvexHull from sklearn.metrics import pairwise_distances_argmin_min -# Point to folder with custom imports -sys.path.append(f"{os.path.dirname(os.getcwd())}/Python/libs") - # Own modules from model.constants import * import features.utils as utils @@ -34,6 +31,7 @@ def determine_lcc(graph, cell_type_assignments, cell_types=None): Args: graph (Networkx Graph): graph representing the slide constructed with Networkx cell_type_assignments (DataFrame): Dataframe containing the cell type labels of the individual tiles indicated with booleans based on P > threshold + slide_submitter_id (str): string with slide submitter ID (default='Slide_1') cell_types (list): list of cell types """ if cell_types is None: @@ -52,7 +50,8 @@ def determine_lcc(graph, cell_type_assignments, cell_types=None): graph_temp.nodes() ) lcc.append([cell_type, lcc_frac]) - return pd.DataFrame(lcc, columns=["cell_type", "type_spec_frac"]) + lcc = pd.DataFrame(lcc, columns=["cell_type", "type_spec_frac"]) + return lcc def compute_dual_node_fractions(cell_type_assignments, cell_types=None): @@ -398,6 +397,7 @@ def _individual_between( cluster_pairs = list( itertools.product(list(range(n_clusters)), list(range(n_clusters))) ) + print(tiles.head()) for cell_type1, cell_type2 in cell_type_pairs: for i, j in cluster_pairs: cluster1_tiles = tiles.loc[ diff --git a/Python/libs/features/graphs.py b/lib/features/graphs.py old mode 100644 new mode 100755 similarity index 100% rename from Python/libs/features/graphs.py rename to lib/features/graphs.py diff --git a/lib/features/lcc.py b/lib/features/lcc.py new file mode 100755 index 0000000..9924454 --- /dev/null +++ b/lib/features/lcc.py @@ -0,0 +1,93 @@ +import sys +import os +import argparse +from argparse import ArgumentParser as AP +import networkx as nx +import pandas as pd +import time +from os.path import abspath + +# Own modules +import features.features as features +import features.utils as utils +from model.constants import DEFAULT_CELL_TYPES + +# Point to folder with custom imports +sys.path.append(f"{os.path.dirname(os.getcwd())}/Python/libs") + +# def get_args(): +# # Script description +# description = """Computing LCC""" + +# # Add parser +# parser = AP(description=description, +# formatter_class=argparse.RawDescriptionHelpFormatter) + +# # Sections +# parser.add_argument( +# "--clinical_files_input", +# help="Path to either a folder for multiple cancer types or single txt file.", required=False, +# default=None +# ) +# # TODO add arguments +# parser.add_argument("--version", action="version", version="0.1.0") +# arg = parser.parse_args() +# arg.output = abspath(arg.output) +# return arg + +def determine_lcc(graph, cell_type_assignments, cell_types=None): + """ Determine the fraction of the largest connected component (LCC) of a + cell type w.r.t. to all nodes (tiles) of that cell type. + 1. Determine the number of nodes N in the LCC for the probability map of a + cell type. + 2. Determine the total number of nodes (tiles) T for that cell type + 3. Determine the fraction of nodes that are connected: N/T + + Args: + graph (Networkx Graph): graph representing the slide constructed with Networkx + cell_type_assignments (DataFrame): Dataframe containing the cell type labels of the individual tiles indicated with booleans based on P > threshold + cell_types (list): list of cell types + """ + if cell_types is None: + cell_types = DEFAULT_CELL_TYPES + + lcc = [] + for cell_type in cell_types: + graph_temp = graph.copy() + graph_temp.remove_nodes_from( + list(cell_type_assignments[~cell_type_assignments[cell_type]].index) + ) + if len(graph_temp.nodes()) > 0: + # Get largest component + # include only cell type specific tiles + lcc_frac = len(max(nx.connected_components(graph_temp), key=len)) / len( + graph_temp.nodes() + ) + lcc.append([cell_type, lcc_frac]) + lcc = pd.DataFrame(lcc, columns=["cell_type", "type_spec_frac"]) + return lcc + + +def lcc_wrapper(id, slide_data, predictions, graph, cell_types, abundance_threshold): + slide_data = utils.get_slide_data(predictions, id) + node_cell_types = utils.assign_cell_types( + slide_data=slide_data, cell_types=cell_types, threshold=abundance_threshold) + lcc = features.determine_lcc( + graph=graph, cell_type_assignments=node_cell_types, cell_types=cell_types + ) + lcc["slide_submitter_id"] = id + + + +# def main(args): +# if not os.path.isdir(args.output_dir): +# os.mkdir(args.output_dir) +# lcc_wrapper(args.id, args.slide_data, args.predictions, args.graph, args.cell_types, args.abundance_threshold) + +# if __name__ == "__main__": +# args = get_args() +# st = time.time() +# main(args) +# rt = time.time() - st +# print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") + diff --git a/lib/features/tests/test_determine_lcc.py b/lib/features/tests/test_determine_lcc.py new file mode 100755 index 0000000..c568423 --- /dev/null +++ b/lib/features/tests/test_determine_lcc.py @@ -0,0 +1,9 @@ +import features + + +def test_determine_lcc(): + graph = + cell_type_assignments = + slide_submitter_id = + + features.determine_lcc(graph = graph, cell_type_assignments= cell_type_assignments, slide_submitter_id= slide_submitter_id) diff --git a/Python/libs/features/utils.py b/lib/features/utils.py old mode 100644 new mode 100755 similarity index 100% rename from Python/libs/features/utils.py rename to lib/features/utils.py diff --git a/Python/libs/features/vis.py b/lib/features/vis.py old mode 100644 new mode 100755 similarity index 100% rename from Python/libs/features/vis.py rename to lib/features/vis.py diff --git a/lib/model/__init__.py b/lib/model/__init__.py new file mode 100755 index 0000000..e69de29 diff --git a/Python/libs/model/constants.py b/lib/model/constants.py old mode 100644 new mode 100755 similarity index 89% rename from Python/libs/model/constants.py rename to lib/model/constants.py index 29ec09e..bdac420 --- a/Python/libs/model/constants.py +++ b/lib/model/constants.py @@ -1,7 +1,7 @@ import multiprocessing import sys -NUM_CORES = multiprocessing.cpu_count() - 4 +NUM_CORES = multiprocessing.cpu_count() METADATA_COLS = ['tile_ID', 'slide_submitter_id', 'Section', 'Coord_X', 'Coord_Y', 'TCGA_patient_ID', ] DEFAULT_CELL_TYPES = ["CAFs", "T_cells", "endothelial_cells", "tumor_purity"] @@ -38,4 +38,4 @@ ] IDS = ['slide_submitter_id', 'sample_submitter_id'] -TILE_VARS = ['Section', 'Coord_X', 'Coord_Y', "tile_ID"] \ No newline at end of file +TILE_VARS = ['Section', 'Coord_X', 'Coord_Y', "tile_ID"] diff --git a/Python/libs/model/evaluate.py b/lib/model/evaluate.py old mode 100644 new mode 100755 similarity index 100% rename from Python/libs/model/evaluate.py rename to lib/model/evaluate.py diff --git a/Python/libs/model/preprocessing.py b/lib/model/preprocessing.py old mode 100644 new mode 100755 similarity index 100% rename from Python/libs/model/preprocessing.py rename to lib/model/preprocessing.py diff --git a/Python/libs/model/utils.py b/lib/model/utils.py old mode 100644 new mode 100755 similarity index 100% rename from Python/libs/model/utils.py rename to lib/model/utils.py diff --git a/lib/myslim/__init__.py b/lib/myslim/__init__.py new file mode 100755 index 0000000..e69de29 diff --git a/lib/myslim/bottleneck_predict.py b/lib/myslim/bottleneck_predict.py new file mode 100755 index 0000000..35e847c --- /dev/null +++ b/lib/myslim/bottleneck_predict.py @@ -0,0 +1,124 @@ +#!/usr/bin/env python +import os +import sys +import time +import tensorflow.compat.v1 as tf + +sys.path.append(os.getcwd()) + +import tf_slim as slim +from nets import nets_factory +from preprocessing import preprocessing_factory + +tf.compat.v1.disable_eager_execution() + +tf.app.flags.DEFINE_integer("num_classes", 42, "The number of classes.") +tf.app.flags.DEFINE_string( + "bot_out", None, "Output file for bottleneck features.") +tf.app.flags.DEFINE_string("pred_out", None, "Output file for predictions.") +tf.app.flags.DEFINE_string( + "model_name", "inception_v4", "The name of the architecture to evaluate.") +tf.app.flags.DEFINE_string( + "checkpoint_path", None, "The directory where the model was written to.") +tf.app.flags.DEFINE_integer("eval_image_size", 299, "Eval image size.") +tf.app.flags.DEFINE_string("file_dir", "../Output/process_train/", "") + +FLAGS = tf.app.flags.FLAGS + + +def main(_): + model_name_to_variables = { + "inception_v3": "InceptionV3", + "inception_v4": "InceptionV4", + } + model_name_to_bottleneck_tensor_name = { + "inception_v4": "InceptionV4/Logits/AvgPool_1a/AvgPool:0", + "inception_v3": "InceptionV3/Logits/AvgPool_1a_8x8/AvgPool:0", + } + bottleneck_tensor_name = model_name_to_bottleneck_tensor_name.get( + FLAGS.model_name) + preprocessing_name = FLAGS.model_name + eval_image_size = FLAGS.eval_image_size + model_variables = model_name_to_variables.get(FLAGS.model_name) + if model_variables is None: + tf.logging.error("Unknown model_name provided `%s`." % + FLAGS.model_name) + sys.exit(-1) + # Either specify a checkpoint_path directly or find the path + if tf.gfile.IsDirectory(FLAGS.checkpoint_path): + checkpoint_path = tf.train.latest_checkpoint(FLAGS.checkpoint_path) + print(checkpoint_path) + if checkpoint_path is None: + sys.exit(-1) + else: + checkpoint_path = FLAGS.checkpoint_path + + image_string = tf.placeholder(tf.string) + image = tf.image.decode_jpeg( + image_string, channels=3, try_recover_truncated=True, acceptable_fraction=0.3 + ) + image_preprocessing_fn = preprocessing_factory.get_preprocessing( + preprocessing_name, is_training=False + ) + network_fn = nets_factory.get_network_fn( + FLAGS.model_name, FLAGS.num_classes, is_training=False + ) + processed_image = image_preprocessing_fn( + image, eval_image_size, eval_image_size) + processed_images = tf.expand_dims(processed_image, 0) + + logits, _ = network_fn(processed_images) + probabilities = tf.nn.softmax(logits) + init_fn = slim.assign_from_checkpoint_fn( + checkpoint_path, slim.get_model_variables(model_variables) + ) + + print(FLAGS.bot_out) + + sess = tf.Session() + init_fn(sess) + + fto_bot = open(FLAGS.bot_out, "w") + fto_pred = open(FLAGS.pred_out, "w") + + filelist = [file_path for file_path in os.listdir( + FLAGS.file_dir) if (file_path.startswith("images_train") & file_path.endswith(".tfrecord"))] + for i in range(len(filelist)): + file = filelist[i] + fls = tf.python_io.tf_record_iterator(FLAGS.file_dir + "/" + file) + tf.logging.info("reading from: %s" % file) + start_time = time.time() + c = 0 + for fl in fls: + example = tf.train.Example() + example.ParseFromString(fl) + x = example.features.feature["image/encoded"].bytes_list.value[0] + filenames = str( + example.features.feature["image/filename"].bytes_list.value[0] + ) + label = str( + example.features.feature["image/class/label"].int64_list.value[0] + ) + preds = sess.run(probabilities, feed_dict={image_string: x}) + bottleneck_values = sess.run( + bottleneck_tensor_name, {image_string: x}) + fto_pred.write(filenames + "\t" + label) + fto_bot.write(filenames + "\t" + label) + for p in range(len(preds[0])): + fto_pred.write("\t" + str(preds[0][p])) + fto_pred.write("\n") + for p in range(len(bottleneck_values[0][0][0])): + fto_bot.write("\t" + str(bottleneck_values[0][0][0][p])) + fto_bot.write("\n") + c += 1 + used_time = time.time() - start_time + tf.logging.info("processed images: %s" % c) + tf.logging.info("used time: %s" % used_time) + + fto_bot.close() + fto_pred.close() + sess.close() + + +if __name__ == "__main__": + tf.app.run() diff --git a/Python/1_extract_histopathological_features/myslim/datasets/__init__.py b/lib/myslim/datasets/__init__.py old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/datasets/__init__.py rename to lib/myslim/datasets/__init__.py diff --git a/Python/1_extract_histopathological_features/myslim/datasets/convert.py b/lib/myslim/datasets/convert.py old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/datasets/convert.py rename to lib/myslim/datasets/convert.py diff --git a/Python/1_extract_histopathological_features/myslim/datasets/dataset_factory.py b/lib/myslim/datasets/dataset_factory.py old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/datasets/dataset_factory.py rename to lib/myslim/datasets/dataset_factory.py diff --git a/Python/1_extract_histopathological_features/myslim/datasets/dataset_utils.py b/lib/myslim/datasets/dataset_utils.py old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/datasets/dataset_utils.py rename to lib/myslim/datasets/dataset_utils.py diff --git a/Python/1_extract_histopathological_features/myslim/datasets/tumors_all.py b/lib/myslim/datasets/tumors_all.py old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/datasets/tumors_all.py rename to lib/myslim/datasets/tumors_all.py diff --git a/Python/1_extract_histopathological_features/myslim/deployment/__init__.py b/lib/myslim/deployment/__init__.py old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/deployment/__init__.py rename to lib/myslim/deployment/__init__.py diff --git a/Python/1_extract_histopathological_features/myslim/deployment/model_deploy.py b/lib/myslim/deployment/model_deploy.py old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/deployment/model_deploy.py rename to lib/myslim/deployment/model_deploy.py diff --git a/Python/1_extract_histopathological_features/myslim/eval_image_classifier.py b/lib/myslim/eval_image_classifier.py old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/eval_image_classifier.py rename to lib/myslim/eval_image_classifier.py diff --git a/Python/1_extract_histopathological_features/myslim/nets/__init__.py b/lib/myslim/nets/__init__.py old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/nets/__init__.py rename to lib/myslim/nets/__init__.py diff --git a/Python/1_extract_histopathological_features/myslim/nets/inception.py b/lib/myslim/nets/inception.py old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/nets/inception.py rename to lib/myslim/nets/inception.py diff --git a/Python/1_extract_histopathological_features/myslim/nets/inception_alt.py b/lib/myslim/nets/inception_alt.py old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/nets/inception_alt.py rename to lib/myslim/nets/inception_alt.py diff --git a/Python/1_extract_histopathological_features/myslim/nets/inception_utils.py b/lib/myslim/nets/inception_utils.py old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/nets/inception_utils.py rename to lib/myslim/nets/inception_utils.py diff --git a/Python/1_extract_histopathological_features/myslim/nets/inception_v4.py b/lib/myslim/nets/inception_v4.py old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/nets/inception_v4.py rename to lib/myslim/nets/inception_v4.py diff --git a/Python/1_extract_histopathological_features/myslim/nets/inception_v4_alt.py b/lib/myslim/nets/inception_v4_alt.py old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/nets/inception_v4_alt.py rename to lib/myslim/nets/inception_v4_alt.py diff --git a/Python/1_extract_histopathological_features/myslim/nets/nets_factory.py b/lib/myslim/nets/nets_factory.py old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/nets/nets_factory.py rename to lib/myslim/nets/nets_factory.py diff --git a/Python/1_extract_histopathological_features/myslim/nets/overfeat.py b/lib/myslim/nets/overfeat.py old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/nets/overfeat.py rename to lib/myslim/nets/overfeat.py diff --git a/lib/myslim/post_process_features.py b/lib/myslim/post_process_features.py new file mode 100755 index 0000000..84204c6 --- /dev/null +++ b/lib/myslim/post_process_features.py @@ -0,0 +1,134 @@ +#  Module imports +import argparse +from argparse import ArgumentParser as AP +import os +import dask.dataframe as dd +import pandas as pd + +#  Custom imports +import DL.utils as utils +from os.path import abspath +import time +from pathlib import Path + + +def get_args(): + # Script description + description = """Post processing features""" + + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + + # Sections + parser.add_argument("--output_dir", help="Set output folder (default='.')", default = ".") + parser.add_argument("--create_parquet_subdir", help = "Whether to create a subdirectory called 'features_format_parquet' if slide_type == 'FFPE', default=False", default = False) + parser.add_argument( + "--slide_type", help="Type of tissue slide (FF or FFPE)]") + parser.add_argument( + "--is_tcga", help="Is TCGA dataset, default=False", type = bool, default = False) + # TODO add more details + parser.add_argument("--bot_train_file", type = str, default = None, help = "Txt file") + parser.add_argument("--version", action="version", version="0.1.0") + arg = parser.parse_args() + + if arg.bot_train_file is None: + arg.bot_train_file = Path(arg.output_dir, "bot_train.txt") + + if (arg.create_parquet_subdir): + arg.output_dir = abspath(Path(arg.output_dir, "features_format_parquet")) + + if not os.path.isdir(arg.output_dir): + os.mkdir(arg.output_dir) + + return arg + + +def handle_ff_slides(bot_train_file, is_tcga): + features_raw = pd.read_csv(bot_train_file, sep="\t", header=None) + # Extract the DL features (discard: col1 = tile paths, col2 = true class id) + features = features_raw.iloc[:, 2:] + features.columns = list(range(1536)) + # Add new column variables that define each tile + features["tile_ID"] = [utils.get_tile_name( + tile_path) for tile_path in features_raw.iloc[:, 0]] + features["Coord_X"] = [i[-2] + for i in features["tile_ID"].str.split("_")] + features["Coord_Y"] = [i[-1] + for i in features["tile_ID"].str.split("_")] + # FIX add sample_submitter_id and slide_submitter_id depending on is_tcga + if is_tcga: + features["sample_submitter_id"] = features["tile_ID"].str[0:16] + features["slide_submitter_id"] = features["tile_ID"].str[0:23] + features["Section"] = features["tile_ID"].str[20:23] + else: + features["sample_submitter_id"] = features['tile_ID'].str.split( + '_').str[0] + return(features) + +def handle_ffpe_slides(bot_train_file, is_tcga): + features_raw = dd.read_csv(bot_train_file, sep="\t", header=None) + features_raw['tile_ID'] = features_raw.iloc[:, 0] + features_raw.tile_ID = features_raw.tile_ID.map( + lambda x: x.split("/")[-1]) + features_raw['tile_ID'] = features_raw['tile_ID'].str.replace( + ".jpg'", "") + features = features_raw.map_partitions( + lambda df: df.drop(columns=[0, 1])) + new_names = list(map(lambda x: str(x), list(range(1536)))) + new_names.append('tile_ID') + features.columns = new_names + # FIX add sample_submitter_id and slide_submitter_id depending on is_tcga + if is_tcga: + features["sample_submitter_id"] = features["tile_ID"].str[0:16] + features["slide_submitter_id"] = features["tile_ID"].str[0:23] + features["Section"] = features["tile_ID"].str[20:23] + else: + features["sample_submitter_id"] = features['tile_ID'].str.split( + '_').str[0] + features['Coord_X'] = features['tile_ID'].str.split('_').str[1] + features['Coord_Y'] = features['tile_ID'].str.split('_').str[-1] + return(features) + + +def post_process_features(bot_train_file, slide_type = "FF", is_tcga="TCGA"): + """ + Format extracted histopathological features from bot.train.txt file generated by myslim/bottleneck_predict.py and extract the 1,536 features, tile names. Extract several variables from tile ID. + + Args: + bot_train_file (txt) + slide_type (str) + is_tcga (bool) + + Returns: + features (dataframe) contains the 1,536 features, followed by the sample_submitter_id, tile_ID, slide_submitter_id, Section, Coord_X and Coord_Y and in the rows the tiles + """ + # Read histopathological computed features + if slide_type == "FF": + return(handle_ff_slides(bot_train_file=bot_train_file, is_tcga=is_tcga)) + elif slide_type == "FFPE": + return(handle_ffpe_slides(bot_train_file=bot_train_file, is_tcga=is_tcga)) + else: + raise Exception("Invalid `slide_type`, please choose 'FF' or 'FFPE' ") + + +def main(args): + features = post_process_features( + bot_train_file=args.bot_train_file, + slide_type=args.slide_type, + is_tcga=args.is_tcga) + if (args.slide_type == "FF"): + #  Save features to .csv file + features.to_csv(Path(args.output_dir, "features.txt"), sep="\t", header=True) + elif (args.slide_type == "FFPE"): + features.to_parquet(path= args.output_dir, compression='gzip', + name_function=utils.name_function) + print("Finished post-processing of features...") + + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/lib/myslim/post_process_predictions.py b/lib/myslim/post_process_predictions.py new file mode 100755 index 0000000..9730749 --- /dev/null +++ b/lib/myslim/post_process_predictions.py @@ -0,0 +1,240 @@ +# Module imports +import argparse +from argparse import ArgumentParser as AP +import os +import dask.dataframe as dd +import pandas as pd + +#  Custom imports +import DL.utils as utils +import numpy as np +from os.path import abspath +import time +from pathlib import Path + +def get_args(): + # Script description + description = """Post-processing predictions""" + + # Add parser + parser = AP(description=description, + formatter_class=argparse.RawDescriptionHelpFormatter) + parser.add_argument("--output_dir", help="Set output folder", default = ".") + parser.add_argument("--create_parquet_subdir", help = "Whether to create a subdirectory called 'predictions_format_parquet' if slide_type == 'FFPE', default=False", default = False) + parser.add_argument( + "--slide_type", help="Type of tissue slide (FF or FFPE) (default='FF')", type = str, default = "FF") + parser.add_argument( + "--path_codebook", help="codebook.txt file", required=True, type=str) + parser.add_argument( + "--path_tissue_classes", help="Tissue_classes.csv file", required=True, type=str) + parser.add_argument("--cancer_type", help = "Cancer type", required = True, type =str) + parser.add_argument("--pred_train_file", help = "", type = str, default = None) + arg = parser.parse_args() + + if arg.pred_train_file is None: + arg.pred_train_file = Path(arg.output_dir, "pred_train.txt") + + if (arg.create_parquet_subdir): + arg.output_dir = abspath(Path(arg.output_dir, "predictions_format_parquet")) + + if not os.path.isdir(arg.output_dir): + os.mkdir(arg.output_dir) + + return arg + + + +def handle_ff_slides(pred_train_file, codebook, tissue_classes, cancer_type): + predictions_raw = pd.read_csv(pred_train_file, sep="\t", header=None) + # Extract tile name incl. coordinates from path + tile_names = [utils.get_tile_name(tile_path) + for tile_path in predictions_raw[0]] + # Create output dataframe for post-processed data + predictions = pd.DataFrame(tile_names, columns=["tile_ID"]) + # Get predicted probabilities for all 42 classes + rename columns + pred_probabilities = predictions_raw.iloc[:, 2:] + pred_probabilities.columns = codebook["class_id"] + # Get predicted and true class ids + predictions["pred_class_id"] = pred_probabilities.idxmax( + axis="columns") + predictions["true_class_id"] = 41 + # Get corresponding max probabilities to the predicted class + predictions["pred_probability"] = pred_probabilities.max(axis=1) + # Replace class id with class name + predictions["true_class_name"] = predictions["true_class_id"].copy() + predictions["pred_class_name"] = predictions["pred_class_id"].copy() + found_class_ids = set(predictions["true_class_id"]).union( + set(predictions["pred_class_id"])) + for class_id in found_class_ids: + predictions["true_class_name"].replace( + class_id, codebook["class_name"][class_id], inplace=True + ) + predictions["pred_class_name"].replace( + class_id, codebook["class_name"][class_id], inplace=True + ) + + # Define whether prediction was right + predictions["is_correct_pred"] = ( + predictions["true_class_id"] == predictions["pred_class_id"]) + predictions["is_correct_pred"] = predictions["is_correct_pred"].replace( + False, "F") + predictions.is_correct_pred = predictions.is_correct_pred.astype(str) + # Get tumor and tissue ID + # TODO ERROR + temp = pd.DataFrame( + {"tumor_type": predictions["true_class_name"].str[:-2]}) + temp = pd.merge(temp, tissue_classes, on="tumor_type", how="left") + # Set of IDs for normal and tumor (because of using multiple classes) + IDs_tumor = list(set(temp["ID_tumor"])) + if list(set(temp.tumor_type.tolist()))[0] == cancer_type: + # Probability for predicting tumor and normal label (regardless of tumor [tissue] type) + predictions["tumor_label_prob"] = np.nan + predictions["normal_label_prob"] = np.nan + for ID_tumor in IDs_tumor: + vals = pred_probabilities.loc[temp["ID_tumor"] + == ID_tumor, ID_tumor] + predictions.loc[temp["ID_tumor"] == + ID_tumor, "tumor_label_prob"] = vals + + predictions["is_correct_pred_label"] = np.nan + else: + IDs_normal = list(set(temp["ID_normal"])) + # Probability for predicting tumor and normal label (regardless of tumor [tissue] type) + predictions["tumor_label_prob"] = np.nan + predictions["normal_label_prob"] = np.nan + for ID_tumor in IDs_tumor: + vals = pred_probabilities.loc[temp["ID_tumor"] + == ID_tumor, ID_tumor] + predictions.loc[temp["ID_tumor"] == + ID_tumor, "tumor_label_prob"] = vals + + for ID_normal in IDs_normal: + vals = pred_probabilities.loc[temp["ID_normal"] + == ID_normal, ID_normal] + predictions.loc[temp["ID_normal"] == + ID_normal, "normal_label_prob"] = vals + + # Check if the correct label (tumor/normal) is predicted + temp_probs = predictions[["tumor_label_prob", "normal_label_prob"]] + is_normal_label_prob = ( + temp_probs["normal_label_prob"] > temp_probs["tumor_label_prob"] + ) + is_tumor_label_prob = ( + temp_probs["normal_label_prob"] < temp_probs["tumor_label_prob"] + ) + is_normal_label = predictions["true_class_name"].str.find( + "_N") != -1 + is_tumor_label = predictions["true_class_name"].str.find( + "_T") != -1 + + is_normal = is_normal_label & is_normal_label_prob + is_tumor = is_tumor_label & is_tumor_label_prob + + predictions["is_correct_pred_label"] = is_normal | is_tumor + predictions["is_correct_pred_label"].replace( + True, "T", inplace=True) + predictions["is_correct_pred_label"].replace( + False, "F", inplace=True) + return(predictions) + +def handle_ffpe_slides(pred_train_file, codebook, tissue_classes, cancer_type): + predictions_raw = dd.read_csv(pred_train_file, sep="\t", header=None) + predictions_raw['tile_ID'] = predictions_raw.iloc[:, 0] + predictions_raw.tile_ID = predictions_raw.tile_ID.map( + lambda x: x.split("/")[-1]) + predictions_raw['tile_ID'] = predictions_raw['tile_ID'].str.replace( + ".jpg'", "") + predictions = predictions_raw.map_partitions( + lambda df: df.drop(columns=[0, 1])) + new_names = list(map(lambda x: str(x), codebook["class_id"])) + new_names.append('tile_ID') + predictions.columns = new_names + predictions = predictions.map_partitions(lambda x: x.assign( + pred_class_id=x.iloc[:, 0:41].idxmax(axis="columns"))) + predictions["true_class_id"] = 41 + predictions = predictions.map_partitions(lambda x: x.assign( + pred_probability=x.iloc[:, 0:41].max(axis="columns"))) + predictions["true_class_name"] = predictions["true_class_id"].copy() + predictions["pred_class_name"] = predictions["pred_class_id"].copy() + predictions.pred_class_id = predictions.pred_class_id.astype(int) + res = dict(zip(codebook.class_id, codebook.class_name)) + predictions = predictions.map_partitions(lambda x: x.assign( + pred_class_name=x.loc[:, 'pred_class_id'].replace(res))) + predictions = predictions.map_partitions(lambda x: x.assign( + true_class_name=x.loc[:, 'true_class_id'].replace(res))) + predictions["is_correct_pred"] = ( + predictions["true_class_id"] == predictions["pred_class_id"]) + predictions["is_correct_pred"] = predictions["is_correct_pred"].replace( + False, "F") + predictions.is_correct_pred = predictions.is_correct_pred.astype(str) + temp = predictions.map_partitions(lambda x: x.assign( + tumor_type=x["true_class_name"].str[:-2])) + temp = temp.map_partitions(lambda x: pd.merge( + x, tissue_classes, on="tumor_type", how="left")) + if (temp['tumor_type'].compute() == cancer_type).any(): + # Probability for predicting tumor and normal label (regardless of tumor [tissue] type) + predictions["tumor_label_prob"] = np.nan + predictions["normal_label_prob"] = np.nan + predictions = predictions.map_partitions( + lambda x: x.assign(tumor_label_prob=x.loc[:, '41'])) + predictions["is_correct_pred_label"] = np.nan + else: + # TO DO + predictions["tumor_label_prob"] = np.nan + predictions["normal_label_prob"] = np.nan + # predictions = predictions.map_partitions(lambda x: x.assign(tumor_label_prob=x.loc[:, '41'])) + # predictions = predictions.map_partitions(lambda x: x.assign(tumor_label_prob=x.loc[:, '41'])) + +def post_process_predictions(pred_train_file, slide_type, path_codebook, path_tissue_classes, cancer_type): + """ + Format predicted tissue classes and derive tumor purity from pred.train.txt file generated by myslim/bottleneck_predict.py and + The pred.train.txt file contains the tile ID, the true class id and the 42 predicted probabilities for the 42 tissue classes. + + Args: + output_dir (str): path pointing to folder for storing all created files by script + + Returns: + {output_dir}/predictions.txt containing the following columns + - tile_ID, + - pred_class_id and true_class_id: class ids defined in codebook.txt) + - pred_class_name and true_class_name: class names e.g. LUAD_T, defined in codebook.txt) + - pred_probability: corresponding probability + - is_correct_pred (boolean): correctly predicted tissue class label + - tumor_label_prob and normal_label_prob: probability for predicting tumor and normal label (regardless of tumor or tissue type) + - is_correct_pred_label (boolean): correctly predicted 'tumor' or 'normal' tissue regardless of tumor or tissue type + In the rows the tiles. + """ + + # Initialize + codebook = pd.read_csv(path_codebook, delim_whitespace=True, header=None) + codebook.columns = ["class_name", "class_id"] + tissue_classes = pd.read_csv(path_tissue_classes, sep="\t") + + # Read predictions + if slide_type == "FF": + return(handle_ff_slides(pred_train_file=pred_train_file, codebook=codebook, tissue_classes=tissue_classes, cancer_type = cancer_type)) + #  Save features to .csv file + elif slide_type == "FFPE": + return(handle_ffpe_slides(pred_train_file=pred_train_file, codebook=codebook, tissue_classes= tissue_classes, cancer_type=cancer_type)) + else: + raise Exception("Invalid `slide_type`, please choose 'FF' or 'FFPE' ") + +def main(args): + predictions = post_process_predictions(output_dir=args.output_dir, slide_type=args.slide_type, path_codebook=args.path_codebook, + path_tissue_classes=args.path_tissue_classes, cancer_type=args.cancer_type) + if (args.slide_type == "FF"): + predictions.to_csv(Path(args.output_dir, "predictions.txt"), sep="\t") + elif (args.slide_type == "FFPE"): + # Save features using parquet + def name_function(x): return f"predictions-{x}.parquet" + predictions.to_parquet( + path=args.output_dir, compression='gzip', name_function=name_function) + print("Finished post-processing of predictions...") + + +if __name__ == "__main__": + args = get_args() + st = time.time() + main(args) + rt = time.time() - st + print(f"Script finished in {rt // 60:.0f}m {rt % 60:.0f}s") diff --git a/Python/1_extract_histopathological_features/myslim/preprocessing/__init__.py b/lib/myslim/preprocessing/__init__.py old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/preprocessing/__init__.py rename to lib/myslim/preprocessing/__init__.py diff --git a/Python/1_extract_histopathological_features/myslim/preprocessing/inception_preprocessing.py b/lib/myslim/preprocessing/inception_preprocessing.py old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/preprocessing/inception_preprocessing.py rename to lib/myslim/preprocessing/inception_preprocessing.py diff --git a/Python/1_extract_histopathological_features/myslim/preprocessing/inception_preprocessing_dataAug.py b/lib/myslim/preprocessing/inception_preprocessing_dataAug.py old mode 100644 new mode 100755 similarity index 99% rename from Python/1_extract_histopathological_features/myslim/preprocessing/inception_preprocessing_dataAug.py rename to lib/myslim/preprocessing/inception_preprocessing_dataAug.py index f6310fb..68a517d --- a/Python/1_extract_histopathological_features/myslim/preprocessing/inception_preprocessing_dataAug.py +++ b/lib/myslim/preprocessing/inception_preprocessing_dataAug.py @@ -246,7 +246,7 @@ def preprocess_for_train(image, height, width, bbox, tf.expand_dims(rotated_image, 0)) # crop image in the center - #rotated_image = tf.image.central_crop(rotated_image, 0.6) + # rotated_image = tf.image.central_crop(rotated_image, 0.6) if add_image_summaries: tf.summary.image('5_centralcropped_image', diff --git a/Python/1_extract_histopathological_features/myslim/preprocessing/preprocessing_factory.py b/lib/myslim/preprocessing/preprocessing_factory.py old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/preprocessing/preprocessing_factory.py rename to lib/myslim/preprocessing/preprocessing_factory.py diff --git a/Python/1_extract_histopathological_features/myslim/run/bottleneck_predict.sh b/lib/myslim/run/bottleneck_predict.sh old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/run/bottleneck_predict.sh rename to lib/myslim/run/bottleneck_predict.sh diff --git a/Python/1_extract_histopathological_features/myslim/run/convert.sh b/lib/myslim/run/convert.sh old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/run/convert.sh rename to lib/myslim/run/convert.sh diff --git a/Python/1_extract_histopathological_features/myslim/run/eval.sh b/lib/myslim/run/eval.sh old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/run/eval.sh rename to lib/myslim/run/eval.sh diff --git a/Python/1_extract_histopathological_features/myslim/run/load_inception_v4.sh b/lib/myslim/run/load_inception_v4.sh old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/run/load_inception_v4.sh rename to lib/myslim/run/load_inception_v4.sh diff --git a/Python/1_extract_histopathological_features/myslim/run/load_inception_v4_alt.sh b/lib/myslim/run/load_inception_v4_alt.sh old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/run/load_inception_v4_alt.sh rename to lib/myslim/run/load_inception_v4_alt.sh diff --git a/Python/1_extract_histopathological_features/myslim/train_image_classifier.py b/lib/myslim/train_image_classifier.py old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/train_image_classifier.py rename to lib/myslim/train_image_classifier.py diff --git a/Python/1_extract_histopathological_features/myslim/train_image_classifier_jpeg.py b/lib/myslim/train_image_classifier_jpeg.py old mode 100644 new mode 100755 similarity index 100% rename from Python/1_extract_histopathological_features/myslim/train_image_classifier_jpeg.py rename to lib/myslim/train_image_classifier_jpeg.py diff --git a/run_pipeline.sh b/run_pipeline.sh deleted file mode 100644 index 6e90915..0000000 --- a/run_pipeline.sh +++ /dev/null @@ -1,69 +0,0 @@ -#!/usr/bin/env bash -#SBATCH -J spotlight-docker -#SBATCH --mail-type=END,FAIL -#SBATCH --mail-user=JohnDoe@mail.com -#SBATCH --partition=veryhimem -#SBATCH --ntasks=1 -#SBATCH --nodes=1 -#SBATCH --cpus-per-task=1 -#SBATCH --mem=64G -#SBATCH --time=01:00:00 -#SBATCH --output=slurm_out/%x_%j.out -#SBATCH --error=slurm_out/%x_%j.out - -module load apptainer - -# Directory 'spotlight_docker' - -work_dir="/path/to/spotlight_docker" -spotlight_sif="path/to/spotlight_sif" - -# Define directories/files in container (mounted) - -folder_images="/path/to/images_dir" -output_dir="/path/to/output_dir" - -# Relative to docker, i.e. start with /data - -checkpoint="/data/checkpoint/Retrained_Inception_v4/model.ckpt-100000" -clinical_files_dir="/data/path/to/clinical/TCGA/file.tsv" - -# Remaining parameters (this configuration has been tested) -slide_type="FF" -tumor_purity_threshold=80 -class_names="SKCM_T" -model_name="inception_v4" - -echo "Create output directory: ${output_dir}..." -mkdir -p ${output_dir} - -echo "Binding directories..." -# Bind directories + give r/o/w access (do not touch) -# Automatically binds the following -# - Included in repository: /data, /Python -# - Defined by used: {folder_images}, {output_dir} -export APPTAINER_BINDPATH=${work_dir}/data/:/project/data:ro,${folder_images}:/project/images:ro,${output_dir}:/project/output:rw,${work_dir}/run_scripts:/project/run_scripts:ro,${work_dir}/Python:/project/Python:ro - -echo "Run pipeline..." -echo "Extract histopathological features (1 out of 3)" -apptainer exec \ - --cleanenv \ - -c \ - ${spotlight_sif} \ - bash "/project/run_scripts/1_extract_histopatho_features.sh" ${checkpoint} ${clinical_files_dir} ${slide_type} ${class_names} ${tumor_purity_threshold} ${model_name} - -echo "Tile level cell type quanitification (2 out of 3)" -apptainer exec \ - --cleanenv \ - -c \ - ${spotlight_sif} \ - bash "/project/run_scripts/2_tile_level_cell_type_quantification.sh" $slide_type - -echo "Compute spatial features (3 out of 3)" -apptainer exec \ - --cleanenv \ - -c \ - ${spotlight_sif} \ - bash "/project/run_scripts/3_compute_spatial_features.sh" ${slide_type} - -echo "COMPLETED!" \ No newline at end of file diff --git a/run_scripts/1_extract_histopatho_features.sh b/run_scripts/1_extract_histopatho_features.sh deleted file mode 100755 index f8c1a27..0000000 --- a/run_scripts/1_extract_histopatho_features.sh +++ /dev/null @@ -1,94 +0,0 @@ -#!/bin/bash - -# Adjusted pipeline from PC-CHiP workflow: -# Fu, Y., Jung, A.W., Torne, R.V. et al. Pan-cancer computational histopathology reveals mutations, tumor composition and prognosis. Nat Cancer 1, 800–810 (2020).> - -# To execute the pipeline, define: input_dir, output_dir, cancer_type, class_name, checkpoint_path and TCGA_clinical_files. -# cancertype = TCGA_abbreviation_tumor/normal e.g. TCGA_COAD_tumor (see https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations) -# class_name = e.g. COAD_T (see codebook.txt) - -# Define type of slide (Fresh-Frozen [FF] vs Formalin-Fixed Paraffin-Embedded [FFPE]) - -# General setup -repo_dir="/project" - -echo "REPO_DIR: ${repo_dir}" -echo "Model checkpoints: $1"; -echo "Dir w/ clinical files: $2"; - -echo "Slide type: $3"; -echo "Class names: $4"; -echo "Tumor purity threshold: $5"; -echo "Model name: $6"; - -# ---- Relative to /project ---- # -# User input -checkpoint_path=${repo_dir}/$1 -clinical_files_dir=${repo_dir}/$2 - -# Fixed dir -slides_dir=${repo_dir}/images -output_dir=${repo_dir}/output - -# Fixed files -path_codebook=${repo_dir}/Python/1_extract_histopathological_features/codebook.txt -path_tissue_classes=${repo_dir}/Python/1_extract_histopathological_features/tissue_classes.csv - -# ---- Parameters ---- # -slide_type=$3 -class_names=$4 -tumor_purity_threshold=$5 -model_name=$6 - -# ---------------------------------- # -# ---- create new clinical file ---- # -# ---------------------------------- # - -python $repo_dir/Python/1_extract_histopathological_features/myslim/create_clinical_file.py \ - --class_names $class_names \ - --clinical_files_dir $clinical_files_dir \ - --tumor_purity_threshold $tumor_purity_threshold \ - --output_dir $output_dir/1_histopathological_features \ - --path_codebook ${path_codebook} - -clinical_file=$output_dir/1_histopathological_features/generated_clinical_file.txt - -ls $slides_dir | tee ${output_dir}/list_images.txt -awk -v a=81 -v b="${class_names}" -v c=41 'FNR==NR{print; next}{split($1, tmp, "."); OFS="\t"; print tmp[1], tmp[1], $1, a, b, c}' $clinical_file ${output_dir}/list_images.txt > $output_dir/1_histopathological_features/final_clinical_file.txt - -# --------------------------------------------------------- # -# ---- image tiling and image conversion to TF records ---- # -# --------------------------------------------------------- # - -python $repo_dir/Python/1_extract_histopathological_features/pre_processing.py \ - --slides_folder $slides_dir \ - --output_folder $output_dir/1_histopathological_features \ - --clinical_file_path $output_dir/1_histopathological_features/final_clinical_file.txt - -# ------------------------------------------------------ # -# ---- Compute predictions and bottlenecks features ---- # -# ------------------------------------------------------ # - -# Compute predictions and bottlenecks features using the Retrained_Inception_v4 checkpoints -python $repo_dir/Python/1_extract_histopathological_features/myslim/bottleneck_predict.py \ - --num_classes=42 \ - --bot_out=$output_dir/1_histopathological_features/bot_train.txt \ - --pred_out=$output_dir/1_histopathological_features/pred_train.txt \ - --model_name $model_name \ - --checkpoint_path $checkpoint_path \ - --file_dir $output_dir/1_histopathological_features/process_train - -# ----------------------------------------------------- # -# ---- Post-processing of predictions and futures ----- # -# ----------------------------------------------------- # - -# Transform bottleneck features, add dummy variable for tissue type for each tile, save predictions in seperate files -# (= input for pipeline part 2) - -python $repo_dir/Python/1_extract_histopathological_features/post_processing.py \ - --output_dir $output_dir/1_histopathological_features \ - --slide_type $slide_type \ - --path_codebook $path_codebook \ - --path_tissue_classes $path_tissue_classes \ - -# # outputs two files: $output_dir/features $output_dir/predictions diff --git a/run_scripts/2_tile_level_cell_type_quantification.sh b/run_scripts/2_tile_level_cell_type_quantification.sh deleted file mode 100644 index 989d19c..0000000 --- a/run_scripts/2_tile_level_cell_type_quantification.sh +++ /dev/null @@ -1,45 +0,0 @@ -#!/bin/bash - -##################################################################### -## Compute cell-type quantification from transfer learning models ## -##################################################################### - -# ----------------------------------- # -# --------- Setup file paths -------- # -# ----------------------------------- # - -# General setup -repo_dir=project - -# command line arguments -echo "Slide type: $1"; - -# Define type of slide -slide_type=$1 - -# Fixed dir -output_dir=${repo_dir}/output -histopatho_features_dir=${output_dir}/1_histopathological_features - -# Transfer Learning trained models directory (default: use of FF here) -models_dir=${repo_dir}/data/TF_models/SKCM_FF -var_names_path=${repo_dir}/Python/2_train_multitask_models/task_selection_names.pkl - -# Compute predictions using models learned from unseen folds -prediction_mode="test" # (tcga_validation, tcga_train_validation) - -echo "Prediction mode: $prediction_mode" - -# ---------------------------------------------------- # -# ---- Predict cell type abundances on tile level ---- # -# ---------------------------------------------------- # - -# For now, we use models trained on FF slides - -python ${repo_dir}/Python/2_train_multitask_models/tile_level_cell_type_quantification.py \ - --models_dir $models_dir \ - --output_dir "$output_dir/2_tile_level_quantification" \ - --histopatho_features_dir $histopatho_features_dir \ - --prediction_mode $prediction_mode \ - --var_names_path $var_names_path \ - --slide_type $slide_type diff --git a/run_scripts/3_compute_spatial_features.sh b/run_scripts/3_compute_spatial_features.sh deleted file mode 100644 index 3f27244..0000000 --- a/run_scripts/3_compute_spatial_features.sh +++ /dev/null @@ -1,75 +0,0 @@ -#!/bin/bash - -############################### -## Compute spatial features ## -############################### - -# ----------------------------------- # -# --------- Setup file paths -------- # -# ----------------------------------- # - -# General setup -repo_dir="/project" - -# command line rguments -echo "Slide type: $1"; - - -# Fixed dir -output_dir=${repo_dir}/output - -# Fixed files -tile_quantification_path="${output_dir}/2_tile_level_quantification/test_tile_predictions_proba.csv" - -# Define type of slide -slide_type=$1 - -# ---------------------------------- # -# ---- Compute all features -------- # -# ---------------------------------- # -run_mode=1 -python $repo_dir/Python/3_spatial_characterization/computing_features.py \ - --workflow_mode $run_mode \ - --tile_quantification_path $tile_quantification_path \ - --output_dir $output_dir/3_spatial_features \ - --metadata_path $output_dir/3_spatial_features/metadata.csv \ - --slide_type $slide_type # OPTIONAL BY DEFAULT FF - # --cell_types=$cell_types \ # OPTIONAL - #--graphs_path=$graphs_path # OPTIONAL - -# # ---------------------------------- # -# # ---- Compute network features ---- # -# # ---------------------------------- # -# workflow=2 -# python $repo_path/Python/computing_features.py \ -# --workflow=$workflow \ -# --tile_quantification_path=$tile_quantification_path \ -# --output_dir=$output_dir -# # --slide_type=$slide_type \ # OPTIONAL BY DEFAULT FF -# # --cell_types=$cell_types \ # OPTIONAL -# # --graphs_path=$graphs_path \ # OPTIONAL - -# # ------------------------------------- # -# # ---- Compute clustering features ---- # -# # ------------------------------------- # -# workflow=3 -# python $repo_path/Python/computing_features.py \ -# --workflow=$workflow \ -# --tile_quantification_path=$tile_quantification_path \ -# --output_dir=$output_dir \ -# # --slide_type=$slide_type \ # OPTIONAL BY DEFAULT FF -# # --cell_types=$cell_types \ # OPTIONAL -# # --graphs_path=$graphs_path \ # OPTIONAL - - -# # -------------------------- # -# # ---- Combine features ---- # -# # -------------------------- # -# workflow=4 -# python $repo_path/Python/computing_features.py \ -# --workflow=$workflow \ -# --tile_quantification_path=$tile_quantification_path \ -# --output_dir=$output_dir \ -# # --slide_type=$slide_type \ # OPTIONAL BY DEFAULT FF -# # --cell_types=$cell_types \ # OPTIONAL -# # --graphs_path=$graphs_path \ # OPTIONAL diff --git a/run_scripts/create_tmp_clinical_file.sh b/run_scripts/create_tmp_clinical_file.sh deleted file mode 100644 index f9e692c..0000000 --- a/run_scripts/create_tmp_clinical_file.sh +++ /dev/null @@ -1,5 +0,0 @@ -#!/bin/bash -clinical_file="$data_example" -ls $slides_dir | tee list_images.txt -awk -v a=81 -v b="SKCM_T" -v c=41 'FNR==NR{print; next}{split($1, tmp, "."); print tmp[1], tmp[1], $1, a, b, c}' $clinical_file list_images.txt > $clinical_file/final_clinical_file.txt - diff --git a/run_scripts/requirements.txt b/run_scripts/requirements.txt deleted file mode 100644 index 79d7ac6..0000000 --- a/run_scripts/requirements.txt +++ /dev/null @@ -1,15 +0,0 @@ -caffeine==0.5 -GitPython==3.1.30 -joblib==1.1.1 -matplotlib==3.6.2 -networkx==3.0 -numpy==1.24.1 -pandas==1.5.2 -Pillow==9.4.0 -scikit_learn==1.2.0 -scipy==1.10.0 -seaborn==0.12.2 -six==1.16.0 -tensorflow==2.11.0 -tf_slim==1.1.0 -openpyxl \ No newline at end of file diff --git a/run_scripts/task_selection_names.pkl b/run_scripts/task_selection_names.pkl deleted file mode 100644 index f1e9f67..0000000 Binary files a/run_scripts/task_selection_names.pkl and /dev/null differ diff --git a/tower.yml b/tower.yml new file mode 100644 index 0000000..e69de29