diff --git a/ABOUT_THIS_TEMPLATE.md b/ABOUT_THIS_TEMPLATE.md deleted file mode 100644 index 27a6104..0000000 --- a/ABOUT_THIS_TEMPLATE.md +++ /dev/null @@ -1,198 +0,0 @@ -# About this template - -Hi, I created this template to help you get started with a new project. - -I have created and maintained a number of python libraries, applications and -frameworks and during those years I have learned a lot about how to create a -project structure and how to structure a project to be as modular and simple -as possible. - -Some decisions I have made while creating this template are: - - - Create a project structure that is as modular as possible. - - Keep it simple and easy to maintain. - - Allow for a lot of flexibility and customizability. - - Low dependency (this template doesn't add dependencies) - -## Structure - -Lets take a look at the structure of this template: - -```text -├── Containerfile # The file to build a container using buildah or docker -├── CONTRIBUTING.md # Onboarding instructions for new contributors -├── docs # Documentation site (add more .md files here) -│   └── index.md # The index page for the docs site -├── .github # Github metadata for repository -│   ├── release_message.sh # A script to generate a release message -│   └── workflows # The CI pipeline for Github Actions -├── .gitignore # A list of files to ignore when pushing to Github -├── HISTORY.md # Auto generated list of changes to the project -├── LICENSE # The license for the project -├── Makefile # A collection of utilities to manage the project -├── MANIFEST.in # A list of files to include in a package -├── mkdocs.yml # Configuration for documentation site -├── metaloci # The main python package for the project -│   ├── base.py # The base module for the project -│   ├── __init__.py # This tells Python that this is a package -│   ├── __main__.py # The entry point for the project -│   └── VERSION # The version for the project is kept in a static file -├── README.md # The main readme for the project -├── setup.py # The setup.py file for installing and packaging the project -├── requirements.txt # An empty file to hold the requirements for the project -├── requirements-test.txt # List of requirements for testing and devlopment -├── setup.py # The setup.py file for installing and packaging the project -└── tests # Unit tests for the project (add mote tests files here) - ├── conftest.py # Configuration, hooks and fixtures for pytest - ├── __init__.py # This tells Python that this is a test package - └── test_base.py # The base test case for the project -``` - -## FAQ - -Frequent asked questions. - -### Why this template is not using [Poetry](https://python-poetry.org/) ? - -I really like Poetry and I think it is a great tool to manage your python projects, -if you want to switch to poetry, you can run `make switch-to-poetry`. - -But for this template I wanted to keep it simple. - -Setuptools is the most simple and well supported way of packaging a Python project, -it doesn't require extra dependencies and is the easiest way to install the project. - -Also, poetry doesn't have a good support for installing projects in development mode yet. - -### Why the `requirements.txt` is empty ? - -This template is a low dependency project, so it doesn't have any extra dependencies. -You can add new dependencies as you will or you can use the `make init` command to -generate a `requirements.txt` file based on the template you choose `flask, fastapi, click etc`. - -### Why there is a `requirements-test.txt` file ? - -This file lists all the requirements for testing and development, -I think the development environment and testing environment should be as similar as possible. - -Except those tools that are up to the developer choice (like ipython, ipdb etc). - -### Why the template doesn't have a `pyproject.toml` file ? - -It is possible to run `pip install https://github.com/name/repo/tarball/main` and -have pip to download the package direcly from Git repo. - -For that to work you need to have a `setup.py` file, and `pyproject.toml` is not -supported for that kind of installation. - -I think it is easier for example you want to install specific branch or tag you can -do `pip install https://github.com/name/repo/tarball/{TAG|REVISON|COMMIT}` - -People automating CI for your project will be grateful for having a setup.py file - -### Why isn't this template made as a cookiecutter template? - -I really like [cookiecutter](https://github.com/cookiecutter/cookiecutter) and it is a great way to create new projects, -but for this template I wanted to use the Github `Use this template` button, -to use this template doesn't require to install extra tooling such as cookiecutter. - -Just click on [Use this template](https://github.com/rochacbruno/python-project-template/generate) and you are good to go. - -The substituions are done using github actions and a simple sed script. - -### Why `VERSION` is kept in a static plain text file? - -I used to have my version inside my main module in a `__version__` variable, then -I had to do some tricks to read that version variable inside the setuptools -`setup.py` file because that would be available only after the installation. - -I decided to keep the version in a static file because it is easier to read from -wherever I want without the need to install the package. - -e.g: `cat metaloci/VERSION` will get the project version without harming -with module imports or anything else, it is useful for CI, logs and debugging. - -### Why to include `tests`, `history` and `Containerfile` as part of the release? - -The `MANIFEST.in` file is used to include the files in the release, once the -project is released to PyPI all the files listed on MANIFEST.in will be included -even if the files are static or not related to Python. - -Some build systems such as RPM, DEB, AUR for some Linux distributions, and also -internal repackaging systems tends to run the tests before the packaging is performed. - -The Containerfile can be useful to provide a safer execution environment for -the project when running on a testing environment. - -I added those files to make it easier for packaging in different formats. - -### Why conftest includes a go_to_tmpdir fixture? - -When your project deals with file system operations, it is a good idea to use -a fixture to create a temporary directory and then remove it after the test. - -Before executing each test pytest will create a temporary directory and will -change the working directory to that path and run the test. - -So the test can create temporary artifacts isolated from other tests. - -After the execution Pytest will remove the temporary directory. - -### Why this template is not using [pre-commit](https://pre-commit.com/) ? - -pre-commit is an excellent tool to automate checks and formatting on your code. - -However I figured out that pre-commit adds extra dependency and it an entry barrier -for new contributors. - -Having the linting, checks and formatting as simple commands on the [Makefile](Makefile) -makes it easier to undestand and change. - -Once the project is bigger and complex, having pre-commit as a dependency can be a good idea. - -### Why the CLI is not using click? - -I wanted to provide a simple template for a CLI application on the project main entry point -click and typer are great alternatives but are external dependencies and this template -doesn't add dependencies besides those used for development. - -### Why this doesn't provide a full example of application using Flask or Django? - -as I said before, I want it to be simple and multipurpose, so I decided to not include -external dependencies and programming design decisions. - -It is up to you to decide if you want to use Flask or Django and to create your application -the way you think is best. - -This template provides utilities in the Makefile to make it easier to you can run: - -```bash -$ make init -Which template do you want to apply? [flask, fastapi, click, typer]? > flask -Generating a new project with Flask ... -``` - -Then the above will download the Flask template and apply it to the project. - -## The Makefile - -All the utilities for the template and project are on the Makefile - -```bash -❯ make -Usage: make - -Targets: -help: ## Show the help. -install: ## Install the project in dev mode. -fmt: ## Format code using black & isort. -lint: ## Run pep8, black, mypy linters. -test: lint ## Run tests and generate coverage report. -watch: ## Run tests on every change. -clean: ## Clean unused files. -virtualenv: ## Create a virtual environment. -release: ## Create a new tag for release. -docs: ## Build the documentation. -switch-to-poetry: ## Switch to poetry package manager. -init: ## Initialize the project based on an application template. -``` diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md deleted file mode 100644 index 8efbe10..0000000 --- a/CONTRIBUTING.md +++ /dev/null @@ -1,113 +0,0 @@ -# How to develop on this project - -metaloci welcomes contributions from the community. - -**You need PYTHON3!** - -This instructions are for linux base systems. (Linux, MacOS, BSD, etc.) -## Setting up your own fork of this repo. - -- On github interface click on `Fork` button. -- Clone your fork of this repo. `git clone git@github.com:YOUR_GIT_USERNAME/METALoci.git` -- Enter the directory `cd METALoci` -- Add upstream repo `git remote add upstream https://github.com/3DGenomes/METALoci` - -## Setting up your own virtual environment - -Run `make virtualenv` to create a virtual environment. -then activate it with `source .venv/bin/activate`. - -## Install the project in develop mode - -Run `make install` to install the project in develop mode. - -## Run the tests to ensure everything is working - -Run `make test` to run the tests. - -## Create a new branch to work on your contribution - -Run `git checkout -b my_contribution` - -## Make your changes - -Edit the files using your preferred editor. (we recommend VIM or VSCode) - -## Format the code - -Run `make fmt` to format the code. - -## Run the linter - -Run `make lint` to run the linter. - -## Test your changes - -Run `make test` to run the tests. - -Ensure code coverage report shows `100%` coverage, add tests to your PR. - -## Build the docs locally - -Run `make docs` to build the docs. - -Ensure your new changes are documented. - -## Commit your changes - -This project uses [conventional git commit messages](https://www.conventionalcommits.org/en/v1.0.0/). - -Example: `fix(package): update setup.py arguments 🎉` (emojis are fine too) - -## Push your changes to your fork - -Run `git push origin my_contribution` - -## Submit a pull request - -On github interface, click on `Pull Request` button. - -Wait CI to run and one of the developers will review your PR. -## Makefile utilities - -This project comes with a `Makefile` that contains a number of useful utility. - -```bash -❯ make -Usage: make - -Targets: -help: ## Show the help. -install: ## Install the project in dev mode. -fmt: ## Format code using black & isort. -lint: ## Run pep8, black, mypy linters. -test: lint ## Run tests and generate coverage report. -watch: ## Run tests on every change. -clean: ## Clean unused files. -virtualenv: ## Create a virtual environment. -release: ## Create a new tag for release. -docs: ## Build the documentation. -switch-to-poetry: ## Switch to poetry package manager. -init: ## Initialize the project based on an application template. -``` - -## Making a new release - -This project uses [semantic versioning](https://semver.org/) and tags releases with `X.Y.Z` -Every time a new tag is created and pushed to the remote repo, github actions will -automatically create a new release on github and trigger a release on PyPI. - -For this to work you need to setup a secret called `PIPY_API_TOKEN` on the project settings>secrets, -this token can be generated on [pypi.org](https://pypi.org/account/). - -To trigger a new release all you need to do is. - -1. If you have changes to add to the repo - * Make your changes following the steps described above. - * Commit your changes following the [conventional git commit messages](https://www.conventionalcommits.org/en/v1.0.0/). -2. Run the tests to ensure everything is working. -4. Run `make release` to create a new tag and push it to the remote repo. - -the `make release` will ask you the version number to create the tag, ex: type `0.1.1` when you are asked. - -> **CAUTION**: The make release will change local changelog files and commit all the unstaged changes you have. diff --git a/Containerfile b/Containerfile deleted file mode 100644 index 9a1b3b4..0000000 --- a/Containerfile +++ /dev/null @@ -1,5 +0,0 @@ -FROM python:3.7-alpine -COPY . /app -WORKDIR /app -RUN pip install . -CMD ["metaloci"] diff --git a/HISTORY.md b/HISTORY.md deleted file mode 100644 index 9bf6ef0..0000000 --- a/HISTORY.md +++ /dev/null @@ -1,13 +0,0 @@ -Changelog -========= - - -0.1.2 (2021-08-14) ------------------- -- Fix release, README and windows CI. [Bruno Rocha] -- Release: version 0.1.0. [Bruno Rocha] - - -0.1.0 (2021-08-14) ------------------- -- Add release command. [Bruno Rocha] diff --git a/LICENSE b/LICENSE index f288702..e72bfdd 100644 --- a/LICENSE +++ b/LICENSE @@ -671,4 +671,4 @@ into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Lesser General Public License instead of this License. But first, please read -. +. \ No newline at end of file diff --git a/Makefile b/Makefile deleted file mode 100644 index 5b94999..0000000 --- a/Makefile +++ /dev/null @@ -1,117 +0,0 @@ -.ONESHELL: -ENV_PREFIX=$(shell python -c "if __import__('pathlib').Path('.venv/bin/pip').exists(): print('.venv/bin/')") -USING_POETRY=$(shell grep "tool.poetry" pyproject.toml && echo "yes") - -.PHONY: help -help: ## Show the help. - @echo "Usage: make " - @echo "" - @echo "Targets:" - @fgrep "##" Makefile | fgrep -v fgrep - - -.PHONY: show -show: ## Show the current environment. - @echo "Current environment:" - @if [ "$(USING_POETRY)" ]; then poetry env info && exit; fi - @echo "Running using $(ENV_PREFIX)" - @$(ENV_PREFIX)python -V - @$(ENV_PREFIX)python -m site - -.PHONY: install -install: ## Install the project in dev mode. - @if [ "$(USING_POETRY)" ]; then poetry install && exit; fi - @echo "Don't forget to run 'make virtualenv' if you got errors." - $(ENV_PREFIX)pip install -e .[test] - -.PHONY: fmt -fmt: ## Format code using black & isort. - $(ENV_PREFIX)isort metaloci/ - $(ENV_PREFIX)black -l 79 metaloci/ - $(ENV_PREFIX)black -l 79 tests/ - -.PHONY: lint -lint: ## Run pep8, black, mypy linters. - $(ENV_PREFIX)flake8 metaloci/ - $(ENV_PREFIX)black -l 79 --check metaloci/ - $(ENV_PREFIX)black -l 79 --check tests/ - $(ENV_PREFIX)mypy --ignore-missing-imports metaloci/ - -.PHONY: test -test: lint ## Run tests and generate coverage report. - $(ENV_PREFIX)pytest -v --cov-config .coveragerc --cov=metaloci -l --tb=short --maxfail=1 tests/ - $(ENV_PREFIX)coverage xml - $(ENV_PREFIX)coverage html - -.PHONY: watch -watch: ## Run tests on every change. - ls **/**.py | entr $(ENV_PREFIX)pytest -s -vvv -l --tb=long --maxfail=1 tests/ - -.PHONY: clean -clean: ## Clean unused files. - @find ./ -name '*.pyc' -exec rm -f {} \; - @find ./ -name '__pycache__' -exec rm -rf {} \; - @find ./ -name 'Thumbs.db' -exec rm -f {} \; - @find ./ -name '*~' -exec rm -f {} \; - @rm -rf .cache - @rm -rf .pytest_cache - @rm -rf .mypy_cache - @rm -rf build - @rm -rf dist - @rm -rf *.egg-info - @rm -rf htmlcov - @rm -rf .tox/ - @rm -rf docs/_build - -.PHONY: virtualenv -virtualenv: ## Create a virtual environment. - @if [ "$(USING_POETRY)" ]; then poetry install && exit; fi - @echo "creating virtualenv ..." - @rm -rf .venv - @python3 -m venv .venv - @./.venv/bin/pip install -U pip - @./.venv/bin/pip install -e .[test] - @echo - @echo "!!! Please run 'source .venv/bin/activate' to enable the environment !!!" - -.PHONY: release -release: ## Create a new tag for release. - @echo "WARNING: This operation will create s version tag and push to github" - @read -p "Version? (provide the next x.y.z semver) : " TAG - @echo "$${TAG}" > metaloci/VERSION - @$(ENV_PREFIX)gitchangelog > HISTORY.md - @git add metaloci/VERSION HISTORY.md - @git commit -m "release: version $${TAG} 🚀" - @echo "creating git tag : $${TAG}" - @git tag $${TAG} - @git push -u origin HEAD --tags - @echo "Github Actions will detect the new tag and release the new version." - -.PHONY: docs -docs: ## Build the documentation. - @echo "building documentation ..." - @$(ENV_PREFIX)mkdocs build - URL="site/index.html"; xdg-open $$URL || sensible-browser $$URL || x-www-browser $$URL || gnome-open $$URL - -.PHONY: switch-to-poetry -switch-to-poetry: ## Switch to poetry package manager. - @echo "Switching to poetry ..." - @if ! poetry --version > /dev/null; then echo 'poetry is required, install from https://python-poetry.org/'; exit 1; fi - @rm -rf .venv - @poetry init --no-interaction --name=a_flask_test --author=rochacbruno - @echo "" >> pyproject.toml - @echo "[tool.poetry.scripts]" >> pyproject.toml - @echo "metaloci = 'metaloci.__main__:main'" >> pyproject.toml - @cat requirements.txt | while read in; do poetry add --no-interaction "$${in}"; done - @cat requirements-test.txt | while read in; do poetry add --no-interaction "$${in}" --dev; done - @poetry install --no-interaction - @mkdir -p .github/backup - @mv requirements* .github/backup - @mv setup.py .github/backup - @echo "You have switched to https://python-poetry.org/ package manager." - @echo "Please run 'poetry shell' or 'poetry run metaloci'" - -.PHONY: init -init: ## Initialize the project based on an application template. - @./.github/init.sh - diff --git a/README.md b/README.md index a772d0d..7bcb232 100644 --- a/README.md +++ b/README.md @@ -1,32 +1,10 @@ -# metaloci +# METALoci -[![codecov](https://codecov.io/gh/3DGenomes/METALoci/branch/main/graph/badge.svg?token=METALoci_token_here)](https://codecov.io/gh/3DGenomes/METALoci) -[![CI](https://github.com/3DGenomes/METALoci/actions/workflows/main.yml/badge.svg)](https://github.com/3DGenomes/METALoci/actions/workflows/main.yml) - -Awesome metaloci created by 3DGenomes +Spatially auto-correlated signals in 3D genomes ## Install it from PyPI ```bash pip install metaloci ``` - -## Usage - -```py -from metaloci import BaseClass - -BaseClass().base_method() -base_function() -``` - -```bash -$ python -m metaloci -#or -$ metaloci -``` - -## Development - -Read the [CONTRIBUTING.md](CONTRIBUTING.md) file. diff --git a/docs/index.md b/docs/index.md deleted file mode 100644 index 000ea34..0000000 --- a/docs/index.md +++ /dev/null @@ -1,17 +0,0 @@ -# Welcome to MkDocs - -For full documentation visit [mkdocs.org](https://www.mkdocs.org). - -## Commands - -* `mkdocs new [dir-name]` - Create a new project. -* `mkdocs serve` - Start the live-reloading docs server. -* `mkdocs build` - Build the documentation site. -* `mkdocs -h` - Print help message and exit. - -## Project layout - - mkdocs.yml # The configuration file. - docs/ - index.md # The documentation homepage. - ... # Other markdown pages, images and other files. diff --git a/metaloci/VERSION b/metaloci/VERSION index 6e8bf73..08456a4 100644 --- a/metaloci/VERSION +++ b/metaloci/VERSION @@ -1 +1 @@ -0.1.0 +0.2.8 \ No newline at end of file diff --git a/metaloci/__init__.py b/metaloci/__init__.py index 378c440..8699e93 100644 --- a/metaloci/__init__.py +++ b/metaloci/__init__.py @@ -1,5 +1,3 @@ -from .base import BaseClass, base_function - -__all__ = ["BaseClass", "base_function"] +from . import mlo import os, sys; sys.path.append(os.path.dirname(os.path.realpath(__file__))) diff --git a/metaloci/__main__.py b/metaloci/__main__.py index 274e056..54f8e44 100644 --- a/metaloci/__main__.py +++ b/metaloci/__main__.py @@ -1,61 +1,89 @@ -import argparse # pragma: no cover - -from . import BaseClass, base_function # pragma: no cover - - -def main() -> None: # pragma: no cover - """ - The main function executes on commands: - `python -m metaloci` and `$ metaloci `. - - This is your program's entry point. - - You can change this function to do whatever you want. - Examples: - * Run a test suite - * Run a server - * Do some other stuff - * Run a command line application (Click, Typer, ArgParse) - * List all available tasks - * Run an application (Flask, FastAPI, Django, etc.) - """ - parser = argparse.ArgumentParser( - description="metaloci.", - epilog="Enjoy the metaloci functionality!", - ) - # This is required positional argument - parser.add_argument( - "name", - type=str, - help="The username", - default="3DGenomes", - ) - # This is optional named argument - parser.add_argument( - "-m", - "--message", - type=str, - help="The Message", - default="Hello", - required=False, - ) - parser.add_argument( - "-v", - "--verbose", - action="store_true", - help="Optionally adds verbosity", - ) - args = parser.parse_args() - print(f"{args.message} {args.name}!") - if args.verbose: - print("Verbose mode is on.") - - print("Executing main function") - base = BaseClass() - print(base.base_method()) - print(base_function()) - print("End of main function") - - -if __name__ == "__main__": # pragma: no cover - main() +import io +import os +import sys +from argparse import ArgumentParser, HelpFormatter, RawDescriptionHelpFormatter +from importlib.metadata import version + +from metaloci.tools import figure, layout, ml, prep + + +def main(arguments) -> None: # pragma: no cover + + DESCRIPTION = "METALoci: spatially auto-correlated signals in 3D genomes.\n" + + if len(arguments) > 1: + + subcommand = arguments[1] + + if subcommand == "version" or subcommand == "--version": + + print("METALoci v" + version("metaloci")) + return + + parser = ArgumentParser() + + parser.formatter_class=lambda prog: HelpFormatter(prog, width=120, + max_help_position=60) + + subparser = parser.add_subparsers(title="Available programs") + + args_pp = {} + + # prep + args_pp["prep"] = subparser.add_parser("prep", + description=prep.DESCRIPTION, + help=prep.DESCRIPTION, + add_help=False, + formatter_class=RawDescriptionHelpFormatter) + args_pp["prep"].set_defaults(func=prep.run) + prep.populate_args(args_pp["prep"]) + + # layout + args_pp["layout"] = subparser.add_parser("layout", + description=layout.DESCRIPTION, + help=layout.DESCRIPTION, + add_help=False, + formatter_class=RawDescriptionHelpFormatter) + args_pp["layout"].set_defaults(func=layout.run) + layout.populate_args(args_pp["layout"]) + + # ml + args_pp["lm"] = subparser.add_parser("lm", + description=ml.DESCRIPTION, + help=ml.DESCRIPTION, + add_help=False, + formatter_class=RawDescriptionHelpFormatter) + args_pp["lm"].set_defaults(func=ml.run) + ml.populate_args(args_pp["lm"]) + + # figure + args_pp["figure"] = subparser.add_parser("figure", + description=figure.DESCRIPTION, + help=figure.DESCRIPTION, + add_help=False, + formatter_class=RawDescriptionHelpFormatter) + args_pp["figure"].set_defaults(func=figure.run) + figure.populate_args(args_pp["figure"]) + + if len(arguments) == 1: + + print(DESCRIPTION) + parser.print_help() + return + + if len(arguments) == 2: + + try: + + args_pp[arguments[1]].print_help() + return + + except KeyError: + + pass + + args = parser.parse_args(arguments[1:]) + + args.func(args) + +sys.exit(main(sys.argv)) \ No newline at end of file diff --git a/metaloci/base.py b/metaloci/base.py deleted file mode 100644 index 3c8cef2..0000000 --- a/metaloci/base.py +++ /dev/null @@ -1,32 +0,0 @@ -""" -metaloci base module. - -This is the principal module of the metaloci project. -here you put your main classes and objects. - -Be creative! do whatever you want! - -If you want to replace this with a Flask application run: - - $ make init - -and then choose `flask` as template. -""" - - -class BaseClass: - def base_method(self) -> str: - """ - Base method. - """ - return "hello from BaseClass" - - def __call__(self) -> str: - return self.base_method() - - -def base_function() -> str: - """ - Base function. - """ - return "hello from base function" diff --git a/metaloci/io/hic_input.py b/metaloci/io/hic_input.py deleted file mode 100644 index e69de29..0000000 diff --git a/metaloci/io/ml_output.py b/metaloci/io/ml_output.py deleted file mode 100644 index e69de29..0000000 diff --git a/metaloci/io/signal_input.py b/metaloci/io/signal_input.py deleted file mode 100644 index e69de29..0000000 diff --git a/metaloci/metaloci b/metaloci/metaloci deleted file mode 100644 index e69de29..0000000 diff --git a/metaloci/misc/misc.py b/metaloci/misc/misc.py index c6406b6..ee04d39 100644 --- a/metaloci/misc/misc.py +++ b/metaloci/misc/misc.py @@ -82,11 +82,11 @@ def check_diagonal(diagonal: np.ndarray) -> tuple[int, float, float, list]: return total, percentage_stretch, percentage_zeroes, zero_loc -def clean_matrix(mlobject: mlo.MetalociObject, bad_regions: pd.DataFrame) -> np.ndarray: +def clean_matrix(mlobject: mlo.MetalociObject) -> np.ndarray: """ Clean a given HiC matrix. It checks if the matrix has too many zeroes at he diagonal, removes values that are zero at the diagonal but are not in - the resto of the matrix, adds pseudocounts to zeroes depending on the min + the rest of the matrix, adds pseudocounts to zeroes depending on the min value, scales all values depending on the min value and computes the log10 off all values. @@ -94,8 +94,6 @@ def clean_matrix(mlobject: mlo.MetalociObject, bad_regions: pd.DataFrame) -> np. ---------- mlo : np.ndarray METALoci object with a matrix in it. - bad_regions : dict - Dictionay {"region": [], "reason": []} in which to append bad regions. Returns ------- @@ -108,20 +106,16 @@ def clean_matrix(mlobject: mlo.MetalociObject, bad_regions: pd.DataFrame) -> np. if total_zeroes == len(diagonal): - # bad_regions[mlobject.region].append(mlobject.region) - bad_regions[mlobject.region].append("empty") - - return None + mlobject.bad_region = "empty" + return mlobject if percentage_zeroes >= 50: - # bad_regions[mlobject.region].append(mlobject.region) - bad_regions[mlobject.region].append("percentage_of_zeroes") + mlobject.bad_region = "too many zeroes" if max_stretch >= 20: - # bad_regions[mlobject.region].append(mlobject.region) - bad_regions[mlobject.region].append("stretch") + mlobject.bad_region = "stretch" mlobject.matrix[zero_loc] = 0 mlobject.matrix[:, zero_loc] = 0 @@ -146,7 +140,7 @@ def clean_matrix(mlobject: mlo.MetalociObject, bad_regions: pd.DataFrame) -> np. mlobject.matrix = np.log10(mlobject.matrix) - return mlobject.matrix + return mlobject def signal_normalization(region_signal: pd.DataFrame, pseudocounts: float = None, norm=None) -> np.ndarray: @@ -173,7 +167,9 @@ def signal_normalization(region_signal: pd.DataFrame, pseudocounts: float = None if pseudocounts is None: - signal = [0.0 if np.isnan(index) else index for index in region_signal] + # signal = [0.0 if np.isnan(index) else index for index in region_signal] + median_default = np.nanmedian(region_signal) # jfk fix + signal = [median_default if np.isnan(index) else index for index in region_signal] # jfk fix else: @@ -198,7 +194,7 @@ def signal_normalization(region_signal: pd.DataFrame, pseudocounts: float = None return np.array(signal) -def check_chromosome_names(hic_file: Path, data: Path, coords: bool): +def check_cooler_names(hic_file: Path, data: Path, coords: bool): with open(data[0], "r") as handler: @@ -210,28 +206,16 @@ def check_chromosome_names(hic_file: Path, data: Path, coords: bool): signal_chr_nom = "N" - try: - - hic_file = cooler.Cooler(hic_file) - - if "chr" in [hic_file.chromnames][0][0]: - - cooler_chr_nom = "chrN" - - else: - - cooler_chr_nom = "N" + hic_file = cooler.Cooler(hic_file) - except OSError: #CHECK EXCEPTION + if "chr" in [hic_file.chromnames][0][0]: - if "chr" in hicstraw.HiCFile(hic_file).getChromosomes()[1].name: + cooler_chr_nom = "chrN" - cooler_chr_nom = "chrN" - - else: - - cooler_chr_nom = "N" + else: + cooler_chr_nom = "N" + del hic_file with open(coords, "r") as handler: @@ -256,95 +240,46 @@ def check_chromosome_names(hic_file: Path, data: Path, coords: bool): "\n\nExiting..." ) - -def bed_to_metaloci(data, coords, resolution): - - boundaries_dictionary = defaultdict(dict) - - # Open centromeres and telomeres coordinates file and assign the corresponding values to variables. - with open(file=coords, mode="r", encoding="utf-8") as chrom: - - for l in chrom: - - line = l.rstrip().split("\t") - boundaries_dictionary[line[0]]["end"] = int(line[1]) # tl_st: end of the initial telomere - - file_info = pd.read_table(data[0]) - - for i in range(1, len(data)): - - temp = pd.read_table(data[i]) - file_info = pd.merge(file_info, temp, on=["chrom", "start", "end"], how="inner") - - col_names = file_info.columns.tolist() - - bad_col_names = [] - - for i in range(3, len(col_names)): - - if len(re.findall("_", col_names[i])) != 1: - - bad_col_names.append(col_names[i]) - if len(bad_col_names) > 0: +def check_hic_names(hic_file: Path, data: Path, coords: bool): - print("Problems with the following signal names:") - print(", ".join(str(x) for x in bad_col_names)) - print("Names for signal must be in the following format: CLASS_NAME.") - print("Class refers to data that can be potentially merged in downstream analysis.") - - sys.exit("Exiting due to improper signal names.") - - for chrm, chrm_value in boundaries_dictionary.items(): - - print(f"chromosome {chrm.rsplit('r', 1)[1]} in progress.") - - pbar = tqdm(total=int(chrm_value["end"] / resolution) + 1) - - if chrm not in file_info["chrom"].unique(): - - print(f"chromosome {chrm.rsplit('r', 1)[1]} not found in the signal file(s), skipping...") - continue + with open(data[0], "r") as handler: - bin_start = 0 - bin_end = bin_start + resolution + if [line.strip() for line in handler][1].startswith("chr"): - info = defaultdict(list) + signal_chr_nom = "chrN" - while bin_start <= chrm_value["end"]: + else: - pbar.update() + signal_chr_nom = "N" - info["Chr"].append(chrm) - info["St"].append(bin_start) - info["Sp"].append(bin_end) + if "chr" in hicstraw.HiCFile(hic_file).getChromosomes()[1].name: - tmp_bin = file_info[ - (file_info["start"] >= int(bin_start)) - & (file_info["end"] <= int(bin_end)) - & (file_info["chrom"] == chrm) - ] + hic_chr_nom = "chrN" - # Go over the columns to get all the signals. - for j in range(3, tmp_bin.shape[1]): + else: - if tmp_bin.shape[0] == 0: + hic_chr_nom = "N" - info[col_names[j]].append(np.nan) + del hic_file - else: + with open(coords, "r") as handler: - info[col_names[j]].append(np.nanmedian(tmp_bin.iloc[:, j].tolist())) + if [line.strip() for line in handler][1].startswith("chr"): - # If the end of the current bin is the start of the terminal telomere, stop - if bin_end == boundaries_dictionary[f"{chrm}"]["end"]: + coords_chr_nom = "chrN" - break + else: - # Creating tmp variables for an easier check of overlap with centromeres and telomeres. - bin_start = bin_end - bin_end = bin_start + resolution + coords_chr_nom = "N" - pbar.close() + if not signal_chr_nom == hic_chr_nom == coords_chr_nom: - return info + exit( + "\nThe signal, cooler and chromosome sizes files do not have the same nomenclature for chromosomes:\n" + f"\n\tSignal chromosomes nomenclature is '{signal_chr_nom}'. " + f"\n\tHi-C chromosomes nomenclature is '{hic_chr_nom}'. " + f"\n\tChromosome sizes nomenclature is '{coords_chr_nom}'. " + "\n\nPlease, rename the chromosome names. " + "\n\nExiting..." + ) diff --git a/metaloci/mlo.py b/metaloci/mlo.py index 0576084..e2f2235 100644 --- a/metaloci/mlo.py +++ b/metaloci/mlo.py @@ -14,6 +14,7 @@ def __init__(self, region, chrom, start, end, resolution, poi, persistence_lengt self.save_path = save_path self.subset_matrix = None self.matrix = None + self.bad_region = None self.mixed_matrices = None self.signals_dict = None self.flat_matrix = None diff --git a/metaloci/plot/plot.py b/metaloci/plot/plot.py index ef08093..bab99ef 100644 --- a/metaloci/plot/plot.py +++ b/metaloci/plot/plot.py @@ -14,13 +14,12 @@ from scipy.ndimage import rotate from scipy.stats import linregress, zscore from shapely.geometry import Point -from shapely.geometry.multipolygon import MultiPolygon from metaloci import mlo from metaloci.misc import misc -def get_kk_plot(mlobject: mlo.MetalociObject, restraints: bool = True): +def get_kk_plot(mlobject: mlo.MetalociObject, restraints: bool = True, neighbourhood: float = None): """ Generate Kamada-Kawai plot from pre-calculated restraints. @@ -36,7 +35,6 @@ def get_kk_plot(mlobject: mlo.MetalociObject, restraints: bool = True): matplotlib.pyplot.figure.Figure Kamada-Kawai layout plot object. """ - PLOTSIZE = 10 POINTSIZE = PLOTSIZE * 5 @@ -48,7 +46,7 @@ def get_kk_plot(mlobject: mlo.MetalociObject, restraints: bool = True): options = {"node_size": 50, "edge_color": "black", "linewidths": 0.1, "width": 0.05} if restraints == True: - + nx.draw( mlobject.kk_graph, mlobject.kk_nodes, @@ -69,8 +67,15 @@ def get_kk_plot(mlobject: mlo.MetalociObject, restraints: bool = True): g.annotate(f" {mlobject.chrom}:{mlobject.start}", (xs[0], ys[0]), size=8) g.annotate(f" {mlobject.chrom}:{mlobject.end}", (xs[len(xs) - 1], ys[len(ys) - 1]), size=8) + poi_x, poi_y = xs[mlobject.poi], ys[mlobject.poi] + + if neighbourhood: + + circle = plt.Circle((poi_x, poi_y), neighbourhood, color="red", fill=False, lw=1, zorder=3) + plt.gca().add_patch(circle) + sns.scatterplot( - x=[xs[mlobject.poi]], y=[ys[mlobject.poi]], s=POINTSIZE * 1.5, ec="lime", fc="none", zorder=3 + x=[poi_x], y=[poi_y], s=POINTSIZE * 1.5, ec="lime", fc="none", zorder=4 ) return kk_plt @@ -207,8 +212,6 @@ def get_gaudi_signal_plot(mlobject: mlo.MetalociObject, lmi_geometry: pd.DataFra matplotlib figure containing the Gaudi signal plot. """ - poi = lmi_geometry.loc[lmi_geometry["moran_index"] == mlobject.poi, "bin_index"].iloc[0] - cmap = "PuOr_r" min_value = lmi_geometry.signal.min() max_value = lmi_geometry.signal.max() @@ -216,8 +219,7 @@ def get_gaudi_signal_plot(mlobject: mlo.MetalociObject, lmi_geometry: pd.DataFra gsp, ax = plt.subplots(figsize=(12, 10), subplot_kw={"aspect": "equal"}) lmi_geometry.plot(column="signal", cmap=cmap, linewidth=2, edgecolor="white", ax=ax) - sns.scatterplot(x=[lmi_geometry.X[poi]], y=[lmi_geometry.Y[poi]], s=50, ec="none", fc="lime") - + sns.scatterplot(x=[lmi_geometry.X[mlobject.poi]], y=[lmi_geometry.Y[mlobject.poi]], s=50, ec="none", fc="lime") sm = plt.cm.ScalarMappable(cmap=cmap, norm=plt.Normalize(vmin=min_value, vmax=max_value)) sm.set_array([1, 2, 3, 4]) @@ -262,7 +264,6 @@ def get_gaudi_type_plot( gtp : matplotlib.pyplot.figure.Figure matplotlib figure containing the Gaudi type plot. """ - poi = lmi_geometry.loc[lmi_geometry["moran_index"] == mlobject.poi, "bin_index"].iloc[0] legend_elements = [ Line2D([0], [0], marker="o", color="w", markerfacecolor=colors[1], label="HH", markersize=20), @@ -272,15 +273,15 @@ def get_gaudi_type_plot( ] cmap = LinearSegmentedColormap.from_list("Custom cmap", [colors[nu] for nu in colors], len(colors)) - alpha = [1.0 if quad <= signipval else 0.3 for quad in lmi_geometry.LMI_pvalue] + alpha = [1.0 if pval <= signipval else 0.3 for pval in lmi_geometry.LMI_pvalue] gtp, ax = plt.subplots(figsize=(12, 10), subplot_kw={"aspect": "equal"}) lmi_geometry.plot(column="moran_quadrant", cmap=cmap, alpha=alpha, linewidth=2, edgecolor="white", ax=ax) plt.axis("off") sns.scatterplot( - x=[lmi_geometry.X[poi]], - y=[lmi_geometry.Y[poi]], + x=[lmi_geometry.X[mlobject.poi]], + y=[lmi_geometry.Y[mlobject.poi]], s=50, ec="none", fc="lime", @@ -334,32 +335,128 @@ def get_lmi_scatterplot( x = zscore(lmi_geometry.signal) y = zscore(y_lag) - _, _, r_value_scat, p_value_scat, _ = linregress(x, y) - scatter, ax = plt.subplots(figsize=(5, 5)) # , subplot_kw={'aspect':'equal'}) + try: - alpha_sp = [1.0 if val < signipval else 0.1 for val in lmi_geometry.LMI_pvalue] - colors_sp = [colors_lmi[val] for val in lmi_geometry.moran_quadrant] + _, _, r_value_scat, p_value_scat, _ = linregress(x, y) + scatter, ax = plt.subplots(figsize=(5, 5)) # , subplot_kw={'aspect':'equal'}) - plt.scatter(x=x, y=y, s=100, ec="white", fc=colors_sp, alpha=alpha_sp) + alpha_sp = [1.0 if val < signipval else 0.1 for val in lmi_geometry.LMI_pvalue] + colors_sp = [colors_lmi[val] for val in lmi_geometry.moran_quadrant] - sns.scatterplot( - x=[x[mlobject.poi]], y=[y[mlobject.poi]], s=150, ec="lime", fc="none", zorder=len(lmi_geometry) - ) - sns.regplot(x=x, y=y, scatter=False, color="k") - sns.despine(top=True, right=True, left=False, bottom=False, offset=10, trim=False) + plt.scatter(x=x, y=y, s=100, ec="white", fc=colors_sp, alpha=alpha_sp) + + sns.scatterplot( + x=[x[mlobject.poi]], y=[y[mlobject.poi]], s=150, ec="lime", fc="none", zorder=len(lmi_geometry) + ) + sns.regplot(x=x, y=y, scatter=False, color="k") + sns.despine(top=True, right=True, left=False, bottom=False, offset=10, trim=False) - plt.title(f"Moran Local Scatterplot\nr: {r_value_scat:4.2f} p-value: {p_value_scat:.1e}") - plt.axvline(x=0, color="k", linestyle=":") - plt.axhline(y=0, color="k", linestyle=":") + plt.title(f"Moran Local Scatterplot\nr: {r_value_scat:4.2f} p-value: {p_value_scat:.1e}") + plt.axvline(x=0, color="k", linestyle=":") + plt.axhline(y=0, color="k", linestyle=":") - ax.set_xlabel("Z-score(Signal)") - ax.set_ylabel("Z-score(Signal Spatial Lag)") + ax.set_xlabel("Z-score(Signal)") + ax.set_ylabel("Z-score(Signal Spatial Lag)") - r_value_scat = float(r_value_scat) - p_value_scat = float(r_value_scat) + r_value_scat = float(r_value_scat) + p_value_scat = float(r_value_scat) + + except ValueError: + + print("\t\tCannot compute lineal regression as all values are identical.") + + return None, None, None return scatter, r_value_scat, p_value_scat +# def signal_bed( +# mlobject: mlo.MetalociObject, +# lmi_geometry: pd.DataFrame, +# neighbourhood: float, +# quadrants: list = [1, 3], +# signipval: float = 0.05, +# ): +# """ +# signal_bed _summary_ + +# Parameters +# ---------- +# mlobject : mlo.MetalociObject +# _description_ +# lmi_geometry : pd.DataFrame +# _description_ +# neighbourhood : float +# _description_ +# quadrants : list, optional +# _description_, by default [1, 3] +# signipval : float, optional +# _description_, by default 0.05 + +# Returns +# ------- +# _type_ +# _description_ +# """ + +# poi_distance = mlobject.kk_distances[mlobject.poi] + +# # Select the polygons that are in the quadrant of interest and are significative. +# ml_indexes = lmi_geometry[ +# (lmi_geometry.moran_quadrant.isin(quadrants)) & (lmi_geometry.LMI_pvalue <= signipval) +# ].bin_index.values.tolist() + +# # Make a big polygon from the small poligons that are significant. +# metalocis = lmi_geometry[lmi_geometry.bin_index.isin(ml_indexes)].unary_union + +# if metalocis and metalocis.geom_type == "Polygon": + +# metalocis = MultiPolygon([metalocis]) # Need a multipolygon in order for the code to work. + +# poi_point = Point( +# (lmi_geometry[lmi_geometry.moran_index == mlobject.poi].X, lmi_geometry[lmi_geometry.moran_index == mlobject.poi].Y) +# ) + +# metalocis_bed = [] + +# bed_data = defaultdict(list) + +# try: + +# for metaloci in metalocis.geoms: + +# metaloci = gpd.GeoSeries(metaloci) + +# if poi_point.within(metaloci[0]): + +# for _, row_ml in lmi_geometry.iterrows(): + +# adjacent_point = Point((row_ml.X, row_ml.Y)) + +# if adjacent_point.within(metaloci[0]): + +# metalocis_bed.append(row_ml.bin_index) + +# # Add close particles +# metalocis_bed.sort() +# close_bins = [i for i, distance in enumerate(poi_distance) if distance <= neighbourhood / 2] +# metalocis_bed = np.sort(list(set(close_bins + metalocis_bed))) + +# for point in metalocis_bed: + +# bed_data["chr"].append(lmi_geometry.bin_chr[point]) +# bed_data["start"].append(lmi_geometry.bin_start[point]) +# bed_data["end"].append(lmi_geometry.bin_end[point]) +# bed_data["bin"].append(point) + +# except: + +# pass + +# bed_data = pd.DataFrame(bed_data) + +# return bed_data, metalocis_bed + + def signal_bed( mlobject: mlo.MetalociObject, @@ -367,6 +464,7 @@ def signal_bed( neighbourhood: float, quadrants: list = [1, 3], signipval: float = 0.05, + metaloci_only: bool = False, ): """ signal_bed _summary_ @@ -383,6 +481,8 @@ def signal_bed( _description_, by default [1, 3] signipval : float, optional _description_, by default 0.05 + metaloci_only : bool, optional + _description_, by default False Returns ------- @@ -390,65 +490,56 @@ def signal_bed( _description_ """ - poi_distance = mlobject.kk_distances[mlobject.poi] - - # Select the polygons that are in the quadrant of interest and are significative. - ml_indexes = lmi_geometry[ - (lmi_geometry.moran_quadrant.isin(quadrants)) & (lmi_geometry.LMI_pvalue <= signipval) - ].bin_index.values.tolist() + metalocis_bed = [] - # Make a big polygon from the small poligons that are significant. - metalocis = lmi_geometry[lmi_geometry.bin_index.isin(ml_indexes)].unary_union + bed_data = defaultdict(list) - if metalocis: + if metaloci_only == False: - if metalocis.geom_type == "Polygon": + for _, row_ml in lmi_geometry.iterrows(): - metalocis = MultiPolygon([metalocis]) # Need a multipolygon in order for the code to work. + if row_ml.LMI_pvalue <= signipval and row_ml.moran_quadrant in quadrants: - poi_point = Point( - (lmi_geometry[lmi_geometry.bin_index == mlobject.poi].X, lmi_geometry[lmi_geometry.bin_index == mlobject.poi].Y) - ) + metalocis_bed.append(row_ml.bin_index) - metalocis_bed = [] + else: - beddata = defaultdict(list) + poi_row = lmi_geometry[lmi_geometry.bin_index == mlobject.poi].squeeze() - try: + # print(lmi_geometry[lmi_geometry.bin_index == mlobject.poi]) - for metaloci in metalocis.geoms: + if poi_row.LMI_pvalue <= signipval and poi_row.moran_quadrant in quadrants: - metaloci = gpd.GeoSeries(metaloci) + poi_point = Point( + (lmi_geometry[lmi_geometry.bin_index == mlobject.poi].X, lmi_geometry[lmi_geometry.bin_index == mlobject.poi].Y) + ) - if poi_point.within(metaloci[0]): + poi_point = Point(lmi_geometry.loc[lmi_geometry["bin_index"] == mlobject.poi, ["X", "Y"]].iloc[0]) - for _, row_ml in lmi_geometry.iterrows(): + for _, row_ml in lmi_geometry.iterrows(): - adjacent_point = Point((row_ml.X, row_ml.Y)) + adjacent_point = Point((row_ml.X, row_ml.Y)) - if adjacent_point.within(metaloci[0]): + if adjacent_point.distance(poi_point) <= neighbourhood: - metalocis_bed.append(row_ml.bin_index) + metalocis_bed.append(row_ml.bin_index) - # Add close particles - metalocis_bed.sort() - close_bins = [i for i, distance in enumerate(poi_distance) if distance <= neighbourhood / 2] - metalocis_bed = np.sort(list(set(close_bins + metalocis_bed))) + metalocis_bed.sort() - for point in metalocis_bed: + # print(metalocis_bed) - beddata["chr"].append(lmi_geometry.bin_chr[point]) - beddata["start"].append(lmi_geometry.bin_start[point]) - beddata["end"].append(lmi_geometry.bin_end[point]) - beddata["bin"].append(point) - except: + for point in metalocis_bed: - pass + bed_data["chr"].append(lmi_geometry.bin_chr[point]) + bed_data["start"].append(lmi_geometry.bin_start[point]) + bed_data["end"].append(lmi_geometry.bin_end[point]) + bed_data["bin"].append(point) + bed_data["quadrant"].append(lmi_geometry.moran_quadrant[point]) - beddata = pd.DataFrame(beddata) - return beddata, metalocis_bed + bed_data = pd.DataFrame(bed_data) + return bed_data, metalocis_bed def signal_plot(mlobject: mlo.MetalociObject, lmi_geometry: pd.DataFrame, metalocis, bins_sig, coords_sig): @@ -459,7 +550,7 @@ def signal_plot(mlobject: mlo.MetalociObject, lmi_geometry: pd.DataFrame, metalo for p in metalocis: - plt.axvline(x=p, color="red", linestyle=":", lw=1.5) + plt.axvline(x=p, color="red", linestyle=":", lw=1, zorder=0, alpha=0.3) plt.axvline(x=mlobject.poi, color="lime", linestyle="--", lw=1.5) @@ -473,7 +564,7 @@ def signal_plot(mlobject: mlo.MetalociObject, lmi_geometry: pd.DataFrame, metalo sns.despine(top=True, right=True, left=False, bottom=True, offset=None, trim=False) - return sig_plt + return sig_plt, g def place_composite(new_PI, ifile, ifactor, ixloc, iyloc): diff --git a/metaloci/spatial_stats/lmi.py b/metaloci/spatial_stats/lmi.py index 75d351f..670b0cd 100644 --- a/metaloci/spatial_stats/lmi.py +++ b/metaloci/spatial_stats/lmi.py @@ -1,6 +1,5 @@ import glob import os -import re import warnings from collections import defaultdict from pathlib import Path @@ -85,8 +84,8 @@ def construct_voronoi(mlobject: mlo.MetalociObject, buffer: float): df_geometry["bin_index"].append(x) df_geometry["moran_index"].append(i) - df_geometry["X"].append(mlobject.kk_coords[i][0]) - df_geometry["Y"].append(mlobject.kk_coords[i][1]) + df_geometry["X"].append(mlobject.kk_coords[x][0]) + df_geometry["Y"].append(mlobject.kk_coords[x][1]) df_geometry["geometry"].append(geometry_data.loc[i, "geometry"]) df_geometry = pd.DataFrame(df_geometry) @@ -197,20 +196,34 @@ def load_region_signals(mlobject: mlo.MetalociObject, signal_data: dict, signal_ """ # Read signal file. Will only process the signals present in this list. - with open(signal_file) as signals_handler: - signal_types = [line.rstrip() for line in signals_handler] + if os.path.isfile(signal_file): + + with open(signal_file) as signals_handler: + + signal_types = [line.rstrip() for line in signals_handler] + + else: + + signal_types = [signal_file] + + # region_signal = signal_data[mlobject.chrom][ + # (signal_data[mlobject.chrom]["start"] >= int(mlobject.start / mlobject.resolution) * mlobject.resolution) + # & (signal_data[mlobject.chrom]["end"] <= int(mlobject.end / mlobject.resolution) * mlobject.resolution) + # ] region_signal = signal_data[mlobject.chrom][ - (signal_data[mlobject.chrom]["start"] >= int(mlobject.start / mlobject.resolution) * mlobject.resolution) - & (signal_data[mlobject.chrom]["end"] <= int(mlobject.end / mlobject.resolution) * mlobject.resolution) - ] + (signal_data[mlobject.chrom]["start"] >= int(np.floor(mlobject.start / mlobject.resolution)) * mlobject.resolution) + & (signal_data[mlobject.chrom]["end"] <= int(np.ceil(mlobject.end / mlobject.resolution)) * mlobject.resolution) + ] # jfm: fix if len(region_signal) != len(mlobject.kk_coords): tmp = len(mlobject.kk_coords) - len(region_signal) tmp = np.empty((tmp, len(region_signal.columns))) - tmp[:] = 0 + # tmp[:] = 0 + tmp[:] = np.nan # jfm: leave NAs in for now. (in misc.signal_normalization() those NaNS will be substituted for + # the median of the signal of the region) region_signal = pd.concat([region_signal, pd.DataFrame(tmp, columns=list(region_signal))], ignore_index=True) @@ -218,7 +231,13 @@ def load_region_signals(mlobject: mlo.MetalociObject, signal_data: dict, signal_ for signal_type in signal_types: - signals_dict[signal_type] = misc.signal_normalization(region_signal[signal_type]) + try: + + signals_dict[signal_type] = misc.signal_normalization(region_signal[signal_type]) + + except KeyError: + + return None, signal_type return signals_dict, signal_types @@ -284,7 +303,7 @@ def compute_lmi( # Calculate Local Moran's I moran_local_object = Moran_Local( - y, weights, permutations=n_permutations + y, weights, permutations=n_permutations, n_jobs=1 ) # geoda_quadsbool (default=False) If False use PySAL Scheme: HH=1, LH=2, LL=3, HL=4 lags = lag_spatial(moran_local_object.w, moran_local_object.z) @@ -317,7 +336,7 @@ def compute_lmi( df_lmi["LMI_score"].append(round(moran_local_object.Is[row.moran_index], 9)) df_lmi["LMI_pvalue"].append(round(moran_local_object.p_sim[row.moran_index], 9)) df_lmi["LMI_inv_pval"].append(round((1 - moran_local_object.p_sim[row.moran_index]), 9)) - df_lmi["ZSig"].append(y[row.moran_index]) ## MAYBE IT'S MORAN_INDEX INSTED OF BIN_INDEX. CHECK + df_lmi["ZSig"].append(y[row.moran_index]) df_lmi["ZLag"].append(lags[row.moran_index]) df_lmi = pd.DataFrame(df_lmi) diff --git a/metaloci/tools/__init__.py b/metaloci/tools/__init__.py index c7df41c..0980070 100644 --- a/metaloci/tools/__init__.py +++ b/metaloci/tools/__init__.py @@ -1 +1 @@ -import os, sys; sys.path.append(os.path.dirname(os.path.realpath(__file__))) \ No newline at end of file +import os, sys; sys.path.append(os.path.dirname(os.path.realpath(__file__))) diff --git a/metaloci/tools/figure.py b/metaloci/tools/figure.py index 6a1ef85..a8a2fd1 100644 --- a/metaloci/tools/figure.py +++ b/metaloci/tools/figure.py @@ -1,379 +1,400 @@ """ This script generates METALoci plots. """ -import argparse import os import pathlib import pickle import re -import sys +from argparse import HelpFormatter from datetime import timedelta from time import time import geopandas as gpd import matplotlib.pyplot as plt import pandas as pd -from matplotlib.lines import Line2D from PIL import Image from metaloci.plot import plot -description = "This script outputs the different plots to show METALoci.\n" -description += "It creates the following plots:\n" -description += " - HiC matrix\n" -description += " - Signal plot\n" -description += " - Kamada-Kawai layout\n" -description += " - Local Moran's I scatterplot\n" -description += " - Gaudí plot for signal\n" -description += " - Gaudí plot for LMI quadrant\n" - - -parser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter, description=description, add_help=False) - -# The script will create a folder called 'datasets' and a subfolder for the signal, -# with subfolders for every chromosome and region of choice. -input_arg = parser.add_argument_group(title="Input arguments") - -input_arg.add_argument( - "-w", "--work-dir", dest="work_dir", required=True, metavar="PATH", type=str, help="Path to working directory." -) - -signal_arg = parser.add_argument_group(title="Signal arguments", description="Choose one of the following options:") - -signal_arg.add_argument( - "-s", - "--types", - dest="signals", - metavar="STR", - type=str, - nargs="*", - action="extend", - help="Space-separated list of signals to plot.", -) - -signal_arg.add_argument( - "-S", - "--types-file", - dest="signals", - metavar="STR", - type=str, - help="Path to the file with the list of signals to plot, one per line.", -) - -region_input = parser.add_argument_group(title="Region arguments", description="Choose one of the following options:") - -region_input.add_argument( - "-g", - "--region", - dest="region_file", - metavar="FILE", - type=str, - nargs="*", - action="extend", - help="Region to apply LMI in the format chrN:start-end_midpoint " "gene_symbol gene_id.\n" "Space separated ", -) - -region_input.add_argument( - "-G", - "--region-file", - dest="region_file", - metavar="FILE", - type=str, - help="Path to the file with the regions of interest.", -) - -optional_arg = parser.add_argument_group(title="Optional arguments") - -optional_arg.add_argument("-h", "--help", action="help", help="Show this help message and exit.") - -optional_arg.add_argument( - "-e", - "--delete", - dest="rm_types", - required=False, - metavar="STR", - type=str, - nargs="*", - default=["png"], - help="Delete temporal image files, determined by extension " "(default: %(default)s)", -) - -optional_arg.add_argument( - "-a", - "--aggregated", - dest="agg", - required=False, - action="store_true", - help="Use the file with aggregated signals (*_LMI_byType.pkl)", -) - -optional_arg.add_argument( - "-q", - "--quarts", - dest="quart", - default=[1, 3], - metavar="INT", - nargs="*", - help="Space-separated list with the LMI quadrants to highlight " - "(default: %(default)s)\n" - "1: High-high (signal in bin is high, signal on neighbours is " - "high)\n" - "2: Low-High (signal in bin is low, signal on neighbours is high)\n" - "3: Low-Low (signal in bin is low, signal on neighbours is low)\n" - "4: High-Low (signal in bin is high, signal on neighbours is low)", -) - -optional_arg.add_argument( - "-v", - "--pval", - dest="signipval", - default=0.05, - metavar="FLOAT", - type=float, - help="P-value significance threshold (default: %(default)d).", -) - -optional_arg.add_argument("-u", "--debug", dest="debug", action="store_true", help=argparse.SUPPRESS) - -args = parser.parse_args(None if sys.argv[1:] else ["-h"]) +DESCRIPTION = "Outputs the different plots to show METALoci." +DESCRIPTION += " It creates the following plots:\n" +DESCRIPTION += "HiC matrix" +DESCRIPTION += ", Signal plot" +DESCRIPTION += ", Kamada-Kawai layout" +DESCRIPTION += ", Local Moran's I scatterplot" +DESCRIPTION += ", Gaudí plot for signal" +DESCRIPTION += ", Gaudí plot for LMI quadrant" +DESCRIPTION += " and a composite image with all the above." + + +def populate_args(parser): + + parser.formatter_class=lambda prog: HelpFormatter(prog, width=120, + max_help_position=60) + + input_arg = parser.add_argument_group(title="Input arguments") + optional_arg = parser.add_argument_group(title="Optional arguments") + + input_arg.add_argument( + "-w", "--work-dir", dest="work_dir", required=True, metavar="PATH", type=str, help="Path to working directory." + ) + + input_arg.add_argument( + "-s", + "--types", + dest="signals", + metavar="STR", + type=str, + nargs="*", + action="extend", + help="Space-separated list of signals to plot or path to the file with the list of signals to plot, one per line.", + ) + + input_arg.add_argument( + "-g", + "--region", + dest="regions", + metavar="PATH", + type=str, + required=True, + help="Region to apply LMI in format chrN:start-end_midpoint or file with the regions of interest. If a file is provided, " + "it must contain as a header 'coords', 'symbol' and 'id', and one region per line, tab separated.", + + ) + + optional_arg.add_argument( + "-e", + "--delete", + dest="rm_types", + required=False, + metavar="STR", + type=str, + nargs="*", + default=["png"], + help="Delete temporal image files, determined by extension " "(default: %(default)s)", + ) + + optional_arg.add_argument( + "-m", + "--metalocis", + dest="metalocis", + action="store_true", + help="Flag to select highlightning of the signal plots. If True, only the neighbouring bins from the point of interest will be " + "highlighted (independently of the quadrant and significance of those bins, but only if the point of interest is significant). " + "If False, all significant regions that correspond to the quadrant selected with -q will be highlighted (default: False).", + ) + + optional_arg.add_argument( + "-a", + "--aggregated", + dest="agg", + required=False, + action="store_true", + help="Use the file with aggregated signals (*_LMI_byType.pkl)", + ) + + optional_arg.add_argument( + "-q", + "--quarts", + dest="quart", + default=[1, 3], + metavar="INT", + nargs="*", + help="Space-separated list with the LMI quadrants to highlight " + "(default: %(default)s) " + "1: High-high (signal in bin is high, signal on neighbours is " + "high) " + "2: Low-High (signal in bin is low, signal on neighbours is high) " + "3: Low-Low (signal in bin is low, signal on neighbours is low) " + "4: High-Low (signal in bin is high, signal on neighbours is low).", + ) + + optional_arg.add_argument( + "-v", + "--pval", + dest="signipval", + default=0.05, + metavar="FLOAT", + type=float, + help="P-value significance threshold (default: %(default)s).", + ) + + optional_arg.add_argument("-h", "--help", action="help", help="Show this help message and exit.") + + +def run(opts): + + work_dir = opts.work_dir + regions = opts.regions + signals = opts.signals + metaloci_bed = opts.metalocis + quadrants = opts.quart + signipval = opts.signipval + rmtypes = opts.rm_types + agg = opts.agg + + quadrants = [int(x) for x in quadrants] + + if not work_dir.endswith("/"): + + work_dir += "/" + + INFLUENCE = 1.5 + BFACT = 2 + + colors = {1: "firebrick", 2: "lightskyblue", 3: "steelblue", 4: "orange"} + + start_timer = time() + + if os.path.isfile(regions): + + df_regions = pd.read_table(regions) + + else: + + df_regions = pd.DataFrame({"coords": [regions], "symbol": ["symbol"], "id": ["id"]}) + + # Parse list of signals to plot. If it is a file, strip it, if there are + if os.path.isfile(signals[0]) and os.access(signals[0], os.R_OK): -work_dir = args.work_dir -regions = args.region_file -signals = args.signals -quadrants = args.quart -signipval = args.signipval -rmtypes = args.rm_types -debug = args.debug -agg = args.agg + with open(signals[0], "r", encoding="utf-8") as handler: -quadrants = [int(x) for x in quadrants] + signals = [line.strip() for line in handler] -if not work_dir.endswith("/"): - work_dir += "/" + plot_opt = {"bbox_inches": "tight", "dpi": 300, "transparent": True} -if debug: + for i, region_iter in df_regions.iterrows(): - table = [ - ["work_dir", work_dir], - ["rfile", regions], - ["signals", signals], - ["quart", quadrants], - ["rmtypes", rmtypes], - ["debug", debug], - ] - print(table) - sys.exit() + region = region_iter.coords -INFLUENCE = 1.5 -BFACT = 2 + print(f"\n------> Working on region {region} [{i + 1}/{len(df_regions)}]") -colors = {1: "firebrick", 2: "lightskyblue", 3: "steelblue", 4: "orange"} + try: -legend_elements = [ - Line2D([0], [0], marker="o", color="w", markerfacecolor=colors[1], label="HH", markersize=20), - Line2D([0], [0], marker="o", color="w", markerfacecolor=colors[2], label="LH", markersize=20), - Line2D([0], [0], marker="o", color="w", markerfacecolor=colors[3], label="LL", markersize=20), - Line2D([0], [0], marker="o", color="w", markerfacecolor=colors[4], label="HL", markersize=20), -] + with open( + f"{work_dir}{region.split(':', 1)[0]}/{re.sub(':|-', '_', region)}.mlo", + "rb", + ) as mlobject_handler: -start_timer = time() + mlobject = pickle.load(mlobject_handler) -if os.path.isfile(regions): + except FileNotFoundError: - df_regions = pd.read_table(regions) + print("\n\t.mlo file not found for this region. \n\tSkipping to the next one.") -else: + continue - df_regions = pd.DataFrame({"coords": [regions], "symbol": ["symbol"], "id": ["id"]}) + if mlobject.lmi_info == None: -# hacer que un archivo de señales sera una lista. cambia esta lógica que es super enrevesada. -if os.path.isfile(signals[0]) and os.access(signals[0], os.R_OK): + print("\n\tLMI not calculated for this region. \n\tSkipping to the next one...") + continue - with open(signals[0], "r", encoding="utf-8") as handler: + buffer = mlobject.kk_distances.diagonal(1).mean() * INFLUENCE - signals = [line.strip() for line in handler] + bins = [] + coords_b = [] -plot_opt = {"bbox_inches": "tight", "dpi": 300, "transparent": True} -data_moran = {"Coords": [], "Symbol": [], "Gene_id": [], "Signal": [], "R_value": [], "P_value": []} + for i in range(1, 4): -for i, region_iter in df_regions.iterrows(): + if i == 1: - region = region_iter.coords + bins.append(int(0)) - print(f"\n------> Working on region {region} [{i + 1}/{len(df_regions)}]") + elif i == 4: - try: + bins.append(len(mlobject.lmi_geometry) - 1) - with open( - f"{work_dir}{region.split(':', 1)[0]}/{re.sub(':|-', '_', region)}.mlo", - "rb", - ) as mlobject_handler: + else: - mlobject = pickle.load(mlobject_handler) + bins.append(int(((i - 1) / 2) * len(mlobject.lmi_geometry)) - 1) - except FileNotFoundError: + coords_b.append(f"{mlobject.start + bins[i - 1] * mlobject.resolution:,}") - print(".mlo file not found for this region. \nSkipping to next region.") + for signal in signals: - continue + if len(signal) == 0: - buffer = mlobject.kk_distances.diagonal(1).mean() * INFLUENCE - bbfact = buffer * BFACT + continue - bins = [] - coords_b = [] + print(f"\n\tPlotting signal: {signal}") - for i in range(1, 4): + plot_filename = os.path.join(work_dir, mlobject.chrom, "plots", signal, mlobject.region) + pathlib.Path(plot_filename).mkdir(parents=True, exist_ok=True) - if i == 1: + plot_filename = os.path.join( + plot_filename, + f"{mlobject.chrom}_{mlobject.start}_{mlobject.end}_{mlobject.poi}_{mlobject.resolution}_{signal}", + ) - bins.append(int(0)) + merged_lmi_geometry = pd.merge( + mlobject.lmi_info[signal], + mlobject.lmi_geometry, + on=["bin_index", "moran_index"], + how="inner", + ) - elif i == 4: + merged_lmi_geometry = gpd.GeoDataFrame(merged_lmi_geometry, geometry=merged_lmi_geometry.geometry) - bins.append(len(mlobject.lmi_geometry) - 1) + # TODO The creation of the merged dataframe should be a function in misc. Put there the coditions of aggregation. - else: + print("\t\tHiC plot", end="\r") + hic_plt = plot.get_hic_plot(mlobject) + hic_plt.savefig(f"{plot_filename}_hic.pdf", **plot_opt) + hic_plt.savefig(f"{plot_filename}_hic.png", **plot_opt) + plt.close() + print("\t\tHiC plot -> done.") - bins.append(int(((i - 1) / 2) * len(mlobject.lmi_geometry)) - 1) + print("\t\tKamada-Kawai plot", end="\r") + kk_plt = plot.get_kk_plot(mlobject) + kk_plt.savefig(f"{plot_filename}_kk.pdf", **plot_opt) + kk_plt.savefig(f"{plot_filename}_kk.png", **plot_opt) + plt.close() + print("\t\tKamada-Kawai plot -> done.") # - coords_b.append(f"{mlobject.start + bins[i - 1] * mlobject.resolution:,}") + print("\t\tGaudi Signal plot", end="\r") + gs_plt = plot.get_gaudi_signal_plot(mlobject, merged_lmi_geometry) + gs_plt.savefig(f"{plot_filename}_gsp.pdf", **plot_opt) + gs_plt.savefig(f"{plot_filename}_gsp.png", **plot_opt) + plt.close() + print("\t\tGaudi Signal plot -> done.") - for signal in signals: + print("\t\tGaudi Type plot", end="\r") + gt_plt = plot.get_gaudi_type_plot(mlobject, merged_lmi_geometry, signipval, colors) + gt_plt.savefig(f"{plot_filename}_gtp.pdf", **plot_opt) + gt_plt.savefig(f"{plot_filename}_gtp.png", **plot_opt) + plt.close() + print("\t\tGaudi Type plot -> done.") - if len(signal) == 0: + print("\t\tSignal plot", end="\r") + bed_data, selmetaloci = plot.signal_bed( + mlobject, + merged_lmi_geometry, + buffer * BFACT, + quadrants, + signipval, + metaloci_bed + ) - continue + # selmetaloci = [] + + sig_plt, ax = plot.signal_plot(mlobject, merged_lmi_geometry, selmetaloci, bins, coords_b) + sig_plt.savefig(f"{plot_filename}_signal.pdf", **plot_opt) + sig_plt.savefig(f"{plot_filename}_signal.png", **plot_opt) + plt.close() + print("\t\tSignal plot -> done.") + + print("\t\tLMI Scatter plot", end="\r") + lmi_plt, r_value, p_value = plot.get_lmi_scatterplot( + mlobject, merged_lmi_geometry, buffer * BFACT, signipval, colors + ) + + if lmi_plt is not None: + + lmi_plt.savefig(f"{plot_filename}_lmi.pdf", **plot_opt) + lmi_plt.savefig(f"{plot_filename}_lmi.png", **plot_opt) + plt.close() + + print("\t\tLMI Scatter plot -> done.") + + else: + + if signal == signals[-1]: + + print("\t\tSkipping to next region...") + continue + + else: + + print("\t\tSkipping to next signal...") + continue + + print(f"\t\tFinal composite figure for region '{region}' and signal '{signal}'", end="\r") + img1 = Image.open(f"{plot_filename}_lmi.png") + img2 = Image.open(f"{plot_filename}_gsp.png") + img3 = Image.open(f"{plot_filename}_gtp.png") + + maxx = int((img1.size[1] * 0.4 + img2.size[1] * 0.25 + img3.size[1] * 0.25) * 1.3) + + yticks_signal = ax.get_yticks()[1:-1] + max_chr_yax = max(len(str(max(yticks_signal))), len(str(min(yticks_signal)))) + signal_left = {3 : 41, 4 : 30, 5 : 20, 6 : 9} + + if max_chr_yax not in signal_left.keys(): + + signal_left[max_chr_yax] = 0 + + composite_image = Image.new(mode="RGBA", size=(maxx, 1550)) + + composite_image = plot.place_composite(composite_image, f"{plot_filename}_hic.png", 0.5, 100, 50) # HiC image + composite_image = plot.place_composite(composite_image, f"{plot_filename}_signal.png", 0.4, signal_left[max_chr_yax], 660) # Signal image + composite_image = plot.place_composite(composite_image, f"{plot_filename}_kk.png", 0.3, 1300, 50) # KK image + composite_image = plot.place_composite(composite_image, f"{plot_filename}_lmi.png", 0.4, 75, 900) # LMI scatter image + composite_image = plot.place_composite(composite_image, f"{plot_filename}_gsp.png", 0.25, 900, 900) # Gaudi signal image + composite_image = plot.place_composite(composite_image, f"{plot_filename}_gtp.png", 0.25, 1600, 900) # Gaudi signal image + + composite_image.save(f"{plot_filename}.png") + + plt.figure(figsize=(15, 15)) + plt.imshow(composite_image) + plt.axis("off") + plt.savefig(f"{plot_filename}.pdf", **plot_opt) + plt.close() + print(f"\t\tFinal composite figure for region '{region}' and signal '{signal}' -> done.") + + if len(bed_data) > 0: + + metaloci_bed_path = f"{work_dir}{mlobject.chrom}/metalocis_log/{signal}" + pathlib.Path(metaloci_bed_path).mkdir(parents=True, exist_ok=True) + + bed_data.to_csv(f"{metaloci_bed_path}/{mlobject.region}_{mlobject.poi}_{signal}_metalocis.bed", sep="\t", index=False) + print(f"\t\tBed file with metalocis location saved to: " + f"{metaloci_bed_path}/{mlobject.region}_{mlobject.poi}_{signal}.bed") + + # Log + for signal_key, df in mlobject.lmi_info.items(): + + sq = [0] * 4 + q = [0] * 4 + + for i, row in df.iterrows(): + + quadrant = row["moran_quadrant"] + lmi_pvalue = row["LMI_pvalue"] + + q[quadrant - 1] += 1 + + if lmi_pvalue <= signipval: + + sq[quadrant - 1] += 1 + + q_string = "\t".join([f"{sq[i]}\t{q[i]}" for i in range(4)]) + + with open(f"{work_dir}moran_info.txt", "a+") as handler: + + log = f"{region}\t{region_iter.name}\t{region_iter.id}\t{signal_key}\t{r_value}\t{p_value}\t{q_string}\n" + + handler.seek(0) + + if os.stat(f"{work_dir}moran_info.txt").st_size == 0: + + handler.write("region\tsymbol\tgene_id\tsignal\tr_value\tp_value\tsq1\tq1\tsq2\tq2\tsq3\tq3\tsq4\tq4\n") + + if not any(log in line for line in handler): + + handler.write(log) + + # Remove image used for composite figure. + if rmtypes: + + for ext in rmtypes: + + os.remove(f"{plot_filename}_hic.{ext}") + os.remove(f"{plot_filename}_signal.{ext}") + os.remove(f"{plot_filename}_kk.{ext}") + os.remove(f"{plot_filename}_lmi.{ext}") + os.remove(f"{plot_filename}_gsp.{ext}") + os.remove(f"{plot_filename}_gtp.{ext}") - print(f"\n\tPlotting signal: {signal}") - - plot_filename = os.path.join(work_dir, mlobject.chrom, "plots", signal, mlobject.region) - pathlib.Path(plot_filename).mkdir(parents=True, exist_ok=True) - - plot_filename = os.path.join( - plot_filename, - f"{mlobject.chrom}_{mlobject.start}_{mlobject.end}_{mlobject.poi}_{mlobject.resolution}_{signal}", - ) - - merged_lmi_geometry = pd.merge( - mlobject.lmi_info[signal], - mlobject.lmi_geometry, - on=["bin_index", "moran_index"], - how="inner", - ) - - merged_lmi_geometry = gpd.GeoDataFrame(merged_lmi_geometry, geometry=merged_lmi_geometry.geometry) - - # The creation of the merged dataframe should be a function in misc. Put there the coditions of aggregation. - - print("\t\tHiC plot", end="\r") - hic_plt = plot.get_hic_plot(mlobject) - hic_plt.savefig(f"{plot_filename}_hic.pdf", **plot_opt) - hic_plt.savefig(f"{plot_filename}_hic.png", **plot_opt) - plt.close() - print("\t\tHiC plot -> done.") - - print("\t\tKamada-Kawai plot", end="\r") - kk_plt = plot.get_kk_plot(mlobject) - kk_plt.savefig(f"{plot_filename}_kk.pdf", **plot_opt) - kk_plt.savefig(f"{plot_filename}_kk.png", **plot_opt) - plt.close() - print("\t\tKamada-Kawai plot -> done.") # - - print("\t\tGaudi Signal plot", end="\r") - gs_plt = plot.get_gaudi_signal_plot(mlobject, merged_lmi_geometry) - gs_plt.savefig(f"{plot_filename}_gsp.pdf", **plot_opt) - gs_plt.savefig(f"{plot_filename}_gsp.png", **plot_opt) - plt.close() - print("\t\tGaudi Signal plot -> done.") - - print("\t\tGaudi Type plot", end="\r") - gt_plt = plot.get_gaudi_type_plot(mlobject, merged_lmi_geometry, signipval, colors) - gt_plt.savefig(f"{plot_filename}_gtp.pdf", **plot_opt) - gt_plt.savefig(f"{plot_filename}_gtp.png", **plot_opt) - plt.close() - print("\t\tGaudi Type plot -> done.") - - print("\t\tLMI Scatter plot", end="\r") - lmi_plt, r_value, p_value = plot.get_lmi_scatterplot( - mlobject, merged_lmi_geometry, buffer * BFACT, signipval, colors - ) - lmi_plt.savefig(f"{plot_filename}_lmi.pdf", **plot_opt) - lmi_plt.savefig(f"{plot_filename}_lmi.png", **plot_opt) - plt.close() - print("\t\tLMI Scatter plot -> done.") - - print("\t\tSignal plot", end="\r") - bed_data, selmetaloci = plot.signal_bed( - mlobject, - merged_lmi_geometry, - buffer * BFACT, - quadrants, - signipval, - ) - - selmetaloci = [] - - sig_plt = plot.signal_plot(mlobject, merged_lmi_geometry, selmetaloci, bins, coords_b) - sig_plt.savefig(f"{plot_filename}_signal.pdf", **plot_opt) - sig_plt.savefig(f"{plot_filename}_signal.png", **plot_opt) - plt.close() - print("\t\tSignal plot -> done.") - - print(f"\t\tFinal composite figure for region of interest: {region} and signal: {signal}", end="\r") - img1 = Image.open(f"{plot_filename}_lmi.png") - img2 = Image.open(f"{plot_filename}_gsp.png") - img3 = Image.open(f"{plot_filename}_gtp.png") - - maxx = int((img1.size[1] * 0.4 + img2.size[1] * 0.25 + img3.size[1] * 0.25) * 1.3) - - composite_image = Image.new(mode="RGBA", size=(maxx, 1550)) - - # HiC image - composite_image = plot.place_composite(composite_image, f"{plot_filename}_hic.png", 0.5, 100, 50) - # Singal image - composite_image = plot.place_composite(composite_image, f"{plot_filename}_signal.png", 0.4, 42, 660) - # KK image - composite_image = plot.place_composite(composite_image, f"{plot_filename}_kk.png", 0.3, 1300, 50) - # MLI scatter image - composite_image = plot.place_composite(composite_image, f"{plot_filename}_lmi.png", 0.4, 75, 900) - # Gaudi signal image - composite_image = plot.place_composite(composite_image, f"{plot_filename}_gsp.png", 0.25, 900, 900) - # Gaudi signi image - composite_image = plot.place_composite(composite_image, f"{plot_filename}_gtp.png", 0.25, 1600, 900) - - composite_image.save(f"{plot_filename}.png") - - fig = plt.figure(figsize=(15, 15)) - plt.imshow(composite_image) - plt.axis("off") - plt.savefig(f"{plot_filename}.pdf", **plot_opt) - plt.close() - print(f"\t\tFinal composite figure for region of interest: {region} and signal: {signal} -> done.") - - data_moran["Coords"].append(region) - data_moran["Symbol"].append(region_iter.name) - data_moran["Gene_id"].append(region_iter.id) - data_moran["Signal"].append(signal) - data_moran["R_value"].append(r_value) - data_moran["P_value"].append(p_value) - - if rmtypes: - - for ext in rmtypes: - - os.remove(f"{plot_filename}_hic.{ext}") - os.remove(f"{plot_filename}_signal.{ext}") - os.remove(f"{plot_filename}_kk.{ext}") - os.remove(f"{plot_filename}_lmi.{ext}") - os.remove(f"{plot_filename}_gsp.{ext}") - os.remove(f"{plot_filename}_gtp.{ext}") - -data_moran = pd.DataFrame(data_moran) -data_moran.to_csv(f"{os.path.join(work_dir, 'moran_info.txt')}", sep="\t", index=False, mode="a") - -print(f"\nInformation saved to {os.path.join(work_dir, 'moran_info.txt')}") - -print(f"\nTotal time spent: {timedelta(seconds=round(time() - start_timer))}") - -print("\nall done.") + print(f"\nInformation saved to: '{os.path.join(work_dir, 'moran_info.txt')}'") + print(f"\nTotal time spent: {timedelta(seconds=round(time() - start_timer))}") + print("\nall done.") diff --git a/metaloci/tools/from_Marc/00_parse_coords.py b/metaloci/tools/from_Marc/00_parse_coords.py deleted file mode 100755 index ce1c3b9..0000000 --- a/metaloci/tools/from_Marc/00_parse_coords.py +++ /dev/null @@ -1,93 +0,0 @@ -import sys -import warnings -warnings.filterwarnings('ignore') # Warnings filtered -import argparse - -# Instantiate the parser -parser = argparse.ArgumentParser(description='Generate a list of regions from a list of genes.') - -# Required arguments -parser.add_argument('-c','--cfile', required=True, dest='cfile', - help='Path to the chromosome size file (two column).') -parser.add_argument('-i','--ifile', required=True, dest='ifile', - help='Path to the annotation file (three column).') -parser.add_argument('-n','--numpf', required=True, dest='numpf', type=int, - help='Split total in n lines per file.') -parser.add_argument('-r','--resolution', required=True, dest='reso', type=int, - help='Resolution in bp to work at. ') -parser.add_argument('-e','--extension', required=True, dest='exte', type=int, - help='Extension from TSS up- and down-stream (in bp). ') - -# Parse arguments -args = parser.parse_args() - -print("Input argument values:") -for k in args.__dict__: - if args.__dict__[k] is not None: - print("\t{} --> {}".format(k,args.__dict__[k])) -#sys.exit() - -# Variables -cfile = args.cfile -ifile = args.ifile -numpf = args.numpf -reso = args.reso -exte = args.exte - -# LIBRARIES -import os -import re -import pandas as pd -from tqdm import tqdm - -# Read chrom sizes -csizes = pd.read_csv(cfile, sep='\t', header=None) -csizes.columns = ['chrom','size'] -print('Read size for {} chromosomes'.format(len(csizes))) -print(csizes) - -# Read coords -coords = pd.read_csv(ifile, sep='\t') -print('Read a total of {} entries'.format(len(coords))) -print(coords) - -# Parse data -data = pd.DataFrame(columns='coords id'.split()) -for i,row in tqdm(coords.iterrows(), total=coords.shape[0]): - chrm, ini, end = row.chr, row.start, row.end - if (chrm in list(csizes.chrom)): - if (row.strand == "+"): - tss = ini - else: - tss = end - tss = int(tss) - ini = tss-exte - end = tss+exte - csize = csizes['size'][csizes.chrom==chrm].values[0] - #print(chrm,tss,ini,end,csize) - # Check start and end beyond chromosome - if (ini<1): - ini = 1 - if (end>csize): - end = csize - # Determine bin of TSS - mid = int((tss-ini)/reso) - tot = int((end-ini)/reso) - # Write results - coo = '{}:{}-{}_{}'.format(chrm,ini,end,mid) - #print(coo) - nrow= {'coords':coo, - 'id':row.gene} - data = data.append(nrow, ignore_index=True) - #print("{}:{}-{}\t{}\t{}".format(chrm,ini,end,row.symbol,row.id)) - -# Write files... -print('Writting files...') -n = 0 -for id, i in enumerate(range(0,len(data),numpf)): - n = n + 1 - start = i - end = i + numpf-1 #neglect last row ... - data.iloc[start:end].to_csv('{}_{}_{}.{}.lst'.format(ifile,reso,exte,n), sep='\t', index=False) - -print('Done!') diff --git a/metaloci/tools/from_Marc/01_kk_layouts.py b/metaloci/tools/from_Marc/01_kk_layouts.py deleted file mode 100755 index deeacc3..0000000 --- a/metaloci/tools/from_Marc/01_kk_layouts.py +++ /dev/null @@ -1,520 +0,0 @@ -#!/usr/bin/env python -import sys -import warnings -warnings.filterwarnings('ignore') # Warnings filtered -import argparse - -# Instantiate the parser -parser = argparse.ArgumentParser(description='Generate a Kamada-Kawai layout for a list of regions.') - -# Required arguments -parser.add_argument('-w','--wfolder', required=True, dest='work_dir', - help='Path to the folder where results will be stored. THe script will create subfolders inside.') -parser.add_argument('-c','--cfolder', required=True, dest='cooler_dir', - help='Path to the folder where the cooler files are stored. You only need to specify the folder, not each cooler file.') -parser.add_argument('-g','--gfile', required=True, dest='gene_file', - help='Path to the file that contains the regions to analyze.') -parser.add_argument('-r','--resolution', required=True, dest='reso', type=int, - help='Resolution in bp to work at. ') -# Optional arguments -parser.add_argument('-p','--pcutoff', required=False, dest='cutoff', type=float, default=20, - help='Select the top X percentage of HiC interaction to generate Kamada-Kawai graph [20].') -parser.add_argument('-x','--silent', required=False, dest='silent', action='store_false', - help='Silent mode [True]') -parser.add_argument('-f','--force', required=False, dest='force', action='store_true', - help='Force rewriting existing data [False]') - -# Parse arguments -args = parser.parse_args() - -print("Input argument values:") -for k in args.__dict__: - if args.__dict__[k] is not None: - print("\t{} --> {}".format(k,args.__dict__[k])) -#sys.exit() - -# Variables -work_dir = args.work_dir -cooler_dir = args.cooler_dir -genes_2_do = args.gene_file -reso = args.reso -cutoff = args.cutoff -silent = args.silent -force = args.force - -# Massage arguments -# Divide the cutoff by 100, so we can just multiply the length of the array of contacts -cutoff = int(cutoff)/100 - -# Checking if the user has ended the path with the slash, if not add it -if not args.cooler_dir.endswith("/"): - cooler_dir = args.cooler_dir+"/" -else: - cooler_dir = args.cooler_dir -if not args.work_dir.endswith("/"): - work_dir = args.work_dir+"/" -else: - work_dir = args.work_dir - -# LIBRARIES -import matplotlib.pyplot as plt -import numpy as np -import seaborn as sns -import pandas as pd -import networkx as nx -import os -import h5py -import cooler -import pickle -import subprocess -import pathlib -from scipy.spatial import distance -from scipy.sparse import csr_matrix -from scipy.signal import argrelextrema - -# FUNCTIONS -def CreateStructure(work_dir): - - ## This function creates the folder structure in the working directory on where to store the - ## Kamada-Kawai layaouts extracted from the Hi-C matrices - - subfolders = ["mls/coords", "mls/datas", "mls/figures", "mls/pdfs/png", "mls/pdfs/pdf", "mls/pdfs/matdata"] - - for folder in subfolders: - dir_com = work_dir+folder - - ## parents is to create the structure "above", exist_ok is to skip the folder in case it exists - pathlib.Path(dir_com).mkdir(parents=True, exist_ok=True) - -def CreateSubStructure(work_dir, dirnum): - - ## This function creates the a subfolder structure in order to separate the outputs in a way - ## that the OS does not explode (separates the output by chromosome) - - subfolders = ["mls/coords/", "mls/datas/", "mls/pdfs/png/", "mls/pdfs/pdf/", "mls/pdfs/matdata/"] - - for folder in subfolders: - dir_com = work_dir+folder+str(dirnum) - - ## parents is to create the structure "above", exist_ok is to skip the folder in case it exists - pathlib.Path(dir_com).mkdir(parents=True, exist_ok=True) - -def cool2mat(cfile, region, pdff, silent): - region,nu = region.split('_') - c = cooler.Cooler(cfile) - - print("COOL2MAT> ",region) - print("COOL2MAT> ",c.info) - - mat = c.matrix(sparse=True, balance=True).fetch(region) - arr = mat.toarray() - #print(arr) - arr[np.isnan(arr)] = 0 # Replace NaNs by 0 - fig = plt.figure(figsize=(10, 10)) - ax = fig.add_subplot(111) - - ## This line was added in order to avoid the "divided by 0" error from the log10 function below. - ## It deactivates the error, and the script reactivates it after the calculation. - np.seterr(divide='ignore') - im = ax.matshow(np.log10(arr), cmap='YlOrRd') - - np.seterr(divide='warn') - - fig.colorbar(im) - - if (pdff): - plt.savefig(pdff) - if (silent==False): - plt.show() - - plt.close() - - return(arr) - -def MatTransform(mat): - - ## We transform the matrix (normalized contacts?) in order to work more easily with it - - print('Matrix data...') - print('\tMin: {}, Max: {}'.format(np.min(mat), np.max(mat))) - - ## First of all we check if the diagonal has 0s. If this is the case, we set the row and - ## column for that point on the diagonal to also 0s (sometimes the normalization algorithm - ## has artifacts) - - diag = np.array(mat.diagonal()) - - for i in range(0, len(diag)): - if (diag[i] == 0): - - mat[i] = 0 - mat[:, i] = 0 - - ## Pseudocounts if min is zero - if (np.min(mat) == 0): - pc = np.min(mat[mat>0]) - print('Pseudocounts: {}'.format(pc)) - mat = mat+pc - - ## Scale if all below 1 - if (np.max(mat)<=1 or np.min(mat)<=1): ## Why np.min? - sf = 1/np.min(mat) - print('Scaling factor: {}'.format(sf)) - mat = mat*sf - - ## Recheck that the mnimum of the matrix is, at least, 1 - if (np.min(mat)<1): - mat[mat<1] = 1 - - print('\tMin: {}, Max: {}'.format(np.min(mat), np.max(mat))) - - ## Log10 the data - print('Log10 matrix...') - mat = np.log10(mat) - print('\tMin: {}, Max: {}'.format(np.min(mat), np.max(mat))) - - ## Mean of the data - print('Mean of non zero...') - me = mat[mat>1].mean() - print('\tMean: {}'.format(me)) - print() - - return(mat) - -def CheckDiagonal(diag): - - ## This function checks the number of 0s of the diagonal and return - ## the total number of 0s from the diagonal and the maximum stretch of - ## consecutive 0s at the diagonal. - - total = 0 - stretch = 0 - max_stretch = 0 - - for i in range(0, len(diag)): - - if (diag[i] == 0): - total += 1 - stretch +=1 - - if (stretch > max_stretch): - max_stretch = stretch - - else: - stretch = 0 - - return(total, max_stretch) - -def AnaMat(runid, mat, reso, pl, silent, work_dir, dirnum, cutoff): - - ## Function to change the matrix in order to get the cutoff% of max hits - - ## runid is used to print the coordinates of the gene, but most OSs do not - ## like to have ":" or "-" in the filename. It is changed to "_" - runid_4_linux = runid.replace(":", "_").replace("-", "_") - - plotsize = 5 - - ## Always remember to copy the numpy array using copy. If you do the usual - ## (temp = mat), it acts as a link between them and changes in temp will affect - ## the original matrix! - temp = mat.copy().flatten() - print("Matrix size: {}\n".format(len(temp))) - - ## Put the minimum values as NaNs, (remember, the minimums of this matrix correspond - ## to the pseudocounts, which are the 0 of the Hi-C) - temp[temp<=np.min(temp)] = np.nan - - ## Give the user info about how empty (or filled) is the matrix. - non0 = int(len(temp)-len(temp[np.isnan(temp)])) - perc_non0 = np.round(non0 / len(temp) * 100, decimals = 2) - - print("Non 0 matrix size: {}".format(non0)) - print("% of non 0 in the original matrix: {}%\n".format(perc_non0)) - - ## Cutoff percentil - print('Cutoff...') - per = cutoff - subf = mat.flatten() - subf_nonz = subf[subf>min(subf)] # Percentil of NON-ZEROES in original matrix - #top = int(len(subf)*per) - top = int(len(subf_nonz)*per) - ind = np.argpartition(subf, -top)[-top:] - temp = subf[ind] - print("\tSize of the whole matrix: {}".format(len(subf))) - print("\tNumber of top interactions to pick and get the cutoff: {}".format(top)) - print("\tSize of the submatrix with the top interations: {}".format(len(ind))) - print("\tSize of the new matrix with {} elements: {}".format(top, len(temp))) - print() - ct = np.min(temp) - ctmax = np.max(temp) - print('\tCutoff {}'.format(ct)) - print('\tCutoff (%) {}'.format(len(temp)/len(subf)*100)) - print() - - # Data distribution - print('Data distr...') - #sns.distplot(mat) - if (silent==False): - plt.show() - plt.close() - - # Subset to cutoff percentil - logcutoff = ct - subm = mat.copy() - subm = np.where(subm==1., 0, subm) - subm[subm < logcutoff] = 0 - - # Connectivity (~persistance length) - print('Connectivity between consequtive particles...') - if (pl): - perlen = pl - else: - perlen = np.nanmax(subm[subm>0])**2 - print('\t"Persistance lenght": {}'.format(perlen)) - print() - - rng = np.arange(len(subm)-1) - subm[rng, rng+1] = perlen - - # Remove exact diagonal - subm[rng, rng] = 0 - - # Plot - #fig, ax = plt.subplots(figsize=(plotsize, plotsize)) - #ax.imshow(mat, cmap='YlOrRd', vmax=ctmax) - #temp = work_dir+'mls/pdfs/matdata/subfolder_'+str(dirnum)+'/{}_hic.pdf' - #plt.savefig(temp.format(runid_4_linux)) - #plt.close() - - # Mix matrices to plot - u = np.triu(mat+1,k=1) - l = np.tril(subm,k=-1) - mixmat = u+l - - # Plot data - #fig = plt.figure() - fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 5)) - fig.suptitle('Matrix for {}'.format(runid)) - ax1.imshow(mixmat, cmap='YlOrRd', vmax=ctmax+1) - #sns.histplot(data=temp.flatten(), stat="density", alpha=0.4, kde=True, legend=False, - # kde_kws={"cut": 3}, **{"linewidth" : 0}) - fig.tight_layout() - - temp = work_dir+'mls/pdfs/matdata/'+str(dirnum)+'/{}_matdata.pdf' - if (silent==False): - plt.savefig(temp.format(runid_4_linux)) - plt.close() - - # Modify the matrix and transform to restraints - subm = np.where(subm==0, np.nan, subm) # Remove zeroes - subm = 1/subm # Convert to distance Matrix instead of similarity matrix - subm = np.triu(subm, k=0) # Remove lower diagonal - - # This function (nan_to_num) allows the user to change nans, posinf and neginf - # to a number the user specifies - subm = np.nan_to_num(subm, nan=0, posinf=0, neginf=0) # Clean nans and infs - - # FINAL RESTRAINTS - #print("FINAL RESTRAINTS CALCULATED...") - #fig, ax = plt.subplots(figsize=(plotsize, plotsize)) - #ax.imshow(subm) - #plt.savefig('mls/pdfs/{}_restraints.pdf'.format(runid_4_linux)) - #if (silent=False) - # plt.show() - #plt.close() - - return(subm) - -def Mat2KK(runid, mat, reso, midp, silent, work_dir, dirnum): - - runid_4_linux = runid.replace(":", "_").replace("-", "_") - - ## Create sparse matrix - matrix = csr_matrix(mat) - print("Sparse matrix contains {:,} restraints".format(np.count_nonzero(matrix.toarray()))) - - ## Get the KK layout - G = nx.from_scipy_sparse_array(matrix) - #G = nx.from_scipy_sparse_matrix(matrix) - print("Layouting KK...") - pos = nx.kamada_kawai_layout(G) - - ## Get distance matrix - coords = list(pos.values()) - dists = distance.cdist(coords, coords, 'euclidean') - ## Plot KK - ## The following code is duplicated in order to produce visible lines in both pdf - ## and png. Silver lines in png where almost invisible. - print("Plotting KK...") - - plotsize = 10 - color_map = [] - plt.figure(figsize=(plotsize,plotsize)) - options = { - 'node_size': 50, - 'edge_color': 'silver', - 'linewidths': 0.1, - 'width': 0.05 - } - - ## With nx.draw we draw the network (the Kamada-Kawai layout), to add the numbers of the - ## nodes we use plt.text. - nx.draw(G, pos, node_color=range(len(pos)), cmap=plt.cm.coolwarm, **options) - if (midp): - plt.scatter(pos[midp][0],pos[midp][1], s=80, facecolors='none', edgecolors='r') - xs = [pos[n][0] for n in pos] - ys = [pos[n][1] for n in pos] - sns.lineplot(x=xs, y=ys, sort=False, lw=2, color='black', legend = False, zorder=1) - for p in range(1,len(coords)+1):#range(Dpar-Dlen,Dpar+Dlen+1): - x = coords[p-1][0] - y = coords[p-1][1] - plt.text(x, y, s=p, color='black', fontsize=10) - - temp = work_dir+'mls/pdfs/pdf/'+str(dirnum)+'/{}_kk.pdf' - if (silent==False): - plt.savefig(temp.format(runid_4_linux)) - plt.close() - - if (silent==False): - plt.show() - - options['edge_color'] = 'black' - plotsize = 15 - plt.figure(figsize=(plotsize,plotsize)) - - nx.draw(G, pos, node_color=range(len(pos)), cmap=plt.cm.coolwarm, **options) - if (midp): - plt.scatter(pos[midp][0],pos[midp][1], s=80, facecolors='none', edgecolors='r') - xs = [pos[n][0] for n in pos] - ys = [pos[n][1] for n in pos] - sns.lineplot(x=xs, y=ys, sort=False, lw=2, color='black', legend = False, zorder=1) - for p in range(1,len(coords)+1):#range(Dpar-Dlen,Dpar+Dlen+1): - x = coords[p-1][0] - y = coords[p-1][1] - plt.text(x, y, s=p, color='black', fontsize=15) - - temp = work_dir+'mls/pdfs/png/'+str(dirnum)+'/{}_kk.png' - if (silent==False): - plt.savefig(temp.format(runid_4_linux)) - - plt.close() - - ## Save KK data in pkl data, so it can be read at later time/scripts - temp = work_dir+'mls/coords/'+str(dirnum)+'/{}.pkl' - cfile = temp.format(runid_4_linux) - with open(cfile, 'wb') as output: - pickle.dump(matrix, output) - pickle.dump(pos, output) - pickle.dump(dists, output) - pickle.dump(coords, output) - pickle.dump(G, output) - - return(coords,pos) - -CreateStructure(work_dir) - -## In this file the script stores bad regions (> 50% of the diagonal are 0s or -## the stretch of 0s is bigger than 50) -perc_file = work_dir+"bad_regions.txt" -perc_f = open(perc_file, 'a') - -# Input file with all regions to KK -regions = pd.read_csv(genes_2_do, sep='\t') - -for i,row in regions.iterrows(): - - region = row.coords - - runid = '{}_{}'.format(region,reso) - #runid = '{}'.format(region) - - runid_4_linux = runid.replace(":", "_").replace("-", "_") - - chrom = runid_4_linux.split("_")[0] - - CreateSubStructure(work_dir, chrom) - - grep_com = "ls -1v "+cooler_dir+" | grep mcool" - grep_res = subprocess.getoutput(grep_com) - #print("GC>>>>>>>>>>>", grep_res) - if (len(grep_res.split()) > 1): # Multiple chromosomes files in folder - grep_com = "ls -1v "+cooler_dir+" | grep mcool | grep _"+chrom+"_" - grep_res = subprocess.getoutput(grep_com) - #print("GP>>>>>>>>>>>", grep_res) - - cfile = cooler_dir+grep_res+"::/resolutions/"+str(reso) - - print('\n---> Working on region {}: {} [{}/{}]'.format(row.id,runid,i+1,len(regions))) - print("Will use {} file for chromosome {}... ".format(cfile, chrom)) - - # Check if file already exists - temp = work_dir+'mls/coords/'+str(chrom)+'/{}.pkl' - - coordsfile = temp.format(runid_4_linux) - - print("Checking if region {} is already done...".format(region)) - - grep_com = "grep {} {}".format(region, perc_file) - grep_check = subprocess.getoutput(grep_com) - - - ## This part is modified. In some of the datasets (NeuS) the matrix was completely empty, - ## so the pkl file was not created (gave an error in the MatTransform function: pc = np.min(mat[mat>0])) - ## To bypass this part, the script checks the bad_regions.txt; if the regions is also found in this - ## file, the script skips the calculations (as the script already has checked the region in question). - if os.path.isfile(coordsfile) and force==False: - print('WARNING! already done! If you want to rewrite it, use -f or --force.') - continue - elif (grep_check != ""): - print('ERROR! no data file... {}!'.format(perc_file)) - continue - - try: - ## Run the code... - ## Get the submatrix for the region from the general Hi-C matrix - print(">>>>>>>>",cfile, region, None, silent) - mat = cool2mat(cfile, region, None, silent) - - ## Get the diagonal to check if it is correct - diag = np.array(mat.diagonal()) - - total_0, max_stretch = CheckDiagonal(diag) - perc_0 = np.round(total_0 / len(diag)*100, decimals = 2) - - ## If the whole diagonal is 0s, the script cannot do anything, so it goes to the next - ## region after saying so to the user. - ## flush is needed in order to write to the file and continue (memory can get full and - ## not write properly to file) - if (total_0 == len(diag)): - print("Void matrix; passing to the next one") - perc_f.write("{}\tvoid\n".format(region)) - perc_f.flush() - continue - - ## If half (or more) of the diagonal corresponds to 0s, save that region in - ## a file to check - if (int(perc_0) >= 50): - perc_f.write("{}\tperc\n".format(region)) - perc_f.flush() - - ## If half (or more) of the diagonal corresponds to 0s, save that region in - ## a file to check - elif (int(max_stretch) >= 50): - perc_f.write("{}\tstretch\n".format(region)) - perc_f.flush() - - ## Change the matrix in order to properly calculate the Kamada-Kawai layout - mat = MatTransform(mat) - - ## Get submatrix of restraints - subm = AnaMat(runid, mat, reso, None, silent, work_dir, chrom, cutoff) - - ## Kamada Kawai Layout - coords,pos = Mat2KK(runid, subm, reso, None, silent, work_dir, chrom) - - print('<---done!') - - except Exception: - print('ERROR!>Failed: {}'.format(runid)) - diff --git a/metaloci/tools/from_Marc/02_LMI_anysignal.py b/metaloci/tools/from_Marc/02_LMI_anysignal.py deleted file mode 100755 index 2769abb..0000000 --- a/metaloci/tools/from_Marc/02_LMI_anysignal.py +++ /dev/null @@ -1,571 +0,0 @@ -#!/usr/bin/env python -import sys -import warnings -warnings.filterwarnings('ignore') # Warnings filtered -import argparse - -# Instantiate the parser -parser = argparse.ArgumentParser(description='Generate LMI for any genomic signal for a list of regions.') - -# Required arguments -parser.add_argument('-w','--wfolder', required=True, dest='work_dir', - help='Path to the folder where results will be stored. THe script will create subfolders inside.') -parser.add_argument('-c','--cfolder', required=True, dest='cooler_dir', - help='Path to the folder where the cooler files are stored. You only need to specify the folder, not each cooler file.') -parser.add_argument('-g','--gfile', required=True, dest='gene_file', - help='Path to the file that contains the regions to analyze.') -parser.add_argument('-s','--signal_dir', required=True, dest='signal_dir', - help='Path to the folder where the signal bigwig files are stored. Need to have the same name as signals and end with \".bw\". ') -parser.add_argument('-t','--typesignals', required=True, dest='signals', - help='Signals to run seperated by \",\". ') -parser.add_argument('-r','--resolution', required=True, dest='reso', type=int, - help='Resolution in bp to work at. ') -# Optional arguments -parser.add_argument('-p','--pval', required=False, dest='signipval', default=0.05, - help='P-value to select significant LMIs [0.05].') -parser.add_argument('-i','--influence', required=False, dest='influence', default=1.5, - help='Bins influence radius [1.5 bins average distance].') -parser.add_argument('-b','--buffer', required=False, dest='bfact', default=2.0, - help='Buffer to multiply influence [2.0].') -parser.add_argument('-x','--silent', required=False, dest='silent', action='store_false', - help='Silent mode [True]') -parser.add_argument('-f','--force', required=False, dest='force', action='store_true', - help='Force rewriting existing data [False]') - -# Parse arguments -args = parser.parse_args() - -print("Input argument values:") -for k in args.__dict__: - if args.__dict__[k] is not None: - print("\t{} --> {}".format(k,args.__dict__[k])) -#sys.exit() - -# Variables -gene_file = args.gene_file -reso = args.reso -signipval = args.signipval -silent = args.silent -force = args.force -signals = args.signals.split(',') -influence = args.influence -bfact = args.bfact - -# Massage arguments -# Checking if the user has ended the path with the slash, if not add it -if not args.cooler_dir.endswith("/"): - cooler_dir = args.cooler_dir+"/" -else: - cooler_dir = args.cooler_dir -if not args.work_dir.endswith("/"): - work_dir = args.work_dir+"/" -else: - work_dir = args.work_dir -if not args.signal_dir.endswith("/"): - signal_dir = args.signal_dir+"/" -else: - signal_dir = args.signal_dir - -# LIBRARIES -print('Importing...') -import time -import os -import math -import re -import pickle -import pyBigWig -import pathlib -import glob -import re - -import matplotlib.pyplot as plt -import numpy as np -import seaborn as sns -import pandas as pd -import geopandas as gpd -import networkx as nx -import libpysal as lp - -import shapely.geometry -import shapely.ops - -from esda.moran import Moran_Local -from libpysal.weights.spatial_lag import lag_spatial -from splot.esda import moran_scatterplot, plot_moran, lisa_cluster -from scipy.spatial import Voronoi, voronoi_plot_2d -from shapely.geometry import Polygon, LineString, Point -print("Imports done!") - -# FUNCTIONS -def signalProfile(bw,reg,res,norm,log10): - sig = [] - c,i,e = re.split(':|-',reg) - i = int(i) - e = int(e)+1 - for coor in range(i,e,res): - amean = np.mean(bw.values(c, coor, coor+res)) - sig.append(amean) - print("signal min and max: {} {}".format(np.nanmin(sig),np.nanmax(sig))) - if (isinstance(norm, (int, float))): - sig = [float(i)/norm for i in sig] - if (norm == 'max'): - sig = [float(i)/max(sig) for i in sig] - if (norm == 'sum'): - sig = [float(i)/sum(sig) for i in sig] - if (norm == '01'): - sig = [(float(i)-min(sig))/(max(sig)-min(sig)+0.01) for i in sig] - - if (log10): - sig = np.log10(sig) - - return(sig) - -def signalProfile20230123(bw,reg,res,pc,norm,log10): - sig = [] - c,i,e = re.split(':|-',reg) - i = int(i) - e = int(e)+1 - for coor in range(i,e,res): - try: - #asum = np.nansum(bw.values(c, coor, coor+res)) # SUMS ALL OVER 10K TIMES! - asum = np.nanmean(bw.values(c, coor, coor+res)) - if (asum < pc): - #asum = 0 - #asum = np.NaN - asum = pc - except RuntimeError: - asum = pc - #asum = np.NaN - sig.append(asum) - if (isinstance(norm, (int, float))): - sig = [float(i)/norm for i in sig] - if (norm == 'max'): - sig = [float(i)/max(sig) for i in sig] - if (norm == 'sum'): - sig = [float(i)/sum(sig) for i in sig] - if (norm == '01'): - sig = [(float(i)-min(sig))/(max(sig)-min(sig)+0.01) for i in sig] - - if (log10): - sig = np.log10(sig) - - return(sig) - -def signalProfileBED(bd,reg,res,pc,norm): - sig = [] - c,i,e = re.split(':|-',reg) - i = int(i) - e = int(e)+1 - sig = bd[(bd.chromo ==c) & (bd.start>=i) & (bd.end<=e)].iloc[:, 3] - if (isinstance(norm, (int, float))): - sig = [float(i)/norm for i in sig] - if (norm == 'max'): - sig = [float(i)/max(sig) for i in sig] - if (norm == 'sum'): - sig = [float(i)/sum(sig) for i in sig] - if (norm == '01'): - sig = [(float(i)-min(sig))/(max(sig)-min(sig)+0.01) for i in sig] - return(sig) - -def signalProfileOK(bw,reg,res,pc,norm,log10): - sig = [] - c,i,e = re.split(':|-',reg) ## Split the region name intro chromosome, start and end (chr11:128908798-132928998). Get the name from the pkl file? - i = int(i) - e = int(e)+1 - - for coor in range(i,e,res): ## Coords from start to end by resolution step - try: - ## Get the signal data from the bigwig file (be careful with the namings of the chr) - if (np.all(np.isnan(bw.values(c, coor, coor+res)))): - asum = pc - else: - asum = np.nanmedian(bw.values(c, coor, coor+res)) - - except RuntimeError: - asum = pc - - if (asum < pc): ## If the signal is below a minimum, asume it's the minimum - asum = pc - - sig.append(asum) - - if (isinstance(norm, (int, float))): - sig = [float(i)/int(norm) for i in sig] - elif (norm == 'max'): - sig = [float(i)/max(sig) for i in sig] - elif (norm == 'sum'): - sig = [float(i)/sum(sig) for i in sig] - elif (norm == '01'): - sig = [(float(i)-min(sig))/(max(sig)-min(sig)+0.01) for i in sig] - - if (log10): - sig = np.log10(sig) - - return(sig) - -def GetPointId(cs,p): - i = 0 - for c in cs: - if (p.contains(Point(c))): - return(i) - continue - else: - i=i+1 - -def GetRegionFromFile(cfile): ## Function to get the region from the filename of the pkl (assuming files where generated with the pipeline) - - region_temp = cfile.split("/")[-1] - - region_split = region_temp.split("_") - - region = region_split[0] + ":" + region_split[1] + "-" + region_split[2] - - return(region) - -def GetSignalPlot(cfile, sig_data, reso, G, pos): ## This function "paints" each node of the Kamada-Kawai with the signal from the signal file - - region = GetRegionFromFile(cfile) - - #minsig = -1 - #minsig = min(-1,sig_data.header()['minVal'])#*reso - #print('Signal profiling over {} max value and a pseudo-count of {}'.format(sig_data.header()['maxVal'], minsig)) - #signal = signalProfile(sig_data,region,reso,minsig,'01',False) # Decided to test non normalized data - #signal = signalProfile(sig_data,region,reso,minsig,None,False) - signal = signalProfile(sig_data,region,reso,None,False) - signal = np.array(signal) - - plotsize = 10 - - options = { - 'node_size': 50, - 'edge_color': 'silver', - 'linewidths': 0.1, - 'width': 0.05, - } - - print("Mapping into KK signal data...") - color_map = [] - plt.figure(figsize=(plotsize,plotsize)) - nx.draw(G, pos, node_color=signal[0:len(pos)], cmap=plt.cm.coolwarm, **options) - plt.scatter(pos[midp][0],pos[midp][1], s=80, facecolors='none', edgecolors='r') - xs = [pos[n][0] for n in pos] - ys = [pos[n][1] for n in pos] - sns.lineplot(x = xs, y = ys, sort=False, lw=2, color='black', legend = False, zorder=1) - for p in range(1,len(coords)+1):#range(Dpar-Dlen,Dpar+Dlen+1): - x = coords[p-1][0] - y = coords[p-1][1] - plt.text(x, y, s=p, color='black', fontsize=10) - - plt.show() - plt.close() - -def LMI_data(cfile, sig_data, reso, poly_from_lines, coords, buffer): - - region = GetRegionFromFile(cfile) - - #minsig = -1 - #minsig = min(-1,sig_data.header()['minVal'])#*reso - #print('Signal profiling over {} max value and a pseudo-count of {}'.format(sig_data.header()['maxVal'], minsig)) - #signal = signalProfile(sig_data,region,reso,minsig,'01',False) # Decided not to use normalized data - #signal = signalProfile(sig_data,region,reso,minsig,None,False) - signal = signalProfile(sig_data,region,reso,None,False) - signal = np.array(signal) - - mydata = gpd.GeoDataFrame(columns=['v','geometry']) ## Stored in Geopandas DataFrame to do LMI - - for poly in poly_from_lines: - coordid = GetPointId(coords,poly) - try: - v = signal[coordid] - except IndexError: - v = 0. - - mydata = mydata.append({'v':v,'geometry':poly}, ignore_index=True) - - # Voronoi shaped & closed - shape = LineString(coords).buffer(buffer) - close = mydata.convex_hull.union(mydata.buffer(0.1, resolution=1)).geometry.unary_union - i = 0 - for q in mydata.geometry: - mydata.geometry[i] = q.intersection(close) - mydata.geometry[i] = shape.intersection(q) - i = i+1 - - return(mydata) - -# START PROCESS -startTime = time.perf_counter() - -if (signals == ""): - signal_switch = 0 - print("All signals in the signal folder will be used") -else: - signal_switch = 1 - print("Signals choosen are: " + " ".join(signals)) - -c_dir = os.path.join(work_dir, "mls", "coords") - -info = {} - -signal_files = [f for f in glob.glob(signal_dir + "*.bw")] - -signal_names = [] - -for sf in signal_files: - # print(row.IID) - - name_f = os.path.splitext(os.path.split(sf)[1])[0] - - if (signal_switch): - - if (name_f in signals): - - signal_names.append(name_f) - - else: - - continue - - else: - - signal_names.append(name_f) - - cosa = pyBigWig.open(sf) - - info[name_f] = cosa -print(">>>>>>>>",signal_names) - -#How to read the genes -genes = pd.read_table(gene_file) - -for i_gene, line in genes.iterrows(): - - gene_c = line.coords - gene_n = line.id - gene_i = line.id - print(gene_c,gene_n,gene_i) - - region = gene_c.split("_")[0] - midp = gene_c.split("_")[1] - - # region_lin = region. - - gene_i_t = gene_i.split(".")[0] - - chr_temp = region.split(":")[0] - - cfile_temp = region.replace(":", "_").replace("-", "_") - - # cfile_temp = cfile_temp + "_" + str(reso) + ".pkl" - - cfile = os.path.join(c_dir, chr_temp, cfile_temp + "_" + str(midp) + "_" + str(reso) + ".pkl") - - if os.path.isfile(cfile): - print ("File exists. Loading data...") - with open(cfile, 'rb') as input: - matrix = pickle.load(input) - pos = pickle.load(input) - dists = pickle.load(input) - coords = pickle.load(input) - G = pickle.load(input) - else: - print("File does not exist {}. Exiting script...".format(cfile)) - continue - #sys.exit(1) - -# scatter_2_save = os.path.join(work_dir, "datasets", dataset_name, chr_temp, cfile_temp, "scatter") -# lisa_2_save = os.path.join(work_dir, "datasets", dataset_name, chr_temp, cfile_temp, "lisa_cluster") -# choro_2_save = os.path.join(work_dir, "datasets", dataset_name, chr_temp, cfile_temp, "choropleth") -# moran_2_save = os.path.join(work_dir, "datasets", dataset_name, chr_temp, cfile_temp, "moran_geom") -# moran_2_save = os.path.join(work_dir, dataset_name, "mls/datas", chr_temp) - moran_2_save = os.path.join(work_dir, "mls/datas", chr_temp) - - -# pathlib.Path(scatter_2_save).mkdir(parents=True, exist_ok=True) -# pathlib.Path(lisa_2_save).mkdir(parents=True, exist_ok=True) -# pathlib.Path(choro_2_save).mkdir(parents=True, exist_ok=True) -# pathlib.Path(moran_2_save).mkdir(parents=True, exist_ok=True) - - # info_file = open(os.path.join(work_dir, "datasets", dataset_name, chr_temp, gene_i_t, "info_file.txt"), 'w') - # info_file.write("IID\tMoran_I\tMoran_pvalue\n") - - - # From points to Voronoi polygons - print("Voronoing KK...") - # set limits - lims = 2 - points = coords.copy() - points.append(np.array([-lims,lims])) - points.append(np.array([lims,-lims])) - points.append(np.array([lims,lims])) - points.append(np.array([-lims,-lims])) - points.append(np.array([-lims,0])) - points.append(np.array([0,-lims])) - points.append(np.array([lims,0])) - points.append(np.array([0,lims])) - # get voronoi - vor = Voronoi(points) - #voronoi_plot_2d(vor, show_vertices=False) - - print("Voronoi Polygons and mapped data...") - lines = [ - shapely.geometry.LineString(vor.vertices[line]) - for line in vor.ridge_vertices - if -1 not in line - ] - - poly_from_lines = list(shapely.ops.polygonize(lines)) - - print("Voronois polygons for {} points...".format(sum(1 for _ in poly_from_lines))) - - # Get average distance between consequitive points to define influence, which should be 2 particles - mdist = dists.diagonal(1).mean() - buffer = mdist*influence - print("Average distance between consecutive particles {:6.4f} [{:6.4f}]".format(mdist,buffer)) - - j = 0 - print(signal_names) - for signal_type in signal_names: - print(signal_type) - - dfile = os.path.join(moran_2_save, "{}_{}_{}.tsv").format(cfile_temp, reso, signal_type) - pfile = os.path.join(moran_2_save, "{}_{}_{}_{}.midp").format(cfile_temp, reso, midp, signal_type) - - print('\n---> Working on gene {} [{}/{}] and signal {} [{}/{}]'.format(gene_n,i_gene+1,len(genes), signal_type, j+1, len(signal_names))) - - j += 1 - - if os.path.isfile(dfile and force==False): - print(dfile) - print("LMI for gene {} and signal {} already done, skipping".format(gene_n, signal_type)) - continue - - mydata = LMI_data(cfile, info[signal_type], reso, poly_from_lines, coords, buffer) - - # Get weights for geometric distance - print("Getting weights and geometric distance for LM") - y = mydata['v'].values - #w = Queen.from_dataframe(mydata) - w = lp.weights.DistanceBand.from_dataframe(mydata, buffer*bfact) - w.transform = 'r' - - # # Get Global Moran Index - # moran = Moran(y, w, permutations=10000) - # if (math.isnan(moran.I)): - # print(">>> ERROR... MATH NAN") - # #return() - # print("Global Moran Index: {:4.2f} with p-val: {:8.6f} ".format(moran.I,moran.p_sim)) - # info_file.write(row.IID + "\t" + str(moran.I) + "\t" + str(moran.p_sim) + "\n") - - # Show Global Moran Plot - # #plot_moran(moran, zstandard=False, figsize=(10,4)) - - # calculate Moran_Local and plot - moran_loc = Moran_Local(y, w, permutations=10000) - lags = lag_spatial(moran_loc.w, moran_loc.z) - print("There are a total of {} significant points in Local Moran Index".format(len(moran_loc.p_sim[(moran_loc.p_sim {}".format(k,args.__dict__[k])) -#sys.exit() - -# Variables -run = args.work_dir -minpv = args.signipval -qs = args.qs -rm = args.rm.split(',') -reso = args.reso -silent = args.silent -force = args.force -rfile = args.gene_file -signals = args.signals2choose.split(',') - -# Massage arguments -# default quartiles -if not qs: - qs = [1,2] -# Checking if the user has ended the path with the slash, if not add it -if not args.cooler_dir.endswith("/"): - cooler_dir = args.cooler_dir+"/" -else: - cooler_dir = args.cooler_dir -if not args.work_dir.endswith("/"): - work_dir = args.work_dir+"/" -else: - work_dir = args.work_dir - -# LIBRARIES -print('Importing...') -import matplotlib.pyplot as plt - -import os -import re -import time -import datetime -import subprocess -import numpy as np -import fanc as fanc -import cooler -import seaborn as sns -import pandas as pd -import geopandas as gpd -import libpysal as lp -import scipy.stats as stats -import matplotlib.ticker as tick - -from scipy.spatial import distance -from scipy.ndimage import rotate -from descartes import PolygonPatch -from matplotlib.lines import Line2D -from PIL import Image, ImageEnhance, ImageDraw, ImageFont -from matplotlib.ticker import MaxNLocator -from matplotlib.ticker import FormatStrFormatter -from matplotlib.colors import ListedColormap, LinearSegmentedColormap -from shapely.geometry import Point -from shapely.geometry.multipolygon import MultiPolygon -print("Imports done!") - -# FUNCTIONS -def Region2Parts(r): - c,nu = r.split(':') - i,e = nu.split('-') - return(c,int(i),int(e)) - -def GetMLData(mlfile): - # Read data - #Chr Start End Region Index X Y Signal ZSig Zlag LMIpval LMIq LMIs geometry - mlgdata = gpd.read_file(mlfile, GEOM_POSSIBLE_NAMES="geometry", KEEP_GEOM_COLUMNS="NO") - mlgdata.Signal = mlgdata.Signal.astype('float') # Otherwise geopandas thinks is not a float! - mlgdata.LMIpval = mlgdata.LMIpval.astype('float') # Otherwise geopandas thinks is not a float! - mlgdata.Index = mlgdata.Index.astype('int') # Otherwise geopandas thinks is not a float! - mlgdata.X = mlgdata.X.astype('float') # Otherwise geopandas thinks is not a float! - mlgdata.Y = mlgdata.Y.astype('float') # Otherwise geopandas thinks is not a float! - mlgdata.ZSig = mlgdata.ZSig.astype('float') # Otherwise geopandas thinks is not a float! - mlgdata.ZLag = mlgdata.ZLag.astype('float') # Otherwise geopandas thinks is not a float! - mlgdata.LMIq = mlgdata.LMIq.astype('int') # Otherwise geopandas thinks is not a float! - mlgdata.LMIs = mlgdata.LMIs.astype('float') # Otherwise geopandas thinks is not a float! - return(mlgdata) - -def GetHiCPlot(hicfile,region,midp,signal): - # HiC - c = cooler.Cooler(hicfile) - mat = c.matrix(sparse=True, balance=False).fetch(region) - arr = mat.toarray() - arr = MatTransform(arr) - minv= np.min(arr) - maxv= np.max(arr) - lmat= arr.shape[0] - - # Massage the matrix - arr = np.triu(arr, -1) - arr = rotate(arr, angle=45) - fac = midp/lmat - midm= int(arr.shape[0]/2) - midp= int(arr.shape[0]*fac) - arr = arr[:midm,:] - arr[arr == 0] = np.nan - - # Plot - fig = plt.figure(figsize=(10, 10)) - ax = fig.add_subplot(111) - im = ax.matshow(arr, cmap='YlOrRd', vmin=minv, vmax=maxv) - sns.scatterplot(x=[midp],y=[midm+4], color='lime', marker="^") - plt.axis('off') - #ax.legend().set_visible(False) - nbar = 6 - sm = plt.cm.ScalarMappable(cmap='YlOrRd') - cbar = plt.colorbar(sm, ticks=np.linspace(minv,maxv,nbar), - values=np.linspace(minv,maxv,nbar), - shrink=0.3) - cbar.set_label('log10(Hi-C interactions)', rotation=270, size=14, labelpad=20) - cbar.ax.yaxis.set_major_formatter(tick.FormatStrFormatter('%.2f')) - plt.title('{} [{}] {}'.format(gene,region, signal)) - return(plt) - -def GetKKPlot(mlgdata,region,midp): - # Kamada Kawai Plot - chrm,cini,cend = Region2Parts(region) - plotsize = 10 - poinsize = plotsize*5 - fig = plt.figure(figsize=(plotsize, plotsize)) - x = mlgdata.X - y = mlgdata.Y - plt.axis('off') - g = sns.lineplot(x=x, y=y, sort=False, lw=1, color='grey', legend = False) - g.set_aspect('equal', adjustable='box') - sns.scatterplot(x=x, y=y, hue=range(0,len(x)), palette='coolwarm', legend = False, s=poinsize) - g.set(ylim=(-1.1, 1.1)) - g.set(xlim=(-1.1, 1.1)) - g.tick_params(bottom=False, left=False) - g.annotate('{}:{:,}'.format(chrm,cini), (x[0], y[0])) - g.annotate('{}:{:,}'.format(chrm,cend), (x[len(x)-1], y[len(y)-1])) - sns.scatterplot(x=[x[midp-1]], y=[y[midp-1]], s=poinsize*2, ec="lime", fc="none") - return(plt) - -def GaudiSignalPlot(mlgdata,region,midp,vmin,vmax): - # Gaudi plot Signal - cmap = 'copper_r' - cmap = 'PuOr_r' - x = mlgdata.X - y = mlgdata.Y - q1 = mlgdata.Signal.quantile(0.95) - q2 = mlgdata.Signal.quantile(0.05) - lims = max(abs(q1),abs(q2)) - if (np.sign(q1) == np.sign(q2)): - vmax = q1 - vmin = q2 - if (vmax): - maxv = vmax - else: - #maxv = max(mlgdata.Signal) - maxv = lims - if (vmin): - minv = vmin - else: - #minv = min(mlgdata.Signal) - minv = -lims - polh = mlgdata[mlgdata.Index == midp] - fig, ax = plt.subplots(figsize=(12,10), subplot_kw={'aspect':'equal'}) - mlgdata.plot(column='Signal', cmap=cmap, linewidth=2, edgecolor='white', vmax=maxv, vmin=minv, ax=ax) - sns.scatterplot(x=[x[midp-1]], y=[y[midp-1]], s=50, ec="none", fc="lime") - sm = plt.cm.ScalarMappable(cmap=cmap) - sm.set_array([]) - nbar = 11 - cbar = plt.colorbar(sm, ticks=np.linspace(minv,maxv,nbar), - values=np.linspace(minv,maxv,nbar), - shrink=0.5) - #cbar.ax.label_params(labelsize=20) - cbar.set_label('Signal', rotation=270, size=20, labelpad=35) - cbar.ax.yaxis.set_major_formatter(tick.FormatStrFormatter('%.2f')) - plt.axis('off') - return(plt) - -def GaudiTypePlot(mlgdata,quartiles,minpv,region,midp): - # Gaudi plot LMI type - x = mlgdata.X - y = mlgdata.Y - colors= ['firebrick', 'lightskyblue', 'steelblue', 'orange'] - fig, ax = plt.subplots(figsize=(12,10), subplot_kw={'aspect':'equal'}) - ax = fig.gca() - alphas = [] - polc = [] - for i,row in mlgdata.iterrows(): - pol = row.geometry - pty = row.LMIq - ppv = row.LMIpval - polc.append(colors[pty-1]) - if (ppv<=minpv): - alphas.append(1.0) - else: - alphas.append(0.3) - #ax.add_patch(PolygonPatch(pol, ec='white', fc=colors[pty-1], alpha=alpha, linewidth=2)) - mlgdata.plot(column='LMIq', linewidth=2, edgecolor='white', color=polc, alpha=alphas, ax=ax) - sns.scatterplot(x=[x[midp-1]], y=[y[midp-1]], s=50, ec="none", fc="lime", zorder=len(mlgdata)) - plt.axis('off') - legend_elements = [Line2D([0], [0], marker='o', color='w', markerfacecolor=colors[0], label='HH', markersize=15), - Line2D([0], [0], marker='o', color='w', markerfacecolor=colors[1], label='LH', markersize=15), - Line2D([0], [0], marker='o', color='w', markerfacecolor=colors[2], label='LL', markersize=15), - Line2D([0], [0], marker='o', color='w', markerfacecolor=colors[3], label='HL', markersize=15)] - plt.legend(handles=legend_elements, frameon=False, fontsize=15, loc='center right', bbox_to_anchor=(1.2, 0.5)) - return(plt) - -# THIS NEEDS TO BE FIXED AT THE LEVEL OF SCRIPT 01. -# The problem is that ZSig or ZLag are not really Z-scored and NOT correctly ordered!!! -# Once fixed in SCRIPT 01, we will go back here. -def MLIScatterPlotFastNotWorking(mlgdata,quartiles,minpv,region,midp): - # LM I's scatter - colors= ['firebrick', 'lightskyblue', 'steelblue', 'orange'] - #x,y = mlgdata.ZSig,mlgdata.ZLag - mlgdata.ZSig = stats.zscore(mlgdata.ZSig) - mlgdata.ZLag = stats.zscore(mlgdata.ZLag) - print('\tGet spatial slope...') - slope, intercept, r_value, p_value, std_err = stats.linregress(mlgdata.ZSig,mlgdata.ZLag) - print('\tGet spatial plot...') - fig, ax = plt.subplots(figsize=(5,5))#, subplot_kw={'aspect':'equal'}) - for i,row in mlgdata.iterrows(): - idx = row.Index - pty = row.LMIq - ppv = row.LMIpval - x = row.ZSig - y = row.ZLag - if (ppv<=minpv): - alpha = 1.0 - else: - alpha = 0.1 - print(idx,pty,ppv,x,y,colors[pty-1],alpha) -# plt.scatter(x=x,y=y, s=100, ec='white', fc=colors[pty-1], alpha=alpha) - #sns.scatterplot(x=[x[midp-1]], y=[y[midp-1]], s=150, ec="lime", fc="none", zorder=len(mlgdata)) -# sns.regplot(x=x, y=y, scatter=False, color='k') -# plt.title('Moran Local Scatterplot\nr: {:4.2f} p-value: {:.1e}'.format(r_value, p_value)) -# plt.axvline(x=0, color='k', linestyle=':') -# plt.axhline(y=0, color='k', linestyle=':') -# ax.set_xlabel('Z-score(Signal)') -# ax.set_ylabel('Z-score(Signal Spatial Lag)') -# sns.despine(top=True, right=True, left=False, bottom=False, offset=10, trim=False) -# plt.xlim(min(x)-0.5,max(x)+0.5) -# plt.ylim(min(y)-0.5,max(y)+0.5) - return(plt) - -def MLIScatterPlot(mlgdata,quartiles,minpv,region,midp): - # LM I's scatter - # Get distances to get spatial lag - #print('\tGet spatial Lag...') - coords = mlgdata[['X', 'Y']].values.tolist() - dists = distance.cdist(coords, coords, 'euclidean') - influence = 1.5 - bfact = 2 - colors= ['firebrick', 'lightskyblue', 'steelblue', 'orange'] - mdist = dists.diagonal(1).mean() - buffer = mdist*influence - wq = lp.weights.DistanceBand.from_dataframe(mlgdata,buffer*bfact) - wq.transform = 'r' - y = mlgdata['Signal'] - ylag = lp.weights.lag_spatial(wq, y) - x,y = stats.zscore(mlgdata.Signal),stats.zscore(ylag) - #print('\tGet spatial slope...') - slope, intercept, r_value, p_value, std_err = stats.linregress(x,y) - #print('\tGet spatial plot...') - fig, ax = plt.subplots(figsize=(5,5))#, subplot_kw={'aspect':'equal'}) - for i,row in mlgdata.iterrows(): -# print(i,'...') - pty = row.LMIq - ppv = row.LMIpval - if (ppv<=minpv): - alpha = 1.0 - else: - alpha = 0.1 - plt.scatter(x=x[i],y=y[i], s=100, ec='white', fc=colors[pty-1], alpha=alpha) - #print('\tGet spatial point...') - sns.scatterplot(x=[x[midp-1]], y=[y[midp-1]], s=150, ec="lime", fc="none", zorder=len(mlgdata)) - #print('\tGet regression...') - sns.regplot(x=x, y=y, scatter=False, color='k') - #print('\tGet decoration...') - plt.title('Moran Local Scatterplot\nr: {:4.2f} p-value: {:.1e}'.format(r_value, p_value)) - plt.axvline(x=0, color='k', linestyle=':') - plt.axhline(y=0, color='k', linestyle=':') - ax.set_xlabel('Z-score(Signal)') - ax.set_ylabel('Z-score(Signal Spatial Lag)') - sns.despine(top=True, right=True, left=False, bottom=False, offset=10, trim=False) - plt.xlim(min(x)-0.5,max(x)+0.5) - plt.ylim(min(y)-0.5,max(y)+0.5) - #print('\tReturning...') - return(plt) - -def SignalPlotBed(mlgdata,quartiles,minpv,region,midp): - # Get Bed file - chrm,cini,cend = Region2Parts(region) - reso = (cend-cini)/len(mlgdata) - coords= mlgdata[['X', 'Y']].values.tolist() - dists = distance.cdist(coords, coords, 'euclidean') - midds = dists[midp-1] - mdist = dists.diagonal(1).mean() - influence = 1.5 - bfact = 2 - buffer = mdist*influence - close = midds[midds<=buffer*bfact] - selsebins = [] - selsebins = mlgdata[(mlgdata.LMIq.isin(quartiles)) & (mlgdata.LMIpval <= minpv)].Index.values.tolist() - mls = mlgdata[mlgdata.Index.isin(selsebins)].unary_union - if (mls): - if mls.geom_type == 'Polygon': - mls = MultiPolygon([mls]) - x,y = mlgdata[mlgdata.Index==midp].X,mlgdata[mlgdata.Index==midp].Y - s = Point((x,y)) - selmetaloci = [] - beddata = pd.DataFrame(columns=['chr','start','end','bin']) -# print(mls) -# for i, ml in enumerate(mls): -# ml = gpd.GeoSeries(ml) -# if (s.within(ml[0])): -# if (silent==False): -# ax = ml.plot() -# sns.scatterplot(x=x,y=y, color='red', ax=ax) -# plt.show() -# plt.close() -# for j,row in mlgdata.iterrows(): -# p,x,y = row.Index,row.X,row.Y -# s2 = Point((x,y)) -# if (s2.within(ml[0])): -# selmetaloci.append(p) -# # Add close particles -# selmetaloci.sort() -# closebins = [i for i, val in enumerate(midds) if val<=buffer*bfact] -# selmetaloci = np.sort(list(set(closebins + selmetaloci))) -# print('\tGetting BED...') -# for p in selmetaloci: -# pini = int(cini+(p*reso)) -# pend = int(pini+reso-1) -# bed = '{}\t{}\t{}\t{}'.format(chrm,pini,pend,p) -# row = {'chr': chrm, -# 'start': pini, -# 'end': pend, -# 'bin': p} -# beddata = beddata.append(row, ignore_index=True) - # Signal plot - print('\tSignal profile...') - fig = plt.figure(figsize=(10, 1)) - x = mlgdata.Index - y = mlgdata.Signal - g = sns.lineplot(x=x,y=y) - for p in selmetaloci: - plt.axvline(x=p, color='red', linestyle=':', lw=0.5) - plt.axvline(x=midp, color='lime', linestyle='--', lw=0.5) - bins = list(range(50,len(mlgdata),int((len(mlgdata)+52)/6))) - coords = ['{:,}'.format(int(cini+x*reso)) for x in bins] - plt.xticks(bins,coords) - #g.yaxis.set_major_locator(MaxNLocator(integer=True)) - g.yaxis.set_major_formatter(FormatStrFormatter('%.1f')) - plt.xlabel('chromosome {}'.format(chrm)) - g.margins(x=0) - sns.despine(top=True, right=True, left=False, bottom=True, offset=None, trim=False) - return(plt,beddata) - -def PlaceImage(new,ifile,ifactor,ixloc,iyloc): - img = Image.open(ifile) - niz = tuple([int(i*ifactor) for i in img.size]) - img = img.resize(niz, Image.ANTIALIAS) - new.paste(img, (ixloc,iyloc)) - return(new) - -def MatTransform(mat): - if (silent == False): - print('Matrix data...') - print('\tMin: {}, Max: {}'.format(np.min(mat), np.max(mat))) - - diag = np.array(mat.diagonal()) - - for i in range(0, len(diag)): - if (diag[i] == 0): - - mat[i] = 0 - mat[:, i] = 0 - - # Pseudocounts if min is zero - if (np.min(mat) == 0): - pc = np.min(mat[mat>0]) - if (silent == False): - print('Pseudocounts: {}'.format(pc)) - mat = mat+pc - - # Scale if all below 1 - if (np.max(mat)<=1 or np.min(mat)<=1): ## Why np.min? - sf = 1/np.min(mat) - if (silent == False): - print('Scaling factor: {}'.format(sf)) - mat = mat*sf - - if (np.min(mat)<1): - mat[mat<1] = 1 - - if (silent == False): - print('\tMin: {}, Max: {}'.format(np.min(mat), np.max(mat))) - - # Log10 the data - if (silent == False): - print('Log10 matrix...') - mat = np.log10(mat) - if (silent == False): - print('\tMin: {}, Max: {}'.format(np.min(mat), np.max(mat))) - - # Mean of the data - if (silent == False): - print('Mean of non zero...') - me = mat[mat>1].mean() - if (silent == False): - print('\tMean: {}'.format(me)) - print() - - return(mat) - -def AnaML(work_dir,gene,midp,signal,mlname,mlfile,hicfile,minpv,quartiles,silent,rmfiles): - - # Create directory if does not exist - if not os.path.exists(work_dir+"/mls/figures/"+mlname): - os.makedirs(work_dir+"/mls/figures/"+mlname) - else: - print('\tFolder already exists... overwriting...') - - # Get data - print('\tReading data file: {}'.format(mlfile)) - mlgdata = GetMLData(mlfile) - if (silent == False): - print(mlgdata.head()) - - # Define region coordinates - chrm = mlgdata.Chr[0] - cini = int(mlgdata.Start[0]) - cend = int(mlgdata.End[0]) - region = mlgdata.Region[0] - print('\tWorking on region: {}'.format(region)) - - # Hi-C data and plot - print('\tHi-C matrix...') - hic_plt = GetHiCPlot(hicfile,region,midp,signal) - hic_pdf = '{0}/mls/figures/{1}/{1}_hic.pdf'.format(work_dir,mlname) - hic_png = '{0}/mls/figures/{1}/{1}_hic.png'.format(work_dir,mlname) - hic_plt.savefig(hic_pdf, bbox_inches='tight', transparent=True) - hic_plt.savefig(hic_png, bbox_inches='tight', dpi=300, transparent=True) - if (silent == False): - hic_plt.show() - hic_plt.close() - - # Kamada-Kawai plot - print('\tKamada-Kaway layout...') - kk_plt = GetKKPlot(mlgdata,region,midp) - kk_pdf = '{0}/mls/figures/{1}/{1}_kk.pdf'.format(work_dir,mlname) - kk_png = '{0}/mls/figures/{1}/{1}_kk.png'.format(work_dir,mlname) - kk_plt.savefig(kk_pdf, bbox_inches='tight', transparent=True) - kk_plt.savefig(kk_png, bbox_inches='tight', dpi=300, transparent=True) - if (silent == False): - kk_plt.show() - kk_plt.close() - - # Gaudi plot Signal - print('\tGaudi signal...') - gs_plt = GaudiSignalPlot(mlgdata,region,midp,None,None) - gs_pdf = '{0}/mls/figures/{1}/{1}_ml_signal.pdf'.format(work_dir,mlname) - gs_png = '{0}/mls/figures/{1}/{1}_ml_signal.png'.format(work_dir,mlname) - gs_plt.savefig(gs_pdf, bbox_inches='tight', transparent=True) - gs_plt.savefig(gs_png, bbox_inches='tight', dpi=300, transparent=True) - if (silent == False): - gs_plt.show() - gs_plt.close() - - # Gaudi plot Type - print('\tGaudi type...') - gt_plt = GaudiTypePlot(mlgdata,quartiles,minpv,region,midp) - gt_pdf = '{0}/mls/figures/{1}/{1}_type.pdf'.format(work_dir,mlname) - gt_png = '{0}/mls/figures/{1}/{1}_type.png'.format(work_dir,mlname) - gt_plt.savefig(gt_pdf, bbox_inches='tight', transparent=True) - gt_plt.savefig(gt_png, bbox_inches='tight', dpi=300, transparent=True) - if (silent == False): - gt_plt.show() - gt_plt.close() - - # Local Moran I scatter plot - print('\tLocal Moran I scatter...') - mli_plt = MLIScatterPlot(mlgdata,quartiles,minpv,region,midp) - #print('\tReturned...') - mli_pdf = '{0}/mls/figures/{1}/{1}_mli.pdf'.format(work_dir,mlname) - mli_png = '{0}/mls/figures/{1}/{1}_mli.png'.format(work_dir,mlname) - mli_plt.savefig(mli_pdf, bbox_inches='tight', transparent=True) - mli_plt.savefig(mli_png, bbox_inches='tight', dpi=300, transparent=True) - #print('\tSaved...') - if (silent == False): - mli_plt.show() - mli_plt.close() - #print('\tClosed..') - - # Signal Plot and Bed file - print('\tSignal plot and bed file (if relevant)...') - s_plt,bed_data = SignalPlotBed(mlgdata,quartiles,minpv,region,midp) - s_pdf = '{0}/mls/figures/{1}/{1}_signal.pdf'.format(work_dir,mlname) - s_png = '{0}/mls/figures/{1}/{1}_signal.png'.format(work_dir,mlname) - if (len(bed_data)): - print('\t\tSaving bed data with {} bins...'.format(len(bed_data))) - b_file= '{0}/mls/figures/{1}/{1}.bed'.format(work_dir,mlname) - bed_data.to_csv(b_file, index=False, sep='\t') - s_plt.savefig(s_pdf, bbox_inches='tight', transparent=True) - s_plt.savefig(s_png, bbox_inches='tight', dpi=300, transparent=True) - if (silent == False): - s_plt.show() - s_plt.close() - - # ALL TOGETHER - print('\tFinal composite figure:') - img1 = Image.open(mli_png) - img2 = Image.open(gs_png) - img3 = Image.open(gt_png) - maxx = int((img1.size[1]*0.4+img2.size[1]*0.25+img3.size[1]*0.25)*1.3) - new = Image.new("RGBA", (maxx,1550)) - # HiC image - new = PlaceImage(new,hic_png,0.5,100,50) - # Singal image - new = PlaceImage(new,s_png,0.4,42,660) - # KK image - new = PlaceImage(new,kk_png,0.3,1300,50) - # MLI scatter image - new = PlaceImage(new,mli_png,0.4,75,900) - # Gaudi signal image - new = PlaceImage(new,gs_png,0.25,900,900) - # Gaudi signi image - new = PlaceImage(new,gt_png,0.25,1600,900) - - # Remove files? - if (rmfiles): - for rf in rmfiles: - folderfiles = os.listdir(work_dir+"/mls/figures/"+mlname) - for item in folderfiles: - if item.endswith(".{}".format(rf)): - os.remove(os.path.join(work_dir+"/mls/figures/"+mlname, item)) - - # Save image -# if (silent == False): - fig = plt.figure(figsize=(15, 15)) - plt.imshow(new) - plt.axis('off') - plt.show() - plt.close() - print('\tFinal figure saved here: {0}/mls/figures/{1}/{1}.png'.format(work_dir,mlname)) - new.save('{0}/mls/figures/{1}/{1}.png'.format(work_dir,mlname)) - - # Return composite image - return(new) - -##################### -### START THE ACTION! -##################### - -# Regions to work on -reglst = pd.read_csv(rfile, sep='\t') - -# Create plots≤ -start_time = time.time() -for i, region in reglst.iterrows(): - for signal in signals: - print ('>>>> {}\t{}\t{} [{}/{}]'.format(region.id, run, signal, i+1, reglst.shape[0])) - (chrm,start,end,midp) = re.split(':|_|-',region.coords) - start = int(start) - end = int(end) - midp = int(midp)#-1 # TO MAKE IT PYTHON BASED (0->N instead 1->N) - gene = region.id - linux_region = '{}_{}_{}'.format(chrm,start,end) - mlname = '{}_{}_{}'.format(linux_region,reso,signal) - mlfile = '{}/mls/datas/{}/{}.tsv'.format(run,chrm,mlname) - runname = '{}_{}_{}'.format(mlname,gene,run) - print(gene,runname,mlfile,minpv,qs,silent,rm) - if os.path.isfile(mlfile): - mldata = pd.read_csv(mlfile,sep='\t') - #grep_com = "ls -1v "+cooler_dir+" | grep \".mcool\"" - #hicfile = cooler_dir+subprocess.getoutput(grep_com)+"::/resolutions/"+str(reso) - grep_com = "ls -1v "+cooler_dir+" | grep mcool" - grep_res = subprocess.getoutput(grep_com) - #print("GC>>>>>>>>>>>", grep_res) - if (len(grep_res.split()) > 1): # Multiple chromosomes files in folder - grep_com = "ls -1v "+cooler_dir+" | grep mcool | grep _"+chrm+"_" - grep_res = subprocess.getoutput(grep_com) - #print("GP>>>>>>>>>>>", grep_res) - hicfile = cooler_dir+grep_res+"::/resolutions/"+str(reso) - print("Will use {} as cool file...".format(hicfile)) - if os.path.isdir(work_dir+"/mls/figures/"+runname) and force==False: - print('WARNING! {} ALREADY DONE! If you want to recreate it, use -f or --force ...'.format(work_dir+"/mls/figures/"+runname)) - else: - print('Generating {}'.format(work_dir+"/mls/figures/"+runname)) - figure = AnaML(work_dir,gene,midp,signal,runname,mlfile,hicfile,minpv,qs,silent,rm) - else: - print('ERROR! File {} does not exists. Cannot produce the figure...'.format(mlfile)) - continue - print ('>>>>\n\n') -print("--- {} ---\n".format(str(datetime.timedelta(seconds=(time.time() - start_time))))) diff --git a/metaloci/tools/layout.py b/metaloci/tools/layout.py index 787be4a..66c5a14 100644 --- a/metaloci/tools/layout.py +++ b/metaloci/tools/layout.py @@ -2,12 +2,12 @@ import os import pathlib import re -import sys import warnings -from argparse import SUPPRESS, ArgumentParser, RawDescriptionHelpFormatter +from argparse import SUPPRESS, HelpFormatter from collections import defaultdict from datetime import timedelta from time import time +import subprocess as sp import cooler import hicstraw @@ -16,192 +16,204 @@ import pandas as pd from scipy.sparse import csr_matrix from scipy.spatial import distance -from tqdm.contrib.concurrent import process_map from metaloci import mlo from metaloci.graph_layout import kk from metaloci.misc import misc from metaloci.plot import plot - +import pickle warnings.filterwarnings("ignore", category=RuntimeWarning) -description = """ +DESCRIPTION = """ Creates a Kamada-Kawai layout from a Hi-C for a given region. """ -parser = ArgumentParser(formatter_class=RawDescriptionHelpFormatter, description=description, add_help=False) - -input_arg = parser.add_argument_group(title="Input arguments") - -input_arg.add_argument( - "-w", - "--work-dir", - dest="work_dir", - metavar="PATH", - type=str, - required=True, - help="Path to working directory.", -) - -input_arg.add_argument( - "-c", - "--hic", - dest="hic_file", - metavar="PATH", - type=str, - required=True, - help="Complete path to the cooler file.", -) - -input_arg.add_argument( - "-r", - "--resolution", - dest="reso", - metavar="INT", - type=int, - required=True, - help="Resolution of the HI-C files to be used (in bp).", -) - -input_arg.add_argument( - "-g", - "--region", - dest="regions", - metavar="PATH", - type=str, - help="Region to apply LMI in format chrN:start-end_midpoint or file with the regions of interest. If a file is provided, " - "it must contain as a header 'coords', 'symbol' and 'id', and one region per line, tab separated.", -) - -optional_arg = parser.add_argument_group(title="Optional arguments") -optional_arg.add_argument("-h", "--help", action="help", help="Show this help message and exit.") -optional_arg.add_argument( - "-o", - "--cutoff", - dest="cutoff", - nargs="*", - type=float, - action="extend", - help="Percentage of top interactions to keep, space separated (default: 0.2)", -) - -optional_arg.add_argument("-f", "--force", dest="force", action="store_true", help="force rewriting existing data.") - -optional_arg.add_argument( - "-p", - "--plot", - dest="save_plots", - action="store_true", - help="Plot the matrix, density and Kamada-Kawai plots, even when a single cutoff is selected.", -) - -optional_arg.add_argument( - "-m", - "--mp", - dest="multiprocess", - action="store_true", - help="Flag to set use of multiprocessing.", -) - -optional_arg.add_argument( - "-t", "--threads", dest="threads", type=int, action="store", help="Number of threads to use in multiprocessing." -) - -optional_arg.add_argument( - "-l", - "--pl", - dest="persistence_length", - action="store", - help="Set a persistence length for the Kamada-Kawai layout.", -) -# TODO add sort of silent argument? - -optional_arg.add_argument("-u", "--debug", dest="debug", action="store_true", help=SUPPRESS) - -args = parser.parse_args(None if sys.argv[1:] else ["-h"]) - -work_dir = args.work_dir -hic_path = args.hic_file -regions = args.regions -resolution = args.reso -cutoffs = args.cutoff -force = args.force -save_plots = args.save_plots -multiprocess = args.multiprocess -cores = args.threads -persistence_length = args.persistence_length -debug = args.debug - -if not work_dir.endswith("/"): - - work_dir += "/" - -if multiprocess is None: - - multiprocess = False - -if cores is None: - - cores = mp.cpu_count() - 2 - -if cores > mp.cpu_count(): - - cores = mp.cpu_count() - -if cutoffs is None: - - cutoffs = [0.2] - -# To sort the floats from the cutoffs -cutoffs.sort(key=float, reverse=True) - -if debug: - - table = [ - ["work_dir", work_dir], - ["hic_file", hic_path], - ["region_file", regions], - ["reso", resolution], - ["cutoffs", cutoffs], - ["force", force], - ["debug", debug], - ] - print(table) - sys.exit() +def populate_args(parser): + + parser.formatter_class=lambda prog: HelpFormatter(prog, width=120, + max_help_position=60) + + input_arg = parser.add_argument_group(title="Input arguments") + optional_arg = parser.add_argument_group(title="Optional arguments") + + input_arg.add_argument( + "-w", + "--work-dir", + dest="work_dir", + metavar="PATH", + type=str, + required=True, + help="Path to working directory.", + ) + + input_arg.add_argument( + "-c", + "--hic", + dest="hic_file", + metavar="PATH", + type=str, + required=True, + help="Complete path to the cooler file.", + ) + + input_arg.add_argument( + "-r", + "--resolution", + dest="reso", + metavar="INT", + type=int, + required=True, + help="Resolution of the HI-C files to be used (in bp).", + ) + + input_arg.add_argument( + "-g", + "--region", + dest="regions", + metavar="PATH", + type=str, + required=True, + help="Region to apply LMI in format chrN:start-end_midpoint or file with the regions of interest. If a file is provided, " + "it must contain as a header 'coords', 'symbol' and 'id', and one region per line, tab separated.", + ) + + optional_arg.add_argument( + "-o", + "--cutoff", + dest="cutoff", + nargs="*", + type=float, + action="extend", + help="Fraction of top interactions to keep, space separated (default: 0.2)", + ) + + optional_arg.add_argument("-f", "--force", dest="force", action="store_true", help="force rewriting existing data.") + + optional_arg.add_argument( + "-p", + "--plot", + dest="save_plots", + action="store_true", + help="Plot the matrix, density and Kamada-Kawai plots, even when a single cutoff is selected.", + ) + + optional_arg.add_argument( + "-m", + "--mp", + dest="multiprocess", + action="store_true", + help="Flag to set use of multiprocessing.", + ) + + optional_arg.add_argument( + "-t", "--threads", dest="threads", type=int, action="store", help="Number of threads to use in multiprocessing." + ) + + optional_arg.add_argument( + "-l", + "--pl", + dest="persistence_length", + action="store", + help="Set a persistence length for the Kamada-Kawai layout.", + ) + + optional_arg.add_argument("-h", "--help", action="help", help="Show this help message and exit.") + + +def get_region_layout(row, opts, progress=None, silent: bool = True): + + work_dir = opts.work_dir + regions = opts.regions + hic_path = opts.hic_file + resolution = opts.reso + cutoffs = opts.cutoff + force = opts.force + save_plots = opts.save_plots + persistence_length = opts.persistence_length + + if not work_dir.endswith("/"): + + work_dir += "/" + + if cutoffs is None: + + cutoffs = [0.2] + + if os.path.isfile(regions): + + df_regions = pd.read_table(regions) + + else: + df_regions = pd.DataFrame({"coords": [regions], "symbol": ["symbol"], "id": ["id"]}) -def get_region_layout(row, silent: bool = True): + cutoffs.sort(key=float, reverse=True) region_chrom, region_start, region_end, poi = re.split(":|-|_", row.coords) region_coords = f"{region_chrom}_{region_start}_{region_end}_{poi}" - pathlib.Path(os.path.join(work_dir, region_chrom), "plots", "KK").mkdir(parents=True, exist_ok=True) + pathlib.Path(os.path.join(work_dir, region_chrom)).mkdir(parents=True, exist_ok=True) save_path = os.path.join(work_dir, region_chrom, f"{region_coords}.mlo") - if not os.path.isfile(save_path): + # if not os.path.isfile(save_path): - mlobject = mlo.MetalociObject( - f"{region_chrom}:{region_start}-{region_end}", - str(region_chrom), - int(region_start), - int(region_end), - resolution, - int(poi), - persistence_length, - save_path, - ) - - elif ( - region_coords in bad_regions["region"] - and bad_regions["reason"][bad_regions["region"].index(region_coords)] == "empty" - ): + # mlobject = mlo.MetalociObject( + # f"{region_chrom}:{region_start}-{region_end}", + # str(region_chrom), + # int(region_start), + # int(region_end), + # resolution, + # int(poi), + # persistence_length, + # save_path, + # ) + + # elif mlobject.bad_region == "empty": - if silent == False: + # if silent == False: - print(f"\n------> Region {region_coords} already done (no data).") + # print(f"\n------> Region {region_coords} already done (Hi-C empty in that region).") - return + # if progress is not None: progress["done"] = True - else: + # return + + # else: + + # if silent == False: + + # print(f"\n------> Region {region_coords} already done.") + + # if progress is not None: progress["done"] = True + + # if force: + + # if silent == False: + + # print( + # "\tForce option (-f) selected, recalculating " + # "the Kamada-Kawai layout (files will be overwritten)" + # ) + + # mlobject = mlo.MetalociObject( + # f"{region_chrom}:{region_start}-{region_end}", + # str(region_chrom), + # int(region_start), + # int(region_end), + # resolution, + # int(poi), + # persistence_length, + # save_path, + # ) + + # else: + + # return + + if os.path.isfile(save_path): + + with open(save_path, "rb") as mlobject_handler: + + mlobject = pickle.load(mlobject_handler) if silent == False: @@ -213,10 +225,57 @@ def get_region_layout(row, silent: bool = True): print( "\tForce option (-f) selected, recalculating " - "the Kamada-Kawai layout (files will be overwritten)\n" + "the Kamada-Kawai layout (files will be overwritten)" + ) + + os.remove(f"{save_path}") + + else: + + if save_plots: + + if silent == False: print(f"\tPlotting Kamada-Kawai...") + + pathlib.Path(os.path.join(work_dir, region_chrom), "plots", "KK").mkdir(parents=True, exist_ok=True) + pathlib.Path(os.path.join(work_dir, region_chrom), "plots", "mixed_matrices").mkdir( + parents=True, exist_ok=True + ) + + plot.get_kk_plot(mlobject).savefig(os.path.join( + work_dir, + region_chrom, + f"plots/KK/{region_coords}_" f"{mlobject.kk_cutoff}_KK.pdf", + ), + dpi=300) + + plt.close() + + plot.mixed_matrices_plot(mlobject).savefig( + os.path.join( + work_dir, + region_chrom, + f"plots/mixed_matrices/{region_coords}_" f"{mlobject.kk_cutoff}_mixed-matrices.pdf", + ), + dpi=300, ) - mlobject = mlo.MetalociObject( + plt.close() + + if progress is not None: progress["plots"] = True + + if progress is not None: progress["done"] = True + + if mlobject.bad_region == "empty": + + if silent == False: + + print(f"\n------> Region {region_coords} already done (Hi-C empty in that region).") + + if progress is not None: progress["done"] = True + + if not os.path.isfile(save_path): + + mlobject = mlo.MetalociObject( f"{region_chrom}:{region_start}-{region_end}", str(region_chrom), int(region_start), @@ -227,187 +286,187 @@ def get_region_layout(row, silent: bool = True): save_path, ) - else: + if silent == False: + print(f"\n------> Working on region: {mlobject.region}\n") - return + if hic_path.endswith(".cool") or hic_path.endswith(".mcool"): - if silent == False: - print(f"\n------> Working on region: {mlobject.region}\n") + mlobject.matrix = cooler.Cooler(hic_path + "::/resolutions/" + str(mlobject.resolution)).matrix(sparse=True).fetch(mlobject.region).toarray() - if hic_path.endswith(".cool") or hic_path.endswith(".mcool"): + elif hic_path.endswith(".hic"): - mlobject.matrix = cooler.Cooler(hic_path + "::/resolutions/" + str(mlobject.resolution)).matrix(sparse=True).fetch(mlobject.region).toarray() + chrm, start, end = re.split(':|-', mlobject.region) + mlobject.matrix = hicstraw.HiCFile(hic_path).getMatrixZoomData(chrm, chrm, 'observed', 'VC_SQRT', 'BP', mlobject.resolution).getRecordsAsMatrix(int(start), int(end), int(start), int(end)) + + mlobject = misc.clean_matrix(mlobject) - elif hic_path.endswith(".hic"): + # This if statement is for detecting empty arrays. If the array is too empty, + # clean_matrix() will return mlobject.matrix as None. + if mlobject.matrix is None: - chrm, start, end = re.split(':|-', mlobject.region) - mlobject.matrix = hicstraw.HiCFile(hic_path).getMatrixZoomData(chrm, chrm, 'observed', 'VC_SQRT', 'BP', mlobject.resolution).getRecordsAsMatrix(int(start), int(end), int(start), int(end)) - - mlobject.matrix = misc.clean_matrix(mlobject, bad_regions) + return - # This if statement is for detecting empty arrays. If the array is too empty, - # clean_matrix() would return mlobject.matrix as None. - if mlobject.matrix is None: + time_per_region = time() - return + for cutoff in cutoffs: - for cutoff in cutoffs: + mlobject.kk_cutoff = cutoff - time_per_kk = time() + if silent == False: + print(f"\tCutoff: {int(mlobject.kk_cutoff * 100)} %") - mlobject.kk_cutoff = cutoff + # Get submatrix of restraints + restraints_matrix, mlobject = kk.get_restraints_matrix(mlobject) - if silent == False: - print(f"\tCutoff to be used is: {int(mlobject.kk_cutoff * 100)} %") - # Get submatrix of restraints - restraints_matrix, mlobject = kk.get_restraints_matrix(mlobject) + if silent == False: - if save_plots: + print("\tLayouting Kamada-Kawai...") - mixed_matrices_plot = plot.mixed_matrices_plot(mlobject) + mlobject.kk_graph = nx.from_scipy_sparse_array(csr_matrix(restraints_matrix)) + mlobject.kk_nodes = nx.kamada_kawai_layout(mlobject.kk_graph) + mlobject.kk_coords = list(mlobject.kk_nodes.values()) + mlobject.kk_distances = distance.cdist(mlobject.kk_coords, mlobject.kk_coords, "euclidean") - pathlib.Path(os.path.join(work_dir, region_chrom), "plots", "mixed_matrices").mkdir( - parents=True, exist_ok=True - ) + if len(cutoffs) > 1 or save_plots: + + if silent == False: print(f"\tPlotting Kamada-Kawai...") + + pathlib.Path(os.path.join(work_dir, region_chrom), "plots", "KK").mkdir(parents=True, exist_ok=True) + pathlib.Path(os.path.join(work_dir, region_chrom), "plots", "mixed_matrices").mkdir( + parents=True, exist_ok=True + ) - mixed_matrices_plot.savefig( - os.path.join( + plot.get_kk_plot(mlobject).savefig(os.path.join( work_dir, region_chrom, - f"plots/mixed_matrices/{region_coords}_" f"{mlobject.kk_cutoff}_mixed-matrices.pdf", + f"plots/KK/{region_coords}_" f"{mlobject.kk_cutoff}_KK.pdf", ), - dpi=300, - ) + dpi=300) - plt.close() + plt.close() - if silent == False: + plot.mixed_matrices_plot(mlobject).savefig( + os.path.join( + work_dir, + region_chrom, + f"plots/mixed_matrices/{region_coords}_" f"{mlobject.kk_cutoff}_mixed-matrices.pdf", + ), + dpi=300, + ) - print("\tLayouting Kamada-Kawai...") + plt.close() - mlobject.kk_graph = nx.from_scipy_sparse_array(csr_matrix(restraints_matrix)) - mlobject.kk_nodes = nx.kamada_kawai_layout(mlobject.kk_graph) - mlobject.kk_coords = list(mlobject.kk_nodes.values()) - mlobject.kk_distances = distance.cdist(mlobject.kk_coords, mlobject.kk_coords, "euclidean") + if progress is not None: progress["plots"] = True - if len(cutoffs) > 1 or save_plots: + if silent == False: - kk_plt = plot.get_kk_plot(mlobject) + print(f"\tdone in {timedelta(seconds=round(time() - time_per_region))}.\n") - fig_name = os.path.join( - work_dir, - region_chrom, - f"plots/KK/{region_coords}_" f"{mlobject.kk_cutoff}_KK.pdf", - ) + if len(cutoffs) == 1: - kk_plt.savefig(fig_name, dpi=300) - plt.close() + if silent == False: + print( + f"\tKamada-Kawai layout of region '{mlobject.region}' " + f"at {int(cutoff * 100)} % cutoff saved to file: '{mlobject.save_path}'" + ) - if silent == False: + # Write to file a list of bad regions, according to the filters defined in clean_matrix(). + with open(f"{work_dir}bad_regions.txt", "a+") as handler: - print(f"\tdone in {timedelta(seconds=round(time() - time_per_kk))}.") + log = f"{mlobject.region}\t{mlobject.bad_region}\n" - if len(cutoffs) == 1: + handler.seek(0) - if silent == False: - print( - f"\tKamada-Kawai layout of region {mlobject.region} saved" - f" at {int(cutoff * 100)} % cutoff to file: {mlobject.save_path}" - ) + if not any(log in line for line in handler) and mlobject.bad_region != None: - # Save mlobject. - with open(mlobject.save_path, "wb") as hamlo_namendle: + handler.write(log) - mlobject.save(hamlo_namendle) + # Save mlobject. + with open(mlobject.save_path, "wb") as hamlo_namendle: - # with open(f"{work_dir}bad_regions.txt", "a+") as handler: + mlobject.save(hamlo_namendle) - # for region, reason in bad_regions.items(): + if progress is not None: - # if not any(f"{region}\t{reason[0]}" in line for line in handler): + progress['value'] += 1 - # handler.write(f"{region}\t{reason[0]}\n") - # handler.flush() + time_spent = time() - progress['timer'] + time_remaining = int(time_spent / progress['value'] * (len(df_regions) - progress['value'])) - # with open(f"{work_dir}bad_regions.txt", "a+") as handler: + print(f"\033[A{' '*int(sp.Popen(['tput','cols'], stdout=sp.PIPE).communicate()[0].strip())}\033[A") + print(f"\t[{progress['value']}/{len(df_regions)}] | Time spent: {timedelta(seconds=round(time_spent))} | " + f"ETR: {timedelta(seconds=round(time_remaining))}", end='\r') - # lines = [line.rstrip("\n") for line in handler] - # for region, reason in bad_regions.items(): +def run(opts): - # if f"{region}\t{reason[0]}" not in lines: - # # inserts on top, elsewise use lines.append(name) to append at the end of the file. - # lines.insert(0, f"{region}\t{reason[0]}\n") + work_dir = opts.work_dir + regions = opts.regions + multiprocess = opts.multiprocess + cores = opts.threads - # handler.seek(0) # move to first position in the file, to overwrite ! - # handler.write("\n".join(lines)) - # handler.flush() + if not work_dir.endswith("/"): - # with open(f"{work_dir}bad_regions.txt", "a+") as handler: + work_dir += "/" - # x = handler.readlines() # reading all the lines in a list, no worry about '\n' + if multiprocess is None: - # for region, reason in bad_regions.items(): + multiprocess = False - # if f"{region}\t{reason[0]}" not in x: - # # if word not in file then write the word to the file - # handler.write(f"{region}\t{reason[0]}\n") - # handler.flush() + if cores is None: + cores = mp.cpu_count() - 2 -# Computing + elif cores > mp.cpu_count(): -start_timer = time() + cores = mp.cpu_count() -if os.path.isfile(regions): + start_timer = time() - df_regions = pd.read_table(regions) + if os.path.isfile(regions): -else: + df_regions = pd.read_table(regions) - df_regions = pd.DataFrame({"coords": [regions], "symbol": ["symbol"], "id": ["id"]}) - -pathlib.Path(os.path.join(work_dir)).mkdir(parents=True, exist_ok=True) + else: -bad_regions = defaultdict(list) + df_regions = pd.DataFrame({"coords": [regions], "symbol": ["symbol"], "id": ["id"]}) -if multiprocess: + pathlib.Path(os.path.join(work_dir)).mkdir(parents=True, exist_ok=True) - if __name__ == "__main__": + if multiprocess: - print(f"{len(df_regions)} regions will be computed.\n") + print(f"\n------> {len(df_regions)} regions will be computed.\n") try: - pool = mp.Pool(processes=cores) - process_map(get_region_layout, [row for _, row in df_regions.iterrows()], max_workers=cores, chunksize=1) - pool.close() - pool.join() + manager = mp.Manager() + progress = manager.dict(value=0, timer=start_timer, done=False, plots=False) + + with mp.Pool(processes=cores) as pool: + + pool.starmap(get_region_layout, [(row, opts, progress) for _, row in df_regions.iterrows()]) - except KeyboardInterrupt: + if progress["done"] == True: - pool.terminate() + print("\tSome regions had already been computed and have been skipped.", end="") -else: + if progress["plots"] == True: - for _, row in df_regions.iterrows(): + print(f"\n\tKamada-Kawai layout plots saved to {work_dir}chr/plots/KK.", end="") - get_region_layout(row, silent=False) + pool.close() + pool.join() -# If there is bad regions, write to a file which is the bad region and why, -# but only if that region-reason pair does not already exist in the file. -# bad_regions = pd.DataFrame(bad_regions) + except KeyboardInterrupt: -# if bad_regions.shape[0] > 0: + pool.terminate() -# with open(f"{work_dir}bad_regions.txt", "a+") as handler: + else: -# [ -# handler.write(f"{row.region}\t{row.reason}\n") -# for _, row in bad_regions.iterrows() -# if not any(f"{row.region}\t{row.reason}" in line for line in handler) -# ] + for _, row in df_regions.iterrows(): -print(f"\nTotal time spent: {timedelta(seconds=round(time() - start_timer))}.") -print("\nall done.") + get_region_layout(row, opts, silent=False) + + print(f"\n\nTotal time spent: {timedelta(seconds=round(time() - start_timer))}.") + print("\nall done.") diff --git a/metaloci/tools/ml.py b/metaloci/tools/ml.py index 2d236c5..d76acf3 100644 --- a/metaloci/tools/ml.py +++ b/metaloci/tools/ml.py @@ -3,170 +3,131 @@ """ import multiprocessing as mp import os +import pathlib import pickle import re -import sys -from argparse import SUPPRESS, ArgumentParser, RawDescriptionHelpFormatter +import subprocess as sp +from argparse import HelpFormatter from datetime import timedelta from time import time import pandas as pd -from tqdm.contrib.concurrent import process_map from metaloci.spatial_stats import lmi -description = ( - "This script adds signal data to a Kamada-Kawai layout and calculates Local Moran's I " - "for every bin in the layout." +DESCRIPTION = ( + "Adds signal data to a Kamada-Kawai layout and calculates Local Moran's I for every bin in the layout." ) -parser = ArgumentParser(formatter_class=RawDescriptionHelpFormatter, description=description, add_help=False) - -input_arg = parser.add_argument_group(title="Input arguments") - -input_arg.add_argument( - "-w", - "--work-dir", - dest="work_dir", - metavar="PATH", - type=str, - required=True, - help="Path to working directory.", -) - -input_arg.add_argument( - "-s", - "--signal", - dest="signal_file", - metavar="FILE", - type=str, - required=True, - help="Path to the file with the samples/signals to use.", -) - -region_input = parser.add_argument_group(title="Region arguments", description="Choose one of the following options.") - -region_input.add_argument( - "-g", - "--region", - dest="region_file", - metavar="PATH", - type=str, - help="Region to apply LMI in format chrN:start-end_midpoint or file with the regions of interest.", -) - -signal_arg = parser.add_argument_group(title="Signal arguments", description="Choose one of the following options:") - -signal_arg.add_argument( - "-y", - "--types", - dest="signals", - metavar="STR", - type=str, - nargs="*", - action="extend", - help="Space-separated list of signals to plot.", -) - -signal_arg.add_argument( - "-Y", - "--types-file", - dest="signals", - metavar="PATH", - type=str, - help="Path to the file with the list of signals to plot, one per line.", -) - -optional_arg = parser.add_argument_group(title="Optional arguments") - -optional_arg.add_argument("-f", "--force", dest="force", action="store_true", help="force rewriting existing data.") - -optional_arg.add_argument("-h", "--help", action="help", help="Show this help message and exit.") - -optional_arg.add_argument( - "-c", - "--cores", - dest="num_cores", - default=1, - metavar="INT", - type=int, - help="Number of cores to use in the Local Moran's " "I calculation (default: %(default)d).", -) - -optional_arg.add_argument( - "-p", - "--permutations", - dest="perms", - default=9999, - metavar="INT", - type=int, - help="Number of permutations to calculate the Local Moran's I p-value " "(default: %(default)d).", -) - -optional_arg.add_argument( - "-v", - "--pval", - dest="signipval", - default=0.05, - metavar="FLOAT", - type=float, - help="P-value significance threshold (default: %(default)4.2f).", -) - -optional_arg.add_argument( - "-m", - "--mp", - dest="multiprocess", - action="store_true", - help="Flag to set use of multiprocessing.", -) - -optional_arg.add_argument( - "-t", "--threads", dest="threads", type=int, action="store", help="Number of threads to use in multiprocessing." -) - -optional_arg.add_argument("-u", "--debug", dest="debug", action="store_true", help=SUPPRESS) - -args = parser.parse_args(None if sys.argv[1:] else ["-h"]) - -work_dir = args.work_dir -regions = args.region_file -signal_file = args.signal_file -signals = args.signals -n_cores = args.num_cores -n_permutations = args.perms -signipval = args.signipval -multiprocess = args.multiprocess -cores = args.threads -force = args.force -debug = args.debug - -if not work_dir.endswith("/"): - - work_dir += "/" - -if debug: - - table = [ - ["work_dir", work_dir], - ["gene_file", regions], - ["ind_file", signal_file], - ["types", signals], - ["num_cores", n_cores], - ["perms", n_permutations], - ["debug", debug], - ["signipval", signipval], - ] - - print(table) - sys.exit() - -INFLUENCE = 1.5 -BFACT = 2 - - -def get_lmi(region_iter, i=None, silent: bool = True): +def populate_args(parser): + + parser.formatter_class=lambda prog: HelpFormatter(prog, width=120, + max_help_position=60) + + input_arg = parser.add_argument_group(title="Input arguments") + optional_arg = parser.add_argument_group(title="Optional arguments") + + input_arg.add_argument( + "-w", + "--work-dir", + dest="work_dir", + metavar="PATH", + type=str, + required=True, + help="Path to working directory.", + ) + + input_arg.add_argument( + "-s", + "--signal", + dest="signals", + metavar="FILE", + type=str, + required=True, + help="Path to the file with the samples/signals to use.", + ) + + input_arg.add_argument( + "-g", + "--region", + dest="region_file", + metavar="PATH", + type=str, + help="Region to apply LMI in format chrN:start-end_midpoint or file containing the regions of interest.", + ) + + optional_arg.add_argument("-f", "--force", dest="force", action="store_true", help="Force rewriting existing data.") + + optional_arg.add_argument( + "-p", + "--permutations", + dest="perms", + default=9999, + metavar="INT", + type=int, + help="Number of permutations to calculate the Local Moran's I p-value (default: %(default)d).", + ) + + optional_arg.add_argument( + "-v", + "--pval", + dest="signipval", + default=0.05, + metavar="FLOAT", + type=float, + help="P-value significance threshold (default: %(default)4.2f).", + ) + + optional_arg.add_argument( + "-i", + "--info", + dest="info", + action="store_true", + help="Flag to unpickle LMI info.", + ) + + optional_arg.add_argument( + "-m", + "--mp", + dest="multiprocess", + action="store_true", + help="Flag to set use of multiprocessing.", + ) + + optional_arg.add_argument( + "-t", "--threads", dest="threads", default=int(mp.cpu_count() / 3), type=int, action="store", help="Number" + " of threads to use in multiprocessing. Recommended value is one third of your total cpu count, although" + " increasing this number may improve performance in machines with few cores. (default: %(default)s)" + ) + + optional_arg.add_argument("-h", "--help", action="help", help="Show this help message and exit.") + + +def get_lmi(region_iter, opts, signal_data, progress=None, i=None, silent: bool = True): + + INFLUENCE = 1.5 + BFACT = 2 + + work_dir = opts.work_dir + signals = opts.signals + regions = opts.region_file + n_permutations = opts.perms + signipval = opts.signipval + moran_info = opts.info + force = opts.force + + if not work_dir.endswith("/"): + + work_dir += "/" + + if os.path.isfile(regions): + + df_regions = pd.read_table(regions) + + else: + + df_regions = pd.DataFrame({"coords": [regions], "symbol": ["symbol"], "id": ["id"]}) region_timer = time() @@ -174,7 +135,7 @@ def get_lmi(region_iter, i=None, silent: bool = True): if silent == False: - print(f"\n------> Working on region {region} [{i + 1}/{len(df_regions)}]\n") + print(f"\n------> Working on region {region} [{i+1}/{len(df_regions)}]\n") try: @@ -192,36 +153,64 @@ def get_lmi(region_iter, i=None, silent: bool = True): return - # If the pickle file contains the if mlobject.kk_nodes is None: if silent == False: - - print("Kamada-Kawai layout has not been calculated for this region. \n\tSkipping to next region...") + print("\tKamada-Kawai layout has not been calculated for this region. \n\tSkipping to next region...") return - if mlobject.lmi_geometry is not None and force == False: + # Load only signal for this specific region. + mlobject.signals_dict, signal_types = lmi.load_region_signals(mlobject, signal_data, signals) - if silent == False: + # If the signal is not present in the signals folder (has not been processed with prep) but it is in + # the list of signal to process, raise an exception and print which signals need to be processed. + if mlobject.signals_dict == None and progress is not None: - print("\tLMI already computed for this region. \n\tSkipping to next region...") + progress["missing_signal"] = signal_types - return + raise Exception() + + # This checks if every signal you want to process is already computed. If the user works with a few signals + # but decides to add some more later, she can use the same working directory and KK, and the LMI will be + # computed again with the new signals. If everything is already computed, does nothing. + # TODO compute only the missing signals instead of all the signals again. + if mlobject.lmi_info is not None and force is False: + + if [signal for signal, _ in mlobject.lmi_info.items()] == signal_types: + + if silent == False: + print("\tLMI already computed for this region. \n\tSkipping to next region...") + + if progress is not None: progress["done"] = True + + if moran_info: + + for signal, df in mlobject.lmi_info.items(): + + moran_log_path = f"{work_dir}{mlobject.chrom}/moran_log/{signal}" + pathlib.Path(moran_log_path).mkdir(parents=True, exist_ok=True) + + df.to_csv(f"{moran_log_path}/{mlobject.region}_{mlobject.poi}_{signal}.txt", sep="\t", index=False) + + if silent == False: + print(f"\tLog saved to: {moran_log_path}/{mlobject.region}_{mlobject.poi}_{signal}.txt") + + return # Get average distance between consecutive points to define influence, # which should be ~2 particles of radius. mean_distance = mlobject.kk_distances.diagonal(1).mean() buffer = mean_distance * INFLUENCE - mlobject.lmi_geometry = lmi.construct_voronoi(mlobject, buffer) + + if mlobject.lmi_geometry is None: + + mlobject.lmi_geometry = lmi.construct_voronoi(mlobject, buffer) if silent == False: print(f"\tAverage distance between consecutive particles: {mean_distance:6.4f} [{buffer:6.4f}]") - print(f"\tGeometry information for region {mlobject.region} saved in: {mlobject.save_path}") - - # Load only signal for this specific region. - mlobject.signals_dict, signal_types = lmi.load_region_signals(mlobject, signal_data, signal_file) + print(f"\tGeometry information for region {mlobject.region} saved to: {mlobject.save_path}") all_lmi = {} @@ -234,51 +223,114 @@ def get_lmi(region_iter, i=None, silent: bool = True): if silent == False: print(f"\n\tRegion done in {timedelta(seconds=round(time() - region_timer))}") - print(f"\tLMI information for region {mlobject.region} will be saved in: {mlobject.save_path}\n") + print(f"\tLMI information for region {mlobject.region} will be saved to: {mlobject.save_path}") with open(mlobject.save_path, "wb") as hamlo_namendle: mlobject.save(hamlo_namendle) + if moran_info: + + for signal, df in mlobject.lmi_info.items(): + + moran_log_path = f"{work_dir}{mlobject.chrom}/moran_log/{signal}" + pathlib.Path(moran_log_path).mkdir(parents=True, exist_ok=True) + + df.to_csv(f"{moran_log_path}/{mlobject.region}_{mlobject.poi}_{signal}.txt", sep="\t", index=False) + + if silent == False: + + print(f"\tLog saved to: {moran_log_path}/{mlobject.region}_{mlobject.poi}_{signal}.txt") -timer = time() + if progress is not None: -# Read region list. If its a region as parameter, create a dataframe. -# If its a path to a file, read that dataframe. -if os.path.isfile(regions): + progress['value'] += 1 - df_regions = pd.read_table(regions) + time_spent = time() - progress['timer'] + time_remaining = int(time_spent / progress['value'] * (len(df_regions) - progress['value'])) -else: + print(f"\033[A{' '*int(sp.Popen(['tput','cols'], stdout=sp.PIPE).communicate()[0].strip())}\033[A") + print(f"\t[{progress['value']}/{len(df_regions)}] | Time spent: {timedelta(seconds=round(time_spent))} | " + f"ETR: {timedelta(seconds=round(time_remaining))}", end='\r') - df_regions = pd.DataFrame({"coords": [regions], "symbol": ["symbol"], "id": ["id"]}) -# Read the signal of the chromosomes corresponding to the regions of interest, -# not to load useless data. -signal_data = lmi.load_signals(df_regions, work_dir=work_dir) +def run(opts): -if multiprocess: + work_dir = opts.work_dir + regions = opts.region_file + multiprocess = opts.multiprocess + cores = opts.threads + moran_info = opts.info - if __name__ == "__main__": + if opts.threads is None: - print(f"{len(df_regions)} regions will be computed.\n") + cores = int(mp.cpu_count() / 3) + + if not work_dir.endswith("/"): + + work_dir += "/" + + start_timer = time() + + # Read region list. If its a region as parameter, create a dataframe. + # If its a path to a file, read that dataframe. + if os.path.isfile(regions): + + df_regions = pd.read_table(regions) + + else: + + df_regions = pd.DataFrame({"coords": [regions], "symbol": ["symbol"], "id": ["id"]}) + + # Read the signal of the chromosomes corresponding to the regions of interest, + # not to load useless data. + signal_data = lmi.load_signals(df_regions, work_dir=work_dir) + + if multiprocess: + + print(f"\n------> {len(df_regions)} regions will be computed.\n") try: + + manager = mp.Manager() + progress = manager.dict(value=0, timer = start_timer, done = False, missing_signal = None) + + with mp.Pool(processes=cores) as pool: + + try: + + pool.starmap(get_lmi, [(row, opts, signal_data, progress) for _, row in df_regions.iterrows()]) + + if progress["done"] == True: + + print("\tSome regions had already been computed and have been skipped.", end="") + + if moran_info: + + print(f"\n\tLog saved to: '{work_dir}chr/moran_log/'", end="") + + except Exception: + + if progress["missing_signal"] is not None: + + print(f"\tSignal '{progress['missing_signal']}' is in the signal list but has not been processed with prep.\n" + "\tprocess that signal or remove it from the signal list.\nExiting...") + + pool.close() + pool.join() - pool = mp.Pool(processes=cores) - process_map(get_lmi, [row for _, row in df_regions.iterrows()], max_workers=cores, chunksize=1) - pool.close() - pool.join() + pool.close() + pool.join() except KeyboardInterrupt: pool.terminate() -else: + else: - for i, row in df_regions.iterrows(): + for i, row in df_regions.iterrows(): - get_lmi(row, i, silent=False) + get_lmi(row, opts, signal_data, i=i, silent=False) -print(f"\nTotal time spent: {timedelta(seconds=round(time() - timer))}.") -print("\nall done.") + print(f"\n\nTotal time spent: {timedelta(seconds=round(time() - start_timer))}.") + print("\nall done.") diff --git a/metaloci/tools/prep.py b/metaloci/tools/prep.py index 2d9ea88..11623b5 100644 --- a/metaloci/tools/prep.py +++ b/metaloci/tools/prep.py @@ -5,274 +5,269 @@ import pathlib import subprocess as sp -import sys import warnings -from argparse import SUPPRESS, ArgumentParser, RawDescriptionHelpFormatter +from argparse import HelpFormatter +import h5py +import hicstraw import pandas as pd +from numba.core.errors import (NumbaDeprecationWarning, + NumbaPendingDeprecationWarning) +from pybedtools import BedTool from metaloci.misc import misc +warnings.simplefilter('ignore', category=NumbaDeprecationWarning) +warnings.simplefilter('ignore', category=NumbaPendingDeprecationWarning) warnings.filterwarnings("ignore", category=FutureWarning) -description = """ -This script picks a bed file (USCS format) of signals and transforms it into the format -needed for METALoci to work. -Signal names must be in the format CLASS.ID_IND.ID.\n" -\tClass.ID can be the name of the signal mark.\n" -\tInd.ID can be the name of the patient/cell line/experiment.\n""" - -parser = ArgumentParser(formatter_class=RawDescriptionHelpFormatter, description=description, add_help=False) - -input_arg = parser.add_argument_group(title="Input arguments") - -input_arg.add_argument( - "-w", - "--work-dir", - dest="work_dir", - required=True, - metavar="PATH", - type=str, - help="Path to a working directory.", -) - -input_arg.add_argument( - "-c", - "--hic", - dest="hic_file", - metavar="PATH", - type=str, - required=True, - help="Complete path to the cooler file.", -) - -input_arg.add_argument( - "-d", - "--data", - dest="data", - required=True, - metavar="PATH", - type=str, - nargs="*", - action="extend", - help="Path to file to process. The file must contain titles for the columns," - " being the first 3 columns coded as chrom, start, end. The following " - "columns should contain the name of the signal coded as: id1_id2.\n" - "Names of the chromosomes must be the same as the coords_file " - "described below. ", -) - -input_arg.add_argument("-o", "--name", dest="output", required=True, metavar="STR", type=str, help="Output file name.") - -input_arg.add_argument( - "-r", - "--resolution", - dest="reso", - required=True, - metavar="INT", - type=int, - help="Resolution of the bins, to calculate the signal (in bp).", -) -region_arg = parser.add_argument_group(title="Region arguments") - -region_arg.add_argument( - "-s", - "--coords", - dest="coords", - required=True, - metavar="PATH", - type=str, - help="Full path to a file that contains the name of the chromosomes in the " - "first column and the ending coordinate of the chromosome in the " - "second column. This can be found in UCSC for your species of interest.", -) -optional_arg = parser.add_argument_group(title="Optional arguments") -optional_arg.add_argument("-h", "--help", action="help", help="Show this help message and exit.") -optional_arg.add_argument("-u", "--debug", dest="debug", action="store_true", help=SUPPRESS) - -args = parser.parse_args(None if sys.argv[1:] else ["-h"]) - -if not args.work_dir.endswith("/"): - - args.work_dir += "/" - -work_dir = args.work_dir -data = args.data -hic_path = args.hic_file -out_name = args.output -coords = args.coords -resolution = args.reso -debug = args.debug - -if debug: - - table = [ - ["work_dir", work_dir], - ["data", data], - ["out_name", out_name], - ["centro_and_telo_file", coords], - ["binSize", resolution], - ] +DESCRIPTION = """ +Takes a bed file (UCSC format) of signals and parses it into the format needed for METALoci to work. +""" - print(table) - sys.exit() +def populate_args(parser): -tmp_dir = f"{work_dir}tmp" -pathlib.Path(tmp_dir).mkdir(parents=True, exist_ok=True) + parser.formatter_class=lambda prog: HelpFormatter(prog, width=120, + max_help_position=60) -if hic_path.endswith(".cool") or hic_path.endswith(".mcool"): + input_arg = parser.add_argument_group(title="Input arguments") + optional_arg = parser.add_argument_group(title="Optional arguments") - hic_path = hic_path + "::/resolutions/" + str(resolution) -misc.check_chromosome_names(hic_path, data, coords) + input_arg.add_argument( + "-w", + "--work-dir", + dest="work_dir", + required=True, + metavar="PATH", + type=str, + help="Path to a working directory.", + ) -column_dict = {} + input_arg.add_argument( + "-c", + "--hic", + dest="hic_file", + metavar="PATH", + type=str, + required=True, + help="Complete path to the cooler file.", + ) -# Create a bed file that contains regions of a given resolution and sort it. -sp.call( - f"bedtools makewindows -g {coords} -w {resolution} > {tmp_dir}/{resolution}bp_bin_unsorted.bed", - shell=True, -) + input_arg.add_argument( + "-d", + "--data", + dest="data", + required=True, + metavar="PATH", + type=str, + nargs="*", + action="extend", + help="Path to file to process. The file must contain titles for the columns," + " being the first 3 columns coded as chrom, start, end. The following " + "columns should contain the name of the signal coded as: id1_id2.\n" + "Names of the chromosomes must be the same as the coords_file " + "described below. ", + ) -sp.call( - f"sort {tmp_dir}/{resolution}bp_bin_unsorted.bed -k1,1V -k2,2n -k3,3n | grep -v random | grep -v Un > {tmp_dir}/{resolution}bp_bin.bed", - shell=True, -) + input_arg.add_argument( + "-r", + "--resolution", + dest="reso", + required=True, + metavar="INT", + type=int, + help="Resolution of the bins, to calculate the signal (in bp).", + ) -# Check if the given bed file has a header and sort the file. If there is a header, save column -# names to a dictionary to put it in the next step, then sort the file. -for f in data: + input_arg.add_argument( + "-s", + "--coords", + dest="coords", + required=True, + metavar="PATH", + type=str, + help="Full path to a file that contains the name of the chromosomes in the " + "first column and the ending coordinate of the chromosome in the " + "second column. This can be found in UCSC for your species of interest.", + ) - signal_file_name = f.rsplit("/", 1)[1] - tmp_signal_path = f"{tmp_dir}/tmp_{signal_file_name}" + optional_arg.add_argument("-h", "--help", action="help", help="Show this help message and exit.") - sp.call(f"cp {f} {tmp_signal_path}", shell=True) +def run(opts): + + if not opts.work_dir.endswith("/"): - try: + opts.work_dir += "/" - float( # If first row fourth column is convertible to float, it is a signal, so the first row is not a header. - sp.getoutput(f"head -n 1 {f} | cut -f 4") - ) + work_dir = opts.work_dir + data = opts.data + hic_path = opts.hic_file + coords = opts.coords + resolution = opts.reso - sp.call( - f"sort {tmp_signal_path} -k1,1V -k2,2n -k3,3n | grep -v random | grep -v chrUn > {tmp_dir}/sorted_{f.rsplit('/', 1)[1]}", - shell=True, - ) + tmp_dir = f"{work_dir}tmp" + pathlib.Path(tmp_dir).mkdir(parents=True, exist_ok=True) - header = False + if hic_path.endswith(".cool") or hic_path.endswith(".mcool"): + + if resolution not in sorted([int(x) for x in list(h5py.File(hic_path)["resolutions"].keys())]): - except ValueError: + exit("The given resolution is not in the provided cooler file. Exiting...") - # Saving the corresponding column names in a dict, with only the file name as a key. - column_dict[signal_file_name] = sp.getoutput(f"head -n 1 {f}").split(sep="\t") + hic_path = hic_path + "::/resolutions/" + str(resolution) - sp.call( - f"tail -n +1 {tmp_signal_path} | sort -k1,1V -k2,2n -k3,3n | grep -v random | grep -v chrUn > {tmp_dir}/sorted_{f.rsplit('/', 1)[1]}", - shell=True, - ) + misc.check_cooler_names(hic_path, data, coords) + + elif hic_path.endswith(".hic"): + + if resolution not in hicstraw.HiCFile(hic_path).getResolutions(): + + exit("The given resolution is not in the provided Hi-C file. Exiting...") + + misc.check_hic_names(hic_path, data, coords) - header = True + column_dict = {} + + # Create a bed file that contains regions of a given resolution and sort it. + BedTool().window_maker(g=coords, w=resolution).saveas(f"{work_dir}tmp/{resolution}bp_bin_unsorted.bed") - # Intersect the sorted files and the binnarized file. sp.call( - f"bedtools intersect -a {tmp_dir}/{resolution}bp_bin.bed -b {tmp_dir}/sorted_{f.rsplit('/', 1)[1]} -wo -sorted > {tmp_dir}/intersected_{f.rsplit('/', 1)[1]}", + f"sort {tmp_dir}/{resolution}bp_bin_unsorted.bed -k1,1V -k2,2n -k3,3n | grep -v random | grep -v Un | grep -v alt > {tmp_dir}/{resolution}bp_bin.bed", shell=True, ) - awk_com = ( - "awk '{print $0\"\t\"$7*($8/($6-$5))}' " - + f"{tmp_dir}/intersected_{f.rsplit('/', 1)[1]} > {tmp_dir}/intersected_ok_{f.rsplit('/', 1)[1]}" - ) - sp.call(awk_com, shell=True) + # Check if the given bed file has a header and sort the file. If there is a header, save column + # names to a dictionary to put it in the next step, then sort the file. + for f in data: -# Create a list of paths to the intersected files. -intersected_files_paths = [(f"{tmp_dir}/intersected_ok_" + i.rsplit("/", 1)[1]) for i in data] + signal_file_name = f.rsplit("/", 1)[1] + tmp_signal_path = f"{tmp_dir}/tmp_{signal_file_name}" -final_intersect = pd.read_csv( - intersected_files_paths[0], sep="\t", header=None, low_memory=False -) # Read the first intersected file, -final_intersect = final_intersect.drop([3, 4, 5, 7, 8], axis=1) # Drop unnecesary columns, + sp.call(f"cp {f} {tmp_signal_path}", shell=True) -if header == True: + try: - final_intersect.columns = column_dict[ - intersected_files_paths[0].rsplit("/", 1)[1].split("_", 2)[2] - ] # Input the corresponding column names + float( # If first row fourth column is convertible to float, it is a signal, so the first row is not a header. + sp.getoutput(f"head -n 1 {f} | cut -f 4") + ) -else: + sp.call( + f"sort {tmp_signal_path} -k1,1V -k2,2n -k3,3n | grep -v random | grep -v Un | grep -v alt > {tmp_dir}/sorted_{f.rsplit('/', 1)[1]}", + shell=True, + ) - final_intersect.columns = [ - "chrom", - "start", - "end", - f"{intersected_files_paths[0].rsplit('/', 1)[1].split('_', 2)[2]}", - ] + header = False + + except ValueError: -final_intersect = ( # Calculate the median of all signals of the same bin. - final_intersect.groupby(["chrom", "start", "end"])[final_intersect.columns[3 : len(final_intersect.columns)]] - .median() - .reset_index() -) + # Saving the corresponding column names in a dict, with only the file name as a key. + column_dict[signal_file_name] = sp.getoutput(f"head -n 1 {f}").split(sep="\t") -# Process the rest of the files the same way and merge with next one, until all files are merged. -if len(intersected_files_paths) > 1: + sp.call( + f"tail -n +1 {tmp_signal_path} | sort -k1,1V -k2,2n -k3,3n | grep -v random | grep -v Un | grep -v alt > {tmp_dir}/sorted_{f.rsplit('/', 1)[1]}", + shell=True, + ) - for i in range(1, len(intersected_files_paths)): + header = True - tmp_intersect = pd.read_csv(intersected_files_paths[i], sep="\t", header=None, low_memory=False) - tmp_intersect = tmp_intersect.drop([3, 4, 5, 7, 8], axis=1) + # Intersect the sorted files and the binnarized file. + BedTool(f"{tmp_dir}/{resolution}bp_bin.bed").intersect(BedTool(f"{tmp_dir}/sorted_{f.rsplit('/', 1)[1]}"), wo=True, sorted=True).saveas(f"{tmp_dir}/intersected_{f.rsplit('/', 1)[1]}") - if header == True: + awk_com = ( + "awk '{print $0\"\t\"$7*($8/($6-$5))}' " + + f"{tmp_dir}/intersected_{f.rsplit('/', 1)[1]} > {tmp_dir}/intersected_ok_{f.rsplit('/', 1)[1]}" + ) + sp.call(awk_com, shell=True) - tmp_intersect.columns = column_dict[ - intersected_files_paths[i].rsplit("/", 1)[1].split("_", 2)[2] - ] # Input the corresponding column names + # Create a list of paths to the intersected files. + intersected_files_paths = [(f"{tmp_dir}/intersected_ok_" + i.rsplit("/", 1)[1]) for i in data] - else: + final_intersect = pd.read_csv( + intersected_files_paths[0], sep="\t", header=None, low_memory=False + ) # Read the first intersected file, + final_intersect = final_intersect.drop([3, 4, 5, 7, 8], axis=1) # Drop unnecesary columns, - tmp_intersect.columns = [ - "chrom", - "start", - "end", - f"{intersected_files_paths[i].rsplit('/', 1)[1].split('_', 2)[2]}", - ] + if header == True: - tmp_intersect = ( - tmp_intersect.groupby(["chrom", "start", "end"])[tmp_intersect.columns[3 : len(tmp_intersect.columns)]] - .median() - .reset_index() - ) + final_intersect.columns = column_dict[ + intersected_files_paths[0].rsplit("/", 1)[1].split("_", 2)[2] + ] # Input the corresponding column names - final_intersect = pd.merge(final_intersect, tmp_intersect, on=["chrom", "start", "end"], how="inner") + else: -# For each chromosome, create a directory and save the information for that chromosome in .csv and -# .pkl. -for chrom in sorted(final_intersect["chrom"].unique()): + final_intersect.columns = [ + "chrom", + "start", + "end", + f"{intersected_files_paths[0].rsplit('/', 1)[1].split('_', 2)[2]}", + ] - pathlib.Path(f"{work_dir}signal/{chrom}").mkdir(parents=True, exist_ok=True) - final_intersect[final_intersect.chrom == f"{chrom}"].to_pickle( - f"{work_dir}signal/{chrom}/{out_name}_{chrom}_signal.pkl" - ) - final_intersect[final_intersect.chrom == f"{chrom}"].to_csv( - f"{work_dir}signal/{chrom}/{out_name}_{chrom}.csv", sep="\t", index=False + final_intersect = ( # Calculate the median of all signals of the same bin. + final_intersect.groupby(["chrom", "start", "end"])[final_intersect.columns[3 : len(final_intersect.columns)]] + .median() + .reset_index() ) -print(f"\nSignal bed files saved to {work_dir}signal/") + # Process the rest of the files the same way and merge with next one, until all files are merged. + if len(intersected_files_paths) > 1: + + for i in range(1, len(intersected_files_paths)): + + tmp_intersect = pd.read_csv(intersected_files_paths[i], sep="\t", header=None, low_memory=False) + tmp_intersect = tmp_intersect.drop([3, 4, 5, 7, 8], axis=1) + + if header == True: + + tmp_intersect.columns = column_dict[ + intersected_files_paths[i].rsplit("/", 1)[1].split("_", 2)[2] + ] # Input the corresponding column names + + else: + + tmp_intersect.columns = [ + "chrom", + "start", + "end", + f"{intersected_files_paths[i].rsplit('/', 1)[1].split('_', 2)[2]}", + ] -chroms_no_signal = [ - chrom - for chrom in sp.getoutput(f"cut -f 1 {tmp_dir}/{resolution}bp_bin.bed | uniq").split("\n") - if chrom not in final_intersect["chrom"].unique() -] + tmp_intersect = ( + tmp_intersect.groupby(["chrom", "start", "end"])[tmp_intersect.columns[3 : len(tmp_intersect.columns)]] + .median() + .reset_index() + ) + + final_intersect = pd.merge(final_intersect, tmp_intersect, on=["chrom", "start", "end"], how="inner") + + # For each chromosome, create a directory and save the information for that chromosome in .csv and + # .pkl. + for chrom in sorted(final_intersect["chrom"].unique()): + + pathlib.Path(f"{work_dir}signal/{chrom}").mkdir(parents=True, exist_ok=True) + final_intersect[final_intersect.chrom == f"{chrom}"].to_pickle( + f"{work_dir}signal/{chrom}/{chrom}_signal.pkl" + ) + final_intersect[final_intersect.chrom == f"{chrom}"].to_csv( + f"{work_dir}signal/{chrom}/{chrom}_signal.csv", sep="\t", index=False + ) + + print(f"\nSignal bed files saved to {work_dir}signal/") + + chroms_no_signal = [ + chrom + for chrom in sp.getoutput(f"cut -f 1 {tmp_dir}/{resolution}bp_bin.bed | uniq").split("\n") + if chrom not in final_intersect["chrom"].unique() + ] -if len(chroms_no_signal) == 1: + if len(chroms_no_signal) == 1: - print(f"Omited chromosome {chroms_no_signal[0]} as it did not have any signal.") + print(f"Omited chromosome {chroms_no_signal[0]} as it did not have any signal.") -elif len(chroms_no_signal) > 1: + elif len(chroms_no_signal) > 1: - print(f"Omited chromosomes {', '.join(chroms_no_signal)} as they did not have any signal.") + print(f"Omited chromosomes {', '.join(chroms_no_signal)} as they did not have any signal.") -misc.remove_folder(pathlib.Path(tmp_dir)) + misc.remove_folder(pathlib.Path(tmp_dir)) -print("\ndone.") + print("\ndone.") diff --git a/metaloci/utility_scripts/__init__.py b/metaloci/utility_scripts/__init__.py index 378c440..ed508cf 100644 --- a/metaloci/utility_scripts/__init__.py +++ b/metaloci/utility_scripts/__init__.py @@ -1,5 +1,3 @@ -from .base import BaseClass, base_function - -__all__ = ["BaseClass", "base_function"] +from .mlo import MetalociObject import os, sys; sys.path.append(os.path.dirname(os.path.realpath(__file__))) diff --git a/requirements.txt b/requirements.txt index b05f2a6..395b56f 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,5 +1,118 @@ -# This template is a low-dependency template. -# By default there is no requirements added here. -# Add the requirements you need to this file. -# or run `make init` to create this file automatically based on the template. -# You can also run `make switch-to-poetry` to use the poetry package manager. +# Automatically generated by https://github.com/damnever/pigar. + +# METALoci/build/lib/metaloci/plot/plot.py: 13 +# METALoci/build/lib/metaloci/tools/figure.py: 15 +# METALoci/metaloci/plot/plot.py: 13 +# METALoci/metaloci/tools/figure.py: 15 +Pillow == 10.0.0 + +# METALoci/build/lib/metaloci/misc/misc.py: 9 +# METALoci/build/lib/metaloci/tools/layout.py: 12 +# METALoci/metaloci/misc/misc.py: 9 +# METALoci/metaloci/tools/layout.py: 12 +cooler == 0.9.2 + +# METALoci/build/lib/metaloci/mlo.py: 1 +# METALoci/metaloci/mlo.py: 1 +dill == 0.3.6 + +# METALoci/build/lib/metaloci/spatial_stats/lmi.py: 11 +# METALoci/metaloci/spatial_stats/lmi.py: 11 +esda == 2.4.3 + +# METALoci/build/lib/metaloci/plot/plot.py: 3 +# METALoci/build/lib/metaloci/spatial_stats/lmi.py: 7 +# METALoci/build/lib/metaloci/tools/figure.py: 12 +# METALoci/metaloci/plot/plot.py: 3 +# METALoci/metaloci/spatial_stats/lmi.py: 7 +# METALoci/metaloci/tools/figure.py: 12 +geopandas == 0.13.2 + +# METALoci/build/lib/metaloci/misc/misc.py: 11 +# METALoci/build/lib/metaloci/tools/layout.py: 13 +# METALoci/metaloci/misc/misc.py: 11 +# METALoci/metaloci/tools/layout.py: 13 +hic_straw == 1.3.1 + +# METALoci/build/lib/metaloci/plot/plot.py: 4 +# METALoci/build/lib/metaloci/spatial_stats/lmi.py: 8,12 +# METALoci/metaloci/plot/plot.py: 4 +# METALoci/metaloci/spatial_stats/lmi.py: 8,12 +libpysal == 4.6.2 + +# METALoci/build/lib/metaloci/plot/plot.py: 12 +# METALoci/build/lib/metaloci/tools/figure.py: 13 +# METALoci/build/lib/metaloci/tools/layout.py: 14 +# METALoci/metaloci/plot/plot.py: 12 +# METALoci/metaloci/tools/figure.py: 13 +# METALoci/metaloci/tools/layout.py: 14 +matplotlib == 3.7.2 + +# METALoci/build/lib/metaloci/plot/plot.py: 6 +# METALoci/build/lib/metaloci/tools/layout.py: 15 +# METALoci/metaloci/plot/plot.py: 6 +# METALoci/metaloci/tools/layout.py: 15 +networkx == 3.1 + +# METALoci/build/lib/metaloci/tools/prep.py: 14 +# METALoci/metaloci/tools/prep.py: 14 +numba == 0.55.1 + +# METALoci/build/lib/metaloci/graph_layout/kk.py: 1 +# METALoci/build/lib/metaloci/misc/misc.py: 1 +# METALoci/build/lib/metaloci/plot/plot.py: 7 +# METALoci/build/lib/metaloci/spatial_stats/lmi.py: 9 +# METALoci/metaloci/graph_layout/kk.py: 1 +# METALoci/metaloci/misc/misc.py: 1 +# METALoci/metaloci/plot/plot.py: 7 +# METALoci/metaloci/spatial_stats/lmi.py: 9 +numpy == 1.21.6 + +# METALoci/build/lib/metaloci/misc/misc.py: 2 +# METALoci/build/lib/metaloci/plot/plot.py: 8 +# METALoci/build/lib/metaloci/spatial_stats/lmi.py: 10 +# METALoci/build/lib/metaloci/tools/figure.py: 14 +# METALoci/build/lib/metaloci/tools/layout.py: 16 +# METALoci/build/lib/metaloci/tools/ml.py: 13 +# METALoci/build/lib/metaloci/tools/prep.py: 11 +# METALoci/build/lib/metaloci/utility_scripts/prep_coords_genes.py: 4 +# METALoci/metaloci/misc/misc.py: 2 +# METALoci/metaloci/plot/plot.py: 8 +# METALoci/metaloci/spatial_stats/lmi.py: 10 +# METALoci/metaloci/tools/figure.py: 14 +# METALoci/metaloci/tools/layout.py: 16 +# METALoci/metaloci/tools/ml.py: 13 +# METALoci/metaloci/tools/prep.py: 11 +# METALoci/metaloci/utility_scripts/prep_coords_genes.py: 4 +pandas == 1.3.0 + +# METALoci/build/lib/metaloci/tools/prep.py: 10 +# METALoci/metaloci/tools/prep.py: 10 +pybedtools == 0.9.0 + +# METALoci/build/lib/metaloci/plot/plot.py: 15 +# METALoci/build/lib/metaloci/spatial_stats/lmi.py: 13 +# METALoci/build/lib/metaloci/tools/layout.py: 17,18 +# METALoci/metaloci/plot/plot.py: 15 +# METALoci/metaloci/spatial_stats/lmi.py: 13 +# METALoci/metaloci/tools/layout.py: 17,18 +scipy == 1.9.1 + +# METALoci/build/lib/metaloci/plot/plot.py: 9 +# METALoci/metaloci/plot/plot.py: 9 +seaborn == 0.11.0 + +# METALoci/setup.py: 5 +setuptools == 59.8.0 + +# METALoci/build/lib/metaloci/plot/plot.py: 16,17 +# METALoci/build/lib/metaloci/spatial_stats/lmi.py: 15,16 +# METALoci/metaloci/plot/plot.py: 16,17 +# METALoci/metaloci/spatial_stats/lmi.py: 15,16 +shapely == 2.0.1 + +# METALoci/build/lib/metaloci/misc/misc.py: 3 +# METALoci/build/lib/metaloci/utility_scripts/prep_coords_genes.py: 10 +# METALoci/metaloci/misc/misc.py: 3 +# METALoci/metaloci/utility_scripts/prep_coords_genes.py: 10 +tqdm == 4.65.0 diff --git a/setup.py b/setup.py index 27c3205..db5cf98 100644 --- a/setup.py +++ b/setup.py @@ -1,15 +1,12 @@ """Python setup.py for metaloci package""" import io import os + from setuptools import find_packages, setup def read(*paths, **kwargs): """Read the contents of a text file safely. - >>> read("metaloci", "VERSION") - '0.1.0' - >>> read("README.md") - ... """ content = "" @@ -22,6 +19,7 @@ def read(*paths, **kwargs): def read_requirements(path): + return [ line.strip() for line in read(path).split("\n") @@ -32,16 +30,20 @@ def read_requirements(path): setup( name="metaloci", version=read("metaloci", "VERSION"), - description="Awesome metaloci created by 3DGenomes", + description="METALoci: spatially auto-correlated signals in 3D genomes", url="https://github.com/3DGenomes/METALoci/", long_description=read("README.md"), long_description_content_type="text/markdown", - author="3DGenomes", - # package_dir = {'metaloci': PATH + '/metaloci'}, + author="Leo Zuber, Iago Maceda, Juan Antonio Rodríguez and Marc Martí-Renom", + author_email="martirenom@cnag.eu", + classifiers=["Programming Language :: Python :: 3.9", + "License :: OSI Approved :: GNU General Public License v3 (GPLv3)", + "Topic :: Scientific/Engineering :: Bio-Informatics", + "Operating System :: POSIX :: Linux",], packages=find_packages(exclude=["tests", ".github"]), install_requires=read_requirements("requirements.txt"), entry_points={ "console_scripts": ["metaloci = metaloci.__main__:main"] }, - extras_require={"test": read_requirements("requirements-test.txt")}, + # extras_require={"test": read_requirements("requirements-test.txt")}, ) diff --git a/tests/conftest.py b/tests/conftest.py deleted file mode 100644 index ff9a4c7..0000000 --- a/tests/conftest.py +++ /dev/null @@ -1,13 +0,0 @@ -import sys -import pytest - -# each test runs on cwd to its temp dir -@pytest.fixture(autouse=True) -def go_to_tmpdir(request): - # Get the fixture dynamically by its name. - tmpdir = request.getfixturevalue("tmpdir") - # ensure local test created packages can be imported - sys.path.insert(0, str(tmpdir)) - # Chdir only for the duration of the test. - with tmpdir.as_cwd(): - yield diff --git a/tests/test_base.py b/tests/test_base.py deleted file mode 100644 index 69e2b1d..0000000 --- a/tests/test_base.py +++ /dev/null @@ -1,18 +0,0 @@ -import pytest - -from metaloci import BaseClass, base_function - -given = pytest.mark.parametrize - - -@given("fn", [BaseClass(), base_function]) -def test_parameterized(fn): - assert "hello from" in fn() - - -def test_base_function(): - assert base_function() == "hello from base function" - - -def test_base_class(): - assert BaseClass().base_method() == "hello from BaseClass"