Push of first non container version

kalininalab · Feb 1, 2024 · c46c16d · c46c16d
1 parent 3203a65
commit c46c16d
Show file tree

Hide file tree

Showing 119 changed files with 1,933,954 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,150 @@
+config.txt
+structman_config.txt
+errorlog.txt
+example/Output/*
+
+*.pdb
+*~
+*.gz
+!structman/scripts/*.gz
+
+.vscode/
+
+# ---> Python
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -0,0 +1,13 @@
+include structman/lib/rinerator/reduce
+include structman/lib/rinerator/probe
+include structman/lib/rinerator/reduce_wwPDB_het_dict.txt
+include structman/scripts/pdb-rsync.sh
+include structman/scripts/struct_man_db_uniprot.sql
+include structman/scripts/database_structure.sql
+include structman/scripts/sqlite_database_structure.sql
+include structman/scripts/structguy_project_template.conf
+include structman/resources/Components-smiles-stereo-oe.smi
+include structman/resources/inchi_base.tsv
+include structman/resources/pdbba.fasta.gz
+include structman/resources/human_info_map.tab
+include changelog.txt
diff --git a/changelog.txt b/changelog.txt
@@ -0,0 +1,171 @@
+Version 1.4.0
+    - Major update preparing the public version installable outside of a container
+    - Implemented a installation script to fully setup structman in a conda environment
+    - New outputs that support the direct usage of structguy after using structman
+    - Added support for sqlite3 as a local database option, hence removed the lite version
+    - Integrated MicroMiner into the pipeline (usage requires to configure an executable)
+    - Switched to ray 2.9.1
+    - Many smaller bug fixes
+
+Version 1.3.0
+    - Needed to add an Interface hashing system to the database to retrace different interfaces from the same protein
+    - Updated the automated Modelling pipeline, improved the template selection to focus on mutated sites
+    - New functionality: now one can give tags to proteins that annotate a gene identifier (for example: "gene:P53"). Multiple protein isoforms with same gene tag will then lead to an isoform analysis.
+    - switched to ray 2.2.0
+    - Added support for Ensembl transcript identifiers
+
+Version 1.2.2
+    - added the feature to parse HGVS protein mutation identifiers (for example: p.Arg12His)
+
+
+Version 1.2.1
+    - updated Uniprot ID mapping service to REST API
+
+Version 1.2.0
+    - added database support for interfaces, aggregated interface, protein-protein-interaction, position-position-interactions
+    - switched to ray 1.13.0
+
+Version 1.1.1
+    - added aggregated contact matrix calculation
+    - added aggregated interface calculation
+
+Version 1.1.0
+    - database makes now version checks
+    - switched to ray 1.11.0
+    - added model database support (specifically alphafold DB)
+
+Version 1.0.3
+    - database create command now checks if already an instance exists first
+    - if database is not properly configured and structman got still called in db mode, now it switches automatically to lite mode
+
+Version 1.0.2
+    - fasta input parsing checks now for irregular characters
+    - improved some bug logging
+    - added a possibility to run the alignment in serial mode without ray for debugging purposes
+    - switched to ray 1.10.0
+
+Version 1.0.1
+    - removed some commented code
+    - slight changes of file structure to prevent circular imports
+    - multiple small bug fixes
+
+Version 1.0.0
+    - first main version
+    - switched to ray 1.9.0
+    - Multiple smaller bug fixes
+
+Version 0.9.2
+    - changed how imports are organized
+    - switched to ray 1.8.0
+
+Version 0.9.1
+    - Various smaller bug fixes
+    - Multiple model structures (not NMR) are now fused internally into one model
+    - obsolete uniprot entries are now supported
+
+Version 0.9.0
+    - Moved Modeller to version 10.1
+    - Fixed multiple modelling bugs
+    - Implemented aggregated indel analysis
+    - Indel analysis results are now a compressed row in the database
+    - Added automated script that creates/updates the mapping/sequence database
+    - mapping/sequence database got a new structure
+    - default mode now doesn't model indel structures. Can be activated by config or flag
+    - Updated disclaimers
+
+Version 0.8.1
+    - The size of the database row for position data and alignment is now bigger
+    - Fixed a bug that prevented disordered region annotation to happen
+    - Upgraded disorder prediction method from IUpred2A to IUpred3
+    - Unpacking of position data in the ouput creation is now paralellized
+
+Version 0.8.0
+    - Bigger data fields are now stored as binaries in the database
+
+Version 0.7.3
+    - The update script now addtionally loads the .../data/status directory
+    - The RINdb update functions checks and writes (into meta.txt in the RINdb main directory) the date of the last update
+    - The RINdb compares the date of the last update with the dates of structure modifications in the /data/status directory of the PDB
+    - Added a function to split sequences of structures with unknown portions; to align them individually; and to fuse the alignments afterwards
+    - Configured custom output format of MMseqs2
+    - Use sequence length of the structure to better estimate the cost for an alignment
+
+Version 0.7.2
+    - Added a nested parallelization for the classifcation. Triggers only, when we have many cores available and the classifcation task is large enough.
+    - updated the docker update script
+
+Version 0.7.1
+    - Fixed a bug that kept the program using a local instance of the PDB
+    - Improved the protein splitting scheduling in the alignment parallelization, now works better with few proteins that are mapped to many structures
+    - Improved the performance of the packaging in the parallelized classification
+
+Version 0.7.0
+    - Fixed some bugs regarding the parsing of asymmetric unit structures for cases them being given as input
+    - Parallelization if filter_structures is now chunked
+    - Improved chunking of alignment tasks
+    - Ray reinit for each major core loop to prevent memory leaks
+    - added custom object serialization functions to utils
+    - Fixed bugs regarding the parsing of PDB-type inputs
+    - added '--mem_limit [any%]' flag to limit maximal memory usage for testing purposes
+    - extensive overhaul of the parallelization of the annotation and the classification
+
+Version 0.6.1
+    - Added output util 'RAS' that Retrieves all Annotated Structures (and their protein interaction partner) from a specific session.
+
+Version 0.6.0:
+    - Added output util `suggest` command that ranks structural annotation templates
+    - Added associated database table `Suggestion` for rankings
+    - Add pandas requirement to structman conda environment file
+
+Version 0.5.1
+    - increased radius for intra chain interface score sphere
+
+Version 0.5.0:
+    - Location (formerly surface/core) gets now divided into surface/buried/core and into total/mainchain/sidechain
+    - Classification is based on sidechain location (means also a new class 'buried')
+    - Changed location and surface values had to be stored in the database -> version change to 0.5.x
+    - Added all types of locations and surface values to classification and feature table outputs
+    - Implemented an interface milieu analysis (inspired by core/rim interface distinction, but goes into more detail)
+
+Version 0.4.1:
+    - Unspecified fasta inputs now add all positions as query
+
+Version 0.4.0:
+    * Major Changes
+        - Add input checksums to see if input file has changed, triggering new session if so
+        - Adds new Checksum column to Session table in DB
+        - Separate output.py into multiple modules
+    * Enhancements
+        - Allow C++ style // line and trailing comments in smlf format
+        - Search user home folder for config file when no others found
+        - Added new structman.utils module (less structman specific than structman.lib.sdsc.utils)
+    * Misc Changes/Fixes
+        - Fix Mutated Sequence off-by-1 indexing error
+        - Move several functions from structman.lib.sdsc.utils to structman.utils
+        - Check that database_source_path exists before creating DB to avoid missing tables and ensuing failure of structman database create
+        - Delete duplicated sql script from /lib (resides in /scripts)
+        - Prefer os.path.join to avoid errors with double slashes in path
+        - Use package paths set in settings.py for ray_init()
+        - Use surface_threshold from config in database.getClassWT, database.getClass, database.diffSurfs
+        - Rename class Output_generator to OutputGenerator to conform to pep8
+        - Fix input/output issues when using relative paths
+        - Import fixes
+        - Autopep8 fixes
+        - Updated Readme
+
+Version 0.3.2:
+    - Modelling now builds up a database of models that can be used to save computation costs for later runs
+    - Positions filtered by sanity checks now do not delete the whole protein
+    - Indel analysis can now be used with stored information from the database
+    - Added '--only_wt' option to run the pipeline without the generation of mutated proteins
+    - Multiple smaller bug fixes
+    - Improved remote classification
+
+Version 0.3.1:
+    - Improved the store handling of remote classification
+    - Implemented a modelling output framework. Currently produced one model per input protein and model for each given mutation.
+    - Multiple smaller bug fixes
+
+Version 0.3.0:
+    - Begin of changelog
+    - Acts as first master version of structman as pip package
diff --git a/config_template.txt b/config_template.txt
@@ -0,0 +1,24 @@
+[user]
+db_user_name=
+db_password=
+db_name=
+mmseqs_tmp_folder=
+mail=
+
+db_address=
+mapping_db=
+
+verbosity=1
+
+mmseqs2_path=mmseqs
+
+pdb_path=
+path_to_model_db=
+rin_db_path=
+mmseqs2_db_path=
+mmseqs2_model_db_path=
+
+microminer_exe=
+microminer_db=
+
+iupred_path=