Design and Architecture

This document is primarily focused on high-level design, architecture and some technicalities of the Molpher project. Consequently, the document does not contain explanation of the overall concept behind chemical space exploration as described in scientific articles (1) and (4). Because Molpher algorithm was already thoroughly described in (1) and experimentally implemented (2) before the actual project begun, it would be redundant to rephrase those ideas in this document. Since the actual project was all about taking the original concept to the whole different level of performance, scalability, usability and productivity, it is essential that the reader of this document understands the starting point of the project (that is the original source code in (2) and mainly the article (1), which should be also attached as a printed copy in the paper version of this documentation). Anybody reading this document is also advised to have a look at the requirements in the specification (3), to get a quick overview of what were the actual targets of the project. If you are familiar with the mentioned prerequisites, you should have no problem understanding the ideas in this document. While it is possible to read the documentation in the begin-to-end manner, especially potential future developers of Molpher are advised to read the documentation alongside with the source code or doxygen documentation to get up-to-speed as quickly as possible. Generated doxygen documentation also contains complex inheritance and call diagrams that can help initial orientation in the project even more.

(1) Exploration of Chemical Space by Molecular Morphing
(2) Molpher Sources
(3) Molpher Specification
(4) Chemical Space Travel

Architecture Overview

Whole project is split into two separate parts, one being responsible for all the background computation (backend) and the second for the interaction with the user (frontend). Components can be also viewed as a server, which is running on some remote high-performance machine, and the GUI client, which is launched on local work station. Of course, it is possible to run both components on the same machine. One backend can serve to multiple frontends at the same time, but each frontend can be connected to at most one backend. It is important to mention that frontends connected to single backend are not isolated from each other, i.e. they share the same context on the server, see the same jobs and are notified about the same events. The communication between backend and frontend is asynchronous in nature, i.e. work requests from frontend are queued by the backend and once the results are computed, frontend is notified. While calls from frontend to backend are point-to-point, the backwards notifications are broadcasted to all connected frontends. As for current implementation, taking into considerations the size of communicated data and lack of channel encryption, the Molpher components are intended to be deployed on fast trusted local area networks. The whole communication concept was designed for the potential chemical laboratory where individual researchers maintain their own backend instances while granting other researchers read-only access to spectate or read-write access to participate on the space exploration.

In more detail, backend stores all the scheduled Molpher input instances (jobs) and dispatches them one-by-one into the highly parallelized computation loop. Once the jobs are scheduled, their definition can be further edited from frontend. Even the currently computed job can be altered (with certain limitations). Concurrently with job calculation, backend can also generate neighborhoods or calculate Euclidean coordinates for molecules selected by user. From the threading point of view, backend consists of 3 important threads - first thread listening on network and dispatching communication handlers invoked by frontend, second thread fetching and calculating the jobs, and finally third thread generating the neighborhoods or calculating the coordinates. Both background computation threads are further forked into as many worker threads as is exploitable by the physical machine.

Regarding frontend, it consists of just 2 threads - first is listening to the backend notifications and second executes the GUI event loop. The GUI itself is composed from widget representing the backend job queue and several dialogs to control the queue and individual jobs. Each job can be opened into dedicated tab where user can inspect and alter the chemical space exploration. Chemical space is depicted as a graph of molecules projected into 2D Euclidean space. Edges of the graph represent Molpher chemical operators. Apart from already mentioned, frontend also contains components allowing communication with online databases of chemical molecules.

A few words should be also said about source code directory layout. There are 3 physical folders, two of them containing NetBeans projects and files specific to either frontend or backend, and then there is a common folder containing shared classes, data structures and communication interface. Note that on the physical level, there is no folder nesting as all the source code files are on the same level. However, when the project folders are opened in NetBeans, all files are organized into virtual folders dedicated to auxiliary files, communication files, chemical files etc.

Molecule Morphing

Toolkit Choice

Original Molpher was based on the OpenBabel chemical toolkit. As it was unclear whether OpenBabel is indeed good choice for Molpher for the long term, extensive research was carried out to pick the most appropriate alternative. Apart from OpenBabel, other promising candidates offering the required features, platform support and license compatibility were Indigo and RDKit. All three toolkits were compared in terms of thread-safety, fingerprints, similarity coefficients and rendering support. While Indigo seemed as a good choice at first sight because of very good documentation, the closer inspection revealed that it is implemented as a state machine and as such is not very friendly to concurrent access from more threads. Another problem with Indigo was its C API, which would not be very comfortable to call from C++ code. As for RDKit, it provides reasonable level of thread-safety, native C++ API and superior supply of fingerprints and similarity coefficients (both in terms of amount and performance) in comparison to OpenBabel and Indigo. Regarding rendering of molecule images, both RDKit and Indigo can only calculate coordinates of atoms and bonds, but the resulting image must be rendered by third-party. OpenBabel does not support rendering at all. In the end, the RDKit was chosen as a best alternative despite its poor documentation.

Molpher Algorithm

While current implementation of the Molpher algorithm is semantically based on the pseudo-code from the conference article and on the original experimental implementation, it was heavily refactored to be more friendly to parallelization and network communication. Therefore, following description focuses only on the specifics of the current implementation and differences from the experimental implementation. If you are interested in the complete description of the algorithm, please refer to the articles mentioned in the introduction of this documentation.

Algorithm works with the following data structures. Note that data structures were designed to support serialization, which was essential requirement for persistent saving of algorithm state and for communication between backend and frontend.

MolpherMolecule stores molecule identification (SMILES string), connections to the candidate tree (parent SMILES, children SMILES), quality attributes (e.g. distance to target molecule) and Euclidean coordinates. Note that MolpherMolecule is intended mainly for transport and storage - it is converted to the RDKit molecule for the actual morphing operations.
MolpherParam is a structure for various algorithm parameters that were directly taken from the original article and implementation of Molpher. To avoid confusion, variable names for the parameters were also kept the same as in the experimental implementation and original article (despite the original naming is not being overly descriptive).
IterationSnapshot represents the single iteration of the algorithm. As such, it stores iteration identification, input arguments for the algorithm (source molecule, target molecule, parameters, chosen methods, etc.) and current state of the algorithm (most notably candidate tree). Given any IterationSnapshot, algorithm can continue its execution from the saved state. To avoid having a special case in the algorithm, the zeroth IterationSnapshot have to be crafted by frontend according to user input.

Regarding the algorithm itself, it very closely follows the original implementation even though it is not apparent at first sight due to TBB usage. The only significant difference between the original and current implementation is the way how molecules are identified. Originally, molecules were identified by generated integer ID which allowed existence of duplications (two or more instances of the same molecule with different IDs). Such duplications in the candidate tree have potential to make certain statistical properties of the algorithm worse. As an example, consider that despite giving the chemical operators equal chance to produce morphs, each chemical operator has different tendency to create duplicates. As a result, operators with high tendency towards duplicates could have dominant impact on the candidate tree population and diversity. Therefore, it was decided to identify molecules by SMILES strings. Apart from avoiding duplicates, SMILES strings also carry all the information required for molecule reconstruction which is important for networking and persistence. Regarding the performance, SMILES identification is surely slower than using integer IDs, but it was considered as acceptable when compared to gained simplicity and flexibility.

Current Molpher implementation also contains support for decoy handling which was not part of the original implementation. Morphs are rated according to their proximity to the connecting line between their closest decoy and the target (i.e. sum of both distances is minimal on the connecting line between decoy and target). When sums for both morphs are equal, it is possible (but not necessary) that both morphs lie on the same connecting line. In that case, morphs are rated only according to their proximity to the target. Such comparison should allow convergence to the target even in the late stages of the algorithm when majority of morphs lie on the connecting line between decoy closest to the target and the target itself.

Morphing Modules

There are three principal classes of morphing modules in the project - fingerprint strategies, similarity coefficient strategies and chemical operators. As was already hinted, all of them are implemented via strategy pattern to allow addition of more modules in the future. While it might seem that morphing modules are very straightforward and isolated part of the project, it was revealed during development that for them to work properly there must be certain invariants and policy restrictions throughout the project. Together with some additional notes, following paragraphs enumerates the important information snippets for each class of modules.

Fingerprint modules:

As of writing this, RDKit is the only chemical toolkit that provides Morgan fingerprints. These fingerprints are of particular importance to Molpher because of their properties and superior performance. The quality of Morgan fingerprints is comparable with other traditional methods, but the algorithm is by several orders of magnitude faster. Because of those advantages, it is the default Molpher fingerprint method. See (1) and (2) for more information about Morgan fingerprints.
As was described in the original article about Molpher algorithm, it is usually beneficial to extend fingerprints by additional statistics. Contrary to the experimental Molpher implementation, current Molpher implementation provides both standard and extended variants of the fingerprints. These extensions are appended to the original fingerprint as an extra bit vector. The bit vector contains the cardinality of each atom type found in the molecule, the cardinality of each bond type (single, double, etc.) and finally the accumulated bond cardinality between the atom types.
While most of the RDKit fingerprints can be extensively customized by providing explicit parameters, Molpher currently uses only default parameters assuming they are wisely chosen by RDKit creators. In case some customization would be needed, it would have to be decided whether it is enough to hardcode those customization statically into the code or whether it should be exposed to the user via GUI.
(1) Circular Fingerprints
(2) ECFP

Similarity coefficient modules:

Those methods usually returns a real number from closed interval [0, 1] or from [-1, 1] - the greater the value is, the more similar molecules are. Instead of raw similarity coefficients, Molpher algorithm uses them as distances between molecules. Distance is a real number from closed interval [0, 1] - the lower the distance is, the closer the molecules are. To ensure this change of semantics, each strategy implements a conversion method, which transforms similarity coefficient to distance.
While most of the similarity coefficients do not need any parameters, Tversky coefficient must be configured by two weighting parameters, one for each compared molecule. Because RDKit does not provide any default parameters, it was decided to provide two special instantiated variants of Tversky coefficient. Setting the weighting of prototype (e.g. target molecule) features to 90% and variant (i.e. morph) features to 10% means that mainly the prototype features are important, i.e., this produces a "superstructure-likeness" measure. In this case, a Tversky similarity value of 1.0 means that almost all prototype features are represented in the variant, 0.0 that almost none are. Conversely, setting the weights to 10% prototype and 90% variant produces a "substructure-likeness" measure, where variants almost completely embedded into prototype have values near 1.0.

Chemical operators:

RDKit uses a very strict sanitization method which carries out a collection of tasks for cleaning up a molecule and ensuring that it makes chemical sense. While sanitization of molecules is not necessarily required by Molpher algorithm and might be considered as unnecessary slowdown, most of RDKit functions cannot work with unsanitized molecules (they tend to crash, loop infinitely or throw exceptions). During development, it was however determined that only a reduced subset of sanitization is necessary to keep things working under the assumption that molecules does not contain information about aromacity. Therefore, the molecules are represented in Kekulized form in the whole project.
While RDKit is not very friendly to direct editing of molecule structure (e.g. add atom, remove bond), it usually works as expected. There is however one problematic situation - after removing a bond from a Kekulized, ring containing molecule, the sanitization fails, although the result molecule has chemical sense. The reason of this behavior is probably damaged or inconsistent private state of the RDKit's molecule after such change (it is hard to say whether it is a bug or whether it is by design). This problem is currently solved by customized copy constructor, which copies the molecule as a graph (set of atoms and bonds between them) and doesn't copy any additional information about the molecule.
While the Molpher algorithm does not care about atom charges, RDKit's sanitization method discards molecules containing substructures that do not make sense without charges (e.g. nitrogen oxide). To avoid such unnecessarily discarded morphs, MolpherAtom data structure is equipped with additional information. Beside the atomic number, MolpherAtom contains also the charge and the mass of the atom. So, if a molecule contains a neutral oxygen and an anion oxygen, the function GetAtomTypesFromMol will return both as different MolpherAtom instances. Because of that, Molpher algorithm can be slower than the original experimental implementation in case, when the target molecule contains ions (the domain of atoms from which the algorithm has to generate the morphs is larger).

Neighborhood Generation

Neighborhood generation is handled by class NeighborhoodGenerator running in its own thread. Reason for having thread dedicated to neighborhood generation is to allow user to generate neighborhoods and recalculate molecule coordinates while the Molpher algorithm is computing iterations in other thread. As such, neighborhoods are not part of IterationSnapshot and therefore do not belong to the algorithm history. However, to present the neighborhood to the user, molecule coordinates have to be calculated with some context (usually the molecules from the most recent Molpher iteration) so the neighborhood relative position makes sense when surrounded by other molecules. Since the calculation of coordinates can be quite expensive, it is up to user whether the context contains whole candidate tree, just a few molecules or is completely empty. Moreover, to allow user improve visualization when he originally chose overly small context, it possible to ask NeighborhoodGenerator to just calculate coordinates for already existing molecules without actually generating any new neighborhood.

Neighborhood generation is defined by members NeighborhoodTask data structure. Each task is identified by a microsecond timestamp of its creation. Neighborhood itself is defined by its origin MolpherMolecule, maximum size in terms of molecule count, maximum radius in terms of similarity coefficient distance and maximum depth in terms of chemical operator distance. NeighborhoodGenerator repeatedly attempts to generate morphs from the origin and decides whether they are part of the neighborhood according to given constraints. Obviously, if the constraints were too strict, the resulting neighborhood may contain nothing but the origin. Whether to generate neighborhood or just recalculate molecule coordinates of the given context is decided by checking whether the SMILES string of the origin is empty or not.

Object Decomposition

As was already described, at the lowest level there are strategy pattern classes implementing various fingerprint and similarity coefficient methods. Whereas fingerprint methods are all inherited from FingerprintStrategy, similarity coefficient methods are inherited from SimCoefStrategy. To simplify calculation of fingerprints and similarity coefficients with the aforementioned methods, there is also convenience wrapper class SimCoefCalculator.

Still at the lowest level of object hierarchy, there is another occurrence of strategy pattern used for chemical operators as all of them are inherited from MorphingStrategy. Chemical operators applied on a particular molecule share the common read-only context MorphingData which is precomputed from morphed molecule just before new morphs are generated. For each operator, MorphingData contains specific candidates for random modification which lead to the new morph that is allowed according to the chemical laws. Data structures for candidate modifications differ for each morphing operator. For example, operator for atom addition needs a vector containing indices of atoms that has at least one free bond.

The described low level primitives are called by parallel functors CalculateMorphs, CalculateDistances and ReturnResults which are in turn called from the function GenerateMorphs. This function is important as it marks the boundary between low level morphing code, which is based on a particular chemical toolkit library, and high level implementation of Molpher algorithm, which is mostly independent and sees molecules just as SMILES strings. Responsibility of GenerateMorphs is to return the specified number of morphs derived from a given molecule so the caller can start to filter them.

Finally, on the high level, there are two main classes - PathFinder, which implements the Molpher algorithm, and NeighborhoodGenerator, which generates neighborhoods upon request. Both of these classes call GenerateMorphs at certain step in their code to retrieve new molecules. Both classes are composed from several parallel functors which are invoked from the operator() method running in the dedicated thread. The two dedicated threads represent the backend work consumers and are further forked into TBB threads. When there is no work, the consumer threads are usually passively blocked on the condition variable of their producer. Regarding the PathFinder, note that it uses PathFinderContext which is basically IterationSnapshot modified to support concurrent access to its internal data structures.

Parallelization

From the very beginning, it was clear that Molpher algorithm has potential for massive parallelization on the cluster of multiprocessors. At the same time, there was a serious concern of reaching such goal in the given time frame. Therefore it was decided that while all the technology considerations and designing will take into account the ultimate cluster parallelization, the implementation would leverage just a single multiprocessor for now.

When choosing technology for parallelization on multiprocessor, it was considered to use either OpenMP or TBB. While OpenMP is embedded directly into compiler and is somewhat light-weight, TBB was chosen for its superior design and support for advanced parallel constructs and concurrent containers. Regarding cluster parallelization, the choice was between the two most popular MPI implementations (MPICH and OpenMPI). Brief investigation revealed that MPICH is supposedly more portable and stable implementation, so it was chosen over OpenMPI. However, because the Molpher currently does not use MPI at all, the technology choice should be reviewed in the future. Important thing to consider is the legacy C API and data types provided by standard MPI, which is not very suitable for C++ implementation of Molpher. Fortunately, there exists C++ wrapper for MPI inside Boost library. The wrapper offers API that supports modern C++ style of programming and contains support for STL and user-defined data types (based on Boost serialization subsystem). Since Boost MPI can run on top of both MPICH and OpenMPI, it should not cause any compatibility problems. For future reference, it is important to note that MPI and TBB are not mutually exclusive - i.e. if Molpher will ever get parallelized on the cluster of multiprocessor machines, the work on each multiprocessor should be still distributed by TBB as in the current single-machine implementation.

Following analysis of Molpher algorithm focuses primarily on data structures and algorithm sections that have high impact on parallelization. The analysis is intended to provide hindsight over the whole parallelization aspect of the project.

While it might not be apparent at the first sight, the Molpher algorithm very often calls random number generator. Since STL random generator is not thread-safe, there would be need for explicit synchronization to protect its internal state against race condition. With explicit locking however, the random number generation would become the very low-level bottle-neck of the whole algorithm. The obvious solution is to have an independent random number generator for each thread. To provide the same comfort as when using the STL generator, current implementation defines the random number generator as a thread-safe global singleton which stores the individual random engines in the thread local storage (TLS) of each thread. All the engines are seeded from the common engine which in turn is seeded by epoch time. Therefore, explicit synchronization is required only once per each thread and after the thread engine is seeded, the thread can generate random numbers concurrently with other threads.

To find parallelization opportunities in the Molpher algorithm, it was helpful to identify the important data structures and describe the algorithm as a few lines of pseudocode. In the pseudocode, green stripe marks sections of the algorithm that are potentially massively parallelizable on the cluster of multiprocessors and red stripe marks sections of the code that are potentially parallelizable on the single multiprocessor with shared memory.

Single iteration of top-level loop:

1. Find leaves in the candidate tree. ** 2. Generate morphs for each leaf. ** 3. Merge and sort all generated morphs. ** 4. Test all generated morphs on identity with target. ** 5. Filter out majority of generated morphs.
1. Insert survived morphs into candidate tree and prune the tree.

Generating morphs from a single leaf:

2-a. Prepare shared data structures to support morph creation.
2-b. Create a specified number of morphs.

Creating a single new morph:

2-b-i. Pick a random morphing operator.
2-b-ii. Apply the morphing operator on the leaf.
2-b-iii. Calculate fingerprint and distance to target.

Following list enumerates all the data structures that were relevant for parallelization considerations:

candidate tree (PathFinderContext::candidates) ** holds morphs that survived across iterations ** technically not a tree (connections are expressed as identification strings, not as pointers) ** global data structure ** tree leaves are used to fork the execution path ** concurrent access only on a single multiprocessor (or on the master of the cluster) ** implemented as a map-like container supporting concurrent access
derived morphs count map (PathFinderContext::morphDerivations) ** stores a counter of derived morphs for each morph (including unsuccessful attempts or filtered morphs) ** global data structure ** concurrent access (both read/write) required on the whole cluster ** implemented as a map-like container supporting concurrent access ** expected to be copied to each cluster node and then merged back to the master node
set of derived survivor morphs (MolpherMolecule::historicDescendants) ** holds the identifications of morphs that were successfully derived from the particular morph and survived at least into the next iteration (where they might get pruned) ** independent set for each morph in the candidate tree ** does not have to be accessed concurrently (there is almost always enough morphs to saturate all computational units so there is no need to provide parallelism even on the sub-morph level)
list of generated morphs in the single iteration (PathFinder::!MoleculeVector) ** product of step 2. of the pseudocode above ** data structure is created and destroyed every iteration ** concurrent access only on a single multiprocessor ** implemented as a vector-like container supporting concurrent access ** expected to be used for scatter-gather strategy on the whole cluster throughout the step 3., 4. and 5. of the pseudocode above
auxiliary data structures for morphing operators (MorphingData) ** a few collections of pre-computed data to support 2-b. step of the pseudocode ** data structures are created and destroyed every iteration ** read-only concurrent access on a single multiprocessor ** containers were updated after each new morph in the original algorithm to avoid duplications, but in the current implementation, duplicates have to be filtered on a higher level when integrating new morphs from all computational units

Overall idea of splitting the computation on the cluster of multiprocessors is apparent from the previous section about data structures. The reason for which the cluster idea seems viable is that step 3., 4. and 5. of the pseudocode contains sufficiently large number of sufficiently large portions of work to feed the whole cluster with relatively small communication overhead (depending on the algorithm parameters). On the other hand, steps 1., 2a., 2b. and 6. do not seem suitable for the whole cluster, but rather for a single multiprocessor with shared memory, as computation is carried out on relatively small shared data structures.

It is obvious that both levels of parallelization could be combined together in the Molpher. However, cluster level is significantly more challenging than multiprocessor level due to lack of hardware and more complex debugging. Therefore, it is reasonable to split the parallelization effort into several phases according to overall cost and benefits. As a first step, the algorithm was parallelized on a multiprocessor because majority of algorithm steps benefit from it. If the project development will continue in the future, parallelization could be expanded for the whole cluster (however only certain parts of the algorithm would benefit from it).

Dimension Reduction

Dimension reduction is a mechanism how to project molecules whose relationships are expressed in terms of a similarity coefficient into two-dimensional Euclidean space. It is necessary to reduce dimension, because similarity coefficient cannot be directly used as a 2D distance (e.g. it does not maintain triangle inequality). Dimension reduction is offered by both core Molpher components, PathFinder and NeighborhoodGenerator. While PathFinder calculates dimension reduction automatically for the whole candidate tree after each iteration, NeighborhoodGenerator must be explicitly requested by user who also specifies which part of visualization should be updated. It is possible to reduce dimension just for the newly generated neighborhood, for the neighborhood together with some context consisting of already existing molecules, or just for a group of existing molecules. Because dimension reduction could be computationally very expensive and is relevant just for visualization, there are two trivial methods which could be used to quickly skip the dimension reduction when it is not needed - dummy method does nothing and random method assign random coordinates to each molecule. All this gives user the flexibility to decide when and how the dimension reduction should be calculated (be it proper but expensive reduction after each iteration, or just an explicit request to reduce some/all molecules before the user is going to investigate particular iteration or neighborhood).

Modularization

Similarly as with already mentioned morphing methods, even the dimension reduction could be extended via strategy design pattern. Each dimension reduction method inherits and implements the DimensionReducer interface which can be instantiated by ReducerFactory class. While it is not necessary to provide parallel implementation of the new dimension reduction method, it is highly desirable because the reduction would become bottle-neck otherwise.

Algorithm Choice

Regarding dimension reduction algorithm, there was already some work done (3) before the start of Molpher project. Document (1) enumerates various dimension reduction methods and their properties. There are basically three algorithm categories based either on principal component analysis (PCA), self-organizing maps (SOM) or multi-dimensional scaling (MDS). The proof-of-concept implementation (2) was available for PCA and MDS.

While all the mentioned methods would be beneficial for the Molpher, it was decided to choose just one due to time constraints of the project. It is expected that rest of the methods will be implemented in the future by various individuals leveraging the strategy pattern described above. As for the actual choice, preliminary investigation revealed that while PCA is very fast, it does not yield results of desirable quality. It seemed that PCA is more suitable as pre-processing for other methods rather than stand-alone algorithm. Both SOM and MDS are much more computationally expensive methods, but should produce more intuitive results for the user. While SOM is based on artificial neural networks and their learning, MDS usually leverages simulation of physical systems, stress maps and gradient descent. Because application of SOM method was not very well investigated in the available materials and there was concern about conflict between interactivity and neural network learning, it was ultimately decided to choose some MDS method for the pilot implementation. More specifically, the chosen method is based on Kamada-Kawai algorithm, which is formally described in (4) (where you can also find bibliographic links to the original articles of the respective researchers). Informal description of Kamada-Kawai method and specifics of its application to Molpher is elaborated in the section below.

Kamada-Kawai

First of all, let's describe how the algorithm actually works (for more formal description, see ![4] from the previous section). Molecules for which the dimension reduction should be calculated are represented as nodes of the complete graph, where each edge is assigned weight equal to the similarity coefficient distance between the two end nodes of the edge. This complete graph is then considered as a physical system, where edges are springs whose stiffness corresponds to the edge weights (similarity coefficient) and elongations correspond to the Euclidean distances. Such system can be described by energetic function mapping molecule positions to the overall energy of the system. The goal of the algorithm is to gradually decrease the energy of the system until it reaches equilibrium. In order to do that, algorithm fixes positions of all molecules but one (so the resulting function has only two real variables) and calculates the gradient of the resulting function. During each iteration, algorithm picks a molecule with the greatest gradient and moves it to the local minimum. This goes as long as there is some molecule whose gradient is not sufficiently small. Since each spring tries to reach its own equilibrium, which correspond to the situation when similarity coefficient is proportional to the Euclidean distance, the resulting layout should reflect very good trade-off between the influences of all springs.

For MDS methods, it is generally simple to express computational complexity of the single iteration - e.g. for Kamada-Kawai, it is quadratic with respect to molecule count. It is however nearly impossible to analyze how many iterations would algorithm need to converge into equilibrium on the given input if it converges at all. According to literature, MDS methods usually scale well for the graphs with several hundred or thousand of nodes. So for the purpose of having at least some control over iteration count, algorithm is equipped with various parameters to specify what is already considered as equilibrium and how many iterations should be executed at most with respect to the molecule count.

Because the Kamada-Kawai dimension reduction is computationally very expensive, various means to make it faster were considered. While current implementation is completely parallel and scales well for multi-core processors, there is still room for future optimizations - the performance could be further improved by using single instead of double precision and by using SIMD vectorization offered by modern CPUs. It could be also interesting to introduce new parameter limiting the time that could be devoted to dimension reduction. With such parameter, the resulting layout would be sub-optimal but the algorithm would become practical even for execution after each Molpher iteration. Currently it might not be practical, especially when the Molpher algorithm is configured to have only short iterations or when the candidate tree tends to become very large, because in such cases majority of execution time would be consumed by dimension reduction. Note however, that it is not easy to decide the value of such parameter - it should be either provided by user or somehow computed from the input data, it should definitely not be declared as universal constant. Another opportunity for experimentation is the graph representation of molecules - while it must be represented as a complete graph, the edges of candidate tree could be emphasized by increasing their weight by a few orders of magnitude. Although, by intuition, such change could increase chances and speed of algorithm convergence, initial experiments have shown decrease in speed and layout quality, so it was not further investigated due to more important things.

Workflow Management

As mentioned earlier, in backend there are two main executive objects with separate responsibilities - the PathFinder responsibility is to search the path between target and source molecule. The NeighborhoodGenerator responsibility is to generate neighborhood of the given molecule and coordinates of the given set of molecules. Both those executive classes have their dedicated threads where the particular algorithms are executed. Because Molpher is built on server-client architecture, it is quite common scenario that new requests arise from the user while the server is in the middle of execution. To maintain consistency, user requests can influence the computation only at certain phases of the algorithm (e.g. when the previous iteration of the algorithm just completed and the new one has not started yet). Therefore some way to store the user requests until they can be fulfilled is needed. User requests for the PathFinder are stored at the JobManager. Moreover, the PathFinder uses the JobManager to publish results of the iteration. Similarly, the NeighborhoodGenerator uses NeighborhoodTaskQueue to store requests and publish results.

Job Manager

In Molpher, the term job is used for a single user task to find a path between source and target molecule in the chemical space. All jobs are stored in the queue which splits jobs into 3 categories: * Live jobs are those that are one-by-one consumed by PathFinder depending on their position in the queue. First live job in the queue is implicitly considered as running. * Sleeping jobs are not queued, but can be woken up to become live job, in which case the newly woken job is queued behind all the live jobs. Job can be put to sleep by request from frontend or by PathFinder if such job consumed assigned number of iterations or seconds without finding the target. * Terminated jobs are those that were successfully finished (i.e. target found, path established). Once the job terminates, it cannot be revived.

The JobManager is responsible for maintenance of jobs and their states. It also publishes the current states of the jobs to the clients. To store job states, the manager uses the JobGroup object. The JobGroup contains four important structures. First of all, there is mJobMap which maps job IDs to the job definitions (i.e. the most recent IterationSnapshot of the job containing all information necessary to fully describe the job state). Then there are three queues, one for every job category - the mLiveJobQueue, mSleepingJobQueue and mTerminatedJobQueue. Each queue contains job IDs of the jobs that are in the corresponding lifecycle state. Note that the order in which IDs are stored in mLiveJobQueue corresponds to the order in which the jobs are dispatched into PathFinder. Since JobGroup is serializable, it can be directly sent to clients in order to announce the current state of jobs in the backend.

The requirement to store user requests until they can be processed by the PathFinder is met via deferred actions. Whenever a user request comes from the client while the PathFinder is in the middle of iteration, the request cannot be handled immediately and must be stored into corresponding deferred action structure. Also the corresponding deferred action flag is set to true. When the PathFinder finishes the current iteration, it checks the flags and satisfies all deferred actions (if possible) before starting new iteration.

As an example, consider following scenario. A user request to change the fingerprint method comes. If the job whose method the user wants to change is not currently being executed, the method is changed directly. If the job is executed, the new fingerprint method identifier is stored into mDeferredFingerprintSelector structure and the mDeferredFingerprintSelectorIsSet flag is set to true. When the PathFinder finishes the current iteration and publishes the result, it scans all the deferred actions flags. It discovers, that the mDeferredFingerprintSelectorIsSet is enabled, so it changes the fingerprint method of current job to the one stored in mDeferredFingerprintSelector and clears mDeferredFingerprintSelectorIsSet flag.

Task Queue

A single user request for the NeighborhoodGenerator is called a task. There is a structure NeighborhoodTask containing all information needed to describe a single task. Among others, it contains two important members - the neighborhood origin (the molecule whose neighborhood will be searched) and the visualization context (the set of molecules that are considered when calculating coordinates for the new neighborhood). Depending on what members of NeighborhoodTask are filled and what are kept empty, there are three task types: - Neighborhood generation - The neighborhood origin is not empty, visualization context is empty. The generator will produce new neighborhood and calculate coordinates only for new molecules. - Coordinates calculation - The neighborhood origin is empty, visualization context is not empty. The generator does not compute any neighborhood, it only recalculates coordinates of the molecules in the given context. - Neighborhood generation with context - Both the context and origin are defined. First the neighborhood is generated, then the context is merged into the result and finally the dimension reduction is performed on the merged set of molecules.

Similarly to the PathFinder, the NeighborhoodGenerator needs a data structure to store user requests before they can be processed. This role is fulfilled by the NeighborhoodTaskQueue. When a request from the client comes, it is inserted into the queue. The generator then withdraws the task definitions one-by-one from the queue, calculates them and publishes the corresponding results back to clients through the NeighborhoodTaskQueue functions.

Thread Synchronization

Because JobManager and NeighborhoodTaskQueue classes both serve as an interface layer between corresponding calculation threads and communication threads that are delivering data from frontend, their code have to be properly synchronized when accessing any shared data, especially queues. In both classes there are three types of methods according to by whom they are called: - Most of the methods are communication handlers invoked by communication thread. Those functions are usually quite simple as they just lock and trivially modify data. One exception is the method delivering new work to the background calculation thread, because it must signal a condition variable on which the calculation thread might be blocked because there was no more work in the queue. Another thing to remember is that data are currently protected by non-recursive mutex, so if communication handlers share some private convenience methods to avoid code duplication, these must not lock the data as it is expected to be already locked by caller. - Then there are two methods used by calculation thread to fetch new work from the queue and to publish finished work to the frontends. While the publishing method is straightforward, the function to fetch work contains an intricate mechanism to atomically check the queue and sleep on condition variable if the queue is empty. Notice that blocking on condition variable yields the lock protecting the shared data. - Last group are maintenance methods called from main backend thread to control lifecycle of calculation threads. Currently, there is only one such method called Halt, which by flushing the queue and signalling the condition variable causes calculation thread to break his otherwise endless loop and terminate.

Communication

Technology Choice

Since Molpher is based on the client-server architecture, it was necessary to design and implement an infrastructure to allow the communication of both separated parts. In order to do that, it was first necessary to choose appropriate middleware technology that meets following requirements: - distributed under GPL compatible license - stable, tested implementation - usable directly from C++, with good support for serialization of complex C++ data structures

The most important question was, whether to use a robust and generic technology like CORBA, or whether to employ less generic (and even perhaps language-dependent) technology. Since CORBA is unnecessarily complex (due to its support of different languages and various data types), the Remote Call Framework (RCF) was chosen for its simplicity, straightforward usage and native C++ usage. Moreover, because RCF internally uses Boost serialization, it is possible for data structures to travel without any explicit conversions between frontend and backend, or even between backend master and backend slaves in the case of planned MPI support (assuming it would be based on Boost MPI wrapper).

Interface

In contrast to CORBA, RCF does not need language independent IDL (Interface Description Language) to declare interfaces. Instead, ordinary C/C++ macros are used to specify interfaces. The Molpher communication infrastructure consists of two - server-side and client-side - interfaces. Both are defined in the trunk/common/molpher_interface.idl file, and describe mandatory methods, that must be implemented in backend and frontend.

Both interfaces use arguments consisting of primitive data types, STL data types and custom data types, all of which have to be serializable. Among the custom data types that are serialized and sent between frontend and backend are MolpherMolecule<./i>, MolpherParam, IterationSnapsho, NeighborhoodTask, NeighborhoodTaskResult and JobGroup. All custom data types have disabled pointer tracking and versioning to avoid performance degradation and to keep the interface simple. While the pointer tracking would be useless in the current architecture, the versioning feature might be required in the future if the declarations are going to be extended by additional data members. However, then it would be also required to provide some conversion utility to transform persistent IterationSnapshots saved by older versions of executables.

Backend Interface (BackendIfc)

void InitClient(string &backendId) Notifies the server about the existence of new client and asks for backend timestamp, which is used as unique identifier for persistent storage of snapshots for one particular backend run.

uint32_t CreateJob(IterationSnapshot snp, string password) Asks the JobManager to create a new job with the snapshot snp as a source for this job. The created job is protected by the given password. Method returns the unique ID of the created job.

void WakeJob(uint32_t jobId, string password) Asks the JobManager to wake a sleeping job identified by the given jobId and password. If the password does not match, the JobManager ignores the request.

void SleepJob(uint32_t jobId, string password) Asks the JobManager to put a live job identified by the given jobId and password to the sleeping state. If the password does not match, the JobManager ignores the request.

void RemoveJob(uint32_t jobId, string password) Asks the JobManager to remove a job identified by the given jobId and password. There is an important restriction - the current running job cannot be removed. If desired, the running job must be put to sleep first. If the password does not match, the JobManager ignores the request.

void ChangeJobOrder(uint32_t jobId, int32_t queuePosDiff, string password) Asks the JobManager to reorder the job queue and move the job identified by the given jobId and password. If the queuePosDiff is positive, the job is moved by queuePosDiff to the end of the queue , otherwise the job is moved by queuePosDiff to the front of the queue. If the password does not match, the JobManager ignores the request.

bool ValidateJobPassword(uint32_t jobId, string password) Asks the JobManager to validate a password for a job identified by the given jobId. If the given password matches the real password of the job, returns true. If passwords do not match, returns false.

IterationSnapshot GetJobHistory(uint32_t jobId, uint32_t iterIdx, bool &loaded) Asks the JobManager to retrieve a result of iterIdx iteration of the job identified by the given jobId. The loaded flag informs the caller, whether the requested iteration was found.

void SetFingerprintSelector(uint32_t jobId, int32_t selector, string password) Asks the JobManager to assign fingerprint specified by the selector method to the job identified by the given jobId and password. If the password does not match, the JobManager ignores the request.

SetSimCoeffSelector and SetDimRedSelector have similar semantics as SetFingerprintSelector, but they set the similarity coefficient method or dimension reduction method.

void SetChemOperSelectors(uint32_t jobId, vector<int32_t> selectors, string password) Asks the JobManager to assign chemical operators that can be used for generating morphs to the job identified by the given jobId and password. The set of chemical operators is specified by the vector of selectors. If the password does not match, the JobManager ignores the request.

void SetParams(uint32_t jobId, MolpherParam params, string password) Asks the JobManager to assign given Molpher algorithm params to the job identified by the given jobId and password. If the password does not match, the JobManager ignores the request.

void SetDecoys(uint32_t jobId, vector decoys, string password) Asks the JobManager to assign decoys to the job identified by the given jobId and password. The original decoys (if there are any) are removed and replaced by the given molecules. If the password does not match, the JobManager ignores the request.

void AddPruned(uint32_t jobId, vector pruned, string password) Asks the JobManager to mark the given molecules for pruning inside the job identified by the given jobId and password. The Molpher algorithm will try to prune those molecules in the next iteration. If the password does not match, the JobManager ignores the request.

void EnqueueNeighborhoodTask(NeighborhoodTask task) Asks the NeighborhoodTaskQueue to enqueue a new task for the NeighborhoodGenerator. The NeighborhoodTask data structure contains a task creation timestamp, which can serve as the task identifier.

void SkipNeighborhoodTask(posix_time::ptime timestamp) Asks the NeighborhoodTaskQueue to skip a task created at the time equal to the given timestamp.

Frontend Interface (FrontendIfc)

void AcceptJobs(JobGroup jobs) Accepts the state of the server job queues (the JobGroup) sent from the server.

void AcceptIteration(IterationSnapshot snp) Accepts the result of the single Molpher algorithm iteration (the IterationSnapshot) from the server.

void AcceptNeighborhoodTaskResult(NeighborhoodTaskResult res) Accepts the result of a neighborhood query from the server.

Communicators

The important aspect of communication is to understand responsibilities of application threads and their mutual communication when propagating messages. In the frontend, there are two important threads - the RCF listening thread, and the GUI event-loop thread. The GUI runs in the event-loop thread, all painting and handling of user actions is done there. Also, whenever user performs an action that posts some request to the server, this request is sent from the GUI thread. Whenever a message from the server arrives to the client, the message is accepted by the RCF thread. To avoid implementing explicit synchronization for access to data structures, it is convenient to propagate the message to the GUI thread event-loop. For this purpose, FrontendCommunicator is derived from QObject and is connected to the rest of the client objects via Qt signals and slots. The cross-thread communication is done by employing the Qt::QueuedConnection. The backend has its RCF listening thread as well. Apart from that, both executive components in backend (PathFinder and NeighborhoodGenerator) run in their dedicated threads. Whenever a server method is invoked by the client, this call is handled by the RCF thread. Any details of the user requests are stored either in JobManager or NeighborhoodTaskQueue (depends on the nature of the request), from where they are later withdrawn by PathFinder or NeighborhoodGenerator, respectively. Both JobManager and NeighborhoodTaskQueue are explicitly synchronized to allow concurrent access from RCF thread and executive thread. Messages sent from the server back to clients can either have their origin in the backend executive threads or are created as a direct response to client request in the RCF thread.

Both backend and frontend have their part of the communication infrastructure encapsulated in dedicated classes - the BackendCommunicator implements the backend interface and the FrontendCommunicator implements the frontend interface. The model of communication from the server to clients is based on the publish-subscribe schema - the server creates a publishing service on a fixed port 50001, and uses it to publish messages describing the state of server and the results of the computation. Clients can subscribe for this service to receive published messages. Every message is delivered to all currently subscribed clients, so it is the client's concern to decide whether a particular message is relevant to him or not. The subscription is started by calling FrontendCommunicator::ConnectToBackend(ip) method in the client. The reverse direction - from the client to the server - is based on the direct unicast calls. FrontendCommunicator calls remote method of the BackendCommunicator, who executes its body and then returns. Such approach allows the client to get feedback via return values from the methods called on the server (for example, the CreateJob method creates a new job from the given snapshot and returns the ID of the new job to the client). The server listens for those direct unicast calls on the fixed port 50002. However, this model creates a restriction for implementation of backend interface methods. To keep things simple, GUI thread in the client usually synchronously waits until the execution of the remote method is completed. Therefore, those remote backend methods should be fast to avoid unresponsive GUI. Generally, if handling of the request requires long time to handle, the interface methods should just register the client's request and return. When finished, the result of the request should be then propagated back to the client via publish-subscribe mechanism.

Because the data structures traveling between server and client can be quite large (the size entirely depends on the algorithm parameters, but to get an idea, the size of Molpher iteration with default parameters is usually between hundreds of kilobytes and units of megabytes), some kind of compression to reduce the network load seems appropriate. The RCF provides a nice way to define a compression filter, which is applied on all messages. The Molpher uses the zlib library for the stateless compression in both server and client.

Job Control

Job Queue

In both backend and frontend, the job queue is represented by already described JobGroup data structure. JobGroup stores order in which jobs are dispatched and also the most recent IterationSnapshot for each job, which stores the current state of the job, its parameters etc. In backend, the job queue is maintained by JobManager. Whenever the queue itself or the job in the queue is changed, backend broadcasts the updated JobGroup to all connected frontends. In frontend, the JobGroup is delivered to the JobQueue widget which handles visual representation of the queue. Upon each update, the current visual representation is destroyed on completely rebuild by JobsQueue::OnUpdateJobs function. Apart from these complete rebuilds, there is also partial update caused by PathFinder finishing the iteration calculation, JobManager sending the new IterationSnapshot to frontend and finally JobQueue widget storing the new snapshot into JobGroup and updating the visual representation of the queue.

Job Protection

Because single backend can serve to multiple frontends, and therefore to multiple users, there is a motivation to provide a mechanism to isolate users from each other in certain situations. To understand the reasoning behind the implemented design, let's again image the target environment for Molpher - chemical laboratory where multiple researchers access shared high performance server from their work stations. To allow possible cooperation between users, it is desirable for them to see common job queue and to have read-only access to any job in the queue. While it is allowed for two or more users to interact with the same job concurrently (write access), it is expected that in most of the scenarios the user would probably prefer to protect his job because he is either doing some solo work or he simply wants to avoid accidental damage to the job from colleagues. For this purpose, Molpher offers optional password protection for each job. To avoid any confusion, it is important to say, that this feature does not use any advanced security - there is no encryption of communication channel or job data structures, there is no protection from password guessing etc. It is really intended only for friendly, cooperative environment on some local area network to avoid accidental damage. It does not provide any protection against adversaries with bad intentions and should not be used on Internet (until at least the communication channel is encrypted).

Now, how it actually works. User can assign the password to the job during its creation. Backend maintains the mapping from job ID to job password and authorizes any incoming call against it. All job related calls from frontend to backend are equipped with a string argument reserved for the password. To avoid asking user to retype password for each action, frontend caches known mappings and fill password arguments automatically. If frontend does not know the mapping, user has only read access to the job until he provides the mapping through the dialog for privilege elevation. If a job is not protected by password (i.e. empty password), all users have full read-write access because frontend automatically tries to elevate privileges with empty password for convenience.

While most of the password protected actions are straight-forward (i.e. either it is executed or not), altering the order of jobs in the queue needs special considerations. Obviously, user is allowed to change order of a particular job only if he knows the password (or if the password is empty). If such job is then moved down in the queue, there are no limitations. However, if the job is moved up in the queue, it cannot jump over a job for which the password is not known to ensure fair sharing of resources among users.

Job History

Job history, or in other words all the IterationSnapshot instances created for a particular job so far, is stored by backend so it can be broadcasted whenever new frontend connects. Because snapshots, depending on user-specified parameters, tend to be quite large to be stored in memory all the time, they are serialized and persistently stored in ./Results directory. Under the Results directory, there are subdirectories named by timestamp of each backend process instance. Each subdirectory then contains directories named by job ID which ultimately contain snapshot files named by iteration ID. Filepath can be therefore expressed by generic schema ./Results/backend_timestamp/job_id/snapshot_id.snp.

Similarly, the frontend also saves received IterationSnapshots to persistent storage and works only with light-weight representations called IterationSnapshotProxy. Proxy objects contain only that much information so it is possible to reconstruct filepath and load full snapshot from the storage. Snapshots are saved into ./Temp directory and generic schema for their filepath looks like ./Temp/frontend_timestamp/backend_timestamp/job_id/snapshot_id.snp. Notice that there is need for additional level of per-backend subdirectories to allow frontend to connect to multiple backends during its lifetime.

When frontend connects to backend, it is immediately informed about the current JobGroup so it is possible to construct job queue representation for user to interact with. Because JobGroup contains most recent snapshot for each job, it is even possible to investigate the current state of the job via visualization described further below. Once the frontend maintains connection to backend, it is also informed about all newly calculated iterations. For performance reasons, it is however unreasonable to send full history of each job from backend to the newly connected frontend (consider hundreds of megabytes transferred over local area network upon connection). Instead, history retrieval must be explicitly requested by user.

History Retrieval

History retrieval works in the following way. Frontend gradually requests backend to send snapshots one-by-one, first more recent snapshots, then older snapshots down to the zeroth snapshot. JobManager loads snapshot from persistent storage and send it to frontend via BackendCommunicator. In frontend, snapshot is received by FrontendCommunicator, who saves snapshot to persistent storage and creates the proxy object which is then sent to the upper levels of GUI. This mechanism works reasonably fast when the communication between frontend and backend is handled by localhost IPC. It is however much more slower over LAN, which results in the following problems.

First of all, RCF framework expect programmer to explicitly set timeouts for the entire remote call to finish. Because the required timeout depends on the size of transferred snapshot and the current throughput of the network, both of which are unknown at compile time and hard to find out at run time, it is hard to set the right value. Currently, it is done naively by setting big enough constant timeout and attempt count to accommodate most expected usage scenarios. Still, it might happen that user will have to initiate the interrupted history retrieval several times until the history is fully acquired. Hopefully, the future version of RCF would infer big enough timeouts internally. If not, the proper solutions would be to split snapshots into pieces of some constant size.

Another deficiency of the current implementation is that the history retrieval is handled by GUI thread to keep frontend architecture and communication infrastructure simple. Although the GUI event loop is dispatched during history retrieval, there is a noticeable increase in GUI latency. There are two possible proper solutions to this, both of which has non-trivial consequences. Either the history retrieval would have to be driven by backend instead of frontend or there would have to be dedicated background worker thread in frontend. The first solution would require communication infrastructure to allow point-to-point communication originated from backend to frontend (currently backend is only able to broadcast to all frontends). Second solution would require synchronization between GUI thread, background worker thread and communication threads. The potential impact of both solutions was considered too big in comparison to the importance of history retrieval feature, so the naive approach was chosen for the initial implementation.

Tabbed GUI

Apart from JobQueue, another important frontend component is Tab which provides user with means to explore the job state. For this purpose, Tab stores IterationSnapshotProxy objects for a particular job and serves as a container for ChemicalSpaceView visualization component. There are 3 types of tab: * Live tab reflects the current state of the job and is gradually updated according to backend progress. Overall, this type of tab is intended as a sandbox for user interaction with the running Molpher algorithm in the backend. Therefore, it is available only when frontend is connected to backend. To maintain the sandbox state (mainly generated neighborhoods) and integrate asynchronous updates, user is not allowed to change iterations in this tab. Live tab is the only type of tab which is allowed to retrieve job history from backend and have to be opened for the feature to work. At most one live tab can be opened for a particular job at once. When frontend disconnects from backend, all live tabs becomes detached tabs (see below). * Detached tab represents the job state at the time the tab was opened. It is not further updated afterwards. User can freely change the iterations and explore the job history, but the generated neighborhoods are always lost after the iteration change. There is no limit to how many tabs can be detached from the particular job during its lifetime. * Adhoc tab is used to explore one or more snapshots that are not necessarily part of the same job. More specifically, while characteristics of adhoc tab are similar to detached tab, additionally there are no assumptions about job IDs, iteration IDs and causality of the snapshots loaded into adhoc tab. User can use adhoc tabs to load group of snapshots from hard drive or to peek into a particular iteration without actually changing the iteration in source tab (e.g. double-click on some historic iteration in live tab results in opened adhoc tab for that particular iteration).

Job Termination

Whenever some job terminates (i.e. the path from source to target molecule is found), following files are generated into the job's subdirectory in the Results directory: * path.txt contains textual representation of the path that should be easily parsed by some external utility, in case it is needed. The path is represented by molecule SMILES strings on odd lines of the file, and chemical operator descriptors on even lines (in the form operator_id:operator_acronym). File begins by source molecule SMILES on the first line and is followed by an operator which transforms it into second molecule of the path and so no. * final_mols.sdf contains the candidate tree molecules from the last iteration. While this file might not be particularly useful, it contains all molecules on the path. Also, some of the molecules might be interesting because they survived all or most of the iterations. * all_mols.sdf contains the union of candidate trees from all iterations (not to be confused with all explored/generated molecules during algorithm execution). This is the most important output of Molpher algorithm, because it basically represents the relevant neighborhood of the resulting path in the vast chemical space. In a sense, this file can be considered as a resulting targeted chemical library that is expected to be further processed by some other utility in the drug discovery pipeline.

Visualization

Graphic Items

The visualization of chemical space is implemented in class ChemicalSpaceView which is derived from Qt class QGraphicsView. There are two objects used to represent the chemical space entities, both of them derived from QGraphicsViewItem: * VisualizedMolecule class is visual representation of molecule. To allow easy navigation in chemical space, each molecule is characterized by color and shape. Color differentiation is used to distinguish source molecule, target molecule, inner node of candidate tree, leaf of candidate tree, neighborhood molecule, decoy molecule. Shape differentiation is used to distinguish ordinary molecule, new candidate, neighborhood origin, molecule found in online database. For particular colors or shapes, refer to source code or user documentation. To allow highlighting of paths in the candidate tree, each molecule contains pointers to edges that are connected to parent and descendant molecules in the tree. * Edge class is visual representation of chemical operator. It has two states - either it is highlighted or isn't. Then the edge contains pointers to molecules that it connects. Moreover this class contains instance of QGraphicsSimpleTextItem to visually represent chemical operator type. This label is painted near the edge only when the edge is highlighted.

Molecule Priorities

ChemicalSpaceView identifies molecules by their SMILES, so there can be at most one instance of VisualizedMolecule for a particular SMILES. Because ChemicalSpaceView must visualize different groups of molecules (candidate tree, decoys, multiple neighborhoods) that can have non-zero mutual overlap in regard to molecule identities (SMILES strings), there must be some priority rules about how paint molecules that are to be found in more than one group. This is also convenient to allow duplicates in the mentioned groups, although it is currently not needed (for example candidate tree is guaranteed not to contain any duplicates). Highest priority is assigned to source and target molecule, which also means that they are always visible. Then the priority is, in decreasing order, assigned to decoys, candidates (inner nodes or leaves of candidate tree) and neighborhood members. It means that if there is decoy and candidate (or neighbor) with the same SMILES, corresponding VisualizedMolecule will have color and shape of decoy. Similarly, if there is candidate and neighbor with the same SMILES, molecule is painted as candidate. Note that this priority of visual depiction does not prevent VisualizedMolecule to represent for example decoy and candidate molecule at the same time - it is perfectly fine to have decoy molecule with edges connected to the rest of the tree.

Path Highlighting

In candidate tree, it is possible to highlight path from any molecule to the source molecule. ChemicalSpaceView supports two modes of path highlighting: * In hover mode, path to source will become highlighted when mouse cursor enters the molecule. When cursor leaves the molecule, highlighting will disappear. * In fixed mode, currently highlighted path to the source molecule is permanently displayed, regardless of mouse position. Highlight of the path can be enabled/disabled by clicking on the endpoint molecule of the path (second endpoint of the path is always source molecule). If there is currently some path highlighted, the highlight can be transferred to different path by simply clicking on the new endpoint molecule - highlight of the old path will disappear and the new path will become highlighted. In this mode, highlighted path is preserved during iteration switching in detached tab as long as all molecules on the path exist.

Mouse events are detected by VisualizedMolecule and propagated to ChemicalSpaceView via two signals. First signal is saying that highlighting mode has changed or that the end molecule of highlighted path has changed (ChangeHighlightingRule). Second signal is the request to revert currently highlighted path (RevertHighlightedPath).

By calling VisualizedMolecule::SetHighlightingOfPathToSource() of some particular molecule, the path to the source molecule is highlighted (if possible). First, method checks whether the molecule has edge to the parent. If so, color and width of the molecule is changed. Also, z-index of molecules connected by the edge is increased to ensure visibility of the molecules. Edges are iterated and highlighted one-by-one through the parent pointers until there is no parent, which means that the source molecule was found and the path is therefore completely highlighted. For this to work, there cannot be any circle in the graph of molecules connected by edges, otherwise the highlighting algorithm would end up in the infinite loop. Currently such thing cannot occur because edges connect only molecules from the candidate tree which does not contain any circles by definition.

Ergonomic Features

During the project development, visualization was gradually improved to be more ergonomic. First of all, color and shapes of molecules types were chosen to have high mutual contrast and to be easily distinguishable. In order to ensure that color and shapes are easily visible even when the visualization is zoomed out, all items on the screen are maintained at constant predetermined scale and only the distances between items are changed during zooming. To provide seamless transition between Molpher algorithm iterations allowing user to see the most obvious differences at first sight, the source molecule is always placed to the center of the screen and whole visualization is rotated in such a way that the target molecule is on the right hand side and at the same vertical level as source molecule. Lastly, the zooming transformation is always applied with its vector space origin aligned to the current mouse pointer location, which is generally perceived by a user as more intuitive way of zooming.

Molecule Thumbnails

By hovering mouse over the molecule, Molpher is requested to render thumbnail of 2D structural schema of the molecule. The thumbnail itself is created in the VisualizedMolecule::hoverEnterEvent method. There are two ways how the Molpher generates molecule thumbnail: * The primary thumbnail is created using external utility indigo-depict from the Indigo chemical toolkit (http://ggasoftware.com/opensource/indigo). By invoking indigo-depict with appropriate command line parameters, it renders a given molecule thumbnail to a file, which is then immediately loaded and displayed by Molpher. Since this mechanism includes spawning of external process and interacting with filesystem, there can be visible lag between mouse hover and appearance of the molecule thumbnail, especially under heavy load of CPU or file storage. It is however outweighed by superior visual quality provided by indigo-depict. In order to remove the overhead, it would be possible to include whole Indigo as another project dependency link against it, but it seems as an overkill if it would be used just for depiction. * Should the indigo-depict visualization fail for any reason (e.g. the binary was removed by the user), there is a secondary thumbnail, created by the MoleculePainter::DrawMolecule method. This approach uses RDKit to calculate coordinates of molecule's atoms and bonds, which are then painted into QPixmap via the QPainter. The painting code itself is based on the visualization tutorial provided by the RDKit (dependencies/rdkit/Code/Demos/RDKit/Draw) - only instead of Cairo graphic library, the code was ported to use QPainter. This direct thumbnail drawing is used only as fallback, because the resulting quality is not as good as with indigo-depict.

Neighborhood Generation

Molpher can generate two types of molecule neighborhoods, depending on from where they are originated: * Offline neighborhood is generated by NeighborhoodGenerator in the backend as was already described above. Since those neighborhoods are generated by the same underlying RDKit-based code as is used by the Molpher algorithm, it is not guaranteed that the resulting molecules in the neighborhood exist in real world. * Online neighborhood is generated by querying online database that support search based on similarity. Currently, Molpher uses only PubChem database for this purpose (http://pubchem.ncbi.nlm.nih.gov/). Before including result of the query into visualization, it must be sent to NeighborhoodGenerator to calculate coordinates for the new molecules.

Because backend broadcasts calculated neighborhoods to all connected frontends, it is necessary to distinguish what neighborhood results are relevant for each client. To accomplish this, the NeighborhoodTask contains microsecond timestamp of its creation as an identifier, and this timestamp is also included in the corresponding NeighbrohoodTaskResult. When sending a task to the backend, frontend stores the timestamp to be able to recognize the result. Whenever a task result is received, the client compares the stored timestamps to determine whether the result is relevant and should be displayed.

Because dimension reduction is computationally very expensive, it is usually reasonable to deactivate automatic dimension reduction after each Molpher iteration. Instead, user can explicitly request dimension reduction only for iteration he is interested in. In order to do that, user is expected to select all molecules in the chemical space (or whole candidate tree), create NeighborhoodTask without any origin molecule and then request NeighborhoodGenerator to recalculate coordinates for the selected visualization context.

Remote Databases

Online databases of molecules provide web-based API to allow either similarity or identity queries. Molpher leverages this to allow user to test selected group of molecules on identity match with a given database, or to create an online neighborhood based on a similarity search. When only individual molecules are queried by the VisualizedMolecule context menu, the results of the query are shown externally in the web-browser window to display more detailed information. While the group queries require implementation of quite complicated infrastructure on the Molpher side (which is described in the next two sections), individual queries are implemented in a very straightforward manner just by constructing a query string and spawning web-browser process.

Modularization

To allow uniform querying of the different databases, Molpher employs the strategy design pattern - algorithms for accessing particular databases are encapsulated in classes inherited from a common base class defining a common interface for both similarity and identity queries (the SimilarityTester class). The access to various similarity testers is allowed through the DbTestingManager. Using the methods of the manager object, the caller is able to specify the type of query, target database and the consumer of the results. The manager then creates a particular tester for the specified database, connects its signals to the slots of the consumer and notifies the tester to send a query. After finishing the search, results are not returned through the manager. Instead, results are delivered directly from the tester to the consumer via signal and slot mechanism. Moreover, the manager provides a way to dispose finished testers - after emitting the result, tester calls DeleteTester method of the manager class, which ensures the lazy deallocation of the tester once it returns from all of its methods.

Pubchem Tester

PubchemSimmilarityTester, a particular implementation of SimilarityTester interface, provides communication with the PubChem database (http://pubchem.ncbi.nlm.nih.gov/). Both the similarity query and the identity query are asynchronous in PubChem and work in a similar way. The database is accessible via web API (the API itself is described at (http://pubchem.ncbi.nlm.nih.gov/pug_rest/). The URL query generally consists of four parts - the PubChem prefix (which is constant), the input specification, the operation specification and the output specification. Optionally, there can be fifth part - the operation options. Following paragraphs describe the approach how the individual molecules from the given group are queried.

First the tester assembles the initial query, specifying compound identified by SMILES as the input, similarity as the operation and XML as the output specification. The tester also specifies some operation options. Since the similarity search is a time-consuming operation, the PubChem web API works asynchronously. After submitting the query to PubChem, the tester receives a reply. This reply however does not contain search results (because they are probably not computed yet). Instead, it contains request identifier called ListKey.

The ListKey identifies the sent query and allows further questioning the database, whether the original query is completed. In the PubchemSimilarityTester the ListKey is retrieved from the reply and stored for further usage. The tester has its internal timer, which produces the timeout event every 2 seconds. The handler for this timeout event (WaitingTimeout method) iterates through a list of incomplete queries and uses their ListKeys to ask PubChem whether the query execution have already finished.

When the query is completed, the reply contains the list of results. Unfortunately, those results contains only CIDs of the found molecules (CID is an internal PubChem identifier of the molecule). But the Molpher uses SMILES and CIDs can be translated to smiles only by using PubChem again.

The tester assembles a new query for every CID retrieved by the original query, specifying compound identified by CID as input, property retrieval of the canonical SMILES as the operation and XML as the output. This query is sent to the PubChem, and since property retrieval is a synchronous operation, the reply contains requested canonical SMILES, which is then parsed and stored. After all CIDs are translated into SMILES, the tester emits the whole result via Qt signal PublishSimilarityResult.

Since any network connection is unreliable, there has to be some error handling in the tester. Moreover, the PubChem web API returns error message when querying a molecule the database does not contain. In the PubchemSimilarityTester, every query has error counter. When an error is received during execution of the query, the error counter is incremented. There is also an error threshold, currently set to three errors. If the error counter exceeds the threshold, it means either that database is unreachable or the database does not contain the source molecule of the query and the query is pronounced invalid and removed from the querying process.

Molecule Queries

Molpher offers user an easy way to search online chemical databases for molecules similar or identical to an individual molecule visualized in the chemical space. After right-clicking on the molecule, context menu that contains options "Find similar molecules in" and "Find identical molecules in" appears. When a user requests a database search, the handler for the given database assembles a query and posts it to the database using a web-browser to display results.

The search itself uses web-based API of the particular database. There is no standard upheld among the databases, every API is quite unique. Some provides a detailed interface to specify various parameters of the search, some offers only a simple search. Molpher therefore does not allow user to set advanced parameters, for it would make the search less intuitive and it would be necessary to have database-specific interface for every database.

Currently, Molpher supports querying following chemical databases:

PubChem, API description at http://pubchem.ncbi.nlm.nih.gov/search/help_search.html#UrlSch
ChemBl, API description at https://www.ebi.ac.uk/chembldb/ws
ZINC, API description at http://wiki.bkslab.org/index.php/ZINCCL

Should the need to add support for another database arise, Molpher is well prepared. The only actions necessary is to create handler that assembles a query (in the VisualizedMolecule class), add a new item to the context menu the user issues the search from and connect this item to the handler.

Design and Architecture

Architecture Overview

Molecule Morphing

Toolkit Choice

Molpher Algorithm

Morphing Modules

Neighborhood Generation

Object Decomposition

Parallelization

Dimension Reduction

Modularization

Algorithm Choice

Kamada-Kawai

Workflow Management

Job Manager

Task Queue

Thread Synchronization

Communication

Technology Choice

Interface

Communicators

Job Control

Job Queue

Job Protection

Job History

History Retrieval

Tabbed GUI

Job Termination

Visualization

Graphic Items

Molecule Priorities

Path Highlighting

Ergonomic Features

Molecule Thumbnails

Neighborhood Generation

Remote Databases

Modularization

Pubchem Tester

Molecule Queries

Clone this wiki locally