Skip to content

Latest commit

 

History

History
38 lines (22 loc) · 9.29 KB

icde2023_instructions.md

File metadata and controls

38 lines (22 loc) · 9.29 KB

Overview

As the BLANT directory is large, this file will describe where in the codebase BLANT-seed is stored as well as many details which will hopefully save the reviewer's time.

The source code of the index creation algorithm (Algorithm 1), can be found in various files in BLANT/src. The source code of the seed mining algorithm (Algorithm 2), can be found in various files in BLANT/seed_mining. After, we will discuss how we generated ODV seeds. Finally, we will go over how to run all the different parts of the algorithm.

Index Creation Algorithm (Algorithm 1)

All parameters can be found in blant.c. The variable G corresponds to the parameter G, and the variable _k corresponds to the parameter k. The "-l" option with the "DEG" suboption sets the variable _numWindowRepLimit, which corresponds to the parameter D. While the "-l" option does many other things, such as setting _windowRep_limit_method, the reviewer can ignore all of these effects; the only relevant thing it does is set _numWindowRepLimit. The variable heuristicValues[] corresponds to the parameter f applied to all nodes in G.

Lines 346-389 in blant.c correspond to the CreateIndex() function in the pseudocode. Lines 369-378 sort the list of nodes based on f, breaking ties alphabetically by the name of the node. While breaking ties alphabetically is using information outside of topology, we have found that its effects are far too small to be meaningful. The only time this affects the algorithm is when creating the patched index as it will very slightly change the order of lines in the file. We have found that aligning two alphabetic indexes produces results within 0.1% of aligning one alphabetic and one reverse alphabetic index, so we make all indexes alphabetic for simplicity. nwhn stands for "node with heur + name", and it is simply a struct we use to sort nodes based on their heuristic value and name. The rest of the code should be fairly straightforward.

Next, the function SampleGraphletIndexAndPrint() in blant-sampling.c corresponds to the GetNodeEntries() function in the pseudocode. The code is well commented, but we will describe idiosyncransies here. ProcessGraphlet() (in blant-output.c) takes care of ambiguity checking, printing the graphletID + root node orbit, and printing the nodes in a canonical order. ProcessGraphlet() will be discussed in the next paragraph. Lines 941-982 correspond to the GetExpandNeighbors() function in the pseudocode. _topThousandth on line 998 is an unused parameter and can be ignored. The commented out line 999 correspond to different experiments with "stairs-like" skipping algorithms, and "algo=stairs" is the one discussed in "Initial Exploration into Sudden Volume Drops". Finally, lines 1001-1029 correspond to the for loop through Vexp in the pseudocode.

The final function to discuss is ProcessGraphlet(), which is in blant-output.c. This function is by far the most complex because it is used by so many parts of BLANT, so there is no need to understand it in its entirety. The only important part of this function is lines 199-206, which is the path BLANT-seed takes. G is the graph, k is k, and Varray is the array of nodes in the order they were expanded to. NodeSetSeenRecently() simply helps filter out some duplicate graphlets. It has no function except for making the algorithm faster because it has to print less often. _windowRep_allowed_ambig_set is the set of graphlets which are "unambiguous enough". In our algorithm, we only allow completely unambiguous graphlets. PrintIndexEntry() prints the graphletID/canonicalID (l143), the orbit of the first node in Varray (l146-148), and the nodes in Varray in a canonical order (l150-152).

Seed Mining Algorithm (Algorithm 2)

Next, we will discuss Algorithm 2. As a reminder, the source code is in various files in BLANT/seed_mining. As this code is not native to the BLANT repo, we were able to delete most irrelevant pieces.

The full_run() function in full_algorithm_helpers.py is in charge of creating patched indexes and finding seeds from these patched indexes. The only slightly confusing part of this algorithm is the term "ortholog", which simply means a pair of nodes in two networks which are considered "correct". gtag1 (graph tag 1) and gtag2 are simply tags to refer to different networks, and their only use is in get_g1_to_g2_orthologs(), which returns an ortholog dictionary based on the provided tags.

The patched index is created by get_patched_index() in patch_helpers.py. This function first reads the entries in the BLANT index file in order and stores it in entry_list. get_matching_poses_list() then returns pairs of lines numbers which should be patched together based on both C and P. The variable prox corresponds to P, and the variable target_num_matching corresponds to C. Together, lines 113-119 correspond to the first four lines of PatchIndex() in the pseudocode (two for loops, a statement, and an if statement). Constructing the PatchedIndexEntry() object creates the patched ID and a longer list of nodes in a canonical order, corresponding to the "key =..." and "p =..." lines in the pseudocode. The rest of the function is pretty self explanatory.

After the patched index is created, seeds are found by find_seeds() in seeding_algorithm_core.py. Lines 11-21 correspond to the for loop and if statement in FindSeeds() in the pseudocode. Lines 23-35 correspond to the "S = S U..." line. Overall, this function is fairly straightforward.

After calling get_patched_index() and find_seeds(), full_run() then calls get_all_metrics(). get_all_metrics() calls extract_node_pairs() in node_pair_extraction_helpers.py, which corresponds to ExtractNodePairs() in the pseudocode. create_node_pair_voting(), create_node_favorite_pairs(), and create_output_pairs() respectively correspond to the first for loop, the second for loop, and the return statement in the pseudocode. The only interesting thing about this file is seeds_to_m2m. We first convert the list of seeds to a list of node pairs with duplicates, and extract pairs from this new list. This simply improves modularity.

ODV Seeds (Benchmark)

ODV seeds were generated with get_odv_seeds() in odv_helpers. The entire file is extremely straightforward so we will omit explaining the code in detail. The ODV class simply implements the "Graphlet degree vectors and signature similarities" section in the original ODV paper (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2901631/). The ODV class's correctness was verified with validate_sim_function() (unrunnable but code was left in). Selecting k and n are both described in the code's comments. ODVDirectory simply provides a wrapper around ODV. get_odv_seeds() then uses ODVDirectory and a simple heap algorithm to find the node pairs with the highest similarities between two networks. no1 is our own method described in the paper, and alpha is the parameter described in GRAAL's Supplementary Information (http://www0.cs.ucl.ac.uk/staff/natasa/GRAAL/GRAAL_SI.pdf) in Section 1.2.

How to Run

The website did not specify whether the reviewer would run the algorithm. We will go over the process here, but not in extreme depth.

To make the blant executable, go to BLANT and run "PAUSE=0 NO7=0 NO8=0 ./regression-test-all.sh -make". This will take about an hour. To run blant the same way as in the paper, type "./blant -k8 -lDEG2 -sINDEX -mr -a1 graph.el". -k8 sets k to be 8, and -lDEG2 sets D to be 2. -sINDEX runs the indexing mode of blant, since blant is a tool with many functionalities. Finally, -mr activates the root node orbit technique discussed in Section IIIb. -a1 is optional and simply breaks ties alphabetically. This can be left out while barely changing the results. ./blant will output the index to stdout and will need to be redirected into a file. ./blant does not perfectly deduplicate its output, so you will also need to run ./dedup.sh on the output file to remove duplicates while maintaining the order of lines. The reviewer may try to replicate the example indexes we have provided: examples/syeast0.index and examples/syeast05.index. Note that ./blant must be run from the BLANT/ directory. More information about how to make ./blant and environment information can be found in README.md.

Once indexes are created for two networks, they can be aligned with full_algorithm_helpers.py. You will need to pass in the index files and graph files for both networks, as well as the "gtags" (graph tags) of both networks. The graph tags are only used in get_g1_to_g2_orthologs() in order to return a mapping from nodes in one network to their "correct" counterparts in the other. The reviewer should modify get_g1_to_g2_orthologs() if they want to use gtags which are not provided. An example is provided in examples/syeast0-syeast05.seeds. Note that full_algorithm_helpers.py must be run from the BLANT/seed_mining/ directory. Python 3.7.4 is the version we used for development.

To create ODV seeds, you will need to use orca.sh to first create odv files for a network. Examples are provided in examples/syeast0-k5.odv and examples/syeast05-k5.odv. Then, simply pass in the el files and odv files, as well as k and n (k is the parameter you gave to orca.sh and n is the number of ODV seeds you want) to odv_helpers.py. An example is given in examples/syeast0-syeast05-k5-n1004.odv.

Examples for all these commands can also be found in regression_tests/blantSeed and regression_tests/odvSeeds.