diff --git a/assets/TetianaBlogFigs/insert_animation.gif b/assets/TetianaBlogFigs/insert_animation.gif deleted file mode 100644 index a6e1b9e9..00000000 Binary files a/assets/TetianaBlogFigs/insert_animation.gif and /dev/null differ diff --git a/assets/TetianaBlogFigs/maps_init.png b/assets/TetianaBlogFigs/maps_init.png new file mode 100644 index 00000000..e8c2d9ea Binary files /dev/null and b/assets/TetianaBlogFigs/maps_init.png differ diff --git a/assets/TetianaBlogFigs/maps_init_cut.png b/assets/TetianaBlogFigs/maps_init_cut.png deleted file mode 100644 index 35f76c6f..00000000 Binary files a/assets/TetianaBlogFigs/maps_init_cut.png and /dev/null differ diff --git a/assets/TetianaBlogFigs/remove_animation_1.gif b/assets/TetianaBlogFigs/remove_animation_1.gif deleted file mode 100644 index 111144c1..00000000 Binary files a/assets/TetianaBlogFigs/remove_animation_1.gif and /dev/null differ diff --git a/assets/TetianaBlogFigs/remove_animation_1_small.gif b/assets/TetianaBlogFigs/remove_animation_1_small.gif new file mode 100644 index 00000000..99fea69d Binary files /dev/null and b/assets/TetianaBlogFigs/remove_animation_1_small.gif differ diff --git a/assets/TetianaBlogFigs/remove_animation_2.gif b/assets/TetianaBlogFigs/remove_animation_2.gif deleted file mode 100644 index 44205cec..00000000 Binary files a/assets/TetianaBlogFigs/remove_animation_2.gif and /dev/null differ diff --git a/assets/TetianaBlogFigs/remove_animation_2_small.gif b/assets/TetianaBlogFigs/remove_animation_2_small.gif new file mode 100644 index 00000000..173e2aa6 Binary files /dev/null and b/assets/TetianaBlogFigs/remove_animation_2_small.gif differ diff --git a/blog/_posts/2020-07-31-change-log-genome.md b/blog/_posts/2020-07-31-change-log-genome.md index 103ec14b..83b444a5 100644 --- a/blog/_posts/2020-07-31-change-log-genome.md +++ b/blog/_posts/2020-07-31-change-log-genome.md @@ -1,49 +1,54 @@ --- layout: post -title: "Investigating use of change log for genome class" +title: "Performance of Genome Class when Using Change Log" date: 2020-07-31 author: Tetiana D --- # Introduction -For more information about MABE (Modular Agent-Based Evolution platform) as well as other approaches for solving this problem see [this introduction](https://mmore500.com/waves/blog/Team-MABE.html). +For a short introduction about MABE (Modular Agent-Based Evolution platform) as well as the description and performance analysis of other approaches see [this post](https://mmore500.com/waves/blog/Team-MABE.html). ## Genome Class -### Naive Implementation +Genome is a sequence of sites that contain heritable and mutable data. +Genome is used by other MABE modules, for example as a source of data necessary to construct a brain in MABE. + +In biology, a genome is a sequence of four types of nucleotides (A, C, G, T). In MABE, the genome data could be of any type, and for this project the data in our genome will be of `std::byte` type. -Genome is a list of sites with specific values: +The genome class interface provides several mutation methods, which are used to create an offspring genome from a parent genome, specifically: +* Overwrite mutation - the value at one or more sites is overwritten by a different value +* Insert mutation - one or more sites are inserted into the genome +* Remove mutation - one or more sites are removed from the genome -![genome example]({{ site.baseurl }}/assets/TetianaBlogFigs/GenomeExample.png){:style="width: 60%; display: block; margin-left: auto; margin-right: auto;"} +### Naive Implementation -Genome can be naively implemented as a `std::vector` data structure from the standard library. +Genome is a sequence of sites with specific values, e.g.: -When the offspring is created from the parent genome, several mutations could take place: -* **Overwrite** - the value at one or more sites is overwritten by a different value -* **Insert** - one or more sites are inserted into the genome -* **Remove** - one or more sites are removed from the genome +![example genome]({{ site.baseurl }}/assets/TetianaBlogFigs/GenomeExample.png){:style="width: 60%; display: block; margin-left: auto; margin-right: auto;"} -In this naive implementation, there mutations can be implemented using the standard library algorithms on `std::vector`, specifically: +It can be naively implemented as a `std::vector` data structure from the standard library. +In this naive design, all mutations can be implemented using the standard library algorithms on `std::vector`, specifically: ```cpp -std::vector sites{std::byte(1), std::byte(2), std::byte(3)}; // this genome consists of three sites with values 1, 2 and 3 +// A genome that consists of three sites with values 1, 2 and 3 +std::vector sites{std::byte(1), std::byte(2), std::byte(3)}; -// Overwrite sites starting at index with values from segment +// Overwrite mutation: overwrite using values from segment starting at index void overwrite(size_t index, const std::vector& segment) { - for (size_t i(0); i < segment.size(); i++) { - sites[index + i] = segment[i]; - } +for (size_t i(0); i < segment.size(); i++) { +sites[index + i] = segment[i]; +} } // Insert mutation: a segment is inserted at index void insert(size_t index, const std::vector& segment) { - sites.insert(sites.begin() + index, segment.begin(), segment.end()); +sites.insert(sites.begin() + index, segment.begin(), segment.end()); } // Remove mutation: segmentSize sites are removed starting at index void remove(size_t index, size_t segmentSize) { - sites.erase(sites.begin() + index, sites.begin() + index + segmentSize); +sites.erase(sites.begin() + index, sites.begin() + index + segmentSize); } ``` @@ -51,12 +56,12 @@ void remove(size_t index, size_t segmentSize) { The advantages of this approach include: * All the sites are in contiguous memory -> fast operations due to cache-friendliness (e.g. random access or iterations) * Use of C++ standard library data structures -> code is simple, readable, expressive and optimized for performance -
-However, there are also disadvantages: -* Every generation, the whole genome (the whole `sites` vector) is copied and then the mutations are applied to it -> In a common situation of large genome and low mutation rates, it means copying a lot of values that didn't change + +However, there are some disadvantages: +* Every generation, the whole genome (the whole `sites` vector) is copied and then the mutations are applied to it. In a common situation of a large genome and low mutation rates, it means copying a lot of values that do not change between the parent and the offspring genomes * The `insert()` and `erase()` algorithms have linear time complexity -> inefficient time - -**The goal of this project was to investigate if storing only the mutations (as opposed to storing the whole genome) would provide a better time and memory performance.** + +**The goal of this project was to investigate if storing only the mutations (as opposed to storing the whole genome) for each offspring would provide a better time and memory performance.** ### Optimized Implementation Using Change Log @@ -64,54 +69,56 @@ One of the ways to improve the time complexity as well as optimize for memory us The change log will keep track of the mutations that occurred between the parent and the offsprings over generations. This means only storing the differences between parent genome and it's offsprings as opposed to storing the whole genome for every offspring. -As we've seen above, the algorithm has to support the following mutations: +As we've seen above, the genome class has to support the following mutations: * Overwrite * Insert * Remove My implementation consists of two maps, which, for each offspring genome, store all the necessary information on how this genome is different from the parent: -* **change_log** is implemented as `std::map` and contains the information about the **number of inserted and removed sites**. It is used to calculate the relationship between a particular site in the offspring genome and either the parent genome or the newly inserted values stored in the segments_log (see next) -* **segments_log** is implemented as `std::unordered_map` and stores the segments that were inserted into the map during mutation +1. **change_log** is implemented as a `std::map` and contains the information about the **number of inserted and removed sites** (a shift in sites compared to the parent genome). It is used to calculate the relationship between a particular site in the offspring genome and the parent genome +2. **segments_log** is implemented as a `std::unordered_map` and stores the segments that were inserted into the genome during mutations +![map initialization]({{ site.baseurl }}/assets/TetianaBlogFigs/maps_init.png){:style="width: 75%; display: block; margin-left: auto; margin-right: auto;"} -![Schematics of change_log and segments_log]({{ site.baseurl }}/assets/TetianaBlogFigs/maps_init_cut.png){:style="width: 75%; display: block; margin-left: auto; margin-right: auto; "} - -Each genome will have it's own change_log and segments_log, which in combination with the parent genome will allow the random access to any value in the offspring genome as well as the reconstruction of complete offspring genome or a part of it as a contiguous memory block of necessary sites. +Each genome will have it's own change_log and segments_log, which in combination with the parent genome will allow the random access to any value in the offspring genome as well as the reconstruction of complete offspring genome (or a part of it) as a contiguous memory block of the necessary sites. -One important detail of the change_log is that it doesn't store every removed or inserted index. Instead, to optimize for memory use, it stores only one index for each range of a particular shift in indices due to insertion of removal (see example below). I.e. each key in the change_log represents all the keys in the range from the current key until the next key. +One important detail about the change_log is that it doesn't store every removed or inserted index, instead each key in the change_log represents all the keys in the range from the current key until the next key (see example below). {% raw %} -For example, a change_log with entries `{{3 : -2}, {5 : 3}}` corresponds to the following mapping: +For example, a change_log with entries `{{0 : 0}, {3 : -2}, {5 : 3}}` corresponds to the following mapping: {% endraw %} ![range map]({{ site.baseurl }}/assets/TetianaBlogFigs/range_map.png){:style="width: 75%; display: block; margin-left: auto; margin-right: auto;"} To access any index, the following code can be used: ```cpp ---map.upper_bound(index); // "upper_bound" returns the key, which is higher than the index, "--" moves to the previous key +--change_log.upper_bound(index); // "upper_bound" returns the key, which is higher than the index, "--" moves to the previous key ``` - -In the case of the change_log above, `--map.upper_bound(7);` will return 5. -For each key change_log stores how many sites were removed and inserted up until this key. +In the case of the change_log above, `--change_log.upper_bound(7);` will return the iterator to the key = 5. + + +#### Remove mutation :hammer: + +When one or more sites are removed from the genome, a new element `{key : value}` is added in the change_log map: the `key` corresponds to the index, at which the remove mutation starts and the `value` corresponds to the number of sites that were removed. - -#### Remove mutaiton :hammer: +The following animation shows how the remove mutation is stored in the change_log and how the change_log is then used to reconstruct the offspring genome: -When one or more sites are removed from the genome, a new element is added in the change_log map: the map key corresponds to the index, at which the remove mutation starts and the map value corresponds to the number of sites that were removed. The following animation shows how the remove mutation is stored int eh change_log and how the change_log is then used to reconstruct the offspring genome: +![removal animation]({{ site.baseurl }}/assets/TetianaBlogFigs/remove_animation_1_small.gif){:style="width: 100%; display: block; margin-left: auto; margin-right: auto;"} -![remove animation]({{ site.baseurl }}/assets/TetianaBlogFigs/remove_animation_1.gif){:style="width: 100%; display: block; margin-left: auto; margin-right: auto;"} +In the animation above, the `remove(3, 2)` method is called, which corresponds to removing two sites at index 3. The index and the number of removed sites are stored in the change_log as a `{key : value}` pair. -In the animation above, the remove(3, 2) method is called, which corresponds to removing two sites at index 3. The index and the number of removed sites are stored int eh change_log as map key and value. +The change_log can then be used to reconstruct the offspring genome, specifically, the values at indices < 3 in the offspring genome will be the same as in the parent genome. +And the sites at the indices >= 3 will be shifted to the left by two sites. -The change_log can then be used to reconstruct the offspring genome, specifically, the values at indices < 3 are the same as in parent genome. And the sites at the indices >= 3 were shifted to the left by two sited, therefore in order to access the corresponding values, we need to shift two sites to the right in the parent genome: +Therefore, in order to access the corresponding values, we need to shift two sites to the right in the parent genome: ``` offspring[index] = parent[index + 2] ``` -In the change_log map, each value is the the accumulation of all the changes up to corresponding key, for example, if **two** elements were removed at index 3 and then **three** elements were removed at index 5, the accumulated shift at index >= 5 will be -5: +In the change_log map, each `value` is the the accumulation of all the changes up to the corresponding `key`, for example, if **two** elements were removed at index 3 and then **three** elements were removed at index 5, the accumulated shift at index >= 5 will be -5: -![remove animation]({{ site.baseurl }}/assets/TetianaBlogFigs/remove_animation_2.gif){:style="width: 100%; display: block; margin-left: auto; margin-right: auto;"} +![removal animation]({{ site.baseurl }}/assets/TetianaBlogFigs/remove_animation_2_small.gif){:style="width: 100%; display: block; margin-left: auto; margin-right: auto;"} Using this change_log and the parent genome, it is possible to reconstruct the offspring genome by calculating a specific index in the offspring genome in relationship to the parent genome, i.e. for indices: * < 3: same value as in the parent genome @@ -120,36 +127,28 @@ Using this change_log and the parent genome, it is possible to reconstruct the o #### Insert mutation :wrench: -Each value in the change_log map corresponds to the index shift relative to the parent genome. The values of the newly inserted sites do not have any relation to the parent genome, therefore, the values for such keys is set to zero. In order to not confuse it with zero sites shift, an additional variable is added to the map: a boolean, which specifies whether sites were inserted at this key: +Up to now each value in the change_log corresponded to the shift of genome sites relative to its parent genome. +The values of the newly inserted sites do not have any relation to the parent genome, therefore, the values for such keys are set to zero. +In order to not confuse it with zero sites shift, an additional variable is added to the map: a boolean, which specifies whether sites were inserted at this key: ``` -{key : {val, insert}} // insert = true if there are sites inserted at this key +{key : {val : insert}} // insert == true if sites were inserted at this key ``` -The animation shows an example, where our previous change_log is updated with an insertion of 3 elements {20, 21, 22} at index 6: +The animation shows an example, where our previous change_log is updated with an insertion of three elements {20, 21, 22} at index 6: -![insert animation]({{ site.baseurl }}/assets/TetianaBlogFigs/insert_animation.gif){:style="width: 100%; display: block; margin-left: auto; margin-right: auto;"} -![small insert animation]({{ site.baseurl }}/assets/TetianaBlogFigs/insert_animation_small.gif){:style="width: 100%; display: block; margin-left: auto; margin-right: auto;"} +![insertion animation]({{ site.baseurl }}/assets/TetianaBlogFigs/insert_animation_small.gif){:style="width: 100%; display: block; margin-left: auto; margin-right: auto;"} +In addition to change_log, we now also use segments_log to store the inserted segment. The `std::unordered_map` allows constant time access by key. -In addition to change_log, we use `std::unordered_map`, called segments_log to store the inserted segments. The `std::unordered_map` allows constant time access by key. - -After the insert mutation is added to the change_log `{6, {0, true}}` and segments_log `{6, {20, 21, 22}}`, we also add an additional element to know where the inserted segment ends. Specifically, in this case the element is `{9, {-2, false}}`, where key corresponds to the (insert index + segment size): `9 = 6 + 3` and value corresponds to the sites shift up to now (remove 2 sites && remove 3 sites && insert 3 sites): `-2 = -2 + (-3) + 3`. - - -For index 10 and above, there had been accumulated 2 sites removals (two sites removal at index 3 and 3 sites removal at index 5 and 2 sites insertion ar index 10 => (-2) + (-3) + 3 = -2). To reconstruct the offspring genome now using the parent genome and change_log + segments_log: +After the insert mutation is added to the change_log: `{6, {0, true}}` and segments_log: `{6, {20, 21, 22}}`, we also add an additional element to know where the inserted segment ends. +Specifically, in this case the element is `{9, {-2, false}}`, where the `key` corresponds to the (insert index + segment size): `9 = 6 + 3` and `value` corresponds to the sites shift up to now (remove 2 sites && remove 3 sites && insert 3 sites): `-2 = -2 + (-3) + 3`. Using this change_log and the parent genome, it is possible to reconstruct the offspring genome by calculating a specific index in the offspring genome in relationship to the parent genome, i.e. for indices: * < 3: same value as in the parent genome -* \>= 3 && <5: `offspring[index] = parent[index + 2]` -* \>= 5: `offspring[index] = parent[index + 5]` - -parent genome: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} - -ind < 3: same value as in the parent genome -ind >= 3 && ind < 5: `offspring[index] = parent[index + 2]` -ind >= 5 && ind < 6: `offspring[index] = parent[index + 5]` -ind >= 6 && ind < 9: segment from `segments_log.at(6)` -ind > 9: `offspring[index] = parent[index + 2]` +* \>= 3 && ind < 5: `offspring[index] = parent[index + 2]` +* \>= 5 && ind < 6: `offspring[index] = parent[index + 5]` +* \>= 6 && ind < 9: segment from `segments_log.at(6)` +* \> 9: `offspring[index] = parent[index + 2]`