-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling additional equalities for generic sequences #152
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @masri2019, but after almost completing the review of your PR, I just realized something that you will not be happy with I am afraid -> I was looking that test that you adjusted, because due to transitivity it basically had just one big equality test, and realized that that we actually can't assume transitivity :(. I am so sorry about this, I can't believe I haven't realized this before, I completely forgot about that use case. The use case is the one from the test, when we want "N" to match any other character -> A, C, T or G, but we of course don't want A to match G and so on, which we can't achieve if we assume transitivity! So to generalize, if we assume transitivity, we can't use additional equalities to describe the "joker" character that matches multiple other characters.
Please take my apologies again, I am really sorry I didn't figure this out before.
I am afraid we will have to drop this PR and go with non-transitive approach. I don't think we have much choice there, we most likely need to construct unordered_map, or unordered_set, and use it inside of "isEqual()". That will not be so fast as the solution we had for transitivity, but I don't think we can go around that.
So if we have (A, N), (C, N), (G, H), then we would have unordered set of [(A, N), (N, A), (C, N), (N, C), (G, H), (H, G)] -> it probably makes sense to have these duplicates that don't care about order of elements.
Or we could go with unordered_map of unordered sets: { A -> [N], C -> [N], G -> [H], N -> [A, C] }.
I think unordered set is a better match, in theory, than unordered_map of unordered sets, but I am not sure how complicated it is to ensure that EqualityPair can be the key for unordered_set, so if that is a problem, it might be easier to go with map of sets.
Let me know if you have some better ideas how to approach this!
Oh oh, one idea -> what if we let them define the equality function themselves? For example, let's say they have big alphabet, and only one equality: A == B. I am not sure how we would let them provide such function. Would it be implementation of interface that we specified? They could overload operator == on Element, but I am afraid that might be problematic in some other places if we don't expect it (or maybe not?). |
@Martinsos I thought about the two approaches you said (I mean using a set of pairs or using a hash table of sets). I wrote a script to implement both and compare their performance. My test scenario was that I made an array of About allowing the user to define the equality function, I don't have any idea on how to use lambda in python and define a function based on that in C++. If you know any technique please tell me so maybe we can go that way. I guess this step is critical because this function is going to be called many times in edlib and if the user cannot provide a well-defined function it will harm the performance. I think using |
And one more thing is that I had to use The declaration of the set is something like this: We have to replace |
Great job on doing the tests @masri2019! Regarding the function in Python - I also don't know an easy way, so I agree, let's drop that for now, that can be an optimization for the future at some moment if we decide it is worth bothering with it. Regarding hashing of pair -> yes, I thought that might be a problem hm! Using boost normally makes sense, but, in Edlib we have no external dependencies, and adding Boost for something small like this feels like an overkill. I found this link https://stackoverflow.com/a/9729747/1509394 where they describe how to easily implement the same hashing as boost has -> so how about we do it on our own? I checked Boost source and implementation in Boost is still the same as in this SO answer. Btw, as a final thought -> we could have specialization of AdditionalEqualities for char, where we build the table, like we did before. Or also some other kinds of optimizations. But I would also leave that for the future, I just wanted to mention it. TLDR; -> Great, let's do set of pairs, however let's not use Boost, instead let's copy paste their hashing logic (it should be function hash_combine and then defining the hash struct for pair -> this struct uses hash_combine and then we give it to the unordered_set as second template argument I believe). |
@Martinsos |
I took a look and looks good! Ok, let's go for it! I think this is good
first effort, and if needed we can try optimizing this further in the
future.
Btw. one thing I was thinking about for some time is setting up some really
nice performance tests, that we could use in the future to optimize parts
of Edlib as needed, without having to manually check the speed -> it would
be great to have a suite of perpformance tests that target very specific
things and that measure how much is specific part of edlib taking, and then
making optimization would just be trying couple of approaches and checking
out what are the numbers. Ok, certainly not something we should be doing
now, but just wanted to share the idea, it would be really cool to have
that one day. I did write some perf tests in the past, but that was to
compare with other tools, so not really the same thing. I will open an
issue for it.
…On Sat, Jun 27, 2020 at 10:39 AM Mobin Asri ***@***.***> wrote:
@Martinsos <https://github.com/Martinsos>
Great! I used the hash function you sent in my test code. Let me share my
code and the result.
Here is the test code:
https://github.com/masri2019/temporaryCodes/blob/master/edlib/compareSetAndMap/compareSetAndMap.cpp
Here is the result which shows that using a set of pairs is nearly 4 times
faster: (for 20% expected hit rate)
#std::unordered_map -> Finished in 0.051417 seconds
#std::unordered_map -> Hit Count / all: 19853 / 100000
#std::unordered_set -> Finished in 0.012612 seconds
#std::unordered_set -> Hit Count / all: 19919 / 100000
I'm going to do the similar implementation for edlib. Please tell me if
you have any fundamental comments especially on how I implemented the set
of pairs. If you think it is OK you can close this PR and I'll come back
with a new one.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#152 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALXFB7IGJQVBHPRATYRQ5LRYWV47ANCNFSM4NJXI77Q>
.
|
Ok, created an issue here: #153 .
…On Sat, Jun 27, 2020 at 1:16 PM Martin Šošić ***@***.***> wrote:
I took a look and looks good! Ok, let's go for it! I think this is good
first effort, and if needed we can try optimizing this further in the
future.
Btw. one thing I was thinking about for some time is setting up some
really nice performance tests, that we could use in the future to optimize
parts of Edlib as needed, without having to manually check the speed -> it
would be great to have a suite of perpformance tests that target very
specific things and that measure how much is specific part of edlib taking,
and then making optimization would just be trying couple of approaches and
checking out what are the numbers. Ok, certainly not something we should be
doing now, but just wanted to share the idea, it would be really cool to
have that one day. I did write some perf tests in the past, but that was to
compare with other tools, so not really the same thing. I will open an
issue for it.
On Sat, Jun 27, 2020 at 10:39 AM Mobin Asri ***@***.***>
wrote:
> @Martinsos <https://github.com/Martinsos>
> Great! I used the hash function you sent in my test code. Let me share my
> code and the result.
> Here is the test code:
>
> https://github.com/masri2019/temporaryCodes/blob/master/edlib/compareSetAndMap/compareSetAndMap.cpp
> Here is the result which shows that using a set of pairs is nearly 4
> times faster: (for 20% expected hit rate)
> #std::unordered_map -> Finished in 0.051417 seconds
> #std::unordered_map -> Hit Count / all: 19853 / 100000
> #std::unordered_set -> Finished in 0.012612 seconds
> #std::unordered_set -> Hit Count / all: 19919 / 100000
> I'm going to do the similar implementation for edlib. Please tell me if
> you have any fundamental comments especially on how I implemented the set
> of pairs. If you think it is OK you can close this PR and I'll come back
> with a new one.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#152 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AALXFB7IGJQVBHPRATYRQ5LRYWV47ANCNFSM4NJXI77Q>
> .
>
|
This PR contains the implementation of a disjoint-set data structure and using it to activate additional equalities for generic sequences.
The main change is based on the transitive relation, which means that if A is equivalent to B (A ~ B) and B ~ C then A ~ C. If we have such an assumption we can use a disjoint-set structure to map all the elements in the same equivalency set to just one element.
Here is the list of main changes:
transformSequences()
.EqualityDefintion
is removed and wherever we were callingareEqual
it is now replaced by equality operator==
. It is going to be faster than before.