-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Match merging against obfuscation #1202
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very exhaustive review, be aware. One underlying issue I see is the anti-pattern of relying too much on fields instead of passing via parameters. All your private methods are void, thus, do not return a result. Instead, they read and write into the shared state, mainly the global matches list. I think we should refactor this so the global matches are passed as a parameter instead, and it is no longer a field. Then, the return values of the private methods should be used. computeNeighbors
should return the neighbors. removeToken
should return the list of altered matches (thus solving a problem described in one of the comments below). Similar for the others.
@tsaglam can you take a look at the code smell? My idea would be to make removeBuffer a void function but that's contrary to one of your requests changes. |
This is caused by the habit in the code to re-use input collections of a method as output collection. This should be avoided because it can lead to side-effects that were not considered. E.g. someone changing the returned list can modify the input list. |
For the sake of unambiguity, I'd like to apply the following naming conventions to MatchMerging:
This adresses the ambiguity of first and second. Please give me feedback on this and I'll make the changes. |
As discussed with Timur: |
…for MergingParamters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, just 4 minor things, and then we can merge! 👍
@Kr0nox, alright from your side? |
@tsaglam |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I reviewed it early as I missed that the file end behavior was not in there. Now everything is in, right? One minor thing regarding duplication.
[JPlag Plagiarism Detector] Kudos, SonarCloud Quality Gate passed! |
I don't have merging rights, so you have to click the button |
Yes, that is intentional. 👍 |
This PR proposes Match Merging, a new defence mechanism against obfuscations like insertions, alterations and swapping.
How does it work:
The greedy string tiling algorithm at the core of JPlag calculates matches between submissions.
In order to not raise similarity with very short matches the MinimumTokenMatch (MTM) defines the minimal length of a token-match in order to be considered for the similarity calculation. In Java it is 9 and in CPP 12.
Tools like Mossad or JPlag-GEN use this for plagiarism obfuscations by applying changes to submissions which keep the semantics identically but break up large matches into multiple smaller ones.
When done often enough most matches are below MTM and therefore ignored, resulting in a very low similarity score between the original and obfuscated plagiarism.
My approach aims to revert these changes on submissions by merging neighbouring matches and is based on two parameters mergeBuffer and seperatingThreshold.
mergeBuffer defines how lower the length of a match can be than the MTM.
seperatingThreshold defines how many tokens can be between two neighbouring matches.
My approach considers previously ignored matches and regular ones and merges them based on the parameters.
Merged-over tokens are removed from the submission.
Currently, both parameters default to 0 which disables my logic.
The default values will be changed in a later PR.
Which classes are affected and why: