Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupt lines in pair file #43

Open
malinkallen opened this issue Jul 11, 2020 · 1 comment
Open

Corrupt lines in pair file #43

malinkallen opened this issue Jul 11, 2020 · 1 comment

Comments

@malinkallen
Copy link

I have run the SourcererCC clone detector on a little bit more than 35,000,000 files. The resulting clone pair file consists of >18,000,000,000 lines. Of these, 5 lines contain more than 4 numbers separated with commas (which should be the expected format):

263694,263710,455981,41668,70616
591916,1015368,508215,591934,1015376,192522,333749
14702,100025479,527866,914862,100025719,706877,1213095
502505,200858502537,200858458,1527027,102616237
1454158,2021454205,202495178,785203,101352033

The first one is located on line 1604224 in query_3clones_index_WITH_FILTER.txt, which is attached in zipped format (split in 3 since I cannot upload files larger than 10MB). query_3clones_index_WITH_FILTER_1.txt.gz query_3clones_index_WITH_FILTER_2.txt.gz query_3clones_index_WITH_FILTER_3.txt.gz

The server that I ran on went down a couple of times, so one could imagine that 263694,<parts of an ID> was written before the crash, and the next clone pair was written on the same line. However, I don't think that's the case: Since SourcererCC starts from the last line logged in recovery.txt, I see two possibilities:

  1. The last line logged in recovery.txt is the last line before the one that was processed when the server went down. Then the second number of the line should end with the first number of the line, which is not the case.
  2. The last line processed (and giving rise to an output line) before the crash is not the last one logged in recovery.txt. Then the first line to be processed after recovery should already have been processed before the crash. Then we should find another line ending with 455981,41668,70616, which I can't.

My blocks file is 7,9 GB, so I don't attach it, but let me know if you need more information!

@zoubaihan
Copy link

Hi, I am also running this tool. When I was running python controller.py ,the following exception came out:

search will be carried out with 2 nodes
loading previous run state
previous run state 1
current state: 1
flushing current state 1
running new command /mnt/hgfs/G/SourcererCC-master/clone-detector/restore-gtpm.sh
running new command /mnt/hgfs/G/SourcererCC-master/clone-detector/runnodes.sh init 1
Traceback (most recent call last):
File "controller.py", line 180, in
controller.execute()
File "controller.py", line 144, in execute
raise ScriptControllerException("error during init.")
main.ScriptControllerException: error during init.

How can I deal with this trouble? Could you help me?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants