Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve performance by using 2 hashes to avoid reading file multiple times #11

Open
GoogleCodeExporter opened this issue Nov 1, 2015 · 2 comments

Comments

@GoogleCodeExporter
Copy link

Mem:   3087880k total,  3066536k used,    21344k free,    83468k buffers
Swap:  2104476k total,    29812k used,  2074664k free,   147640k cached
10518 xuefer    20   0 2387m 2.3g 2052 D    8 77.9   2:47.64 hardlink

cpu is not a problem as you can see CPU = 8%, mem by hardlink = 77.9% and it 
keeps going up.
do you know why it use so much memory? i'm running it against 679433 files. 
maybe the filecmp module cache files being read?

anyway it take so long to complete. i don't think it optimal
i don't think sorting all file together by content is because 1 file maybe read 
multiple to due to sorting algorithm, and system level file caching may be 
flush when the data is bigger than cache.

i would suggest hardlink to do md5 or sha1, like other de-duplication tool do 
for hashing, md5 string takes 32bytes (hex-mac), or 16 bytes (binary)
679433*32/1024/1024 = 20M bytes (and more due to dictionary and python variable)
just before it compare 2 files, it compare the md5 hash of the files
for file in reagularFiles:
  if sizehash[file.size] is already there:
    compare(file, sizehash[file.size][0])
  sizehash[file.size].append(file)

def compare(file1, file2):
  if md5[file1] <> md5[file2]: # only calc md5 JIT
    return False
  if filecmp.cmp(file1, file2):
    hardlink(file1, file2)
let's see if it's faster when disk bound is a problem

Original issue reported on code.google.com by [email protected] on 7 Feb 2012 at 3:48

@GoogleCodeExporter
Copy link
Author

btw, can you please handle "too many links" error?

Original comment by [email protected] on 8 Feb 2012 at 12:21

@GoogleCodeExporter
Copy link
Author

I've written a patch for this; it's a significant performance improvement on my 
main use case (an rsnapshot-like backup partition).

John, please review and pull:

https://code.google.com/r/peterkolbus-betterperformance/source/detail?r=ca7d95fc
0350453c22c5c7c017c6d54e51d83789

Original comment by [email protected] on 8 Jun 2013 at 5:30

Beurt added a commit to Beurt/hardlinkpy that referenced this issue Feb 12, 2017
The commit includes several changes meant to deal with directories containing tons of files, including huge ones.

The commit includes patches proposed but not pulled in on the project hosted in Google Code:
- dealing with max hardlink: JohnVillalovos#14 (inlcuded with some reporting)
- uses hashes for comparisons that greatly (!!) improves perfs, included from patch: JohnVillalovos#11
- added --min-size option (greatly improve perfs too), including the patch from: JohnVillalovos#13
- triggers an exception when out of memory instead of crashing
- added some more logging (in verbose mode 3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant