improve performance by using 2 hashes to avoid reading file multiple times #11

GoogleCodeExporter · 2015-11-01T04:58:16Z

Mem:   3087880k total,  3066536k used,    21344k free,    83468k buffers
Swap:  2104476k total,    29812k used,  2074664k free,   147640k cached
10518 xuefer    20   0 2387m 2.3g 2052 D    8 77.9   2:47.64 hardlink

cpu is not a problem as you can see CPU = 8%, mem by hardlink = 77.9% and it 
keeps going up.
do you know why it use so much memory? i'm running it against 679433 files. 
maybe the filecmp module cache files being read?

anyway it take so long to complete. i don't think it optimal
i don't think sorting all file together by content is because 1 file maybe read 
multiple to due to sorting algorithm, and system level file caching may be 
flush when the data is bigger than cache.

i would suggest hardlink to do md5 or sha1, like other de-duplication tool do 
for hashing, md5 string takes 32bytes (hex-mac), or 16 bytes (binary)
679433*32/1024/1024 = 20M bytes (and more due to dictionary and python variable)
just before it compare 2 files, it compare the md5 hash of the files
for file in reagularFiles:
  if sizehash[file.size] is already there:
    compare(file, sizehash[file.size][0])
  sizehash[file.size].append(file)

def compare(file1, file2):
  if md5[file1] <> md5[file2]: # only calc md5 JIT
    return False
  if filecmp.cmp(file1, file2):
    hardlink(file1, file2)
let's see if it's faster when disk bound is a problem

Original issue reported on code.google.com by [email protected] on 7 Feb 2012 at 3:48

The text was updated successfully, but these errors were encountered:

GoogleCodeExporter · 2015-11-01T04:58:16Z

btw, can you please handle "too many links" error?

Original comment by [email protected] on 8 Feb 2012 at 12:21

GoogleCodeExporter · 2015-11-01T04:58:16Z

I've written a patch for this; it's a significant performance improvement on my 
main use case (an rsnapshot-like backup partition).

John, please review and pull:

https://code.google.com/r/peterkolbus-betterperformance/source/detail?r=ca7d95fc
0350453c22c5c7c017c6d54e51d83789

Original comment by [email protected] on 8 Jun 2013 at 5:30

The commit includes several changes meant to deal with directories containing tons of files, including huge ones. The commit includes patches proposed but not pulled in on the project hosted in Google Code: - dealing with max hardlink: JohnVillalovos#14 (inlcuded with some reporting) - uses hashes for comparisons that greatly (!!) improves perfs, included from patch: JohnVillalovos#11 - added --min-size option (greatly improve perfs too), including the patch from: JohnVillalovos#13 - triggers an exception when out of memory instead of crashing - added some more logging (in verbose mode 3)

GoogleCodeExporter added Priority-Medium Type-Defect auto-migrated labels Nov 1, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve performance by using 2 hashes to avoid reading file multiple times #11

improve performance by using 2 hashes to avoid reading file multiple times #11

GoogleCodeExporter commented Nov 1, 2015

GoogleCodeExporter commented Nov 1, 2015

GoogleCodeExporter commented Nov 1, 2015

improve performance by using 2 hashes to avoid reading file multiple times #11

improve performance by using 2 hashes to avoid reading file multiple times #11

Comments

GoogleCodeExporter commented Nov 1, 2015

GoogleCodeExporter commented Nov 1, 2015

GoogleCodeExporter commented Nov 1, 2015