Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better parallelisation of exiftool for faster report generation #20

Open
fbuchinger opened this issue Sep 4, 2015 · 1 comment
Open

Comments

@fbuchinger
Copy link

In https://github.com/mattburns/exiftool.js-test/blob/master/test.js#L66 you invoke a new instance of exiftool for every new image found. This is not terribly efficient, since there is a huge overhead in starting exiftool (perl interpreter warmup, load modules,...) and we are doing this for every sample image we find.

Better approaches would be
a) invoke exiftool once and let it do the batch processing (e.g. exiftool *.jpg -w .jpg.json) - might require some refactoring in the report generation
b) use the -stay_open option of exiftool together with an ARGFILE where we write the commands to run on each image. Here exiftool stays in memory and executes the commands written to the ARGFILE until we write a terminate command there.

Both approaches can bring speedups of up to 60 times compared to single-command invocation. Actually approach b) could even bring a better performance, since we can prefork multiple "daemonized" instances of exiftool and share the work between them.

@fbuchinger
Copy link
Author

Evaluated the performance of the different exiftool invocation options using pyexiftool, since it had already builtin support for exiftool's faster stay_open invocation.
I compared the following scenarios:

  • invoking one exiftool instance per image
  • exiftool's internal batch execution
  • "external" batch execution using stay_open mode
  • "external" batch execution with preforking multiple exiftool instances (multiprocessing.Pool in Python)

My results for the 20 sample images from the Acer directory:

Exiftool no batch took 6.37464756469 sec 
Exiftool internal batch took 0.590772722123 sec
Exiftool Stay Open/External batch took 0.575033621959 sec
Exiftool multiprocessing batch took 0.64755278114 sec

For the more complex sample images (more tags to decode) from the Nikon directory:

Exiftool no batch took 80.8621684399
Exiftool internal batch took 3.93503120808
Exiftool Stay Open/External batch took 4.23961249768
Exiftool multiprocessing batch took 4.3239334162

It turns out that using one of the batch modes can bring a 10-20 times speedup, while the multiprocessing is actually a bit slower (maybe since exiftool is mostly IO-bound). Note that this numbers might vary for node.js, since it is asynchronous per default, while python is synchronous.

Conclusion: it definitely makes sense to use the exiftool stay_open mode in the node.js test scripts instead of firing up one instance per image.

See my python test script and the
this pyexiftool issue for more context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant