-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cfb_add and write performance issues #2
Comments
On performance: none of our tests or workflows deal with hundreds of files, let alone thousands of files. The function that sees the most write activity reshapes XLS VBA blobs to XLSB form, and the most extreme case I've seen involved about 250 extremely small files, so it's not at all surprising that there are ways to improve performance when adding a large number of files. When this was built, node was still in the
Contributions would be awesome :). To start, we'd accept a PR that just removed the call to P.S.: Here's a throwback issue about node function performance nodejs/node-v0.x-archive#7809 affecting the |
Great, thank you for the quick response and background info. I'm sure that my use case is outside of what was originally intended / tested, so no fault to the library for not being optimized just for me :). Am I correct in thinking that cfb.flow.js is the primary source file, and the other .js files are derived from it in a build step? |
The primary source files are the |
- `unsafe` option to `cfb_add` for bulk write (see #2) - use `lastIndexOf` to save operations in BFP queue
@rossj we just pushed 1.0.6 with the first part guarded behind the option CFB.utils.cfb_add(cfb, path, content, {unsafe:true}); Runkit unfortunately puts a memory limit, but https://runkit.com/5acb0cf21599f20012a3e001/5acb0cf2aeee9400120ba682 should demonstrate 4000 files. It uses a small test script for adding 5000 byte files to the FAT and 500 byte files to the mini FAT: var CFB = require('./');
var cfb = CFB.utils.cfb_new();
var cnt = 20000;
console.log("alloc", new Date());
var bufs = [];
for(var i = 0; i < cnt; ++i) bufs[i] = [Buffer.alloc(500, i), Buffer.alloc(5000, i)];
console.log("start", new Date());
for(var i = 0; i < cnt; ++i) {
if(!(i%100)) console.log(i, new Date());
CFB.utils.cfb_add(cfb, "/short/" + i.toString(16), bufs[i][0], {unsafe:true});
CFB.utils.cfb_add(cfb, "/long/" + i.toString(16), bufs[i][1], {unsafe:true});
}
console.log("prewrite", new Date());
CFB.utils.cfb_gc(cfb);
CFB.writeFile(cfb, "out.bin");
console.log("done", new Date());
var cfb2 = CFB.read("out.bin", {type:"file"});
console.log("read", new Date());
for(var i = 0; i < cnt; i += 100) {
var file = CFB.find(cfb2, "/short/" + i.toString(16));
if(0 != Buffer.compare(file.content, bufs[i][0])) throw new Error("short " + i);
file = CFB.find(cfb2, "/long/" + i.toString(16));
if(0 != Buffer.compare(file.content, bufs[i][1])) throw new Error("short " + i);
} |
@rossj Before a new release is cut, is there any other change you recommend? |
First, sorry for being slow with my PRs. I just submitted a change that uses |
Hi there,
I'm working on a program which converts .pst files to .msg files, primarily in Node but also in the browser, and it uses this library in a very write-heavy way for saving the .msg files. Through testing and profiling, I've noticed a couple write-related performance issues that I wanted to share.
With some modifications, I've been able to reduce the output generation time of my primary "large" test case (4300 .msg files from 1 .pst) by a factor of 8 from about 16 minutes to 2 minutes (running on Node).
The 1st issue, which may just be a matter of documentation, is that using
cfb_add
repeatedly to add all streams to a newdoc
is very slow, as it callscfb_gc
andcfb_rebuild
every time. We switched from usingcfb_add
to directly pushing tocfb.FileIndex
andcfb.FullPaths
(and then calling cfb_rebuild once at the end) which reduced the output time from 16 minutes to 3.5 minutes.The 2nd issue is that the
_write
andWriteShift
functions do not utilizeBuffer
capabilities when it is available. By usingBuffer.alloc()
for the initial creation, which guarantees a 0-filled initialization, along withBuffer.copy
for content streams,Buffer.write
for hex / utf16le strings, andBuffer
's various write int / uint methods, we were able to further reduce the output time from 3.5 minutes to 2 minutes.If you wish, I would be happy to share my changes, or to work on a pull request which uses
Buffer
functions when available. My current changes don't do any feature detection, and rather just rely onBuffer
being available, as even in the browser we use feross/buffer, so it would need some more work to maintain functionality in non-Buffer environments.Thanks
The text was updated successfully, but these errors were encountered: