Sometimes you need to parse large XML files to extract some repetitive element and process each of them sequentially as a NodeJS Stream. Existing solutions are either too complex, using native libraries or too low-level, so I made this one for those who need pretty simple stream-based XML parsing. For now, it's based on sax module and allows you to simply specify path to elements that you would like to extract via simple dot notation (not XPath!)
Use it if you don't need extra performance and don't want to deal with low-level details of using sax library.
If you need something more complicated, consider using sax or something else directly
- Based on NodeJS Transform stream, so easy to use with existing solutions
- Stream emits chunks of text containing XML definition of the queried element
- Extremely simple, so you can use your favorite xml library to parse each element, such as xmldoc or xmldoc
- Only one external dependency without nested dependencies - sax
- Typescript types out of the box
Imagine you have a large XML with following structure and you want to process each elt sequentially:
<root>
<elt foo="1">First element</elt>
<elt foo="2">Second element</elt>
<elt foo="3">Third element</elt>
<!-- Several more million of the rows -->
</root>
const fs = require('fs');
const { XmlStream } = require('xmlkit');
(async function () {
// Create stream pipeline
const input = fs.createReadStream('./10gbfile.xml');
const parser = new XmlStream('root.elt');
const xml = input.pipe(parser);
// Iterate using async iterable
for await (const chunk of xml) {
// each chunk will contain an xml string representing current elt node
console.log(chunk);
}
// Or iterate using stream interface
xml
.on('data', (chunk) => console.log(chunk))
.on('end', () => console.log('all xml loaded'));
// Or just pipe to other file
xml.pipe(fs.createWriteStream('./result.xml'));
})();
A constructor that creates parser instance.
Takes a string specifying selector for extracting elements in dot notation as a
first parameter, example: root.elt
. root and elt is an en element name
ONLY this, simple format is supported, this is NOT an XPath, so only element
selectors is supported.
Event emitted for each of the extracted nodes specified by selector passed in constructor. Chunk is just a string containing XML data of the needed element, which enables you to use any of your favorite XML parser to further process the chunk.
Working with XML is pain in NodeJS, so if this package gains any substantil amount of popularity, I will consider adding other useful XML tools here, since package name is very flexible and fits for anything related to XML.
Feel free to create discussion or file PR