Parsing RSS in an event-driven fashion #177

erayerdin · 2023-08-16T05:29:23Z

erayerdin
Aug 16, 2023

I'm parsing RSS, but they get huge at times. I've seen this library can lazily parse XML as well, which, I would think, would have less memory and CPU footprint.

RSS has multiple item elements. This is one of an example:

<item>
    <title><![CDATA[ Asit Yağmuru Nedir? Zararlı Bir Doğa Olayı Olan Asit Yağmuru Nasıl Önlenebilir? ]]></title>
    <link>https://evrimagaci.org/asit-yagmuru-nedir-zararli-bir-doga-olayi-olan-asit-yagmuru-nasil-onlenebilir-14960</link>
    <guid isPermaLink="true">https://evrimagaci.org/s/14960</guid>
    <description><![CDATA[ En zararlı doğa olaylarından biri olan asit yağmurları, asidik sera gazlarının su ile oluşturduğu zararlı bileşiklerin yağış yoluyla yeryüzüne ulaşmasıdır.[1] Asit yağmurları, pH'ı genellikle 4,2 ila 4,4 arasında değişen yağmurlardır.[2]Normal bir yağmur suyunun pH'ı 5,6 civarındadır. Havada oluşan… ]]></description>
    <dc:creator>Yusuf Taha Yılmaz</dc:creator>
    <pubDate>Fri, 30 Jun 2023 23:20:05 +0300</pubDate>
    <atom:updated>2023-06-30T23:20:05.000+03:00</atom:updated>
    <media:content url="https://evrimagaci.org/public/content_media/b13b8afb52d4d4a901235883fdc14633.jpeg"/>
</item>

So, my reasoning is, when I hit an item while iterating lazily, I need to somehow convert it to XmlNode, so I can get its children such as title or link.

So far, my humble implementation (lol) is this:

final events = parseEvents(raw ?? '', withBuffer: true);
final items = events.whereType<XmlStartElementEvent>().where((element) => element.name == 'item')

...which I'm not quite sure if is true.

I'm assuming I'm hitting an item element lazily here. It's technically a XmlStartElementEvent, I guess? That's what the name suggests.

After this point, I'm not sure how I can approach. How can I convert this to XmlNode so that I can read its children?

Thanks in advance.

Answered by renggli

Aug 16, 2023

You assumptions are good, and the code would give you the <item> event of the input, but nothing else. To make it work, you would need to manually iterate over the events, detect the start <item>, then start to build XmlNode until you reach the next </item>. Since you need to remember state during the iteration, this unfortunately doesn't work that well with the built-in iterator methods such as where, but you can get this to work something along the lines of:

var started = false;
for (final event in parseEvents(input)) {
   if (event is XmlStartElementEvent && event.name == 'item') {
      started = true;
   } else if (event is XmlEndElementEvent && event.name == 'item') {
      started = …

View full answer

renggli · 2023-08-16T06:25:17Z

renggli
Aug 16, 2023
Maintainer

You assumptions are good, and the code would give you the <item> event of the input, but nothing else. To make it work, you would need to manually iterate over the events, detect the start <item>, then start to build XmlNode until you reach the next </item>. Since you need to remember state during the iteration, this unfortunately doesn't work that well with the built-in iterator methods such as where, but you can get this to work something along the lines of:

var started = false;
for (final event in parseEvents(input)) {
   if (event is XmlStartElementEvent && event.name == 'item') {
      started = true;
   } else if (event is XmlEndElementEvent && event.name == 'item') {
      started = false;
   } else if (started) {
      print(event);  // an event that is part of `<item>...</item>`
   }
}

Since this is such a common operation there are helpers that do exactly that. However, they work on Stream instances and allow you to asynchronously download, parse and process the input without even keeping the full raw input in memory:

final source = Stream.value(input);   // could be an async download or file-read, see documentation for examples
await source.toXmlEvents()
  .normalizeEvents()
  .selectSubtreeEvents((event) => event.name == 'item')   // only select the events within <item> 
  .toXmlNodes()   // convert the remaining events to XmlNode
  .expand((nodes) => nodes)   // convert `Stream<List<XmlNode>>` to `Stream<XmlNode>`
  .forEach((node) => print(node));   // print each `XmlNode`

Documentation is here, the sitemap.xml example is very similar to what you want to do.

1 reply

erayerdin Aug 16, 2023
Author

Thank you for quick reply. Now I can understand it a little bit better.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing RSS in an event-driven fashion #177

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Parsing RSS in an event-driven fashion #177

erayerdin Aug 16, 2023

Replies: 1 comment · 1 reply

renggli Aug 16, 2023 Maintainer

erayerdin Aug 16, 2023 Author

erayerdin
Aug 16, 2023

Replies: 1 comment 1 reply

renggli
Aug 16, 2023
Maintainer

erayerdin Aug 16, 2023
Author