How to handle html encoded fields with semicolumn delimiter #2028

MithrilMan · 2022-09-12T09:59:29Z

MithrilMan
Sep 12, 2022

I've to import a feed that uses semicolon as delimiter and unluckily its data is very bad.
For instance a record contains an html encoded content

54166;PERFILISTATION SONNINO;Agip Eni;Stradale;PERFILISTATION SONNINO;VIA S.S. 699 "DELL&#039;ABBAZIA DI FOSSANOVA" KM 9+850 SNC 04010, SONNINO (LT) snc 04010;SONNINO;LT;41.40569808543889;13.203876614570618

The single field

VIA S.S. 699 "DELL&#039;ABBAZIA DI FOSSANOVA" KM 9+850 SNC 04010, SONNINO (LT) snc 04010;SONNINO` contains an HTML encoded quote `&#039;

and that semicolon causes the parser to consider the field up to that semicolon, screwing the rest of the record

How should I hande that case?
I'd like to have a way to "pre parse" a raw row in order to HTML Decode the row before the parser kicks in but I can't find a way to do so, is it possible?

AltruCoder · 2022-09-12T18:15:18Z

AltruCoder
Sep 12, 2022

About the only thing I can think of is to use CsvParser as a base and create your own Parser that would handle the HTML entities.

0 replies

JoshClose · 2022-09-12T18:58:03Z

JoshClose
Sep 12, 2022
Maintainer

Process the HTML encodings before running it through CsvHelper. If it comes out as a ; or " you'll need to make it something else and process that during the CSV parsing. You can probably use https://anglesharp.github.io/ to process the html.

0 replies

AltruCoder · 2022-09-13T20:56:06Z

AltruCoder
Sep 13, 2022

I'm sure @JoshClose will tell me this is not the best way to do it and it is rife with gotchas, but I modified the Parser to skip ';' as a delimiter when it is part of an HTML entity. You also have to make a copy of the FieldCache class since it is an internal class to CsvHelper.
Here is the code I added to the Parser. It checks for numeric and hex entities as well as a few of the most common named entities.

        private bool startHtmlEntity = false;
        private bool startNumericHtmlEntity = false;
        private bool skipDelimiter = false;
        private char[] htmlEntitybuffer = new char[10];
        private int htmlEntityPosition = 0;
        private readonly string[] htmlEntities = new string[]
        {
            "nbsp",
            "lt",
            "gt",
            "amp",
            "quote",
            "apos",
            "cent",
            "pound",
            "yen",
            "euro",
            "copy",
            "reg"
        };

        private void CheckForHtmlEntity(ref char c, ref char cPrev)
        {
            if (startHtmlEntity && c == ';' && IsHtmlEntity())
            {
                skipDelimiter = true;
                ResetStartHtmlEntity();
            }
            else if (startHtmlEntity && c != '&' && c != '#' && c != ';') 
            {
                AddToHtmlEntityBuffer(c);
            }
            else if (!startHtmlEntity && c == '&' && delimiter == ";")
            {
                startHtmlEntity = true;
            }
            else if (startHtmlEntity && (c == '&' || c == ' '))
            {
                ResetStartHtmlEntity();
            }
            else if (startHtmlEntity && c == '#' && cPrev == '&')
            {
                startNumericHtmlEntity = true;
            }
        }

        private void ResetStartHtmlEntity()
        {
            startHtmlEntity = false;
            htmlEntityPosition = 0;
            htmlEntitybuffer = new char[10];
        }

        private void AddToHtmlEntityBuffer(char c)
        {
            if(htmlEntityPosition >= htmlEntitybuffer.Length)
            {
                var newSize = htmlEntitybuffer.Length * 2;
                Array.Resize(ref htmlEntitybuffer, newSize);
            }

            htmlEntitybuffer[htmlEntityPosition] = c;

            htmlEntityPosition++;
        }

        private bool IsHtmlEntity()
        {
            var possiblyEntity = new string(htmlEntitybuffer, 0, htmlEntityPosition);

            if (htmlEntities.Contains(possiblyEntity))
                return true;

            if (startNumericHtmlEntity)
            {
                var isInteger = int.TryParse(possiblyEntity, out _);

                if (isInteger)
                    return true;

                if(possiblyEntity.StartsWith("x", StringComparison.OrdinalIgnoreCase))
                {
                    var isHexadecimal = int.TryParse(possiblyEntity[1..], NumberStyles.HexNumber, null, out _);

                    if (isHexadecimal)
                        return true;
                }
            }

            return false;
        }

You add CheckForHtmlEntity in the ReadLine method right after changing c to the next character in the buffer.

cPrev = c;
c = buffer[bufferPosition];
bufferPosition++;
charCount++;
if (countBytes)
{
    byteCount += encoding.GetByteCount(new char[] { c });
}

CheckForHtmlEntity(ref c, ref cPrev);  // Add Here!

Then you check skipDelimiter further down in the Readline method when checking for c == delimiterFirstChar

if (c == delimiterFirstChar)
{
     if (skipDelimiter)   // Add Here!
    {
        skipDelimiter = false;
        continue;
    }

    state = ParserState.Delimiter;

You would use it like this.

var config = new CsvConfiguration(CultureInfo.InvariantCulture)
{
    Delimiter = ";",
};

using var reader = new StringReader("Id;Name\n1;Record&#039;1");
using var parser = new HtmlEntityParser(reader, config);
using var csv = new CsvReader(parser);

var result = csv.GetRecords<Foo>();

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle html encoded fields with semicolumn delimiter #2028

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to handle html encoded fields with semicolumn delimiter #2028

MithrilMan Sep 12, 2022

Replies: 3 comments

AltruCoder Sep 12, 2022

JoshClose Sep 12, 2022 Maintainer

AltruCoder Sep 13, 2022

MithrilMan
Sep 12, 2022

AltruCoder
Sep 12, 2022

JoshClose
Sep 12, 2022
Maintainer

AltruCoder
Sep 13, 2022