How to handle html encoded fields with semicolumn delimiter #2028
Replies: 3 comments
-
About the only thing I can think of is to use CsvParser as a base and create your own Parser that would handle the HTML entities. |
Beta Was this translation helpful? Give feedback.
-
Process the HTML encodings before running it through CsvHelper. If it comes out as a |
Beta Was this translation helpful? Give feedback.
-
I'm sure @JoshClose will tell me this is not the best way to do it and it is rife with gotchas, but I modified the Parser to skip ';' as a delimiter when it is part of an HTML entity. You also have to make a copy of the private bool startHtmlEntity = false;
private bool startNumericHtmlEntity = false;
private bool skipDelimiter = false;
private char[] htmlEntitybuffer = new char[10];
private int htmlEntityPosition = 0;
private readonly string[] htmlEntities = new string[]
{
"nbsp",
"lt",
"gt",
"amp",
"quote",
"apos",
"cent",
"pound",
"yen",
"euro",
"copy",
"reg"
};
private void CheckForHtmlEntity(ref char c, ref char cPrev)
{
if (startHtmlEntity && c == ';' && IsHtmlEntity())
{
skipDelimiter = true;
ResetStartHtmlEntity();
}
else if (startHtmlEntity && c != '&' && c != '#' && c != ';')
{
AddToHtmlEntityBuffer(c);
}
else if (!startHtmlEntity && c == '&' && delimiter == ";")
{
startHtmlEntity = true;
}
else if (startHtmlEntity && (c == '&' || c == ' '))
{
ResetStartHtmlEntity();
}
else if (startHtmlEntity && c == '#' && cPrev == '&')
{
startNumericHtmlEntity = true;
}
}
private void ResetStartHtmlEntity()
{
startHtmlEntity = false;
htmlEntityPosition = 0;
htmlEntitybuffer = new char[10];
}
private void AddToHtmlEntityBuffer(char c)
{
if(htmlEntityPosition >= htmlEntitybuffer.Length)
{
var newSize = htmlEntitybuffer.Length * 2;
Array.Resize(ref htmlEntitybuffer, newSize);
}
htmlEntitybuffer[htmlEntityPosition] = c;
htmlEntityPosition++;
}
private bool IsHtmlEntity()
{
var possiblyEntity = new string(htmlEntitybuffer, 0, htmlEntityPosition);
if (htmlEntities.Contains(possiblyEntity))
return true;
if (startNumericHtmlEntity)
{
var isInteger = int.TryParse(possiblyEntity, out _);
if (isInteger)
return true;
if(possiblyEntity.StartsWith("x", StringComparison.OrdinalIgnoreCase))
{
var isHexadecimal = int.TryParse(possiblyEntity[1..], NumberStyles.HexNumber, null, out _);
if (isHexadecimal)
return true;
}
}
return false;
} You add cPrev = c;
c = buffer[bufferPosition];
bufferPosition++;
charCount++;
if (countBytes)
{
byteCount += encoding.GetByteCount(new char[] { c });
}
CheckForHtmlEntity(ref c, ref cPrev); // Add Here! Then you check if (c == delimiterFirstChar)
{
if (skipDelimiter) // Add Here!
{
skipDelimiter = false;
continue;
}
state = ParserState.Delimiter; You would use it like this. var config = new CsvConfiguration(CultureInfo.InvariantCulture)
{
Delimiter = ";",
};
using var reader = new StringReader("Id;Name\n1;Record'1");
using var parser = new HtmlEntityParser(reader, config);
using var csv = new CsvReader(parser);
var result = csv.GetRecords<Foo>(); |
Beta Was this translation helpful? Give feedback.
-
I've to import a feed that uses semicolon as delimiter and unluckily its data is very bad.
For instance a record contains an html encoded content
The single field
and that semicolon causes the parser to consider the field up to that semicolon, screwing the rest of the record
How should I hande that case?
I'd like to have a way to "pre parse" a raw row in order to HTML Decode the row before the parser kicks in but I can't find a way to do so, is it possible?
Beta Was this translation helpful? Give feedback.
All reactions