How to use DSL to delete non-UTF-8 hex characters from a string? #985
-
I have some CSV data that contains strings that have embedded non-UTF8 hex characters, specifically ISO8859-1 superscripts 1, 2, and 3, x'b9', x'b2', and x'b3', without the leading UTF-8 control character before each one. They are just naked ISO8859-1 one-byte characters with those hex values. In order to further process these string-valued CSV fields n some DSL programs I am writing, I need to be able to delete these characters from the field. It seems I can't use the string escape value "\xb9", etc. in the regexp string value (in gsub argument 2 for instance), and just pasting a copy of the superscript character into a string coded to be the regexp results in miller complaining that it found invalid UTF-8 characters during regexp parsing. REPL example:
How can I set up a gsub or other replacement function to remove all of these characters from a string-valued CSV field please? Peter [Edit: I found the doc on the ssub function, and that works using the naked hex values, but that only replaces the first instance of the value. How would I code it to replace ALL such values in a string? There does not seem to be an sgsub function that would do that.] |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 12 replies
-
@pjfarleyiii could you attach a sample file? |
Beta Was this translation helpful? Give feedback.
-
One-record test file uploaded, only two CSV columns, Date and Name (header included). Name field has the superscript "1" character ("\xb9") in it. |
Beta Was this translation helpful? Give feedback.
-
I was able to code a while loop to solve the issue, pasted below for future readers.
|
Beta Was this translation helpful? Give feedback.
-
Also @pjfarleyiii #997 |
Beta Was this translation helpful? Give feedback.
I was able to code a while loop to solve the issue, pasted below for future readers.