FR: support for extended character class escapes in patterns #33

bpj · 2020-12-12T13:36:02Z

Just an idea. I need to match a sequence of letters and non-spacing marks, which can't be expressed in Lua patterns even with the extension of the meaning of escapes like %a of this module. Now it occurs to me that a possible solution would be if this module supported some extended character class escapes. I would love to do a PR but I don't do C.

Perhaps the most straight forward would be if %x{hhh} and the other escapes from utf8.escape could be used in patterns, including inside character classes so that [%a%x{300}-%x{36f}]+ would match letters followed by characters from the Combining Diacritical Marks block (although there are many non-spacing marks outside that block!)

A perhaps somewhat more key-hole-surgery solution would be a character class escape %m which matches any character with General Category M and its complement %M.

Somewhat more generally perhaps an escape pattern %g{Gc} (and complement %G{Gc}) where Gc is a one- or two-letter General Category abbreviation like L, Lu, Lo, M, Mn, P, Ps, Pe matching any character which does/doesn't belong to that General Category. The curlies would of course have to be required so that one can still use the regular character class %g including %g%{ with a following curly, or perhaps %k{Gc} as if "Kategory"!

The use case is a function for titlecasing words

-- Helper function
local function ul (u, l)
  return utf8.upper(u) .. utf8.lower(l)
end

local function title_case (s)
  -- Add flanking non-word chars so frontier assertion works at start/end
  s = '(' .. s .. ')'
  s = utf8.gsub(s,'%f[%w](%a)([^%s%d%p%c]*)%f[%W]', ul)
  -- Remove dummy parens
  return utf8.sub(s, 2, -2)
end

That [^%s%d%p%c]* has worked so far for my data but it's ugly, it works by accident and there may be things which it matches which it shouldn't although it seems this module includes GC S in %p.

The text was updated successfully, but these errors were encountered:

starwing · 2020-12-24T03:15:52Z

the logic of Lua pattern is just one letter for one function, support multiple letter pattern may difficult, maybe another matching library is needed. pattern matching in this library just for compatible with Lua's.

So maybe it's worth to considering whether is there any alternatives for pattern matching that support unicode fully?

bpj · 2020-12-29T16:19:37Z

I guess I could use lrexlib but then patterns will be entirely incompatible with the Lua pattern syntax when my idea is that programs using my (MoonScript) class can supply functions with similar semantics to use instead of string.match etc. to use by methods of the library, with a pure Lua "mode" still possible. I guess I could write (regex-based) code to translate a superset of Lua pattern syntax into PCRE, but that may easily get bigger than the host library itself.

starwing · 2021-01-02T12:14:50Z

maybe you could make tables (using scripts from this project) and make a new module for check the Unicode categories.

starwing added the enhancement label Dec 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FR: support for extended character class escapes in patterns #33

FR: support for extended character class escapes in patterns #33

bpj commented Dec 12, 2020

starwing commented Dec 24, 2020

bpj commented Dec 29, 2020 •

edited

Loading

starwing commented Jan 2, 2021

FR: support for extended character class escapes in patterns #33

FR: support for extended character class escapes in patterns #33

Comments

bpj commented Dec 12, 2020

starwing commented Dec 24, 2020

bpj commented Dec 29, 2020 • edited Loading

starwing commented Jan 2, 2021

bpj commented Dec 29, 2020 •

edited

Loading