Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR: support for extended character class escapes in patterns #33

Open
bpj opened this issue Dec 12, 2020 · 3 comments
Open

FR: support for extended character class escapes in patterns #33

bpj opened this issue Dec 12, 2020 · 3 comments

Comments

@bpj
Copy link

bpj commented Dec 12, 2020

Just an idea. I need to match a sequence of letters and non-spacing marks, which can't be expressed in Lua patterns even with the extension of the meaning of escapes like %a of this module. Now it occurs to me that a possible solution would be if this module supported some extended character class escapes. I would love to do a PR but I don't do C.

Perhaps the most straight forward would be if %x{hhh} and the other escapes from utf8.escape could be used in patterns, including inside character classes so that [%a%x{300}-%x{36f}]+ would match letters followed by characters from the Combining Diacritical Marks block (although there are many non-spacing marks outside that block!)

A perhaps somewhat more key-hole-surgery solution would be a character class escape %m which matches any character with General Category M and its complement %M.

Somewhat more generally perhaps an escape pattern %g{Gc} (and complement %G{Gc}) where Gc is a one- or two-letter General Category abbreviation like L, Lu, Lo, M, Mn, P, Ps, Pe matching any character which does/doesn't belong to that General Category. The curlies would of course have to be required so that one can still use the regular character class %g including %g%{ with a following curly, or perhaps %k{Gc} as if "Kategory"!

The use case is a function for titlecasing words

-- Helper function
local function ul (u, l)
  return utf8.upper(u) .. utf8.lower(l)
end

local function title_case (s)
  -- Add flanking non-word chars so frontier assertion works at start/end
  s = '(' .. s .. ')'
  s = utf8.gsub(s,'%f[%w](%a)([^%s%d%p%c]*)%f[%W]', ul)
  -- Remove dummy parens
  return utf8.sub(s, 2, -2)
end

That [^%s%d%p%c]* has worked so far for my data but it's ugly, it works by accident and there may be things which it matches which it shouldn't although it seems this module includes GC S in %p.

@starwing
Copy link
Owner

the logic of Lua pattern is just one letter for one function, support multiple letter pattern may difficult, maybe another matching library is needed. pattern matching in this library just for compatible with Lua's.

So maybe it's worth to considering whether is there any alternatives for pattern matching that support unicode fully?

@bpj
Copy link
Author

bpj commented Dec 29, 2020

I guess I could use lrexlib but then patterns will be entirely incompatible with the Lua pattern syntax when my idea is that programs using my (MoonScript) class can supply functions with similar semantics to use instead of string.match etc. to use by methods of the library, with a pure Lua "mode" still possible. I guess I could write (regex-based) code to translate a superset of Lua pattern syntax into PCRE, but that may easily get bigger than the host library itself.

@starwing
Copy link
Owner

starwing commented Jan 2, 2021

maybe you could make tables (using scripts from this project) and make a new module for check the Unicode categories.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants