You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Just an idea. I need to match a sequence of letters and non-spacing marks, which can't be expressed in Lua patterns even with the extension of the meaning of escapes like %a of this module. Now it occurs to me that a possible solution would be if this module supported some extended character class escapes. I would love to do a PR but I don't do C.
Perhaps the most straight forward would be if %x{hhh} and the other escapes from utf8.escape could be used in patterns, including inside character classes so that [%a%x{300}-%x{36f}]+ would match letters followed by characters from the Combining Diacritical Marks block (although there are many non-spacing marks outside that block!)
A perhaps somewhat more key-hole-surgery solution would be a character class escape %m which matches any character with General Category M and its complement %M.
Somewhat more generally perhaps an escape pattern %g{Gc} (and complement %G{Gc}) where Gc is a one- or two-letter General Category abbreviation like L, Lu, Lo, M, Mn, P, Ps, Pe matching any character which does/doesn't belong to that General Category. The curlies would of course have to be required so that one can still use the regular character class %g including %g%{ with a following curly, or perhaps %k{Gc} as if "Kategory"!
The use case is a function for titlecasing words
-- Helper functionlocalfunctionul (u, l)
returnutf8.upper(u) ..utf8.lower(l)
endlocalfunctiontitle_case (s)
-- Add flanking non-word chars so frontier assertion works at start/ends='(' ..s..')'s=utf8.gsub(s,'%f[%w](%a)([^%s%d%p%c]*)%f[%W]', ul)
-- Remove dummy parensreturnutf8.sub(s, 2, -2)
end
That [^%s%d%p%c]* has worked so far for my data but it's ugly, it works by accident and there may be things which it matches which it shouldn't although it seems this module includes GC S in %p.
The text was updated successfully, but these errors were encountered:
the logic of Lua pattern is just one letter for one function, support multiple letter pattern may difficult, maybe another matching library is needed. pattern matching in this library just for compatible with Lua's.
So maybe it's worth to considering whether is there any alternatives for pattern matching that support unicode fully?
I guess I could use lrexlib but then patterns will be entirely incompatible with the Lua pattern syntax when my idea is that programs using my (MoonScript) class can supply functions with similar semantics to use instead of string.match etc. to use by methods of the library, with a pure Lua "mode" still possible. I guess I could write (regex-based) code to translate a superset of Lua pattern syntax into PCRE, but that may easily get bigger than the host library itself.
Just an idea. I need to match a sequence of letters and non-spacing marks, which can't be expressed in Lua patterns even with the extension of the meaning of escapes like
%a
of this module. Now it occurs to me that a possible solution would be if this module supported some extended character class escapes. I would love to do a PR but I don't do C.Perhaps the most straight forward would be if
%x{hhh}
and the other escapes fromutf8.escape
could be used in patterns, including inside character classes so that[%a%x{300}-%x{36f}]+
would match letters followed by characters from the Combining Diacritical Marks block (although there are many non-spacing marks outside that block!)A perhaps somewhat more key-hole-surgery solution would be a character class escape
%m
which matches any character with General Category M and its complement%M
.Somewhat more generally perhaps an escape pattern
%g{Gc}
(and complement%G{Gc}
) whereGc
is a one- or two-letter General Category abbreviation like L, Lu, Lo, M, Mn, P, Ps, Pe matching any character which does/doesn't belong to that General Category. The curlies would of course have to be required so that one can still use the regular character class%g
including%g%{
with a following curly, or perhaps%k{Gc}
as if "Kategory"!The use case is a function for titlecasing words
That
[^%s%d%p%c]*
has worked so far for my data but it's ugly, it works by accident and there may be things which it matches which it shouldn't although it seems this module includes GC S in%p
.The text was updated successfully, but these errors were encountered: