-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Optionally support titlecase for capitalize #14144
Comments
For reference here is a list of the suspect characters: https://www.compart.com/en/unicode/category/Lt |
Great to hear that CUDF will do it by default. I ma a little concerned because ß is the one that bit us in our testing, but it does not show up in https://www.compart.com/en/unicode/category/Lt |
So the
But it looks like when capitalizing
I've not been able to find documentation on this behavior so I would be curious to know what is expected by Spark when capitalizing |
I hope that this helps. Strings in Spark are kind of special as they wrote their own UTF8String implementation |
The I found a few more characters that are not part of the titlecase Unicode definition and behave like
The Python (and Pandas) output for But all of these pass through unchanged with Regardless, the libcudf result matches neither and so the inclination is to fix it to match the Python/Pandas result. |
Sorry I have not been following this as closely as I should. @davidwendt so the proposal is to make the CUDF code match python/pandas, but not Spark? @sameerz if that is true then we will need to write a custom kernel for initcap for Spark. |
Just FYI: From a Spark perspective I found 265 characters that produce different values between the CPU implementation and the GPU one. Their code points are. (223, 304, 329, 452, 454, 455, 457, 458, 460, 496, 497, 499, 604, 609, 618, 620, 642, 647, 669, 670, 912, 944, 1011, 1012, 1321, 1323, 1325, 1327, 1415, 4304, 4305, 4306, 4307, 4308, 4309, 4310, 4311, 4312, 4313, 4314, 4315, 4316, 4317, 4318, 4319, 4320, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4329, 4330, 4331, 4332, 4333, 4334, 4335, 4336, 4337, 4338, 4339, 4340, 4341, 4342, 4343, 4344, 4345, 4346, 4349, 4350, 4351, 5112, 5113, 5114, 5115, 5116, 5117, 7296, 7297, 7298, 7299, 7300, 7301, 7302, 7303, 7304, 7566, 7830, 7831, 7832, 7833, 7834, 7838, 8016, 8018, 8020, 8022, 8064, 8065, 8066, 8067, 8068, 8069, 8070, 8071, 8080, 8081, 8082, 8083, 8084, 8085, 8086, 8087, 8096, 8097, 8098, 8099, 8100, 8101, 8102, 8103, 8114, 8115, 8116, 8118, 8119, 8130, 8131, 8132, 8134, 8135, 8146, 8147, 8150, 8151, 8162, 8163, 8164, 8166, 8167, 8178, 8179, 8180, 8182, 8183, 8486, 8490, 8491, 42649, 42651, 42900, 42903, 42905, 42907, 42909, 42911, 42933, 42935, 42937, 42939, 42941, 42943, 42947, 43859, 43888, 43889, 43890, 43891, 43892, 43893, 43894, 43895, 43896, 43897, 43898, 43899, 43900, 43901, 43902, 43903, 43904, 43905, 43906, 43907, 43908, 43909, 43910, 43911, 43912, 43913, 43914, 43915, 43916, 43917, 43918, 43919, 43920, 43921, 43922, 43923, 43924, 43925, 43926, 43927, 43928, 43929, 43930, 43931, 43932, 43933, 43934, 43935, 43936, 43937, 43938, 43939, 43940, 43941, 43942, 43943, 43944, 43945, 43946, 43947, 43948, 43949, 43950, 43951, 43952, 43953, 43954, 43955, 43956, 43957, 43958, 43959, 43960, 43961, 43962, 43963, 43964, 43965, 43966, 43967, 64256, 64257, 64258, 64259, 64260, 64261, 64262, 64265, 64266, 64267, 64268, 64269, 64275, 64276, 64277, 64278, 64279) |
Is your feature request related to a problem? Please describe.
Spark has a method calling initcap. We implemented this using strings::capitalize, but recently ran into some problems because the first letter it uses is not an uppercase letter, it is a title case letter.
https://unicode.org/faq/casemap_charprop.html#4
Most of the time they are the same, but there are a few cases where they are not and ß is one of them. I would love an option for capitalize that uses title case instead of upper case. Or if we could get a separate initcap function that uses title case would also be great.
The text was updated successfully, but these errors were encountered: