-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for other numeral systems. #18
Comments
Here ("Using neither" section): https://en.wikipedia.org/wiki/Long_and_short_scales#Using_neither there is a list of different numeral systems. |
BTW, this resource has a lot of useful information: https://www.languagesandnumbers.com/site-map/en/ |
@noviluni i have decided to take this for my GSoC proposal, I understood how one can wrote roman numerals in English alphabet and how one would implement it, I wanted to ask about Chinese/Japanese/Korean/Vietnamese numeral system. What I think is that it would parse the number which are input in symbols form the respective languages, or do I do it with the help of Unicode? |
I did not understand this question. To me “symbols form” and “Unicode” are basically the same here, the input would be the symbols as a Python string. |
I meant to ask was should i be using "零" or "U+96F6", cause i tried some test code in python and it did print "零" as it is. |
Hi @AmPhIbIaN26 thanks for showing interest in fixing this, yes, I think this could be feasible for a GSoC proposal. I'm not sure about your last question. In Python 3, unicode is enabled by default, so you don't need to handle it. 零 is unicode the same way than "U+96F6". I don't think you need to convert anything. |
Ohh ok thanks, I'll work on it |
@noviluni I am doing a bit of research on this topic first, so do you only want conversion of numbers to int or a readable string. Or maybe could you give me examples of what kinds of input do you want it to take and what should it return. |
i looked up on how to parse roman numerals, it can be added to the current parser, but for the case of these other numerals for Chinese Japanese and other languages should I create a new parser?? |
If you mean whether the goal is to support new numerical systems in That said, maybe it makes sense to prioritize
It’s hard to give specific examples without some knowledge of those other numeral systems, but for Roman numbers I imagine something like: >>> parse('Built in MDCCLXXVI')
'Built in 1776'
>>> parse_number('MDCCLXXVI')
1776
I’m not very familiar with the internals of number-parser, but I would aim for it to be possible for users to limit which numeral systems are considered by number-parser in a given call. For example, allow users to use number-parser functions limiting them to decimal numbers, roman numbers, or any combination or numeral systems. So I would say that, ideally, the parsers should be as independent as possible. |
So I add roman to parser and then make a new parser for the other languages, right? |
That’s the opposite of what I meant, but I’m also starting to think that I may be misunderstanding what you mean here by “parser”. How would each of the options (extending the existing parser with Roman number support vs adding a separate parser for Roman number support) look like for users, API-wise? Are you talking about creating separate user-level functions for other numeral systems, as opposed to have the existing functions like |
parsing roman is not a big or complex task and can be added directly to parser.py, what I meant by a different parser for other languages was that to create a new file for that, since I am new to this whole concept of parsers and making a python library in general I might be asking the wrong question. I am not confused on how to integrate, that part is done will make a pull request for it soon. What i am confused about is how would I integrate other languages, like do you want it to be >>>parse('百四十五')
'145' or do you want it to be like >>>parse_japanese('百四十五')
'145' So I can add a way to detect language and then parse accordingly or let the user define what language are they parsing. I asked about creating a new parser was because of this comment by @noviluni
|
I personally have no strong feeling either way on how to distribute the new code in the code base. Having separate files for the code that supports each numeral system (what we call parsers) would make sense to me, but I would not worry too much about that. As for integration of other languages, my suggestion would be to aim for reuse of the existing functions, which is best when you don’t know which numeral system the input uses. So, |
I'll look into it, will make a pull request for roman numeral by tomorrow maybe and was thinking of taking other numerals in my proposal, how does that sound? |
Sounds great! |
Thanks a lot, I am excited to work on this!! |
I worked on a way for parse_number, where you have to set So the method I used to parse roman is: >>>parse("XVIII", language='rom')
'18' As for >>> parse('Built in MDCCLXXVI', language='rom')
'Built in 1776' |
It sounds good as a temporary workaround, but we should not treat this as a language, since “rom” is no ISO language. There’s Latin ( I take it you went the “rom” route for now due to implementation limitations. But to discuss how to best address those, and how to refactor the code to not require “rom”, it may be best for you to create a pull request with your work so far, so that we can discuss this over actual code. |
Ok, I'll see a work around to it. |
Hey @Gallaecio hope you're doing well! >>>parse_number('MMCDXX')
'2420' for the case for >>> parse('Built in MDCCLXXVI', language='rom')
'Built in 1776' But this doesn't, >>> parse('Built in MDCCLXXVI')
'Built in 1776' I have to make changes to the As of now I will be working on my proposal and submit a draft by tomorrow, I would be obliged if you could go through it. |
Hi @AmPhIbIaN26 and @Gallaecio! Thank you both for going through this interesting conversation. I checked the PR and you did a really good job looking at the code and understanding how it works. So now that you have some practical knowledge, let's see why the chosen approach (reusing the existing parser) doesn't work. The "parser" you are using is for decimal numbers. That means, that numbers are build in a next way:
[This is easier to understand in higher numbers like thousand, hundreds, etc., because for the small numbers every language has been evolved differently and doesn't follow the rules. That's the reason why we have the "DIRECT_NUMBERS".] The Roman Numeral System doesn't work like that. It has some limited symbols (I, V, X, L, C, D, M) and the numbers are written adding and subtracting:
The rule is basically that you shouldn't repeat the same symbol more than three times. In the PR you submitted you have been reusing the existing structures for Decimal systems:
And of course, they don't fit well. So if you want to continue with this, I would like to suggest you to:
|
Thanks a lot @noviluni for this review of my work. I will look into it. I have started working on my proposal and is almost completed. I was thinking along side with this I could also work on this issue. I have done a fair bit of research on the Suzhou numeral system, it has substantial amount of work to do, but having more issues/ideas in my proposal will give more weight to my work. As of now I have come up with this on my research on the Suzhou numeral system. If you could take a look at it then it would be great. |
Hi @AmPhIbIaN26, For Suzhou, I'm not an expert, but the research looks good to me. Probably a good starting point would be writing some tests with that data and try to develop the parsing function by doing some TDD. I would focus only on one variation (maybe Chinese) and then continue with the other variants (Japanese, Korean, Vietnamese). In case we need it, I know people from France, German, and China, so I could probably ask them to review and provide some feedback if we don't know how to continue or if we have doubts. |
Thanks @noviluni for the support, I will take up German issue(revert numbers) along side with Roman and Suzhou Numeral. |
Hi @noviluni and @Gallaecio , I would be obliged if you could take a look at my draft proposal and suggest changes. |
@AmPhIbIaN26 The technical parts of the proposal look good to me, and as @noviluni said we can already see that you’ve gotten familiar with the code base. But the timeline looks wrong: Google Summer of Code 2021 will only be made of 10 weeks. Please, update your proposal with a new timeline. I haven’t had a detailed look at your timeline because of this issue, but do remember that you do not need to fit all your goals into the timeline; it’s better to be pessimistic with the time estimations and set some stretch goals in case you work faster than estimated, than to reach the mid-project evaluation behind of schedule. And just in case, remember that the deadline is April 13, 2021 20:00 (Central European Summer Time), in less than 24h, so try to fix and submit the application as soon as possible. |
Thanks for the suggestion, I have changed the timeline also made changes to the deliverables. I have also added precedence study. |
@noviluni I will now work on creating on the new |
Hi @noviluni and @Gallaecio hope you both are doing well and are safe. >>>parse_roman('CDXX)
'420'
>>>parse_roamn('Built in MMLXXVII.')
'Built in 2077.' I have have a pull request for it. I have also added test cases to it. |
At this moment, the main
number-parser
goal is to return the number equivalences from different languages, but only when those words are representing the number using the "decimal numeral system" (https://en.wikipedia.org/wiki/Decimal).However, there are some numeral systems that don't rely on the decimal numeral system and uses other structures. That's the case of the Roman Numeral System (https://en.wikipedia.org/wiki/Numeral_system) or the Chinese/Japanese/Korean/Vietnamese Numeral System (https://en.wikipedia.org/wiki/Chinese_numerals and https://en.wikipedia.org/wiki/Suzhou_numerals).
We could probably add support for them in a future version, as they will probably need another kind of parser.
For more on this, you can also check this: https://en.wikipedia.org/wiki/Numeral_system
The text was updated successfully, but these errors were encountered: