-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat/country_utils #26
base: dev
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## dev #26 +/- ##
=====================================
Coverage ? 0.00%
=====================================
Files ? 65
Lines ? 16409
Branches ? 0
=====================================
Hits ? 0
Misses ? 16409
Partials ? 0 Continue to review full report at Codecov.
|
"official_name_ar": "", | ||
"official_name_es": "", | ||
"official_name_cn": "", | ||
"official_name_en": "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taiwan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was missing in source data for some reason, i didn't validate it
https://github.com/datasets/country-codes/blob/master/data/country-codes.csv
@@ -0,0 +1,3227 @@ | |||
[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be duplicated across languages instead of having all translated names in one file to match the structure of other resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought about this, but that requires a whole lot of duplicated keys with redundant info..... not sure whats best?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think separate files are easier for people to extend and less prone to typo errors in keys (official_name_xx
). Maybe a common resource for country code, native lang name?
"official_name_es": "Antártida", | ||
"official_name_cn": "南极洲", | ||
"official_name_en": "Antarctica", | ||
"official_name_ru": "Антарктике" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No language.. should it be None
or an empty string to prevent reference errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this simply said "world" in source data
not sure if adding empty string or None makes sense... are we making a distinction between undefined/unknown?
i usually look at this stuff as responsibility of whatever is reading the file, data provides datapoints about stufff it knows about, not about what is missing
ie, empty string means undefined, missing key means unknown ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I guess this will be up to LF to handle, so I'm not sure what will be easier to parse.. I think an empty value will make it more obvious at a glance if something is missing, though I'm not sure it makes much difference unless there's an instance where there is no language (I don't think that should ever be the case?). Empty string would probably be the easiest way to specify the language is known to be none, since we could just use a type check to validate (if isinstance(lang, str) -> validated lang is specified
)
"official_name_cn": "", | ||
"official_name_en": "", | ||
"official_name_ru": "", | ||
"Language": "zh-TW" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be BCP47-lang
to match the ISO spec'd params? I notice some are ISO639-1
codes, so maybe including one or both and letting the parser decide what to return for language
would make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes sense, the source data for this was noisy and wasnt always valid, i filtered results and only left langs that were 2 or 4 letter codes
@coderabbitai review |
Actions performedReview triggered.
|
WalkthroughThe changes in the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Parse as lingua_franca.parse
participant Util as lingua_franca.util
User ->> Parse: Call extract_currencycode(text)
Parse ->> Util: Call fuzzy_match for currency matching
Util -->> Parse: Return matched currency
Parse -->> User: Return currency code
User ->> Parse: Call extract_countrycode(text, iso3, lang)
Parse ->> Util: Call match_one for country matching
Util -->> Parse: Return matched country
Parse -->> User: Return country code
User ->> Parse: Call extract_langcode(text, lang)
Parse ->> Util: Call match_one for language matching
Util -->> Parse: Return matched language
Parse -->> User: Return language code
Assessment against linked issues
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Outside diff range, codebase verification and nitpick comments (1)
test/unittests/test_parse_ru.py (1)
Line range hint
1-24
: General code review of unit tests.The unit tests appear to be comprehensive and well-structured. The use of
setUpModule
andtearDownModule
for setting up and tearing down the language settings is appropriate. The tests cover a wide range of scenarios, which is good for ensuring robustness.However, there are some commented-out tests and TODOs scattered throughout the file. It would be beneficial to address these or remove them if they are no longer relevant.
Consider either implementing or cleaning up the commented-out code and TODOs to maintain clean and maintainable code.
Also applies to: 26-1000
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (6)
- lingua_franca/parse.py (2 hunks)
- test/unittests/test_parse.py (3 hunks)
- test/unittests/test_parse_az.py (1 hunks)
- test/unittests/test_parse_cs.py (1 hunks)
- test/unittests/test_parse_en.py (2 hunks)
- test/unittests/test_parse_ru.py (1 hunks)
Additional context used
Ruff
lingua_franca/parse.py
45-45: Ambiguous variable name:
l
(E741)
58-61: Use ternary operator
name = c[k] if k in c else c["official_name_en"]
instead ofif
-else
-blockReplace
if
-else
-block withname = c[k] if k in c else c["official_name_en"]
(SIM108)
81-83: Use a single
if
statement instead of nestedif
statements(SIM102)
107-110: Use ternary operator
name = c[k] if k in c else c["official_name_en"]
instead ofif
-else
-blockReplace
if
-else
-block withname = c[k] if k in c else c["official_name_en"]
(SIM108)
129-129: Ambiguous variable name:
l
(E741)
141-144: Use ternary operator
name = c[k] if k in c else c["official_name_en"]
instead ofif
-else
-blockReplace
if
-else
-block withname = c[k] if k in c else c["official_name_en"]
(SIM108)
Additional comments not posted (10)
test/unittests/test_parse.py (2)
24-24
: Review new imports for consistency and correctness.The new imports for
extract_langcode
,extract_currencycode
, andextract_countrycode
are consistent with the changes described in the PR summary. Ensure these functions are used appropriately in the test cases.
26-26
: Check relocated utility functions imports.The import of
fuzzy_match
andmatch_one
has been moved tolingua_franca.util
which is a logical organization for utility functions. This should be reflected in all files that use these functions.test/unittests/test_parse_az.py (1)
26-26
: Review updated imports for consistency and correctness.The imports have been updated to reflect the changes in utility function locations. This is a good practice as it keeps utility functions organized and accessible from a common module.
test/unittests/test_parse_cs.py (1)
24-24
: Updated import statements.The change in import statements from
lingua_franca.parse
tolingua_franca.util
forfuzzy_match
and the addition ofmatch_one
is noted. This alteration should be verified to ensure that it does not break any existing functionality.Verification successful
Updated import statements.
The change in import statements from
lingua_franca.parse
tolingua_franca.util
forfuzzy_match
and the addition ofmatch_one
is correct. The functions are used appropriately in the codebase, and the changes do not break any existing functionality.
test/unittests/test_parse_cs.py
test/unittests/test_parse_ru.py
test/unittests/test_parse.py
test/unittests/test_parse_az.py
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify if `fuzzy_match` and `match_one` are being used correctly in the new import context. # Test: Search for usage of `fuzzy_match` and `match_one` in the entire codebase. rg --type python -A 5 $'fuzzy_match|match_one'Length of output: 78
Script:
#!/bin/bash # Description: Verify if `fuzzy_match` and `match_one` are being used correctly in the new import context. # Test: Search for usage of `fuzzy_match` and `match_one` in the entire codebase. rg --type py -A 5 'fuzzy_match|match_one'Length of output: 12729
test/unittests/test_parse_ru.py (1)
24-24
: Import modifications approved.The imports for
fuzzy_match
andmatch_one
have been moved fromlingua_franca.parse
tolingua_franca.util
as per the summary. This change is consistent with the rest of the file and no issues are apparent.test/unittests/test_parse_en.py (5)
29-29
: Imports updated to include new parsing functions.The import statement has been expanded to include
extract_currencycode
andextract_countrycode
in addition to the existingextract_langcode
. This aligns with the PR's objectives to enhance parsing capabilities for country and currency codes.
1697-1711
: Country code extraction tests are well-implemented.The tests for
extract_countrycode
method are comprehensive, covering a variety of country names. This ensures that the country code extraction is robust and reliable.Consider adding more diverse test cases.
It might be beneficial to include tests for countries with common names or alternative names to ensure broader coverage.
[suggestion]
1712-1737
: Currency code extraction tests are comprehensive.The tests for
extract_currencycode
method cover a range of scenarios, including special cases for the Euro and common currencies. This ensures that the currency code extraction is effective and adheres to the PR's enhancement goals.Consider expanding test cases for additional currencies.
Adding tests for less common currencies or those with multiple countries using the same currency could further enhance the robustness of the currency parsing functionality.
[suggestion]
1673-1674
: Language code extraction tests are well-implemented.The tests for
extract_langcode
method are comprehensive, covering a variety of language names. This ensures that the language code extraction is robust and reliable.Consider adding more diverse test cases.
It might be beneficial to include tests for languages with dialects or regional variations to ensure broader coverage.
[suggestion]Also applies to: 1682-1691
1683-1691
: Country-specific language code tests are effective.The tests for
extract_langcode
method with country-specific variants are well-implemented, covering a variety of scenarios. This enhances the utility of the language parsing functionality in handling internationalization scenarios.Consider expanding test cases for additional regional dialects.
Adding tests for more regional dialects or less commonly used language variants could further enhance the robustness of the language parsing functionality.
[suggestion]
@@ -21,9 +21,8 @@ | |||
from lingua_franca.parse import extract_datetime | |||
from lingua_franca.parse import extract_duration | |||
from lingua_franca.parse import extract_number, extract_numbers | |||
from lingua_franca.parse import fuzzy_match | |||
from lingua_franca.util import fuzzy_match, match_one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General observation: Commented-out tests and TODOs.
Several tests are commented out, and there are multiple TODO comments throughout the file. These should be addressed to ensure full test coverage and to resolve any pending tasks or bugs.
def test_parse_country_code(self): | ||
def test_with_conf(text, expected_lang, min_conf=0.6): | ||
lang, conf = extract_countrycode(text, lang="unk") | ||
self.assertEqual(lang, expected_lang) | ||
self.assertGreaterEqual(conf, min_conf) | ||
|
||
# test fallback to english and fuzzy match | ||
test_with_conf("Português", 'PT') | ||
|
||
def test_parse_currency_code_garbage(self): | ||
def test_with_conf(text, expected_lang, min_conf=0.5): | ||
lang, conf = extract_currencycode(text, lang="unk") | ||
self.assertEqual(lang, expected_lang) | ||
self.assertGreaterEqual(conf, min_conf) | ||
|
||
# simple fuzzy match | ||
test_with_conf("Brazilian Portuguese", 'BRL') | ||
test_with_conf("Brazilian", 'BRL') | ||
test_with_conf("Português", 'EUR') | ||
|
||
# euro special cases | ||
test_with_conf("dglkm euro jbhjkbadksznlkjn,m", 'EUR') | ||
test_with_conf("jhgkkjlmfauclh wçio kglbjkhad ph €", 'EUR') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review new test cases for extract_countrycode
.
The test cases for extract_countrycode
seem to cover basic scenarios well, but consider adding more diverse test inputs to fully evaluate the function's robustness, especially with edge cases and different country formats.
+ # Suggested additional test cases
+ test_with_conf("United States", 'US')
+ test_with_conf("USA", 'US')
+ test_with_conf("United Kingdom", 'GB')
+ test_with_conf("Great Britain", 'GB')
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def test_parse_country_code(self): | |
def test_with_conf(text, expected_lang, min_conf=0.6): | |
lang, conf = extract_countrycode(text, lang="unk") | |
self.assertEqual(lang, expected_lang) | |
self.assertGreaterEqual(conf, min_conf) | |
# test fallback to english and fuzzy match | |
test_with_conf("Português", 'PT') | |
def test_parse_currency_code_garbage(self): | |
def test_with_conf(text, expected_lang, min_conf=0.5): | |
lang, conf = extract_currencycode(text, lang="unk") | |
self.assertEqual(lang, expected_lang) | |
self.assertGreaterEqual(conf, min_conf) | |
# simple fuzzy match | |
test_with_conf("Brazilian Portuguese", 'BRL') | |
test_with_conf("Brazilian", 'BRL') | |
test_with_conf("Português", 'EUR') | |
# euro special cases | |
test_with_conf("dglkm euro jbhjkbadksznlkjn,m", 'EUR') | |
test_with_conf("jhgkkjlmfauclh wçio kglbjkhad ph €", 'EUR') | |
def test_parse_country_code(self): | |
def test_with_conf(text, expected_lang, min_conf=0.6): | |
lang, conf = extract_countrycode(text, lang="unk") | |
self.assertEqual(lang, expected_lang) | |
self.assertGreaterEqual(conf, min_conf) | |
# test fallback to english and fuzzy match | |
test_with_conf("Português", 'PT') | |
# Suggested additional test cases | |
test_with_conf("United States", 'US') | |
test_with_conf("USA", 'US') | |
test_with_conf("United Kingdom", 'GB') | |
test_with_conf("Great Britain", 'GB') | |
def test_parse_currency_code_garbage(self): | |
def test_with_conf(text, expected_lang, min_conf=0.5): | |
lang, conf = extract_currencycode(text, lang="unk") | |
self.assertEqual(lang, expected_lang) | |
self.assertGreaterEqual(conf, min_conf) | |
# simple fuzzy match | |
test_with_conf("Brazilian Portuguese", 'BRL') | |
test_with_conf("Brazilian", 'BRL') | |
test_with_conf("Português", 'EUR') | |
# euro special cases | |
test_with_conf("dglkm euro jbhjkbadksznlkjn,m", 'EUR') | |
test_with_conf("jhgkkjlmfauclh wçio kglbjkhad ph €", 'EUR') | |
def test_parse_currency_code_garbage(self): | ||
def test_with_conf(text, expected_lang, min_conf=0.5): | ||
lang, conf = extract_currencycode(text, lang="unk") | ||
self.assertEqual(lang, expected_lang) | ||
self.assertGreaterEqual(conf, min_conf) | ||
|
||
# simple fuzzy match | ||
test_with_conf("Brazilian Portuguese", 'BRL') | ||
test_with_conf("Brazilian", 'BRL') | ||
test_with_conf("Português", 'EUR') | ||
|
||
# euro special cases | ||
test_with_conf("dglkm euro jbhjkbadksznlkjn,m", 'EUR') | ||
test_with_conf("jhgkkjlmfauclh wçio kglbjkhad ph €", 'EUR') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review new test cases for extract_currencycode
.
The tests for extract_currencycode
are good for basic validation. However, adding tests for non-standard inputs and ensuring that the function correctly handles unexpected or malformed inputs would enhance the test suite's coverage.
+ # Suggested additional test cases
+ test_with_conf("US Dollar", 'USD')
+ test_with_conf("Euro", 'EUR')
+ test_with_conf("Japanese Yen", 'JPY')
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def test_parse_currency_code_garbage(self): | |
def test_with_conf(text, expected_lang, min_conf=0.5): | |
lang, conf = extract_currencycode(text, lang="unk") | |
self.assertEqual(lang, expected_lang) | |
self.assertGreaterEqual(conf, min_conf) | |
# simple fuzzy match | |
test_with_conf("Brazilian Portuguese", 'BRL') | |
test_with_conf("Brazilian", 'BRL') | |
test_with_conf("Português", 'EUR') | |
# euro special cases | |
test_with_conf("dglkm euro jbhjkbadksznlkjn,m", 'EUR') | |
test_with_conf("jhgkkjlmfauclh wçio kglbjkhad ph €", 'EUR') | |
def test_parse_currency_code_garbage(self): | |
def test_with_conf(text, expected_lang, min_conf=0.5): | |
lang, conf = extract_currencycode(text, lang="unk") | |
self.assertEqual(lang, expected_lang) | |
self.assertGreaterEqual(conf, min_conf) | |
# simple fuzzy match | |
test_with_conf("Brazilian Portuguese", 'BRL') | |
test_with_conf("Brazilian", 'BRL') | |
test_with_conf("Português", 'EUR') | |
# euro special cases | |
test_with_conf("dglkm euro jbhjkbadksznlkjn,m", 'EUR') | |
test_with_conf("jhgkkjlmfauclh wçio kglbjkhad ph €", 'EUR') | |
# Suggested additional test cases | |
test_with_conf("US Dollar", 'USD') | |
test_with_conf("Euro", 'EUR') | |
test_with_conf("Japanese Yen", 'JPY') |
|
||
@localized_function(run_own_code_on=[UnsupportedLanguageError, FunctionNotLocalizedError]) | ||
def extract_langcode(text, lang=""): | ||
resource_file = resolve_resource_file(f"text/{lang}/langs.json") or \ | ||
resolve_resource_file("text/en-us/langs.json") | ||
with open(resource_file) as f: | ||
LANGUAGES = {v: k for k, v in json.load(f).items()} | ||
return match_one(text, LANGUAGES, strategy=MatchStrategy.TOKEN_SET_RATIO) | ||
|
||
best_lang, best_score = match_one(text, LANGUAGES, strategy=MatchStrategy.TOKEN_SET_RATIO) | ||
|
||
# match country names | ||
if best_score < 0.7: | ||
resource_file = resolve_resource_file("countries.json") | ||
with open(resource_file) as f: | ||
countries = json.load(f) | ||
for c in countries: | ||
if "Language" not in c: | ||
continue | ||
k = f"official_name_{lang.split('-')[0]}" | ||
if k in c: | ||
name = c[k] | ||
else: | ||
name = c["official_name_en"] | ||
score = fuzzy_match(text, name, strategy=MatchStrategy.TOKEN_SET_RATIO) | ||
if score >= best_score: | ||
best_lang, best_score = c["Language"].lower(), score | ||
|
||
return best_lang, best_score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review the improvements to extract_langcode
.
The function now handles matching country names, which is a significant improvement for localization. Ensure that the function's performance is not adversely affected by these changes, especially with a potentially large number of country names to match.
+ # Consider optimizing the country name matching process
+ # Maybe use a precompiled regex or a more efficient data structure
Committable suggestion was skipped due to low confidence.
Tools
Ruff
107-110: Use ternary operator
name = c[k] if k in c else c["official_name_en"]
instead ofif
-else
-blockReplace
if
-else
-block withname = c[k] if k in c else c["official_name_en"]
(SIM108)
|
||
@localized_function(run_own_code_on=[UnsupportedLanguageError, FunctionNotLocalizedError]) | ||
def extract_countrycode(text, iso3=False, lang=""): | ||
|
||
resource_file = resolve_resource_file("countries.json") | ||
with open(resource_file) as f: | ||
countries = json.load(f) | ||
best_score = 0 | ||
best_country = None | ||
|
||
for c in countries: | ||
# if text is a langcode, return parent country | ||
l = c.get("Language", "").lower() | ||
if not l: | ||
lang_score = 0 | ||
elif l == f'{c["ISO3166-1-Alpha-2"]}-{c["ISO3166-1-Alpha-2"]}'.lower() and l in text: | ||
lang_score = 1.0 | ||
else: | ||
lang_score = fuzzy_match(text, l, strategy=MatchStrategy.TOKEN_SET_RATIO) * 0.8 | ||
if c["ISO3166-1-Alpha-2"].lower() in l: | ||
lang_score += 0.05 | ||
|
||
# match country name to text | ||
k = f"official_name_{lang.split('-')[0]}" | ||
if k in c: | ||
name = c[k] | ||
else: | ||
name = c["official_name_en"] | ||
|
||
name_score = fuzzy_match(text, name, strategy=MatchStrategy.TOKEN_SET_RATIO) | ||
|
||
if name_score < 0.7 <= lang_score: | ||
score = lang_score | ||
elif lang_score < 0.7 <= name_score: | ||
score = name_score | ||
else: | ||
score = 0.5 * name_score + 0.5 * lang_score | ||
|
||
if score >= best_score: | ||
if iso3: | ||
best_country, best_score = c["ISO3166-1-Alpha-3"], score | ||
else: | ||
best_country, best_score = c["ISO3166-1-Alpha-2"], score | ||
|
||
return best_country, best_score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review the implementation of extract_countrycode
.
This function's implementation is robust, but ensure that it handles cases where the country code or name might be ambiguous or represented in multiple ways.
+ # Consider adding more sophisticated handling for ambiguous country names
+ if 'UK' in text or 'GB' in text or 'Britain' in text:
+ return 'GB', 1.0
+ if 'USA' in text or 'US' in text or 'America' in text:
+ return 'US', 1.0
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
@localized_function(run_own_code_on=[UnsupportedLanguageError, FunctionNotLocalizedError]) | |
def extract_countrycode(text, iso3=False, lang=""): | |
resource_file = resolve_resource_file("countries.json") | |
with open(resource_file) as f: | |
countries = json.load(f) | |
best_score = 0 | |
best_country = None | |
for c in countries: | |
# if text is a langcode, return parent country | |
l = c.get("Language", "").lower() | |
if not l: | |
lang_score = 0 | |
elif l == f'{c["ISO3166-1-Alpha-2"]}-{c["ISO3166-1-Alpha-2"]}'.lower() and l in text: | |
lang_score = 1.0 | |
else: | |
lang_score = fuzzy_match(text, l, strategy=MatchStrategy.TOKEN_SET_RATIO) * 0.8 | |
if c["ISO3166-1-Alpha-2"].lower() in l: | |
lang_score += 0.05 | |
# match country name to text | |
k = f"official_name_{lang.split('-')[0]}" | |
if k in c: | |
name = c[k] | |
else: | |
name = c["official_name_en"] | |
name_score = fuzzy_match(text, name, strategy=MatchStrategy.TOKEN_SET_RATIO) | |
if name_score < 0.7 <= lang_score: | |
score = lang_score | |
elif lang_score < 0.7 <= name_score: | |
score = name_score | |
else: | |
score = 0.5 * name_score + 0.5 * lang_score | |
if score >= best_score: | |
if iso3: | |
best_country, best_score = c["ISO3166-1-Alpha-3"], score | |
else: | |
best_country, best_score = c["ISO3166-1-Alpha-2"], score | |
return best_country, best_score | |
@localized_function(run_own_code_on=[UnsupportedLanguageError, FunctionNotLocalizedError]) | |
def extract_countrycode(text, iso3=False, lang=""): | |
resource_file = resolve_resource_file("countries.json") | |
with open(resource_file) as f: | |
countries = json.load(f) | |
best_score = 0 | |
best_country = None | |
# Consider adding more sophisticated handling for ambiguous country names | |
if 'UK' in text or 'GB' in text or 'Britain' in text: | |
return 'GB', 1.0 | |
if 'USA' in text or 'US' in text or 'America' in text: | |
return 'US', 1.0 | |
for c in countries: | |
# if text is a langcode, return parent country | |
l = c.get("Language", "").lower() | |
if not l: | |
lang_score = 0 | |
elif l == f'{c["ISO3166-1-Alpha-2"]}-{c["ISO3166-1-Alpha-2"]}'.lower() and l in text: | |
lang_score = 1.0 | |
else: | |
lang_score = fuzzy_match(text, l, strategy=MatchStrategy.TOKEN_SET_RATIO) * 0.8 | |
if c["ISO3166-1-Alpha-2"].lower() in l: | |
lang_score += 0.05 | |
# match country name to text | |
k = f"official_name_{lang.split('-')[0]}" | |
if k in c: | |
name = c[k] | |
else: | |
name = c["official_name_en"] | |
name_score = fuzzy_match(text, name, strategy=MatchStrategy.TOKEN_SET_RATIO) | |
if name_score < 0.7 <= lang_score: | |
score = lang_score | |
elif lang_score < 0.7 <= name_score: | |
score = name_score | |
else: | |
score = 0.5 * name_score + 0.5 * lang_score | |
if score >= best_score: | |
if iso3: | |
best_country, best_score = c["ISO3166-1-Alpha-3"], score | |
else: | |
best_country, best_score = c["ISO3166-1-Alpha-2"], score | |
return best_country, best_score |
Tools
Ruff
129-129: Ambiguous variable name:
l
(E741)
141-144: Use ternary operator
name = c[k] if k in c else c["official_name_en"]
instead ofif
-else
-blockReplace
if
-else
-block withname = c[k] if k in c else c["official_name_en"]
(SIM108)
@localized_function(run_own_code_on=[UnsupportedLanguageError, FunctionNotLocalizedError]) | ||
def extract_currencycode(text, lang=""): | ||
# this method tries to be lang agnostic and use mainly fuzzy matching | ||
# it should be considered a fallback for unimplemented languages | ||
# dedicated per language implementations wanted! | ||
|
||
# match lang | ||
l, s = extract_langcode(text, lang=lang) | ||
|
||
# match country data | ||
resource_file = resolve_resource_file("countries.json") | ||
with open(resource_file) as f: | ||
countries = json.load(f) | ||
best_score = 0 | ||
best_currency = None | ||
|
||
for c in countries: | ||
if not c["ISO4217-currency_alphabetic_code"]: | ||
continue | ||
k = f"official_name_{lang.split('-')[0]}" | ||
if k in c: | ||
name = c[k] | ||
else: | ||
name = c["official_name_en"] | ||
|
||
# match currency name + country name + country lang | ||
currency_score = fuzzy_match(text, c["ISO4217-currency_name"], strategy=MatchStrategy.TOKEN_SET_RATIO) | ||
country_score = fuzzy_match(text, name, strategy=MatchStrategy.TOKEN_SET_RATIO) | ||
lang_score = 0 | ||
if l in c.get("Language", "").lower(): | ||
# bonus if language is spoken in this country | ||
lang_score = s * 0.6 | ||
# bonus if country code is part of language code | ||
if c['ISO3166-1-Alpha-2'].lower() in l: | ||
lang_score = s | ||
|
||
score = max([currency_score, country_score]) * 0.8 + 0.2 * lang_score | ||
|
||
if score > best_score: | ||
best_score = score | ||
best_currency = c["ISO4217-currency_alphabetic_code"] | ||
|
||
# special corner cases | ||
if best_score < 0.55: | ||
# european union | ||
if "euro" in text.lower() or "€" in text: | ||
return "EUR", 0.5 | ||
|
||
return best_currency, best_score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review the implementation of extract_currencycode
.
The function uses fuzzy matching to determine currency codes, which is appropriate given the variability of text input. However, consider handling edge cases where the currency might not be directly mentioned or is abbreviated differently.
+ # Consider adding handling for common abbreviations and edge cases
+ if 'USD' in text or 'dollar' in text:
+ return 'USD', 1.0
+ if 'GBP' in text or 'pound' in text:
+ return 'GBP', 1.0
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
@localized_function(run_own_code_on=[UnsupportedLanguageError, FunctionNotLocalizedError]) | |
def extract_currencycode(text, lang=""): | |
# this method tries to be lang agnostic and use mainly fuzzy matching | |
# it should be considered a fallback for unimplemented languages | |
# dedicated per language implementations wanted! | |
# match lang | |
l, s = extract_langcode(text, lang=lang) | |
# match country data | |
resource_file = resolve_resource_file("countries.json") | |
with open(resource_file) as f: | |
countries = json.load(f) | |
best_score = 0 | |
best_currency = None | |
for c in countries: | |
if not c["ISO4217-currency_alphabetic_code"]: | |
continue | |
k = f"official_name_{lang.split('-')[0]}" | |
if k in c: | |
name = c[k] | |
else: | |
name = c["official_name_en"] | |
# match currency name + country name + country lang | |
currency_score = fuzzy_match(text, c["ISO4217-currency_name"], strategy=MatchStrategy.TOKEN_SET_RATIO) | |
country_score = fuzzy_match(text, name, strategy=MatchStrategy.TOKEN_SET_RATIO) | |
lang_score = 0 | |
if l in c.get("Language", "").lower(): | |
# bonus if language is spoken in this country | |
lang_score = s * 0.6 | |
# bonus if country code is part of language code | |
if c['ISO3166-1-Alpha-2'].lower() in l: | |
lang_score = s | |
score = max([currency_score, country_score]) * 0.8 + 0.2 * lang_score | |
if score > best_score: | |
best_score = score | |
best_currency = c["ISO4217-currency_alphabetic_code"] | |
# special corner cases | |
if best_score < 0.55: | |
# european union | |
if "euro" in text.lower() or "€" in text: | |
return "EUR", 0.5 | |
return best_currency, best_score | |
@localized_function(run_own_code_on=[UnsupportedLanguageError, FunctionNotLocalizedError]) | |
def extract_currencycode(text, lang=""): | |
# this method tries to be lang agnostic and use mainly fuzzy matching | |
# it should be considered a fallback for unimplemented languages | |
# dedicated per language implementations wanted! | |
# match lang | |
l, s = extract_langcode(text, lang=lang) | |
# match country data | |
resource_file = resolve_resource_file("countries.json") | |
with open(resource_file) as f: | |
countries = json.load(f) | |
best_score = 0 | |
best_currency = None | |
for c in countries: | |
if not c["ISO4217-currency_alphabetic_code"]: | |
continue | |
k = f"official_name_{lang.split('-')[0]}" | |
if k in c: | |
name = c[k] | |
else: | |
name = c["official_name_en"] | |
# match currency name + country name + country lang | |
currency_score = fuzzy_match(text, c["ISO4217-currency_name"], strategy=MatchStrategy.TOKEN_SET_RATIO) | |
country_score = fuzzy_match(text, name, strategy=MatchStrategy.TOKEN_SET_RATIO) | |
lang_score = 0 | |
if l in c.get("Language", "").lower(): | |
# bonus if language is spoken in this country | |
lang_score = s * 0.6 | |
# bonus if country code is part of language code | |
if c['ISO3166-1-Alpha-2'].lower() in l: | |
lang_score = s | |
score = max([currency_score, country_score]) * 0.8 + 0.2 * lang_score | |
if score > best_score: | |
best_score = score | |
best_currency = c["ISO4217-currency_alphabetic_code"] | |
# special corner cases | |
if best_score < 0.55: | |
# european union | |
if "euro" in text.lower() or "€" in text: | |
return "EUR", 0.5 | |
# Consider adding handling for common abbreviations and edge cases | |
if 'USD' in text or 'dollar' in text: | |
return 'USD', 1.0 | |
if 'GBP' in text or 'pound' in text: | |
return 'GBP', 1.0 | |
return best_currency, best_score |
Tools
Ruff
45-45: Ambiguous variable name:
l
(E741)
58-61: Use ternary operator
name = c[k] if k in c else c["official_name_en"]
instead ofif
-else
-blockReplace
if
-else
-block withname = c[k] if k in c else c["official_name_en"]
(SIM108)
81-83: Use a single
if
statement instead of nestedif
statements(SIM102)
adds a .json with per country data -> closes #24
adds
extract_currency
utilSummary by CodeRabbit
New Features
extract_langcode
function.Tests
extract_currencycode
andextract_countrycode
functions.Refactor