Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/country_utils #26

Open
wants to merge 7 commits into
base: dev
Choose a base branch
from

Conversation

NeonJarbas
Copy link

@NeonJarbas NeonJarbas commented May 25, 2022

adds a .json with per country data -> closes #24
adds extract_currency util

Summary by CodeRabbit

  • New Features

    • Added functionality to extract currency codes and country codes from text input.
    • Enhanced language processing capabilities with improved extract_langcode function.
  • Tests

    • Added unit tests for the new extract_currencycode and extract_countrycode functions.
    • Updated test cases to validate country and currency code extraction in various languages.
  • Refactor

    • Reorganized import statements for better modularity and readability.

@JarbasAl JarbasAl requested a review from NeonDaniel May 25, 2022 15:18
@JarbasAl JarbasAl added the enhancement New feature or request label May 25, 2022
@NeonJarbas NeonJarbas marked this pull request as draft May 25, 2022 15:22
@codecov
Copy link

codecov bot commented May 25, 2022

Codecov Report

❗ No coverage uploaded for pull request base (dev@392cc37). Click here to learn what that means.
The diff coverage is n/a.

@@          Coverage Diff          @@
##             dev     #26   +/-   ##
=====================================
  Coverage       ?   0.00%           
=====================================
  Files          ?      65           
  Lines          ?   16409           
  Branches       ?       0           
=====================================
  Hits           ?       0           
  Misses         ?   16409           
  Partials       ?       0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 392cc37...d29c7bb. Read the comment docs.

@JarbasAl JarbasAl changed the title feat/extract_currency feat/country_utils May 25, 2022
"official_name_ar": "",
"official_name_es": "",
"official_name_cn": "",
"official_name_en": "",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taiwan

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was missing in source data for some reason, i didn't validate it

https://github.com/datasets/country-codes/blob/master/data/country-codes.csv

@@ -0,0 +1,3227 @@
[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be duplicated across languages instead of having all translated names in one file to match the structure of other resources?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought about this, but that requires a whole lot of duplicated keys with redundant info..... not sure whats best?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think separate files are easier for people to extend and less prone to typo errors in keys (official_name_xx). Maybe a common resource for country code, native lang name?

lingua_franca/res/countries.json Show resolved Hide resolved
lingua_franca/res/countries.json Show resolved Hide resolved
"official_name_es": "Antártida",
"official_name_cn": "南极洲",
"official_name_en": "Antarctica",
"official_name_ru": "Антарктике"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No language.. should it be None or an empty string to prevent reference errors?

Copy link
Member

@JarbasAl JarbasAl May 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this simply said "world" in source data

not sure if adding empty string or None makes sense... are we making a distinction between undefined/unknown?

i usually look at this stuff as responsibility of whatever is reading the file, data provides datapoints about stufff it knows about, not about what is missing

ie, empty string means undefined, missing key means unknown ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I guess this will be up to LF to handle, so I'm not sure what will be easier to parse.. I think an empty value will make it more obvious at a glance if something is missing, though I'm not sure it makes much difference unless there's an instance where there is no language (I don't think that should ever be the case?). Empty string would probably be the easiest way to specify the language is known to be none, since we could just use a type check to validate (if isinstance(lang, str) -> validated lang is specified)

lingua_franca/res/countries.json Show resolved Hide resolved
"official_name_cn": "",
"official_name_en": "",
"official_name_ru": "",
"Language": "zh-TW"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be BCP47-lang to match the ISO spec'd params? I notice some are ISO639-1 codes, so maybe including one or both and letting the parser decide what to return for language would make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense, the source data for this was noisy and wasnt always valid, i filtered results and only left langs that were 2 or 4 letter codes

@JarbasAl JarbasAl marked this pull request as ready for review May 25, 2022 17:43
@JarbasAl JarbasAl marked this pull request as draft May 25, 2022 18:16
@JarbasAl JarbasAl marked this pull request as ready for review May 25, 2022 23:17
@JarbasAl JarbasAl requested a review from NeonDaniel May 25, 2022 23:17
@JarbasAl
Copy link
Member

@coderabbitai review

Copy link

coderabbitai bot commented Jul 17, 2024

Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

coderabbitai bot commented Jul 17, 2024

Walkthrough

The changes in the lingua_franca module enhance text parsing capabilities by introducing functions for extracting country codes and currency codes, in addition to improving language code extraction. These enhancements improve the module's localization features and extend its utility in handling various internationalization scenarios.

Changes

Files Change Summary
lingua_franca/parse.py Reorganized imports, added extract_currencycode and extract_countrycode, and updated extract_langcode.
test/unittests/test_parse.py Added tests for extract_currencycode and extract_countrycode, moved fuzzy_match and match_one to util.
test/unittests/test_parse_az.py Changed imports for fuzzy_match and match_one to lingua_franca.util.
test/unittests/test_parse_cs.py Changed imports for fuzzy_match and match_one to lingua_franca.util.
test/unittests/test_parse_en.py Added tests for parsing country codes, currency codes, and their respective languages.
test/unittests/test_parse_ru.py Changed imports for fuzzy_match and match_one to lingua_franca.util.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Parse as lingua_franca.parse
    participant Util as lingua_franca.util

    User ->> Parse: Call extract_currencycode(text)
    Parse ->> Util: Call fuzzy_match for currency matching
    Util -->> Parse: Return matched currency
    Parse -->> User: Return currency code

    User ->> Parse: Call extract_countrycode(text, iso3, lang)
    Parse ->> Util: Call match_one for country matching
    Util -->> Parse: Return matched country
    Parse -->> User: Return country code

    User ->> Parse: Call extract_langcode(text, lang)
    Parse ->> Util: Call match_one for language matching
    Util -->> Parse: Return matched language
    Parse -->> User: Return language code
Loading

Assessment against linked issues

Objective Addressed Explanation
Localized language code extraction (#24)
Parsing locales for default units, currencies, and locations

Poem

In code we weave a global thread,
Currency and country, easily read.
With languages parsed, near and far,
Our module shines like a guiding star.
Through tests and trials, robust and bright,
We celebrate this linguistic flight! 🌍💸


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Outside diff range, codebase verification and nitpick comments (1)
test/unittests/test_parse_ru.py (1)

Line range hint 1-24: General code review of unit tests.

The unit tests appear to be comprehensive and well-structured. The use of setUpModule and tearDownModule for setting up and tearing down the language settings is appropriate. The tests cover a wide range of scenarios, which is good for ensuring robustness.

However, there are some commented-out tests and TODOs scattered throughout the file. It would be beneficial to address these or remove them if they are no longer relevant.

Consider either implementing or cleaning up the commented-out code and TODOs to maintain clean and maintainable code.

Also applies to: 26-1000

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 392cc37 and d29c7bb.

Files selected for processing (6)
  • lingua_franca/parse.py (2 hunks)
  • test/unittests/test_parse.py (3 hunks)
  • test/unittests/test_parse_az.py (1 hunks)
  • test/unittests/test_parse_cs.py (1 hunks)
  • test/unittests/test_parse_en.py (2 hunks)
  • test/unittests/test_parse_ru.py (1 hunks)
Additional context used
Ruff
lingua_franca/parse.py

45-45: Ambiguous variable name: l

(E741)


58-61: Use ternary operator name = c[k] if k in c else c["official_name_en"] instead of if-else-block

Replace if-else-block with name = c[k] if k in c else c["official_name_en"]

(SIM108)


81-83: Use a single if statement instead of nested if statements

(SIM102)


107-110: Use ternary operator name = c[k] if k in c else c["official_name_en"] instead of if-else-block

Replace if-else-block with name = c[k] if k in c else c["official_name_en"]

(SIM108)


129-129: Ambiguous variable name: l

(E741)


141-144: Use ternary operator name = c[k] if k in c else c["official_name_en"] instead of if-else-block

Replace if-else-block with name = c[k] if k in c else c["official_name_en"]

(SIM108)

Additional comments not posted (10)
test/unittests/test_parse.py (2)

24-24: Review new imports for consistency and correctness.

The new imports for extract_langcode, extract_currencycode, and extract_countrycode are consistent with the changes described in the PR summary. Ensure these functions are used appropriately in the test cases.


26-26: Check relocated utility functions imports.

The import of fuzzy_match and match_one has been moved to lingua_franca.util which is a logical organization for utility functions. This should be reflected in all files that use these functions.

test/unittests/test_parse_az.py (1)

26-26: Review updated imports for consistency and correctness.

The imports have been updated to reflect the changes in utility function locations. This is a good practice as it keeps utility functions organized and accessible from a common module.

test/unittests/test_parse_cs.py (1)

24-24: Updated import statements.

The change in import statements from lingua_franca.parse to lingua_franca.util for fuzzy_match and the addition of match_one is noted. This alteration should be verified to ensure that it does not break any existing functionality.

Verification successful

Updated import statements.

The change in import statements from lingua_franca.parse to lingua_franca.util for fuzzy_match and the addition of match_one is correct. The functions are used appropriately in the codebase, and the changes do not break any existing functionality.

  • test/unittests/test_parse_cs.py
  • test/unittests/test_parse_ru.py
  • test/unittests/test_parse.py
  • test/unittests/test_parse_az.py
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify if `fuzzy_match` and `match_one` are being used correctly in the new import context.

# Test: Search for usage of `fuzzy_match` and `match_one` in the entire codebase.
rg --type python -A 5 $'fuzzy_match|match_one'

Length of output: 78


Script:

#!/bin/bash
# Description: Verify if `fuzzy_match` and `match_one` are being used correctly in the new import context.

# Test: Search for usage of `fuzzy_match` and `match_one` in the entire codebase.
rg --type py -A 5 'fuzzy_match|match_one'

Length of output: 12729

test/unittests/test_parse_ru.py (1)

24-24: Import modifications approved.

The imports for fuzzy_match and match_one have been moved from lingua_franca.parse to lingua_franca.util as per the summary. This change is consistent with the rest of the file and no issues are apparent.

test/unittests/test_parse_en.py (5)

29-29: Imports updated to include new parsing functions.

The import statement has been expanded to include extract_currencycode and extract_countrycode in addition to the existing extract_langcode. This aligns with the PR's objectives to enhance parsing capabilities for country and currency codes.


1697-1711: Country code extraction tests are well-implemented.

The tests for extract_countrycode method are comprehensive, covering a variety of country names. This ensures that the country code extraction is robust and reliable.

Consider adding more diverse test cases.

It might be beneficial to include tests for countries with common names or alternative names to ensure broader coverage.
[suggestion]


1712-1737: Currency code extraction tests are comprehensive.

The tests for extract_currencycode method cover a range of scenarios, including special cases for the Euro and common currencies. This ensures that the currency code extraction is effective and adheres to the PR's enhancement goals.

Consider expanding test cases for additional currencies.

Adding tests for less common currencies or those with multiple countries using the same currency could further enhance the robustness of the currency parsing functionality.
[suggestion]


1673-1674: Language code extraction tests are well-implemented.

The tests for extract_langcode method are comprehensive, covering a variety of language names. This ensures that the language code extraction is robust and reliable.

Consider adding more diverse test cases.

It might be beneficial to include tests for languages with dialects or regional variations to ensure broader coverage.
[suggestion]

Also applies to: 1682-1691


1683-1691: Country-specific language code tests are effective.

The tests for extract_langcode method with country-specific variants are well-implemented, covering a variety of scenarios. This enhances the utility of the language parsing functionality in handling internationalization scenarios.

Consider expanding test cases for additional regional dialects.

Adding tests for more regional dialects or less commonly used language variants could further enhance the robustness of the language parsing functionality.
[suggestion]

@@ -21,9 +21,8 @@
from lingua_franca.parse import extract_datetime
from lingua_franca.parse import extract_duration
from lingua_franca.parse import extract_number, extract_numbers
from lingua_franca.parse import fuzzy_match
from lingua_franca.util import fuzzy_match, match_one
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General observation: Commented-out tests and TODOs.

Several tests are commented out, and there are multiple TODO comments throughout the file. These should be addressed to ensure full test coverage and to resolve any pending tasks or bugs.

Comment on lines +135 to +158
def test_parse_country_code(self):
def test_with_conf(text, expected_lang, min_conf=0.6):
lang, conf = extract_countrycode(text, lang="unk")
self.assertEqual(lang, expected_lang)
self.assertGreaterEqual(conf, min_conf)

# test fallback to english and fuzzy match
test_with_conf("Português", 'PT')

def test_parse_currency_code_garbage(self):
def test_with_conf(text, expected_lang, min_conf=0.5):
lang, conf = extract_currencycode(text, lang="unk")
self.assertEqual(lang, expected_lang)
self.assertGreaterEqual(conf, min_conf)

# simple fuzzy match
test_with_conf("Brazilian Portuguese", 'BRL')
test_with_conf("Brazilian", 'BRL')
test_with_conf("Português", 'EUR')

# euro special cases
test_with_conf("dglkm euro jbhjkbadksznlkjn,m", 'EUR')
test_with_conf("jhgkkjlmfauclh wçio kglbjkhad ph €", 'EUR')

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review new test cases for extract_countrycode.

The test cases for extract_countrycode seem to cover basic scenarios well, but consider adding more diverse test inputs to fully evaluate the function's robustness, especially with edge cases and different country formats.

+ # Suggested additional test cases
+ test_with_conf("United States", 'US')
+ test_with_conf("USA", 'US')
+ test_with_conf("United Kingdom", 'GB')
+ test_with_conf("Great Britain", 'GB')
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_parse_country_code(self):
def test_with_conf(text, expected_lang, min_conf=0.6):
lang, conf = extract_countrycode(text, lang="unk")
self.assertEqual(lang, expected_lang)
self.assertGreaterEqual(conf, min_conf)
# test fallback to english and fuzzy match
test_with_conf("Português", 'PT')
def test_parse_currency_code_garbage(self):
def test_with_conf(text, expected_lang, min_conf=0.5):
lang, conf = extract_currencycode(text, lang="unk")
self.assertEqual(lang, expected_lang)
self.assertGreaterEqual(conf, min_conf)
# simple fuzzy match
test_with_conf("Brazilian Portuguese", 'BRL')
test_with_conf("Brazilian", 'BRL')
test_with_conf("Português", 'EUR')
# euro special cases
test_with_conf("dglkm euro jbhjkbadksznlkjn,m", 'EUR')
test_with_conf("jhgkkjlmfauclh wçio kglbjkhad ph €", 'EUR')
def test_parse_country_code(self):
def test_with_conf(text, expected_lang, min_conf=0.6):
lang, conf = extract_countrycode(text, lang="unk")
self.assertEqual(lang, expected_lang)
self.assertGreaterEqual(conf, min_conf)
# test fallback to english and fuzzy match
test_with_conf("Português", 'PT')
# Suggested additional test cases
test_with_conf("United States", 'US')
test_with_conf("USA", 'US')
test_with_conf("United Kingdom", 'GB')
test_with_conf("Great Britain", 'GB')
def test_parse_currency_code_garbage(self):
def test_with_conf(text, expected_lang, min_conf=0.5):
lang, conf = extract_currencycode(text, lang="unk")
self.assertEqual(lang, expected_lang)
self.assertGreaterEqual(conf, min_conf)
# simple fuzzy match
test_with_conf("Brazilian Portuguese", 'BRL')
test_with_conf("Brazilian", 'BRL')
test_with_conf("Português", 'EUR')
# euro special cases
test_with_conf("dglkm euro jbhjkbadksznlkjn,m", 'EUR')
test_with_conf("jhgkkjlmfauclh wçio kglbjkhad ph €", 'EUR')

Comment on lines +144 to +158
def test_parse_currency_code_garbage(self):
def test_with_conf(text, expected_lang, min_conf=0.5):
lang, conf = extract_currencycode(text, lang="unk")
self.assertEqual(lang, expected_lang)
self.assertGreaterEqual(conf, min_conf)

# simple fuzzy match
test_with_conf("Brazilian Portuguese", 'BRL')
test_with_conf("Brazilian", 'BRL')
test_with_conf("Português", 'EUR')

# euro special cases
test_with_conf("dglkm euro jbhjkbadksznlkjn,m", 'EUR')
test_with_conf("jhgkkjlmfauclh wçio kglbjkhad ph €", 'EUR')

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review new test cases for extract_currencycode.

The tests for extract_currencycode are good for basic validation. However, adding tests for non-standard inputs and ensuring that the function correctly handles unexpected or malformed inputs would enhance the test suite's coverage.

+ # Suggested additional test cases
+ test_with_conf("US Dollar", 'USD')
+ test_with_conf("Euro", 'EUR')
+ test_with_conf("Japanese Yen", 'JPY')
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_parse_currency_code_garbage(self):
def test_with_conf(text, expected_lang, min_conf=0.5):
lang, conf = extract_currencycode(text, lang="unk")
self.assertEqual(lang, expected_lang)
self.assertGreaterEqual(conf, min_conf)
# simple fuzzy match
test_with_conf("Brazilian Portuguese", 'BRL')
test_with_conf("Brazilian", 'BRL')
test_with_conf("Português", 'EUR')
# euro special cases
test_with_conf("dglkm euro jbhjkbadksznlkjn,m", 'EUR')
test_with_conf("jhgkkjlmfauclh wçio kglbjkhad ph €", 'EUR')
def test_parse_currency_code_garbage(self):
def test_with_conf(text, expected_lang, min_conf=0.5):
lang, conf = extract_currencycode(text, lang="unk")
self.assertEqual(lang, expected_lang)
self.assertGreaterEqual(conf, min_conf)
# simple fuzzy match
test_with_conf("Brazilian Portuguese", 'BRL')
test_with_conf("Brazilian", 'BRL')
test_with_conf("Português", 'EUR')
# euro special cases
test_with_conf("dglkm euro jbhjkbadksznlkjn,m", 'EUR')
test_with_conf("jhgkkjlmfauclh wçio kglbjkhad ph €", 'EUR')
# Suggested additional test cases
test_with_conf("US Dollar", 'USD')
test_with_conf("Euro", 'EUR')
test_with_conf("Japanese Yen", 'JPY')

Comment on lines +88 to +115

@localized_function(run_own_code_on=[UnsupportedLanguageError, FunctionNotLocalizedError])
def extract_langcode(text, lang=""):
resource_file = resolve_resource_file(f"text/{lang}/langs.json") or \
resolve_resource_file("text/en-us/langs.json")
with open(resource_file) as f:
LANGUAGES = {v: k for k, v in json.load(f).items()}
return match_one(text, LANGUAGES, strategy=MatchStrategy.TOKEN_SET_RATIO)

best_lang, best_score = match_one(text, LANGUAGES, strategy=MatchStrategy.TOKEN_SET_RATIO)

# match country names
if best_score < 0.7:
resource_file = resolve_resource_file("countries.json")
with open(resource_file) as f:
countries = json.load(f)
for c in countries:
if "Language" not in c:
continue
k = f"official_name_{lang.split('-')[0]}"
if k in c:
name = c[k]
else:
name = c["official_name_en"]
score = fuzzy_match(text, name, strategy=MatchStrategy.TOKEN_SET_RATIO)
if score >= best_score:
best_lang, best_score = c["Language"].lower(), score

return best_lang, best_score
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review the improvements to extract_langcode.

The function now handles matching country names, which is a significant improvement for localization. Ensure that the function's performance is not adversely affected by these changes, especially with a potentially large number of country names to match.

+ # Consider optimizing the country name matching process
+ # Maybe use a precompiled regex or a more efficient data structure

Committable suggestion was skipped due to low confidence.

Tools
Ruff

107-110: Use ternary operator name = c[k] if k in c else c["official_name_en"] instead of if-else-block

Replace if-else-block with name = c[k] if k in c else c["official_name_en"]

(SIM108)

Comment on lines +117 to +161

@localized_function(run_own_code_on=[UnsupportedLanguageError, FunctionNotLocalizedError])
def extract_countrycode(text, iso3=False, lang=""):

resource_file = resolve_resource_file("countries.json")
with open(resource_file) as f:
countries = json.load(f)
best_score = 0
best_country = None

for c in countries:
# if text is a langcode, return parent country
l = c.get("Language", "").lower()
if not l:
lang_score = 0
elif l == f'{c["ISO3166-1-Alpha-2"]}-{c["ISO3166-1-Alpha-2"]}'.lower() and l in text:
lang_score = 1.0
else:
lang_score = fuzzy_match(text, l, strategy=MatchStrategy.TOKEN_SET_RATIO) * 0.8
if c["ISO3166-1-Alpha-2"].lower() in l:
lang_score += 0.05

# match country name to text
k = f"official_name_{lang.split('-')[0]}"
if k in c:
name = c[k]
else:
name = c["official_name_en"]

name_score = fuzzy_match(text, name, strategy=MatchStrategy.TOKEN_SET_RATIO)

if name_score < 0.7 <= lang_score:
score = lang_score
elif lang_score < 0.7 <= name_score:
score = name_score
else:
score = 0.5 * name_score + 0.5 * lang_score

if score >= best_score:
if iso3:
best_country, best_score = c["ISO3166-1-Alpha-3"], score
else:
best_country, best_score = c["ISO3166-1-Alpha-2"], score

return best_country, best_score
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review the implementation of extract_countrycode.

This function's implementation is robust, but ensure that it handles cases where the country code or name might be ambiguous or represented in multiple ways.

+ # Consider adding more sophisticated handling for ambiguous country names
+ if 'UK' in text or 'GB' in text or 'Britain' in text:
+     return 'GB', 1.0
+ if 'USA' in text or 'US' in text or 'America' in text:
+     return 'US', 1.0
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@localized_function(run_own_code_on=[UnsupportedLanguageError, FunctionNotLocalizedError])
def extract_countrycode(text, iso3=False, lang=""):
resource_file = resolve_resource_file("countries.json")
with open(resource_file) as f:
countries = json.load(f)
best_score = 0
best_country = None
for c in countries:
# if text is a langcode, return parent country
l = c.get("Language", "").lower()
if not l:
lang_score = 0
elif l == f'{c["ISO3166-1-Alpha-2"]}-{c["ISO3166-1-Alpha-2"]}'.lower() and l in text:
lang_score = 1.0
else:
lang_score = fuzzy_match(text, l, strategy=MatchStrategy.TOKEN_SET_RATIO) * 0.8
if c["ISO3166-1-Alpha-2"].lower() in l:
lang_score += 0.05
# match country name to text
k = f"official_name_{lang.split('-')[0]}"
if k in c:
name = c[k]
else:
name = c["official_name_en"]
name_score = fuzzy_match(text, name, strategy=MatchStrategy.TOKEN_SET_RATIO)
if name_score < 0.7 <= lang_score:
score = lang_score
elif lang_score < 0.7 <= name_score:
score = name_score
else:
score = 0.5 * name_score + 0.5 * lang_score
if score >= best_score:
if iso3:
best_country, best_score = c["ISO3166-1-Alpha-3"], score
else:
best_country, best_score = c["ISO3166-1-Alpha-2"], score
return best_country, best_score
@localized_function(run_own_code_on=[UnsupportedLanguageError, FunctionNotLocalizedError])
def extract_countrycode(text, iso3=False, lang=""):
resource_file = resolve_resource_file("countries.json")
with open(resource_file) as f:
countries = json.load(f)
best_score = 0
best_country = None
# Consider adding more sophisticated handling for ambiguous country names
if 'UK' in text or 'GB' in text or 'Britain' in text:
return 'GB', 1.0
if 'USA' in text or 'US' in text or 'America' in text:
return 'US', 1.0
for c in countries:
# if text is a langcode, return parent country
l = c.get("Language", "").lower()
if not l:
lang_score = 0
elif l == f'{c["ISO3166-1-Alpha-2"]}-{c["ISO3166-1-Alpha-2"]}'.lower() and l in text:
lang_score = 1.0
else:
lang_score = fuzzy_match(text, l, strategy=MatchStrategy.TOKEN_SET_RATIO) * 0.8
if c["ISO3166-1-Alpha-2"].lower() in l:
lang_score += 0.05
# match country name to text
k = f"official_name_{lang.split('-')[0]}"
if k in c:
name = c[k]
else:
name = c["official_name_en"]
name_score = fuzzy_match(text, name, strategy=MatchStrategy.TOKEN_SET_RATIO)
if name_score < 0.7 <= lang_score:
score = lang_score
elif lang_score < 0.7 <= name_score:
score = name_score
else:
score = 0.5 * name_score + 0.5 * lang_score
if score >= best_score:
if iso3:
best_country, best_score = c["ISO3166-1-Alpha-3"], score
else:
best_country, best_score = c["ISO3166-1-Alpha-2"], score
return best_country, best_score
Tools
Ruff

129-129: Ambiguous variable name: l

(E741)


141-144: Use ternary operator name = c[k] if k in c else c["official_name_en"] instead of if-else-block

Replace if-else-block with name = c[k] if k in c else c["official_name_en"]

(SIM108)

Comment on lines +38 to +86
@localized_function(run_own_code_on=[UnsupportedLanguageError, FunctionNotLocalizedError])
def extract_currencycode(text, lang=""):
# this method tries to be lang agnostic and use mainly fuzzy matching
# it should be considered a fallback for unimplemented languages
# dedicated per language implementations wanted!

# match lang
l, s = extract_langcode(text, lang=lang)

# match country data
resource_file = resolve_resource_file("countries.json")
with open(resource_file) as f:
countries = json.load(f)
best_score = 0
best_currency = None

for c in countries:
if not c["ISO4217-currency_alphabetic_code"]:
continue
k = f"official_name_{lang.split('-')[0]}"
if k in c:
name = c[k]
else:
name = c["official_name_en"]

# match currency name + country name + country lang
currency_score = fuzzy_match(text, c["ISO4217-currency_name"], strategy=MatchStrategy.TOKEN_SET_RATIO)
country_score = fuzzy_match(text, name, strategy=MatchStrategy.TOKEN_SET_RATIO)
lang_score = 0
if l in c.get("Language", "").lower():
# bonus if language is spoken in this country
lang_score = s * 0.6
# bonus if country code is part of language code
if c['ISO3166-1-Alpha-2'].lower() in l:
lang_score = s

score = max([currency_score, country_score]) * 0.8 + 0.2 * lang_score

if score > best_score:
best_score = score
best_currency = c["ISO4217-currency_alphabetic_code"]

# special corner cases
if best_score < 0.55:
# european union
if "euro" in text.lower() or "€" in text:
return "EUR", 0.5

return best_currency, best_score
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review the implementation of extract_currencycode.

The function uses fuzzy matching to determine currency codes, which is appropriate given the variability of text input. However, consider handling edge cases where the currency might not be directly mentioned or is abbreviated differently.

+ # Consider adding handling for common abbreviations and edge cases
+ if 'USD' in text or 'dollar' in text:
+     return 'USD', 1.0
+ if 'GBP' in text or 'pound' in text:
+     return 'GBP', 1.0
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@localized_function(run_own_code_on=[UnsupportedLanguageError, FunctionNotLocalizedError])
def extract_currencycode(text, lang=""):
# this method tries to be lang agnostic and use mainly fuzzy matching
# it should be considered a fallback for unimplemented languages
# dedicated per language implementations wanted!
# match lang
l, s = extract_langcode(text, lang=lang)
# match country data
resource_file = resolve_resource_file("countries.json")
with open(resource_file) as f:
countries = json.load(f)
best_score = 0
best_currency = None
for c in countries:
if not c["ISO4217-currency_alphabetic_code"]:
continue
k = f"official_name_{lang.split('-')[0]}"
if k in c:
name = c[k]
else:
name = c["official_name_en"]
# match currency name + country name + country lang
currency_score = fuzzy_match(text, c["ISO4217-currency_name"], strategy=MatchStrategy.TOKEN_SET_RATIO)
country_score = fuzzy_match(text, name, strategy=MatchStrategy.TOKEN_SET_RATIO)
lang_score = 0
if l in c.get("Language", "").lower():
# bonus if language is spoken in this country
lang_score = s * 0.6
# bonus if country code is part of language code
if c['ISO3166-1-Alpha-2'].lower() in l:
lang_score = s
score = max([currency_score, country_score]) * 0.8 + 0.2 * lang_score
if score > best_score:
best_score = score
best_currency = c["ISO4217-currency_alphabetic_code"]
# special corner cases
if best_score < 0.55:
# european union
if "euro" in text.lower() or "€" in text:
return "EUR", 0.5
return best_currency, best_score
@localized_function(run_own_code_on=[UnsupportedLanguageError, FunctionNotLocalizedError])
def extract_currencycode(text, lang=""):
# this method tries to be lang agnostic and use mainly fuzzy matching
# it should be considered a fallback for unimplemented languages
# dedicated per language implementations wanted!
# match lang
l, s = extract_langcode(text, lang=lang)
# match country data
resource_file = resolve_resource_file("countries.json")
with open(resource_file) as f:
countries = json.load(f)
best_score = 0
best_currency = None
for c in countries:
if not c["ISO4217-currency_alphabetic_code"]:
continue
k = f"official_name_{lang.split('-')[0]}"
if k in c:
name = c[k]
else:
name = c["official_name_en"]
# match currency name + country name + country lang
currency_score = fuzzy_match(text, c["ISO4217-currency_name"], strategy=MatchStrategy.TOKEN_SET_RATIO)
country_score = fuzzy_match(text, name, strategy=MatchStrategy.TOKEN_SET_RATIO)
lang_score = 0
if l in c.get("Language", "").lower():
# bonus if language is spoken in this country
lang_score = s * 0.6
# bonus if country code is part of language code
if c['ISO3166-1-Alpha-2'].lower() in l:
lang_score = s
score = max([currency_score, country_score]) * 0.8 + 0.2 * lang_score
if score > best_score:
best_score = score
best_currency = c["ISO4217-currency_alphabetic_code"]
# special corner cases
if best_score < 0.55:
# european union
if "euro" in text.lower() or "€" in text:
return "EUR", 0.5
# Consider adding handling for common abbreviations and edge cases
if 'USD' in text or 'dollar' in text:
return 'USD', 1.0
if 'GBP' in text or 'pound' in text:
return 'GBP', 1.0
return best_currency, best_score
Tools
Ruff

45-45: Ambiguous variable name: l

(E741)


58-61: Use ternary operator name = c[k] if k in c else c["official_name_en"] instead of if-else-block

Replace if-else-block with name = c[k] if k in c else c["official_name_en"]

(SIM108)


81-83: Use a single if statement instead of nested if statements

(SIM102)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Localized language code extraction
3 participants