feat/country_utils #26

NeonJarbas · 2022-05-25T15:18:14Z

adds a .json with per country data -> closes #24
adds extract_currency util

Summary by CodeRabbit

New Features
- Added functionality to extract currency codes and country codes from text input.
- Enhanced language processing capabilities with improved extract_langcode function.
Tests
- Added unit tests for the new extract_currencycode and extract_countrycode functions.
- Updated test cases to validate country and currency code extraction in various languages.
Refactor
- Reorganized import statements for better modularity and readability.

codecov · 2022-05-25T15:32:57Z

Codecov Report

❗ No coverage uploaded for pull request base (dev@392cc37). Click here to learn what that means.
The diff coverage is n/a.

@@          Coverage Diff          @@
##             dev     #26   +/-   ##
=====================================
  Coverage       ?   0.00%           
=====================================
  Files          ?      65           
  Lines          ?   16409           
  Branches       ?       0           
=====================================
  Hits           ?       0           
  Misses         ?   16409           
  Partials       ?       0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 392cc37...d29c7bb. Read the comment docs.

NeonDaniel · 2022-05-25T17:03:36Z

lingua_franca/res/countries.json

+ "official_name_ar": "",
+ "official_name_es": "",
+ "official_name_cn": "",
+ "official_name_en": "",


this was missing in source data for some reason, i didn't validate it

https://github.com/datasets/country-codes/blob/master/data/country-codes.csv

NeonDaniel · 2022-05-25T17:04:26Z

lingua_franca/res/countries.json

@@ -0,0 +1,3227 @@
+[


Should this be duplicated across languages instead of having all translated names in one file to match the structure of other resources?

i thought about this, but that requires a whole lot of duplicated keys with redundant info..... not sure whats best?

I think separate files are easier for people to extend and less prone to typo errors in keys (official_name_xx). Maybe a common resource for country code, native lang name?

lingua_franca/res/countries.json

NeonDaniel · 2022-05-25T17:06:25Z

lingua_franca/res/countries.json

+ "official_name_es": "Antártida",
+ "official_name_cn": "南极洲",
+ "official_name_en": "Antarctica",
+ "official_name_ru": "Антарктике"


No language.. should it be None or an empty string to prevent reference errors?

this simply said "world" in source data

not sure if adding empty string or None makes sense... are we making a distinction between undefined/unknown?

i usually look at this stuff as responsibility of whatever is reading the file, data provides datapoints about stufff it knows about, not about what is missing

ie, empty string means undefined, missing key means unknown ?

Hmm. I guess this will be up to LF to handle, so I'm not sure what will be easier to parse.. I think an empty value will make it more obvious at a glance if something is missing, though I'm not sure it makes much difference unless there's an instance where there is no language (I don't think that should ever be the case?). Empty string would probably be the easiest way to specify the language is known to be none, since we could just use a type check to validate (if isinstance(lang, str) -> validated lang is specified)

lingua_franca/res/countries.json

NeonDaniel · 2022-05-25T17:08:54Z

lingua_franca/res/countries.json

+ "official_name_cn": "",
+ "official_name_en": "",
+ "official_name_ru": "",
+ "Language": "zh-TW"


should this be BCP47-lang to match the ISO spec'd params? I notice some are ISO639-1 codes, so maybe including one or both and letting the parser decide what to return for language would make sense?

this makes sense, the source data for this was noisy and wasnt always valid, i filtered results and only left langs that were 2 or 4 letter codes

JarbasAl · 2024-07-17T12:30:51Z

@coderabbitai review

coderabbitai · 2024-07-17T12:30:55Z

Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2024-07-17T12:30:58Z

Walkthrough

The changes in the lingua_franca module enhance text parsing capabilities by introducing functions for extracting country codes and currency codes, in addition to improving language code extraction. These enhancements improve the module's localization features and extend its utility in handling various internationalization scenarios.

Changes

Files	Change Summary
lingua_franca/parse.py	Reorganized imports, added `extract_currencycode` and `extract_countrycode`, and updated `extract_langcode`.
test/unittests/test_parse.py	Added tests for `extract_currencycode` and `extract_countrycode`, moved `fuzzy_match` and `match_one` to `util`.
test/unittests/test_parse_az.py	Changed imports for `fuzzy_match` and `match_one` to `lingua_franca.util`.
test/unittests/test_parse_cs.py	Changed imports for `fuzzy_match` and `match_one` to `lingua_franca.util`.
test/unittests/test_parse_en.py	Added tests for parsing country codes, currency codes, and their respective languages.
test/unittests/test_parse_ru.py	Changed imports for `fuzzy_match` and `match_one` to `lingua_franca.util`.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Parse as lingua_franca.parse
    participant Util as lingua_franca.util

    User ->> Parse: Call extract_currencycode(text)
    Parse ->> Util: Call fuzzy_match for currency matching
    Util -->> Parse: Return matched currency
    Parse -->> User: Return currency code

    User ->> Parse: Call extract_countrycode(text, iso3, lang)
    Parse ->> Util: Call match_one for country matching
    Util -->> Parse: Return matched country
    Parse -->> User: Return country code

    User ->> Parse: Call extract_langcode(text, lang)
    Parse ->> Util: Call match_one for language matching
    Util -->> Parse: Return matched language
    Parse -->> User: Return language code

Assessment against linked issues

Objective	Addressed	Explanation
Localized language code extraction (#24)	✅
Parsing locales for default units, currencies, and locations	✅

Poem

In code we weave a global thread,
Currency and country, easily read.
With languages parsed, near and far,
Our module shines like a guiding star.
Through tests and trials, robust and bright,
We celebrate this linguistic flight! 🌍💸

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 6

Outside diff range, codebase verification and nitpick comments (1)

test/unittests/test_parse_ru.py (1)

Line range hint 1-24: General code review of unit tests.

The unit tests appear to be comprehensive and well-structured. The use of setUpModule and tearDownModule for setting up and tearing down the language settings is appropriate. The tests cover a wide range of scenarios, which is good for ensuring robustness.

However, there are some commented-out tests and TODOs scattered throughout the file. It would be beneficial to address these or remove them if they are no longer relevant.

Consider either implementing or cleaning up the commented-out code and TODOs to maintain clean and maintainable code.

Also applies to: 26-1000

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 392cc37 and d29c7bb.

Files selected for processing (6)

lingua_franca/parse.py (2 hunks)
test/unittests/test_parse.py (3 hunks)
test/unittests/test_parse_az.py (1 hunks)
test/unittests/test_parse_cs.py (1 hunks)
test/unittests/test_parse_en.py (2 hunks)
test/unittests/test_parse_ru.py (1 hunks)

Additional context used

Ruff

lingua_franca/parse.py

45-45: Ambiguous variable name: l

(E741)

58-61: Use ternary operator name = c[k] if k in c else c["official_name_en"] instead of if-else-block

Replace if-else-block with name = c[k] if k in c else c["official_name_en"]

(SIM108)

81-83: Use a single if statement instead of nested if statements

(SIM102)

107-110: Use ternary operator name = c[k] if k in c else c["official_name_en"] instead of if-else-block

Replace if-else-block with name = c[k] if k in c else c["official_name_en"]

(SIM108)

129-129: Ambiguous variable name: l

(E741)

141-144: Use ternary operator name = c[k] if k in c else c["official_name_en"] instead of if-else-block

Replace if-else-block with name = c[k] if k in c else c["official_name_en"]

(SIM108)

Additional comments not posted (10)

test/unittests/test_parse.py (2)

24-24: Review new imports for consistency and correctness.

The new imports for extract_langcode, extract_currencycode, and extract_countrycode are consistent with the changes described in the PR summary. Ensure these functions are used appropriately in the test cases.

26-26: Check relocated utility functions imports.

The import of fuzzy_match and match_one has been moved to lingua_franca.util which is a logical organization for utility functions. This should be reflected in all files that use these functions.

test/unittests/test_parse_az.py (1)

26-26: Review updated imports for consistency and correctness.

The imports have been updated to reflect the changes in utility function locations. This is a good practice as it keeps utility functions organized and accessible from a common module.
test/unittests/test_parse_cs.py (1)
24-24: Updated import statements.

The change in import statements from lingua_franca.parse to lingua_franca.util for fuzzy_match and the addition of match_one is noted. This alteration should be verified to ensure that it does not break any existing functionality.

Verification successful

Updated import statements.

The change in import statements from lingua_franca.parse to lingua_franca.util for fuzzy_match and the addition of match_one is correct. The functions are used appropriately in the codebase, and the changes do not break any existing functionality.

test/unittests/test_parse_cs.py

test/unittests/test_parse_ru.py

test/unittests/test_parse.py

test/unittests/test_parse_az.py
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify if `fuzzy_match` and `match_one` are being used correctly in the new import context.

# Test: Search for usage of `fuzzy_match` and `match_one` in the entire codebase.
rg --type python -A 5 $'fuzzy_match|match_one'
Length of output: 78

Script:
#!/bin/bash
# Description: Verify if `fuzzy_match` and `match_one` are being used correctly in the new import context.

# Test: Search for usage of `fuzzy_match` and `match_one` in the entire codebase.
rg --type py -A 5 'fuzzy_match|match_one'
Length of output: 12729
test/unittests/test_parse_ru.py (1)

24-24: Import modifications approved.

The imports for fuzzy_match and match_one have been moved from lingua_franca.parse to lingua_franca.util as per the summary. This change is consistent with the rest of the file and no issues are apparent.

test/unittests/test_parse_en.py (5)

29-29: Imports updated to include new parsing functions.

The import statement has been expanded to include extract_currencycode and extract_countrycode in addition to the existing extract_langcode. This aligns with the PR's objectives to enhance parsing capabilities for country and currency codes.

1697-1711: Country code extraction tests are well-implemented.

The tests for extract_countrycode method are comprehensive, covering a variety of country names. This ensures that the country code extraction is robust and reliable.

Consider adding more diverse test cases.

It might be beneficial to include tests for countries with common names or alternative names to ensure broader coverage.
[suggestion]

1712-1737: Currency code extraction tests are comprehensive.

The tests for extract_currencycode method cover a range of scenarios, including special cases for the Euro and common currencies. This ensures that the currency code extraction is effective and adheres to the PR's enhancement goals.

Consider expanding test cases for additional currencies.

Adding tests for less common currencies or those with multiple countries using the same currency could further enhance the robustness of the currency parsing functionality.
[suggestion]

1673-1674: Language code extraction tests are well-implemented.

The tests for extract_langcode method are comprehensive, covering a variety of language names. This ensures that the language code extraction is robust and reliable.

Consider adding more diverse test cases.

It might be beneficial to include tests for languages with dialects or regional variations to ensure broader coverage.
[suggestion]

Also applies to: 1682-1691

1683-1691: Country-specific language code tests are effective.

The tests for extract_langcode method with country-specific variants are well-implemented, covering a variety of scenarios. This enhances the utility of the language parsing functionality in handling internationalization scenarios.

Consider expanding test cases for additional regional dialects.

Adding tests for more regional dialects or less commonly used language variants could further enhance the robustness of the language parsing functionality.
[suggestion]

coderabbitai · 2024-07-17T12:35:09Z

test/unittests/test_parse_cs.py

@@ -21,9 +21,8 @@
 from lingua_franca.parse import extract_datetime
 from lingua_franca.parse import extract_duration
 from lingua_franca.parse import extract_number, extract_numbers
-from lingua_franca.parse import fuzzy_match
+from lingua_franca.util import fuzzy_match, match_one


General observation: Commented-out tests and TODOs.

Several tests are commented out, and there are multiple TODO comments throughout the file. These should be addressed to ensure full test coverage and to resolve any pending tasks or bugs.

coderabbitai · 2024-07-17T12:35:09Z

test/unittests/test_parse.py

+ def test_parse_country_code(self):
+ def test_with_conf(text, expected_lang, min_conf=0.6):
+ lang, conf = extract_countrycode(text, lang="unk")
+ self.assertEqual(lang, expected_lang)
+ self.assertGreaterEqual(conf, min_conf)
+
+ # test fallback to english and fuzzy match
+ test_with_conf("Português", 'PT')
+
+ def test_parse_currency_code_garbage(self):
+ def test_with_conf(text, expected_lang, min_conf=0.5):
+ lang, conf = extract_currencycode(text, lang="unk")
+ self.assertEqual(lang, expected_lang)
+ self.assertGreaterEqual(conf, min_conf)
+
+ # simple fuzzy match
+ test_with_conf("Brazilian Portuguese", 'BRL')
+ test_with_conf("Brazilian", 'BRL')
+ test_with_conf("Português", 'EUR')
+
+ # euro special cases
+ test_with_conf("dglkm euro jbhjkbadksznlkjn,m", 'EUR')
+ test_with_conf("jhgkkjlmfauclh wçio kglbjkhad ph €", 'EUR')
+


Review new test cases for extract_countrycode.

The test cases for extract_countrycode seem to cover basic scenarios well, but consider adding more diverse test inputs to fully evaluate the function's robustness, especially with edge cases and different country formats.

+ # Suggested additional test cases + test_with_conf("United States", 'US') + test_with_conf("USA", 'US') + test_with_conf("United Kingdom", 'GB') + test_with_conf("Great Britain", 'GB')

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def test_parse_country_code(self):

def test_with_conf(text, expected_lang, min_conf=0.6):

lang, conf = extract_countrycode(text, lang="unk")

self.assertEqual(lang, expected_lang)

self.assertGreaterEqual(conf, min_conf)

# test fallback to english and fuzzy match

test_with_conf("Português", 'PT')

def test_parse_currency_code_garbage(self):

def test_with_conf(text, expected_lang, min_conf=0.5):

lang, conf = extract_currencycode(text, lang="unk")

self.assertEqual(lang, expected_lang)

self.assertGreaterEqual(conf, min_conf)

# simple fuzzy match

test_with_conf("Brazilian Portuguese", 'BRL')

test_with_conf("Brazilian", 'BRL')

test_with_conf("Português", 'EUR')

# euro special cases

test_with_conf("dglkm euro jbhjkbadksznlkjn,m", 'EUR')

test_with_conf("jhgkkjlmfauclh wçio kglbjkhad ph €", 'EUR')

def test_parse_country_code(self):

def test_with_conf(text, expected_lang, min_conf=0.6):

lang, conf = extract_countrycode(text, lang="unk")

self.assertEqual(lang, expected_lang)

self.assertGreaterEqual(conf, min_conf)

# test fallback to english and fuzzy match

test_with_conf("Português", 'PT')

# Suggested additional test cases

test_with_conf("United States", 'US')

test_with_conf("USA", 'US')

test_with_conf("United Kingdom", 'GB')

test_with_conf("Great Britain", 'GB')

def test_parse_currency_code_garbage(self):

def test_with_conf(text, expected_lang, min_conf=0.5):

lang, conf = extract_currencycode(text, lang="unk")

self.assertEqual(lang, expected_lang)

self.assertGreaterEqual(conf, min_conf)

# simple fuzzy match

test_with_conf("Brazilian Portuguese", 'BRL')

test_with_conf("Brazilian", 'BRL')

test_with_conf("Português", 'EUR')

# euro special cases

test_with_conf("dglkm euro jbhjkbadksznlkjn,m", 'EUR')

test_with_conf("jhgkkjlmfauclh wçio kglbjkhad ph €", 'EUR')

coderabbitai · 2024-07-17T12:35:09Z

test/unittests/test_parse.py

+ def test_parse_currency_code_garbage(self):
+ def test_with_conf(text, expected_lang, min_conf=0.5):
+ lang, conf = extract_currencycode(text, lang="unk")
+ self.assertEqual(lang, expected_lang)
+ self.assertGreaterEqual(conf, min_conf)
+
+ # simple fuzzy match
+ test_with_conf("Brazilian Portuguese", 'BRL')
+ test_with_conf("Brazilian", 'BRL')
+ test_with_conf("Português", 'EUR')
+
+ # euro special cases
+ test_with_conf("dglkm euro jbhjkbadksznlkjn,m", 'EUR')
+ test_with_conf("jhgkkjlmfauclh wçio kglbjkhad ph €", 'EUR')
+


Review new test cases for extract_currencycode.

The tests for extract_currencycode are good for basic validation. However, adding tests for non-standard inputs and ensuring that the function correctly handles unexpected or malformed inputs would enhance the test suite's coverage.

+ # Suggested additional test cases + test_with_conf("US Dollar", 'USD') + test_with_conf("Euro", 'EUR') + test_with_conf("Japanese Yen", 'JPY')

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def test_parse_currency_code_garbage(self):

def test_with_conf(text, expected_lang, min_conf=0.5):

lang, conf = extract_currencycode(text, lang="unk")

self.assertEqual(lang, expected_lang)

self.assertGreaterEqual(conf, min_conf)

# simple fuzzy match

test_with_conf("Brazilian Portuguese", 'BRL')

test_with_conf("Brazilian", 'BRL')

test_with_conf("Português", 'EUR')

# euro special cases

test_with_conf("dglkm euro jbhjkbadksznlkjn,m", 'EUR')

test_with_conf("jhgkkjlmfauclh wçio kglbjkhad ph €", 'EUR')

def test_parse_currency_code_garbage(self):

def test_with_conf(text, expected_lang, min_conf=0.5):

lang, conf = extract_currencycode(text, lang="unk")

self.assertEqual(lang, expected_lang)

self.assertGreaterEqual(conf, min_conf)

# simple fuzzy match

test_with_conf("Brazilian Portuguese", 'BRL')

test_with_conf("Brazilian", 'BRL')

test_with_conf("Português", 'EUR')

# euro special cases

test_with_conf("dglkm euro jbhjkbadksznlkjn,m", 'EUR')

test_with_conf("jhgkkjlmfauclh wçio kglbjkhad ph €", 'EUR')

# Suggested additional test cases

test_with_conf("US Dollar", 'USD')

test_with_conf("Euro", 'EUR')

test_with_conf("Japanese Yen", 'JPY')

coderabbitai · 2024-07-17T12:35:09Z

lingua_franca/parse.py

+
 @localized_function(run_own_code_on=[UnsupportedLanguageError, FunctionNotLocalizedError])
 def extract_langcode(text, lang=""):
 resource_file = resolve_resource_file(f"text/{lang}/langs.json") or \
 resolve_resource_file("text/en-us/langs.json")
 with open(resource_file) as f:
 LANGUAGES = {v: k for k, v in json.load(f).items()}
- return match_one(text, LANGUAGES, strategy=MatchStrategy.TOKEN_SET_RATIO)
+
+ best_lang, best_score = match_one(text, LANGUAGES, strategy=MatchStrategy.TOKEN_SET_RATIO)
+
+ # match country names
+ if best_score < 0.7:
+ resource_file = resolve_resource_file("countries.json")
+ with open(resource_file) as f:
+ countries = json.load(f)
+ for c in countries:
+ if "Language" not in c:
+ continue
+ k = f"official_name_{lang.split('-')[0]}"
+ if k in c:
+ name = c[k]
+ else:
+ name = c["official_name_en"]
+ score = fuzzy_match(text, name, strategy=MatchStrategy.TOKEN_SET_RATIO)
+ if score >= best_score:
+ best_lang, best_score = c["Language"].lower(), score
+
+ return best_lang, best_score


Review the improvements to extract_langcode.

The function now handles matching country names, which is a significant improvement for localization. Ensure that the function's performance is not adversely affected by these changes, especially with a potentially large number of country names to match.

+ # Consider optimizing the country name matching process + # Maybe use a precompiled regex or a more efficient data structure

Committable suggestion was skipped due to low confidence.

Tools

Ruff

107-110: Use ternary operator name = c[k] if k in c else c["official_name_en"] instead of if-else-block

Replace if-else-block with name = c[k] if k in c else c["official_name_en"]

(SIM108)

coderabbitai · 2024-07-17T12:35:09Z

lingua_franca/parse.py

+
+@localized_function(run_own_code_on=[UnsupportedLanguageError, FunctionNotLocalizedError])
+def extract_countrycode(text, iso3=False, lang=""):
+
+ resource_file = resolve_resource_file("countries.json")
+ with open(resource_file) as f:
+ countries = json.load(f)
+ best_score = 0
+ best_country = None
+
+ for c in countries:
+ # if text is a langcode, return parent country
+ l = c.get("Language", "").lower()
+ if not l:
+ lang_score = 0
+ elif l == f'{c["ISO3166-1-Alpha-2"]}-{c["ISO3166-1-Alpha-2"]}'.lower() and l in text:
+ lang_score = 1.0
+ else:
+ lang_score = fuzzy_match(text, l, strategy=MatchStrategy.TOKEN_SET_RATIO) * 0.8
+ if c["ISO3166-1-Alpha-2"].lower() in l:
+ lang_score += 0.05
+
+ # match country name to text
+ k = f"official_name_{lang.split('-')[0]}"
+ if k in c:
+ name = c[k]
+ else:
+ name = c["official_name_en"]
+
+ name_score = fuzzy_match(text, name, strategy=MatchStrategy.TOKEN_SET_RATIO)
+
+ if name_score < 0.7 <= lang_score:
+ score = lang_score
+ elif lang_score < 0.7 <= name_score:
+ score = name_score
+ else:
+ score = 0.5 * name_score + 0.5 * lang_score
+
+ if score >= best_score:
+ if iso3:
+ best_country, best_score = c["ISO3166-1-Alpha-3"], score
+ else:
+ best_country, best_score = c["ISO3166-1-Alpha-2"], score
+
+ return best_country, best_score


Review the implementation of extract_countrycode.

This function's implementation is robust, but ensure that it handles cases where the country code or name might be ambiguous or represented in multiple ways.

+ # Consider adding more sophisticated handling for ambiguous country names + if 'UK' in text or 'GB' in text or 'Britain' in text: + return 'GB', 1.0 + if 'USA' in text or 'US' in text or 'America' in text: + return 'US', 1.0

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@localized_function(run_own_code_on=[UnsupportedLanguageError, FunctionNotLocalizedError])

def extract_countrycode(text, iso3=False, lang=""):

resource_file = resolve_resource_file("countries.json")

with open(resource_file) as f:

countries = json.load(f)

best_score = 0

best_country = None

for c in countries:

# if text is a langcode, return parent country

l = c.get("Language", "").lower()

if not l:

lang_score = 0

elif l == f'{c["ISO3166-1-Alpha-2"]}-{c["ISO3166-1-Alpha-2"]}'.lower() and l in text:

lang_score = 1.0

else:

lang_score = fuzzy_match(text, l, strategy=MatchStrategy.TOKEN_SET_RATIO) * 0.8

if c["ISO3166-1-Alpha-2"].lower() in l:

lang_score += 0.05

# match country name to text

k = f"official_name_{lang.split('-')[0]}"

if k in c:

name = c[k]

else:

name = c["official_name_en"]

name_score = fuzzy_match(text, name, strategy=MatchStrategy.TOKEN_SET_RATIO)

if name_score < 0.7 <= lang_score:

score = lang_score

elif lang_score < 0.7 <= name_score:

score = name_score

else:

score = 0.5 * name_score + 0.5 * lang_score

if score >= best_score:

if iso3:

best_country, best_score = c["ISO3166-1-Alpha-3"], score

else:

best_country, best_score = c["ISO3166-1-Alpha-2"], score

return best_country, best_score

@localized_function(run_own_code_on=[UnsupportedLanguageError, FunctionNotLocalizedError])

def extract_countrycode(text, iso3=False, lang=""):

resource_file = resolve_resource_file("countries.json")

with open(resource_file) as f:

countries = json.load(f)

best_score = 0

best_country = None

# Consider adding more sophisticated handling for ambiguous country names

if 'UK' in text or 'GB' in text or 'Britain' in text:

return 'GB', 1.0

if 'USA' in text or 'US' in text or 'America' in text:

return 'US', 1.0

for c in countries:

# if text is a langcode, return parent country

l = c.get("Language", "").lower()

if not l:

lang_score = 0

elif l == f'{c["ISO3166-1-Alpha-2"]}-{c["ISO3166-1-Alpha-2"]}'.lower() and l in text:

lang_score = 1.0

else:

lang_score = fuzzy_match(text, l, strategy=MatchStrategy.TOKEN_SET_RATIO) * 0.8

if c["ISO3166-1-Alpha-2"].lower() in l:

lang_score += 0.05

# match country name to text

k = f"official_name_{lang.split('-')[0]}"

if k in c:

name = c[k]

else:

name = c["official_name_en"]

name_score = fuzzy_match(text, name, strategy=MatchStrategy.TOKEN_SET_RATIO)

if name_score < 0.7 <= lang_score:

score = lang_score

elif lang_score < 0.7 <= name_score:

score = name_score

else:

score = 0.5 * name_score + 0.5 * lang_score

if score >= best_score:

if iso3:

best_country, best_score = c["ISO3166-1-Alpha-3"], score

else:

best_country, best_score = c["ISO3166-1-Alpha-2"], score

return best_country, best_score

Tools

Ruff

129-129: Ambiguous variable name: l

(E741)

141-144: Use ternary operator name = c[k] if k in c else c["official_name_en"] instead of if-else-block

Replace if-else-block with name = c[k] if k in c else c["official_name_en"]

(SIM108)

coderabbitai · 2024-07-17T12:35:09Z

lingua_franca/parse.py

+@localized_function(run_own_code_on=[UnsupportedLanguageError, FunctionNotLocalizedError])
+def extract_currencycode(text, lang=""):
+ # this method tries to be lang agnostic and use mainly fuzzy matching
+ # it should be considered a fallback for unimplemented languages
+ # dedicated per language implementations wanted!
+
+ # match lang
+ l, s = extract_langcode(text, lang=lang)
+
+ # match country data
+ resource_file = resolve_resource_file("countries.json")
+ with open(resource_file) as f:
+ countries = json.load(f)
+ best_score = 0
+ best_currency = None
+
+ for c in countries:
+ if not c["ISO4217-currency_alphabetic_code"]:
+ continue
+ k = f"official_name_{lang.split('-')[0]}"
+ if k in c:
+ name = c[k]
+ else:
+ name = c["official_name_en"]
+
+ # match currency name + country name + country lang
+ currency_score = fuzzy_match(text, c["ISO4217-currency_name"], strategy=MatchStrategy.TOKEN_SET_RATIO)
+ country_score = fuzzy_match(text, name, strategy=MatchStrategy.TOKEN_SET_RATIO)
+ lang_score = 0
+ if l in c.get("Language", "").lower():
+ # bonus if language is spoken in this country
+ lang_score = s * 0.6
+ # bonus if country code is part of language code
+ if c['ISO3166-1-Alpha-2'].lower() in l:
+ lang_score = s
+
+ score = max([currency_score, country_score]) * 0.8 + 0.2 * lang_score
+
+ if score > best_score:
+ best_score = score
+ best_currency = c["ISO4217-currency_alphabetic_code"]
+
+ # special corner cases
+ if best_score < 0.55:
+ # european union
+ if "euro" in text.lower() or "€" in text:
+ return "EUR", 0.5
+
+ return best_currency, best_score


Review the implementation of extract_currencycode.

The function uses fuzzy matching to determine currency codes, which is appropriate given the variability of text input. However, consider handling edge cases where the currency might not be directly mentioned or is abbreviated differently.

+ # Consider adding handling for common abbreviations and edge cases + if 'USD' in text or 'dollar' in text: + return 'USD', 1.0 + if 'GBP' in text or 'pound' in text: + return 'GBP', 1.0

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@localized_function(run_own_code_on=[UnsupportedLanguageError, FunctionNotLocalizedError])

def extract_currencycode(text, lang=""):

# this method tries to be lang agnostic and use mainly fuzzy matching

# it should be considered a fallback for unimplemented languages

# dedicated per language implementations wanted!

# match lang

l, s = extract_langcode(text, lang=lang)

# match country data

resource_file = resolve_resource_file("countries.json")

with open(resource_file) as f:

countries = json.load(f)

best_score = 0

best_currency = None

for c in countries:

if not c["ISO4217-currency_alphabetic_code"]:

continue

k = f"official_name_{lang.split('-')[0]}"

if k in c:

name = c[k]

else:

name = c["official_name_en"]

# match currency name + country name + country lang

currency_score = fuzzy_match(text, c["ISO4217-currency_name"], strategy=MatchStrategy.TOKEN_SET_RATIO)

country_score = fuzzy_match(text, name, strategy=MatchStrategy.TOKEN_SET_RATIO)

lang_score = 0

if l in c.get("Language", "").lower():

# bonus if language is spoken in this country

lang_score = s * 0.6

# bonus if country code is part of language code

if c['ISO3166-1-Alpha-2'].lower() in l:

lang_score = s

score = max([currency_score, country_score]) * 0.8 + 0.2 * lang_score

if score > best_score:

best_score = score

best_currency = c["ISO4217-currency_alphabetic_code"]

# special corner cases

if best_score < 0.55:

# european union

if "euro" in text.lower() or "€" in text:

return "EUR", 0.5

return best_currency, best_score

@localized_function(run_own_code_on=[UnsupportedLanguageError, FunctionNotLocalizedError])

def extract_currencycode(text, lang=""):

# this method tries to be lang agnostic and use mainly fuzzy matching

# it should be considered a fallback for unimplemented languages

# dedicated per language implementations wanted!

# match lang

l, s = extract_langcode(text, lang=lang)

# match country data

resource_file = resolve_resource_file("countries.json")

with open(resource_file) as f:

countries = json.load(f)

best_score = 0

best_currency = None

for c in countries:

if not c["ISO4217-currency_alphabetic_code"]:

continue

k = f"official_name_{lang.split('-')[0]}"

if k in c:

name = c[k]

else:

name = c["official_name_en"]

# match currency name + country name + country lang

currency_score = fuzzy_match(text, c["ISO4217-currency_name"], strategy=MatchStrategy.TOKEN_SET_RATIO)

country_score = fuzzy_match(text, name, strategy=MatchStrategy.TOKEN_SET_RATIO)

lang_score = 0

if l in c.get("Language", "").lower():

# bonus if language is spoken in this country

lang_score = s * 0.6

# bonus if country code is part of language code

if c['ISO3166-1-Alpha-2'].lower() in l:

lang_score = s

score = max([currency_score, country_score]) * 0.8 + 0.2 * lang_score

if score > best_score:

best_score = score

best_currency = c["ISO4217-currency_alphabetic_code"]

# special corner cases

if best_score < 0.55:

# european union

if "euro" in text.lower() or "€" in text:

return "EUR", 0.5

# Consider adding handling for common abbreviations and edge cases

if 'USD' in text or 'dollar' in text:

return 'USD', 1.0

if 'GBP' in text or 'pound' in text:

return 'GBP', 1.0

return best_currency, best_score

Tools

Ruff

45-45: Ambiguous variable name: l

(E741)

58-61: Use ternary operator name = c[k] if k in c else c["official_name_en"] instead of if-else-block

Replace if-else-block with name = c[k] if k in c else c["official_name_en"]

(SIM108)

81-83: Use a single if statement instead of nested if statements

(SIM102)

feat/extract_currency

f3c3355

JarbasAl requested a review from NeonDaniel May 25, 2022 15:18

JarbasAl added the enhancement New feature or request label May 25, 2022

NeonJarbas marked this pull request as draft May 25, 2022 15:22

JarbasAl added 3 commits May 25, 2022 16:28

use langcode

93eb3d2

imports

fcb9520

more tests and edge cases

80f8624

JarbasAl added 2 commits May 25, 2022 16:43

more tests and edge cases

9213a9e

countries data!

53de042

JarbasAl changed the title ~~feat/extract_currency~~ feat/country_utils May 25, 2022

NeonDaniel reviewed May 25, 2022

View reviewed changes

JarbasAl marked this pull request as ready for review May 25, 2022 17:43

JarbasAl requested a review from ChanceNCounter May 25, 2022 17:43

JarbasAl marked this pull request as draft May 25, 2022 18:16

add extract_countrycode

d29c7bb

JarbasAl marked this pull request as ready for review May 25, 2022 23:17

JarbasAl requested a review from NeonDaniel May 25, 2022 23:17

coderabbitai bot reviewed Jul 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat/country_utils #26

feat/country_utils #26

NeonJarbas commented May 25, 2022 •

edited by coderabbitai bot

Loading

codecov bot commented May 25, 2022 •

edited

Loading

NeonDaniel May 25, 2022

JarbasAl May 25, 2022

NeonDaniel May 25, 2022

JarbasAl May 25, 2022

NeonDaniel May 26, 2022

NeonDaniel May 25, 2022

JarbasAl May 25, 2022 •

edited

Loading

NeonDaniel May 26, 2022

NeonDaniel May 25, 2022

JarbasAl May 25, 2022

JarbasAl commented Jul 17, 2024

coderabbitai bot commented Jul 17, 2024

coderabbitai bot commented Jul 17, 2024 •

edited

Loading

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Jul 17, 2024

coderabbitai bot Jul 17, 2024

coderabbitai bot Jul 17, 2024

coderabbitai bot Jul 17, 2024

coderabbitai bot Jul 17, 2024

coderabbitai bot Jul 17, 2024

feat/country_utils #26

Are you sure you want to change the base?

feat/country_utils #26

Conversation

NeonJarbas commented May 25, 2022 • edited by coderabbitai bot Loading

Summary by CodeRabbit

codecov bot commented May 25, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JarbasAl May 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JarbasAl commented Jul 17, 2024

coderabbitai bot commented Jul 17, 2024

coderabbitai bot commented Jul 17, 2024 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Assessment against linked issues

Poem

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jul 17, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 17, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 17, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 17, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 17, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 17, 2024

Choose a reason for hiding this comment

NeonJarbas commented May 25, 2022 •

edited by coderabbitai bot

Loading

codecov bot commented May 25, 2022 •

edited

Loading

JarbasAl May 25, 2022 •

edited

Loading

coderabbitai bot commented Jul 17, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)