-
Notifications
You must be signed in to change notification settings - Fork 4
NamSor Tools V2
Welcome to the namsor-tools-v2 wiki!
In scientific papers, please indicate software version (NamSorAPIv2.X.YY) and date of data retrieval.
NamSorAPIv2.0.29 (2023-12-03)
- Improvements for Albanian and Kosovo Albanian Diaspora Mapping
- Created a new API for Community Engagement (this option requires a specific licence)
- Added full name classification for Origin and Diaspora
- Added first/last name classification for Country
- Differenciated bw/ regionStat, religionStatAlt and religionStatSynthetic
NamSorAPIv2.0.28 (2023-10-08)
- Added Filipino ethnicity to Diaspora
- Fixed typo in ethnicity Belarusian
- Improvements on names in Cyrillic
NamSorAPIv2.0.27 (2023-07-16)
- added India enpoints to classify first/last names by Religion/Caste/Castegroup/Indian State
- India caste group General was split as General and General/High Caste
- Added a finer grained classification by detailed caste
- Fixed smallish issue on admin CreditAPI
- further improved free account abuse/spam detection
NamSorAPIv2.0.26 (2023-06-18)
- added option to return country religion statistics for taxonomies Origin/Country/Diaspora, with header X-OPTION-RELIGION-STATS=True
- improved AI explainability for Enterprise users (API Key should be set to Explainability=True and API queries with header X-OPTION-EXPLAINABILITY=True)
- replaced JNBC with gotyai-java
- further improved free account abuse/spam detection
NamSorAPIv2.0.25 (2023-05-20)
- fixed Italian names issue ex Andrea/Rossini (https://github.com/namsor/namsor-tools-v2/issues/23)
- added AI explainability for Enterprise users (API Key should be set to Explainability=True and API queries with header X-OPTION-EXPLAINABILITY=True)
- improved some logging features as well as other internal services (free account abuse/spam detection)
NamSorAPIv2.0.24 (2023-03-12)
- Added specific endpoints for Indian names : Indian State subclassification, Religion (Hindu, Muslim, Jain, Christian), Caste Category (General, ST, SC, OST)
- Other improvements on Latin American countries, Portuguese and Spanish names
NamSorAPIv2.0.23 (2023-01-15)
- Improvements for Indian names sub-classification
- New taxonomies for Indian names : Religion, Caste Category (General, ST, SC, OST)
NamSorAPIv2.0.22 (2022-12-17)
- Improvements on names in ARABIC (gender, origin, country)
- Improvements for Indian names sub-classification
- Other improvements : Malaysia, Indonesia, Brasil/Portugal, Spain/LatAm
NamSorAPIv2.0.21 (2022-09-25)
- Added mandatory email verification before activation of API Key due to abuse
- Improvements for Indian names gender classification
- Improvements for classification of Italian names with US "Race"/Ethnicity (issue not full resolved, current recommendation is to combine with Diaspora model)
NamSorAPIv2.0.20 (2022-07-28)
- Diaspora model enhancements for 7 European countries Italy (IT),France (FR), Germany (DE), Ireland (IE), Netherlands (NL), Belgium (BE), Spain (ES).
- Added an open endpoint to query the list of countries/regions
- Minor back-end changes to support features of the new front-end version namsor.app
- Added OPTIONS to CORS for ethnicity-estimate.com
NamSorAPIv2.0.19 (2022-05-08)
- Fixed rare issue with NaN score/probability
- Improvements for Minorities/Diversity Analytics in Italy
- No major change to API (still at v2.namsor.com)
- Added OPTIONS to CORS for gender-guesser.com
NamSorAPIv2.0.18 (2022-01-16)
- Second batch of improvements on CYRILLIC : Diaspora
- No major change to API (still at v2.namsor.com), but we're now redirecting namsor.com front-end to the new version namsor.app
- Added a gender endpoint for just given names (defaulting to 'US' local context)
- Fixed Stripe redirect
- Fixed ParseName issue when not trimmed()
- Added X-OPTION-USRACEETHNICITY-TAXONOMY to CORS
NamSorAPIv2.0.17 (2021-12-05)
- First batch of improvements on CYRILLIC : Origin, Country and Gender
NamSorAPIv2.0.16 (2021-09-26)
- Diaspora model now has calibratedProbability/calibratedProbabilityAlt as the other models (Gender, Country, Origin, US 'Race'/Ethnicity) based on ability to predict either (A) the Diaspora country of birth or (B) the name country of Origin (ie. consistency with NamSor Origin model).
- Various admin API enhancements to support the new Website and CSV tool
- [BugFix] Slightly negative score were causing Python lib error https://github.com/namsor/namsor-tools-v2/issues/17
NamSorAPIv2.0.15 (2021-07-18)
- US 'Race'/Ethnicity : Optionally add header X-OPTION-USRACEETHNICITY-TAXONOMY:
- USRACEETHNICITY-4CLASSES is a new classifier compatible with prior version, but trained using a combination of US data and non-US data (ex. international names of sub-Saharan africa are classified as B_NL; international names of East Asia are classified as A) in alignment with https://www.census.gov/topics/population/race/about.html
- USRACEETHNICITY-4CLASSES-CLASSIC for the classic US'Race'/Ethnicity classifier (pre-version 2.0.15) which has 4 classes : W_NL (white, non latino), HL (hispano latino), A (asian, non latino), B_NL (black, non latino) purely trained on US data.
- USRACEETHNICITY-6CLASSES for two additional classes, AI_AN (American Indian or Alaskan Native) and PI (Pacific Islander). With this option, classifier has 6 classes : W_NL (white, non latino), HL (hispano latino), A (asian, non latino), B_NL (black, non latino), AI_AN (American Indian or Alaskan Native) and PI (Pacific Islander) purely trained on US data.
- general improvements to gender / country / origin models accross all countries
- specific improvements to better classify names of : NG (Nigeria), BD (Bengladesh), ZA (South-Africa), AF (Afghanistan), IR (Iran).
NamSorAPIv2.0.14 (2021-04-11)
- US 'Race'/Ethnicity : Optionally add header X-OPTION-USRACEETHNICITY-TAXONOMY: USRACEETHNICITY-6CLASSES for two additional classes, AI_AN (American Indian or Alaskan Native) and PI (Pacific Islander). With this option, classifier has 6 classes : W_NL (white, non latino), HL (hispano latino), A (asian, non latino), B_NL (black, non latino), AI_AN (American Indian or Alaskan Native) and PI (Pacific Islander).
- [BugFix] Regression on parsing Spanish names without context ES https://github.com/namsor/namsor-tools-v2/issues/16
NamSorAPIv2.0.13 (2021-03-14)
- Japanese names : improvements for gender classification (LATIN, HAN / Kanji, KATAKANA) ; translation LATIN->HAN / Kanji and back.
- Improvements on US 'Race'/Ethnicity model, Diaspora Model
- UI : added an online CSV tool to process files from JavaScript client, append gender, origin, country, diaspora or US 'race'/ethnicity to a list of names in Excel/CSV format.
- [Known Issue] Regression on parsing Spanish names without context ES (pls specify country code ES as a workaround)
NamSorAPIv2.0.12 (2021-01-31)
- Improvements for gender classification of full names
- Split Diaspora taxonomy classes Irish,British -> Irish,English,Scottish,Welsh (British remains as first/second best alternative for now)
- Improve Diaspora classification with non LATIN names
- Added Corridor API for classifying names in cross-border contexts (relevant for : diaspora remittances, international travel, foreign direct investment, crowdfunding etc.)
- Added a general name classification API (nameType), accuracy in range 90-95%
- [BETA] UI : added an online CSV tool to process files from JavaScript client, append gender, origin, country, diaspora or US 'race'/ethnicity to a list of names in Excel/CSV format.
NamSorAPIv2.0.11 (2020-10-31)
- (Infratructure) SSL Certificates updated
- Improvements for names in Brazil, Pakistan, Indonesia
- Improvements for gender classification of full names in Pakistan, Indonesia
- [BETA] added a general name classification API (nameType)
- added a new SDK for JavaScript
NamSorAPIv2.0.10 (2020-06-08)
- Gender, Origin, Country improvements for NON-LATIN Scripts (CYRILLIC, HAN, ARABIC, KATAKANA, HANGUL, GREEK, BENGALI, ARMENIAN, DEVANAGARI, TAMIL, GEORGIAN, TELUGU, ORIYA, ...)
- Gender for parsed names with only initials (ex. J. Smith) now return a probability close to 0.5 https://github.com/namsor/namsor-tools-v2/issues/10
- Prepared a specific API for translating apanese Names (not active yet)
- Other bug fixes, https://github.com/namsor/namsor-tools-v2/issues/9 https://github.com/namsor/namsor-tools-v2/issues/8
NamSorAPIv2.0.9 (2020-03-15)
- Diaspora API improvements for US, FR
NamSorAPIv2.0.8 (2020-01-04)
- Updated Naive Bayes Classifier library to refactored JNBC (v2.0.4)
- Diaspora API : Fix bias towards classifying Eastern European and some Middle Eastern names to Jewish https://github.com/namsor/namsor-tools-v2/issues/4
NamSorAPIv2.0.7 (2019-11-24)
- The probability calibration is no longer based on the Score, but based on the probability estimates.
NamSorAPIv2.0.6 (2019-10-25)
- Gender API : Fix issue where probability could be between 0.33 and 0.5; with a low score, the probability should be 0.5 (corresponding to randomly choosing Male/Female). https://github.com/namsor/namsor-tools-v2/issues/6
NamSorAPIv2.0.5 (2019-07-21)
- Added a calibrated probability based on the Score and a validation set
2019-06-30 : NamSorAPIv2.0.4
- Gender API : Improvements on Chinese names
- Chinese API : Specific API end-points for https://chinese-names.app/
NamSor V2 uses Naive Bayes, a class of algorithms which is excellent at classification. Each classifier will output a SCORE, which is based on the relative probability between the predicted value and the other alternatives.
For most classifiers, we also use a validation dataset to calibrate the probability estimates with actual precision / recall, then return calibratedProbability that can directly be read as a probability. The calibratedProbabilityAlt corresponds to getting the first choice OR the best alternative right.
For example, to determine Gender of names in the US, with score>0, we have a 95% precision and 100% recall. By filtering score>1, you can exclude ambiguous names and increase the precision (at the cost of reducing the recall).
Mapping Rounded Gender Score to Precision and Recall :
======================================================
SCORE PREC. RECALL
0 95% 100%
1 96% 98%
2 97% 93%
3 97% 86%
4 98% 73%
5 99% 56%
6 99% 40%
7 99% 25%
Several classifiers will take contextual geographic information as input, or return a country ISO2 code as output. Please find the list of ISO2 country codes, https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes/blob/master/all/all.csv https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2
NamSor V2 provides several different classifiers : gender, origin, diaspora, US 'race'/ethnicity ... each classifier learns from each other's input and outputs classification according to a specific taxonomy.
NamSor aims to offer the best accuracy on predicting likely gender from names on a global scale : not just for US and European names, but also Asian, African ... in all languages and alphabets.
The taxonomy classes are https://v2.namsor.com/NamSorAPIv2/api2/json/taxonomyClasses/personalname_gender This is a binary classifier, but the results are probability estimates. Other non-binary gender may exist which should be accounted for using other methods (ex. survey).
If you append gender to a simple list of first and last names (ex. John|Smith), without any geographic context, the software will try to detect automatically the geographic context from the last name.
Input :
John W.|Smith
Mary|Smith
Elena|Rossi
Robert|Durieux
Output :
#uid|firstName|lastName|likelyGender|likelyGenderScore|genderScale|rowId
uid2|Elena|Rossi|female|6.053053604522956|1.0|0
uid3|Robert|Durieux|male|7.339173503361962|-1.0|1
uid0|John W.|Smith|male|8.436845363426315|-1.0|2
uid1|Mary|Smith|female|4.349891771969253|1.0|3
The recommended input format is to specify a unique ID and a geographic context (if known) as a countryIso2 code.
Input :
id12|John W.|Smith|US
id13|Mary|Smith|GB
id14|Elena|Rossi|IT
id15|Robert|Durieux|FR
Output :
#uid|firstName|lastName|countryIso2|likelyGender|likelyGenderScore|genderScale|rowId
id13|Mary|Smith|GB|female|4.164040354303551|1.0|0
id12|John W.|Smith|US|male|8.436845363426315|-1.0|1
id15|Robert|Durieux|FR|male|7.162120388463375|-1.0|2
id14|Elena|Rossi|IT|female|5.555580235429088|1.0|3
{ "id": null, "firstName": "John", "lastName": "Smith", "likelyGender": "male", "genderScale": -0.9918105205926329, "score": 41.11285807293116, "probabilityCalibrated": 0.9959052602963164 }
Field | Example | Description |
---|---|---|
id | ref12315 | The input identifier |
firstName | John | The input given name / firstName |
lastName | Smith | The input family name / surname / lastName |
likelyGender | male | The likely gender : male or female |
probabilityCalibrated | 0.99 | The calibrated probability : 0.5 is Unknown, +1 is sure |
genderScale | -0.99 | The scale is -1..0..+1 and is based on the probability (Probability = 0.5 -> Scale = 0; Gender = Male & Probabilty = 1 -> Scale = -1; Gender = Female & Probability = 1 -> Scale = +1) |
score | 41 | A non calibrated Score (use Probability instead) : score = Math.log(getProbaFirst() / getProbaNotFirst()) maxed to 100 |
This classification model infers the likely country of residence, based on the full name alone. The taxonomy classes are https://v2.namsor.com/NamSorAPIv2/api2/json/taxonomyClasses/personalfullname_country
{ "id": null, "name": "Jing Cao", "score": 33.88839357879743, "country": "CN", "countryAlt": "TW", "region": "Asia", "topRegion": "Asia", "subRegion": "Eastern Asia", "countriesTop": [ "CN", "TW", "HK", "SG", "KR", "PH", "MO", "VN", "KH", "AU" ], "probabilityCalibrated": 0.8966946013357358, "probabilityAltCalibrated": 0.9205811403508772 }
Field | Example | Description |
---|---|---|
id | ref12315 | The input identifier |
name | Jing Cao | The input full name |
country | CN | The likely residence country ISO2 code, which CAN include melting-pot countries |
countryAlt | TW | The best alternative residence country |
region | Asia | An arbitrary grouping of countries by topRegion/Region/subRegion |
topRegion | Asia | An arbitrary grouping of countries by topRegion/Region/subRegion |
subRegion | Eastern Asia | An arbitrary grouping of countries by topRegion/Region/subRegion |
countriesTop | CN, TW, HK... | The top 10 likely residence country ISO2 codes |
probabilityCalibrated | .89 | The calibrated probability of having guessed right the country of residence (CN) |
probabilityCalibratedAlt | 0.92 | The calibrated probability of having guessed right the country of residence as either CN or TW. |
score | 41 | A non calibrated Score (use Probability instead) : score = Math.log(getProbaFirst() / getProbaNotFirst()) maxed to 100 |
This classification model infers the likely country of origin from a name, based on how the name appear in the country of origin. This classifier doen't attempt to classify to any of the melting-pot countries (US, CA, etc.) but would recognize a French, Italian, British, Japanese name etc. as they appear in France, Italy, Great-Britain, Japan. The taxonomy classes are : https://v2.namsor.com/NamSorAPIv2/api2/json/taxonomyClasses/personalname_origin_country
Input :
John W.|Smith
Mary|Smith
Elena|Rossi
Robert|Durieux
Output :
#uid|firstName|lastName|countryOrigin|countryOriginAlt|countryOriginScore|rowId
uid2|Elena|Rossi|IT|FR|14.848086484203032|0
uid3|Robert|Durieux|FR|BE|39.63483415843564|1
uid0|John W.|Smith|GB|IE|21.09482904145537|2
uid1|Mary|Smith|GB|IE|12.87667003646059|3
{ "id": null, "firstName": "Jing", "lastName": "Cao", "countryOrigin": "CN", "countryOriginAlt": "TW", "countriesOriginTop": [ "CN", "TW", "HK", "VN", "KR", "MY", "KH", "ID", "DK", "CM" ], "score": 25.613603655787934, "regionOrigin": "Asia", "topRegionOrigin": "Asia", "subRegionOrigin": "Eastern Asia", "probabilityCalibrated": 0.9092268352804216, "probabilityAltCalibrated": 0.9883173013909269 }
Field | Example | Description |
---|---|---|
id | ref12315 | The input identifier |
firstName | Jing | The input first name / given name |
lastName | Cao | The input last name / surname |
countryOrigin | CN | The likely country of origin (ISO2 code) |
countryOriginAlt | TW | The best alternative country of origin (ISO2 code) |
region | Asia | An arbitrary grouping of countries by topRegion/Region/subRegion |
topRegion | Asia | An arbitrary grouping of countries by topRegion/Region/subRegion |
subRegion | Eastern Asia | An arbitrary grouping of countries by topRegion/Region/subRegion |
countriesOriginTop | CN, TW, HK... | The top 10 likely countries of origin (ISO2 code) |
probabilityCalibrated | 0.90 | The calibrated probability of having guessed right the country of origin (CN) |
probabilityCalibratedAlt | 0.98 | The calibrated probability of having guessed right the country of origin as either CN or TW. |
score | 25 | A non calibrated Score (use Probability instead) : score = Math.log(getProbaFirst() / getProbaNotFirst()) maxed to 100 |
This classification model infers the ethnicity or likely diaspora from a name, given a geographic context (ex. US, CA, ...) This model attempts to recognize both French, Italian, British, Japanese name etc. as they appear in France, Italy, Great-Britain, Japan, but also as Diaspora French, Italian, British, Japanese would be named in the United-States, for example. From v2.0.16, Diaspora model has calibratedProbability/calibratedProbabilityAlt as the other models (Gender, Country, Origin, US 'Race'/Ethnicity) based on ability to predict either (A) the Diaspora country of birth or (B) the name country of Origin (ie. consistency with NamSor Origin model). The taxonomy classes are : https://v2.namsor.com/NamSorAPIv2/api2/json/taxonomyClasses/personalname_country_diaspora
Input :
id12|John W.|Smith|US
id13|Mary|Smith|GB
id14|Elena|Rossi|IT
id15|Robert|Durieux|FR
Output :
#uid|firstName|lastName|countryIso2|ethnicity|ethnicityAlt|ethnicityScore|rowId
id13|Mary|Smith|GB|British|Irish|12.348217311566847|0
id12|John W.|Smith|US|British|Irish|27.307137947726286|1
id15|Robert|Durieux|FR|French|Jewish|75.65330992570755|2
id14|Elena|Rossi|IT|Italian|Portuguese|46.084654576433834|3
{ "id": null, "firstName": "Mary", "lastName": "Cao", "score": 12.163977377279767, "ethnicityAlt": "Vietnamese", "ethnicity": "Chinese", "lifted": false, "countryIso2": "US", "ethnicitiesTop": [ "Chinese", "Vietnamese", "NativeHawaiian", "HispanoLatino", "Portuguese", "Cambodian", "Italian", "Malays", "Jewish", "Hispanic" ] }
Field | Example | Description |
---|---|---|
id | ref12315 | The input identifier |
firstName | Mary | The input first name / given name |
lastName | Cao | The input last name / surname |
countryIso2 | US | The country of residence, the host country (ex. US, CA, NZ, GB) |
ethnicity | Chinese | The likely ethnicity |
ethnicityAlt | Vietnamese | The best alternative ethnicity |
ethnicitiesTop | Chinese, Vietnamese, Korean ... | The top 10 likely ethnicities |
or TW. | ||
score | 25 | A non calibrated Score : score = Math.log(getProbaFirst() / getProbaNotFirst()) maxed to 100 ; NB: diaspora doesn't have calibrated probabilities YET |
lifted | false | Some classifications are 'lifted' by a dictionary rule, instead of the machine learning |
This classification model infers the US 'race' / ethnicity from a US name. The geographic context HAS TO BE 'US', or the model will fail. This model outputs race/ethnicity according to US Census taxonomy W_NL (white, non latino), HL (hispano latino), A (asian, non latino), B_NL (black, non latino). The taxonomy classes are : https://v2.namsor.com/NamSorAPIv2/api2/json/taxonomyClasses/personalname_us_race_ethnicity
This is an independant assessment of the model's accuracy provided by ResearchDone.com : https://www.dropbox.com/s/xkfll1nswqjwdn1/Race%20Classification%20Results.txt
From NamSorAPIv2.0.14, it is possible to adjust the taxonomy using a header parameter, X-OPTION-USRACEETHNICITY-TAXONOMY
- X-OPTION-USRACEETHNICITY-TAXONOMY: USRACEETHNICITY-4CLASSES returns 4 classes W_NL (white, non latino), HL (hispano latino), A (asian, non latino), B_NL (black, non latino), by default.
- X-OPTION-USRACEETHNICITY-TAXONOMY: USRACEETHNICITY-6CLASSES returns 6 classes W_NL (white, non latino), HL (hispano latino), A (asian, non latino), B_NL (black, non latino), AI_AN (American Indian or Alaskan Native) and PI (Pacific Islander).
Input :
id12|John W.|Smith|US
id15|Robert|Durieux|US
id16|Jordan|Jackson|US
id17|Carmen|Garcia|US
Output :
#uid|firstName|lastName|countryIso2|raceEthnicity|raceEthnicityAlt|raceEthnicityScore|rowId
id17|Carmen|Garcia|US|HL|A|10.32374080384995|0
id16|Jordan|Jackson|US|B_NL|W_NL|1.9105209599712982|1
id12|John W.|Smith|US|W_NL|B_NL|2.783278508661135|2
id15|Robert|Durieux|US|W_NL|B_NL|1.8889062776993453|3
{ "id": null, "firstName": "Mary", "lastName": "Cao", "raceEthnicityAlt": "W_NL", "raceEthnicity": "A", "score": 27.341640697082248, "raceEthnicitiesTop": [ "A", "W_NL", "HL", "B_NL" ], "probabilityCalibrated": 0.9104267920103436, "probabilityAltCalibrated": 0.954264449825495 }
Field | Example | Description |
---|---|---|
id | ref12315 | The input identifier |
firstName | Mary | The input first name / given name |
lastName | Cao | The input last name / surname |
countryIso2 | US | The country of residence, the host country (ex. US, CA, NZ, GB) |
raceEthnicity | A | The likely 'race'/ethnicity : W_NL (white, non latino), HL (hispano latino), A (asian, non latino), B_NL (black, non latino) |
raceEthnicityAlt | W_NL | The best alternative 'race'/ethnicity |
raceEthnicitiesTop | A, W_NL, ... | The likely 'race'/ethnicities |
probabilityCalibrated | 0.91 | The calibrated probability of having guessed right the 'race'/ethnicity as A (Asian) |
probabilityCalibratedAlt | 0.95 | The calibrated probability of having guessed right the 'race'/ethnicity as either A or W_NL (White Non Latino) |
score | 27 | A non calibrated Score (use Probability instead) : score = Math.log(getProbaFirst() / getProbaNotFirst()) maxed to 100 |
This classification model is a utility for parsing full names (ex. John Smith or Smith, John) into the first and last name components. The system will detect which part is more likely a given name or a family name, and decide where to split in complex cases (such as aristocratic names, composed names, etc.)
Input :
John W. Smith
Mary Smith
Elena Rossi
Robert Durieux
Durieux Robert
Smith Mary
Output :
#uid|fullName|firstNameParsed|lastNameParsed|nameParserType|nameParserTypeAlt|nameParserTypeScore|rowId
uid4|Durieux Robert|Robert|Durieux|LN1FN1|null|8.984422928615022|0
uid5|Smith Mary|Mary|Smith|LN1FN1|null|8.313637255008238|1
uid2|Elena Rossi|Elena|Rossi|FN1LN1|null|7.534662622281973|2
uid3|Robert Durieux|Robert|Durieux|FN1LN1|null|8.429672146707018|3
uid0|John W. Smith|John W.|Smith|FN2LN1|null|16.30796669777909|4
uid1|Mary Smith|Mary|Smith|FN1LN1|null|7.758738464551846|5