Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect assignment of inventor IDs #2

Open
doolin opened this issue Dec 29, 2012 · 1 comment
Open

Incorrect assignment of inventor IDs #2

doolin opened this issue Dec 29, 2012 · 1 comment
Labels

Comments

@doolin
Copy link
Member

doolin commented Dec 29, 2012

I've been working with the Harvard Patent Dataverse 2010 datasets for quite a while now and have stumbled across an issue with unique inventor identification for records with assignee numbers starting with A or H (e.g. H000000000158 for Shell Oil Company instead of the regular 10266734). The algorithm seems to incorrectly assign different inventor ID's to records with such assignee numbers, while the other characteristics of the record are very similar or exactly the same as the records listing the 'regular' assignee number.

Here's an example for one of Shell's key inventors:

HAROLD J
VINEGAR BELLAIRE US 7631690 SHELL OIL COMPANY 10266734 166 04359687-1 2009
HAROLD J VINEGAR BELLAIRE US 7635023 SHELL OIL COMPANY 10266734 166 04359687-1 2009
HAROLD J VINEGAR BELLAIRE US 7635025 SHELL OIL COMPANY 10266734 166 04359687-1 2009

As you can see, these are OK. The inventor is correctly assigned with Invnum 04359687-1. However, the following records receive a different Invnum, while the inventor is of course the same based on the characteristics of the other data fields:

HAROLD J VINEGAR BELLAIRE US 7640980 SHELL OIL COMPANY H000000000158 166-268/166-302/166-369/405-52 07640980-0 2010
HAROLD J VINEGAR BELLAIRE US 7735935 SHELL OIL COMPANY H000000000158 299-5/166-2721/166-302/299-4 07735935-0 2010
HAROLD J VINEGAR BELLAIRE US 7681647 SHELL OIL COMPANY H000000000158 166-302/166-369 07681647-2 2010
For larger selections of data, this leads to a lot of missing connections and overall less connected or dense networks than is actually the case. So far, I've manually corrected the Invnum's for these records, but of course this is not the way to go for selections containing thousands of records ;-)

Would it be possible to address this issue in the next release of the datasets? Please let me know if there's any other info I can provide to further clarify this issue.

Thanks,

André

@doolin
Copy link
Member Author

doolin commented Dec 29, 2012

Cross-listed with funginstitute/disambiguator#2 because it's not perfectly clear where the problem arises.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant