Skip to content
This repository has been archived by the owner on Jan 22, 2019. It is now read-only.

Is there a cluster type function for matching full and abbreviated things? #16

Open
pjhatch125 opened this issue Dec 1, 2015 · 2 comments

Comments

@pjhatch125
Copy link

Hi Owen,

I'm looking looking at some OA data and have some publisher names in full and some abbreviated.

Is there a way of clustering and merging them so all publisher names are in full?

Thanks,

Philippa

@ostephens
Copy link
Contributor

@pjhatch125 the answer is probably 'it depends' :)

You could do a text facet on the Publisher names and edit the abbreviations to the full name in the facet. This would be OK if the numbers are low, but not going to be effective if you have lots of publishers and lots of variation

In some cases the 'Cluster' functionality may help you merge together - but generally abbreviations are so different to the full name this isn't going to be effective (e.g. T&F vs Taylor and Francis)

Because of these challenges the other option I'd consider is finding a mechanism to lookup publisher information from another (external or local) source. There are two approaches here:

  • Look up a publisher name/abbreviation and get back an 'authorised' form
  • Lookup the publication on a service which knows about publications and publishers and has consistent publisher metadata

Some examples:

  • NCSU have a list of publishing organisations with alternative names - you could download (https://www.lib.ncsu.edu/ld/onld/downloads/ONLD.txt) upload into OpenRefine, then use a 'cross' query to lookup between the two projects (see Best way to merge two spreadsheets with similar information? #10 for more on 'cross')
    • having a look the NCSU list has lots of variant names but perhaps not so good on abbreviations
  • Crossref Journal API - in the session we saw how this worked for journals and that included a Publisher
    • I don't know how consistent the publisher name is
  • Crossref DOI API - if you have DOIs but not ISSNs, this maybe an option
  • Sherpa API - again you can lookup by ISSN. API details at http://www.sherpa.ac.uk/romeo/api.html
    • I'd expect publisher to be consistent from Sherpa

It would be interesting to see if any of these prove effective!

@pjhatch125
Copy link
Author

Thank you. Lots of ideas to try!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants