Skip to content
This repository has been archived by the owner on Jan 22, 2019. It is now read-only.

Combine CSV files, remove duplicates #11

Open
libchris opened this issue Nov 30, 2015 · 4 comments
Open

Combine CSV files, remove duplicates #11

libchris opened this issue Nov 30, 2015 · 4 comments

Comments

@libchris
Copy link

Have two CSV files, never borrowed and never browsed. want to create a file that shows those books that are never borrowed or browsed. So ... Presumably I can combine using functions shown in wk2 ...
Then use open refine to find the duplicates. Then a little unsure how I then remove one of the duplilcates, so all titles are shown, but one of duplicates is removed

@ostephens
Copy link
Contributor

@libchris I think you've got the right approach here - but first a question - is there a common value in both files (e.g. a system number, or even just the title?)

@libchris
Copy link
Author

libchris commented Dec 1, 2015

If true duplicates should have same accession number which is recorded in the files

@libchris
Copy link
Author

libchris commented Dec 1, 2015

Safer than title, as potentially two unique books could have same title

@ostephens
Copy link
Contributor

Thanks @libchris.

You could take the approach you describe - combine the two files outside OpenRefine, import the combined file, sort by the Accession number column and (this is important for later) 'Reorder permanently' (an option that you access from the 'Sort' drop down menu which will display once you have applied a sort).

You could do a 'Facet by Duplicate' on the Accession number column - this will give you all the lines in the file that are repeated in the file. You then have a few options:

To combine data from multiple rows which have the same accession number into a single row you need to convert your rows into 'records' (which allows you to have multiple rows linked as a 'record'). The way the 'records' mode works is slightly odd, but basically it relies on the first column having the 'key' to the record in it, with the first row of the record containing the key value, and the next X rows (that belong to the same record) being blank. This is probably easier to see in practice than through words:

  • In the Duplicates facet view, filter down to duplicates and 'Star' (or Flag) all rows (under the All menu), then remove the facet.
  • Move the Accession number column so it is the first column in the project
  • On the Accession column dropdown find 'Edit cells->Blank down' - this will remove any duplicate accession numbers and bring together the rows into a single 'record'
  • Switch to Record view
  • Then, on each of the other columns use the 'Edit cells->Join multi-valued cells' option to merge together the data that appears across rows in a single record
  • When you'd done this for all columns, all the data from your 'records' should be only in the top row of the record.
  • Switch back to Row mode, do a 'Facet by blank' on the Accession Number col, filter down to the blank ones, and remove these rows (under the 'All' menu at the head of the first col)
  • Finally do a 'Facet by Star' - this will display the rows that you previously found by the 'Duplicates' facet

This is clearly a bit complicated, but it is fairly mechanical and once you get used to the records/rows modes in OpenRefine relatively straightforward - there is a good introduction to this at http://kb.refinepro.com/2012/03/difference-between-record-and-row.html

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants