How to do case insensitive counts of hashtags or other texts...
In digital methods research you may sometimes end up with a list of hashtags where small variations in case and capitalisation are counted as separate items, such as in the following table:
date | item | value |
---|---|---|
2021 | ESEAHM2021 | 179 |
2021 | ESEAhm2021 | 46 |
2021 | eseahm2021 | 44 |
2021 | ESEA | 16 |
2021 | Earth2Air | 13 |
2021 | ESEAHM | 13 |
2021 | ESEAHeritageMonth | 11 |
2021 | NIW2021 | 10 |
2021 | TheMSGpod | 8 |
2021 | MoongateMix | 8 |
This recipe explores simple approaches to changing cases.
- A dataset of hashtags where case variations are counted separately (as above)
There are many ways of changing cases of items in datasets with different tools and scripts.
Here's one simple approach using spreadsheets...
- Import your dataset into your spreadsheet software
- Insert new column in which to add hashtags with their cases all changed
- Give the column a meaningful header (e.g.
lower
if you will change hashtags to lowercase) - In the row next to your first item, type
=LOWER()
and add the reference for the first item between the brackets.- So if you are adding the new cell in
C2
and your orginal case sensitive hashtag isB2
then you can type=LOWER(B2)
in cellC1
. - This syntax should work both in Excel as well as in Google Sheets. While lowercase may be easier to read, you could use uppercase instead if you preferred - the main thing is to get all the hashtags in the same case.
- So if you are adding the new cell in
- Press the enter key to complete and you should see case adjusted hashtag in the new column.
- Fill down this formula for the rest of the column
- For example by using fill down in Excel or autofill in Google Sheets.
Another simple alternative to the approach above you can use OpenRefine...
- Download and install OpenRefine. After installation it should open up a browser window.
- Click "choose files" and find the CSV file that you have downloaded from TCAT and then click "Next". (CSV stands for "comma-separated values" and this is a simple, widely used format for storing tabular data and moving it between different applications and services.)
- Check that the data has been "parsed" correctly and then click "Create Project".
- Once the data has loaded you can click the small down arrow next to the column containing your hashtags and click "Edit cells" > "Common transforms" > "To lowercase".
- Finally you can click "Export" in the top right hand corner and select "Comma separated value" to re-export the data as a CSV file that you can then work with in a spreadsheet.
There are many other ways to do this (e.g. using pandas).
Now you should have a table something like the following:
date | item | lower | value |
---|---|---|---|
2021 | ESEAHM2021 | eseahm2021 | 179 |
2021 | ESEAhm2021 | eseahm2021 | 46 |
2021 | eseahm2021 | eseahm2021 | 44 |
2021 | ESEA | esea | 16 |
2021 | Earth2Air | earth2air | 13 |
2021 | ESEAHM | eseahm | 13 |
2021 | ESEAHeritageMonth | eseaheritagemonth | 11 |
2021 | NIW2021 | niw2021 | 10 |
2021 | TheMSGpod | themsgpod | 8 |
2021 | MoongateMix | moongatemix | 8 |
You'll see that the first three hashtags are still separate. How can we combine these?
To recount the new column you can:
- Create a pivot table
- There are slightly different ways of doing this in Excel and Google Sheets.
- Select your new column (in this case
lower
) for the rows - Select the column with the counts (in this case
value
) for the values - Summarise and sort by the
SUM
of the column with counts in (in this casevalue
)
If you have a date
column and multiple different years for the hashtags, you can also add the date to the pivot table to ensure that multiple years are not merged together.
You should now have a table with case insensitive counts of your hashtags, which you can export into a new csv file for further analysis. 🎊
Note the difference in counts between this new table and the tables above (which is why case insensitive counts can matter for analysis)! 😲
date | lower | SUM of value |
---|---|---|
2021 | eseahm2021 | 270 |
2021 | esea | 18 |
2021 | eseahm | 17 |
2021 | earth2air | 13 |
2021 | eseaheritagemonth | 12 |
2021 | niw2021 | 10 |
2021 | beseakidlit | 9 |
2021 | themsgpod | 8 |
2021 | moongatemix | 8 |
2021 | londonpodfest | 7 |