Cillian Berragan [@cjberragan
]1*,
Alex Singleton [@alexsingleton
]1,
Alessia Calafiore [@alel_domi
]2 &
Jeremy Morley [@jeremy_morley
]3
1 Geographic Data Science Lab, University of Liverpool, Liverpool, United Kingdom
2 Edinburgh College of Art, University of Edinburgh, United Kingdom
3 Ordnance Survey, Southampton, United Kingdom
*Correspondence: [email protected]
Observed regional variation in geotagged social media text is often attributed to dialects, where features in language are assumed to exhibit region-specific properties. While dialects are seen as a key component in defining the identity of regions, there are a multitude of other geographic properties that may be captured within natural language text. In our work, we consider locational mentions that are directly embedded within comments on the social media website Reddit, providing a range of associated semantic information, and enabling deeper representations between locations to be captured. Using a large corpus of Reddit comments from UK related local discussion subreddits, we identify place names using a transformer-based named entity recognition model. Embedded semantic information is then generated from these comments and aggregated into local authority districts, representing the semantic footprint of these regions. These footprints broadly exhibit spatial autocorrelation, with clusters that conform with the national borders of Wales and Scotland. London, Wales, and Scotland demonstrate notably different semantic footprints compared with the rest of the UK, which may be explainable through the perception of national identity associated with these regions.
The NER model used as part of this work is available on the HuggingFace model hub. Instructions for using this model are included on the model card.
https://huggingface.co/cjber/reddit-ner-place_names
src
├── common
│ └── utils.py # various utility functions and constants
├── preprocessing.py # process comments with identified place names
├── embeddings.py # generate sentence embeddings
└── zero_shot.py # generate identities using zero shot