This codebase contains code for scraping obituaries from Legacy.com. The workflow has three steps:
- scrape URLs of obituary listings
- scrape obituary text from URLs
- process the text to compute age, gender, and race for the deceased
In brief, our goal is to evaluate how well these obituaries track official death records using the test case of Washington, DC.
This study aims (1) to evaluate the feasibility and accuracy of using open-source data for monitoring COVID-19 and (2) to estimate demographic-specific excess mortality from all causes in 2020 and 2021 using official death records and obituary data. Automated data collection from text mining of openly available online obituaries could allow us to derive quick estimates of the age, sex, and race distribution of deaths by location in a cost-effective way, which is currently not possible since federal available datasets do not offer the necessary granularity or timeliness needed for monitoring efforts that can inform policy. The approaches this study will pursue will also help prepare tools to monitor future outbreaks and understand other types of causes of death, e.g., AIDS or opioid overdose.
First, install things by following the kickstart guide to setup and installations.
- Note that there are a few special setup steps for SUTime (which require Maven) and SpaCy (namely
python3 -m spacy download en_core_web_sm
). Everything else can be handled by creating a virtual environment and runningpip3 install -r requirements.txt
.
Then execute the code for each step by following the run instructions. The three steps are url scraping, obituary scraping, and postprocessing.