Weather Undergound stores data from over 250,000 personal weather stations across the world. Unfortunately, historical data are not easy to access. It's possible to view tables of 5-min data from a single day -- see this example from a station outside Crested Butte, Colorado -- but if you try to scrape the http using something like Python's requests
library, the tables appear blank.
Weather Underground has a security policy that blocks automated requests from viewing data stored in each table. This is where Selenium WebDriver comes in. WebDriver is an toolbox for natively running web browsers, so when you render a page with WebDriver, Weather Underground thinks a regular user is accessing their website and you can access the full source code.
To run the script, the first thing to do is ensure that ChromeDriver is installed. Note that you have to match the ChromeDriver version to whichever version of Chrome is installed on your machine. It's also possible to use something other than Chrome, for example geckodriver for Firefox or safaridriver for Safari.
Next, update the path to chromedriver in scrape_wunderground.py
:
# Set the absolute path to chromedriver
chromedriver_path = '/path/to/chromedriver'
As long as BeautifulSoup and Selenium are installed, the script should work fine after that. However, there are a few important points to note about processing the data once it's downloaded:
- All data is listed in local time. So summer data is in daylight savings time and winter data is in standard time.
- Depending on the quality of the station,
- All pressure data is reported as sea-level pressure. Depending on the weather station, it may be possible to back-calculate to absolute pressure; some manufacturers (e.g., Ambient Weather WS-2902) use a constant offset whereas others (e.g., Davis Vantage Pro2) perform a more complicated barometric pressure reduction using the station's 12-hr temperature and humidity history.
time
pandas
numpy
selenium
beautifulsoup4
Pull requests are welcome. Please let me know if you have suggestions for improvement.
Zach Perzan
[email protected]
ORCiD #0000-0003-2676-2452