A simple objectified crawler to acquire the air quality data in China from
空气质量历史数据查询
An objectfied crawler enables users to acquire AQI data in a free way and couple the crawler into users' own codes
requests-html
Install requests-html
by using
pip install requests-html
Import the crawler class
from crawler import AQIseeker
Acquire the data from the website
this_page = AQIseeker('南京', '201703', 5) # '5' maximum request attempts (default=5)
this_page.getData()
this_page.metadict # access the dict that holds the data
# Note that one crawler instance can only retrieve the data of ONE specified city in ONE given month
The crawler class accepts any valid input and attempts to get the data from the website. Though the user is totally free to call the class in their own ways, a simple parser is provided to handle such demand
Create a text file 'some_cities.txt' to hold some contents like below
南京 201701-201706
上海 201609-201703
北京 201610-201705
the format should be
city_name yyyymm-yyyymm
Import a parser
from setting_parser import getCityTime
city_time_dict = getCityTime('some_cities.txt') # return a dict
The parser will return a dictionary containing the city names (as indice) and the month list. Please refer to 'example_front.py' for the usage of the parser and the crawler
- Base class of the crawler
- A parser for formatted txt file
- Language support of city names in English
- Provide an alternative method to acquire data from multiple cities and months from a dict/str
- Improve the performance by introducing parellel operation
---古老语言的分割线---
一个简单的对象化爬虫,用于从以下网页爬取空气质量数据
空气质量历史数据查询
对象化的爬虫允许用户更自由地获取特定城市和时间的空气质量数据,并且更方便插入用户自己的代码
requests-html
安装 requests-html
pip install requests-html
import爬虫的类
from crawler import AQIseeker
从网站获取数据
this_page = AQIseeker('南京', '201703', 5) # '5' 最大请求次数 (默认=5)
this_page.getData()
this_page.metadict # 获取的数据会存放在字典metadict中
只要是符合格式的城市名和时间表达式,该爬虫都可以处理。用户可以根据自己的需求请求多个数据,也可以使用此处提供的固定文本格式和文本处理工具一次性定义多个城市和时间
创建一个txt文件'some_cities.txt'(文件名可以随意),文件中的内容如下
南京 201701-201706
上海 201609-201703
北京 201610-201705
文本格式应为
city_name yyyymm-yyyymm
import提供的文本处理工具
from setting_parser import getCityTime
city_time_dict = getCityTime('some_cities.txt') # 返回一个字典
该工具会返回一个包含城市名(作为keys)和月份列表(作为values)的字典,之后可以使用爬虫来获取数据。可以参考'example_front.py'
- 本爬虫的基础类
- 格式化文本的处理工具, 用于批量获取数据
- 对中文城市名的英语支持
- 允许从字典/字符串获取多个城市/月份的数据
- 使用平行操作改善爬虫性能