基于scrapy的拉勾网爬虫,抓取各种职位的信息,并存储在MySQL数据库中。
- 自定义爬取的职位
- 存储到MySQL数据库
- 部署在服务器,每天定时执行,完成后发送邮件报告
2.7.10
安装requirements.txt依赖
pip install -r requirements.txt
设置要爬取的职位,格式严格按照拉勾网URL
例如:https://www.lagou.com/zhaopin/**ziranyuyanchuli**/2/
JOBS = {"Java", "Python", "PHP", "C++", "shujuwajue", "HTML5", "Android", "iOS", "webqianduan"}
# 执行create_table.sql
# 数据库配置
MYSQL_HOST = 'xxx.xx.xx.xx'
MYSQL_DBNAME = 'Spider'
MYSQL_USER = 'xx'
MYSQL_PASSWD = 'xx'
MYSQL_PORT = 0
# 邮件配置
From_ADDR = '[email protected]'
TO_ADDR = '[email protected]'
PASSWORD = 'xxxx'
SMTP = 'smtp.163.com'
scrapy crawl lagou
or
python main.py
virtualenv创建环境
scrapyd + SpiderKeeper
新建screen开启scrapyd服务
scrapyd
进入scrapy根目录
scrapyd-deploy name
剩余步骤参考SpiderKeeper文档
DashBoard配置任务定时运行
实现中