GitHub - alephpi/Mathematical-Genealogy: A brief scraper project for self-study, using scrapy framework and sqlite3

Mathematical Genealogy Project scraper

这是一个自学爬虫的项目，最终目标是爬取数学家系谱图计划(Mathematical Genealogy Project)的数据库。主要搭建了scrapy框架爬取，并用sqlite3写入数据库。

参考资料：

爬虫项目类

北理工python爬虫公开课，主要介绍requests库和bs4库，附带一点点scrapy框架。适合不想看英语的python初学者，但年代较久远，后面的示例已经失效。子文件夹learn_crawler是一些杂乱的课堂笔记。
QQ空间说说爬取，使用了selenium库和chrome浏览器驱动模拟填写表单登录QQ空间。代码仍然有效，复制粘贴即可。
一个类似的mgp scraper，我参考了它将数据以数据库形式存储。

爬虫书籍类

Python 网络数据采集 Ryan Mitchell，在北理工课程基础上多介绍了一些（比如数据库）。Github上有配套的项目代码。第一版有中文翻译（科技书中为数不多较有个性的翻译），第二版中加入了scrapy章节，浅尝辄止且大部分是官方文档摘录。
Learning scrapy。主要介绍scrapy库，第一版有人翻译成中文发布在简书上，不过第一版介绍的scrapy版本较老，注意有一些关键词的用法变更。其中结合chrome检查页面功能介绍并获取xpath的部分非常有趣，还介绍了scrapy的shell用法，可以呼出ipython kernel交互，适合scrapy框架编写前的调试。
Getting Started with SQL。主要介绍sql语言和一些数据库的基本操作，非常简短，适合随时查阅。

官方文档类

scrapy，一个非常优秀的开源爬虫框架
sqlite3，一个轻型数据库界面

其他

还学了一丢丢的html语法树、xpath语言和正则表达式，爽！

截图留念

近三十万个页面，服务器爬取用时近10小时。

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
learn_crawler		learn_crawler
mgpSpider		mgpSpider
.gitignore		.gitignore
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mathematical Genealogy Project scraper

参考资料：

爬虫项目类

爬虫书籍类

官方文档类

其他

截图留念

About

Releases

Packages

Languages

alephpi/Mathematical-Genealogy

Folders and files

Latest commit

History

Repository files navigation

Mathematical Genealogy Project scraper

参考资料：

爬虫项目类

爬虫书籍类

官方文档类

其他

截图留念

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages