Skip to content

Commit

Permalink
🍻 [feat]淘宝免cookie策略(可适用于阿里全系自主平台)
Browse files Browse the repository at this point in the history
淘宝免cookie策略,可实现阿里全系自主平台爬取
  • Loading branch information
卜俊杰 committed Feb 27, 2020
1 parent fb207f4 commit 5835247
Show file tree
Hide file tree
Showing 16 changed files with 1,674 additions and 19 deletions.
Empty file.
39 changes: 21 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@
对于精通爬虫的pyer,这将是一个很好的例子减少重复收集轮子的过程。项目经常更新维护,确保即下即用,减少爬取的时间。

对于小白通过✍️实战项目,了解爬虫的从无到有。爬虫知识构建可以移步[项目wiki](https://github.com/DropsDevopsOrg/ECommerceCrawlers/wiki/%E7%88%AC%E8%99%AB%E5%88%B0%E5%BA%95%E8%BF%9D%E6%B3%95%E5%90%97%3F)。爬虫可能是一件非常复杂、技术门槛很高的事情,但掌握正确的方法,在短时间内做到能够爬取主流网站的数据,其实非常容易实现,但建议从一开始就要有一个具体的目标。

在目标的驱动下,你的学习才会更加精准和高效。那些所有你认为必须的前置知识,都是可以在完成目标的过程中学到的😁😁😁。

欢迎大家对本项目的不足加以指正,⭕️Issues或者🔔Pr

>在之前上传的大文件贯穿了3/4的commits,发现每次clone达到100M,这与我们最初的想法违背,我们不能很有效的删除每一个文件(太懒),将重新进行初始化仓库的commit。并在今后不上传爬虫数据,优化仓库结构。
Expand All @@ -31,23 +31,25 @@
<summary>收益表</summary>


|项目|收益|备注|
|:--|--:|:-:|
|DianpingCrawler|200|
|TaobaoCrawler|2000|
|SohuNewCrawler|2500|
|WechatCrawler|未定|暂无具体收益|
|某省药监局|80|
|fofa|700|
|baidu|1000|
|蜘蛛泛目录|1000|
|更多……|……|另部分程序未得到客户开源认可|
| 项目 | 收益 | 备注 |
| :-------------- | ---: | :--------------------------: |
| DianpingCrawler | 200 |
| TaobaoCrawler | 2000 |
| SohuNewCrawler | 2500 |
| WechatCrawler | 未定 | 暂无具体收益 |
| 某省药监局 | 80 |
| fofa | 700 |
| baidu | 1000 |
| 蜘蛛泛目录 | 1000 |
| 更多…… | …… | 另部分程序未得到客户开源认可 |

</details>

## CrawlerDemo

- [x] [DianpingCrawler](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/DianpingCrawler):大众点评爬取
- [x] [East_money](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/East_money):scrapy爬取东方财富网
- [x] [📛TaobaoCrawler(new)](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/TaobaoCrawler(new)):阿里系全自主平台(淘宝、天猫、咸鱼、菜鸟裹裹等)信息爬取 免cookie, 理论上不被反爬虫机制(只提供淘宝,其他思路一样,加密方式一样),
- [x] [📛TaobaoCrawler](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/TaobaoCrawler):淘宝商品爬取
- [x] [📛ZhaopinCrawler](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/ZhaopinCrawler):各大招聘网站爬取
- [x] [ShicimingjuCrawleAndDisplayr](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/ShicimingjuCrawleAndDisplay):诗词名家句网站爬取展示
Expand All @@ -73,11 +75,12 @@
- [x] [0x13 豆瓣影评分析](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/OthertCrawler/0x13douban_yingping)
- [x] [0x14 协程评论爬取](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/OthertCrawler/0x14ctrip_crawler)
- [x] [0x15 小米应用商店爬取](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/OthertCrawler/0x15xiaomiappshop)

## Contribution👏

|<a href="https://github.com/Joynice"><img class="avatar" src="https://avatars0.githubusercontent.com/u/22851022?s=96&amp;v=4" width="48" height="48" alt="@Joynice"></a>|<a href="https://github.com/liangweiyang"><img class="avatar" src="https://avatars0.githubusercontent.com/u/37971213?s=96&amp;v=4" width="48" height="48" alt="@liangweiyang"></a>|<a href="https://github.com/Hatcat123"><img class="avatar" src="https://avatars0.githubusercontent.com/u/28727970?s=96&amp;v=4" width="48" height="48" alt="@Hatcat123"></a>|<a href="https://github.com/jihu9"><img class="avatar" src="https://avatars0.githubusercontent.com/u/17663102?s=96&amp;v=4" width="48" height="48" alt="@jihu9"></a>|<a href="https://github.com/ctycode"><img class="avatar" src="https://avatars3.githubusercontent.com/u/56985178?s=96&amp;v=4" width="48" height="48" alt="@ctycode"></a>|
|:-:|:-:|:-:|:-:|:-:|
|[Joynice](https://github.com/Joynice)|[liangweiyang](https://github.com/liangweiyang)|[Hatcat123](https://github.com/Hatcat123)|[jihu9](https://github.com/jihu9)|[ctycode](https://github.com/ctycode)|
|<a href="https://gitee.com/joseph31"><img class="avatar" src="https://avatars3.githubusercontent.com/u/47005658?s=460&v=4" width="48" height="48" alt="@joseph31"></a>|<a href="https://github.com/Joynice"><img class="avatar" src="https://avatars0.githubusercontent.com/u/22851022?s=96&amp;v=4" width="48" height="48" alt="@Joynice"></a>|<a href="https://github.com/liangweiyang"><img class="avatar" src="https://avatars0.githubusercontent.com/u/37971213?s=96&amp;v=4" width="48" height="48" alt="@liangweiyang"></a>|<a href="https://github.com/Hatcat123"><img class="avatar" src="https://avatars0.githubusercontent.com/u/28727970?s=96&amp;v=4" width="48" height="48" alt="@Hatcat123"></a>|<a href="https://github.com/jihu9"><img class="avatar" src="https://avatars0.githubusercontent.com/u/17663102?s=96&amp;v=4" width="48" height="48" alt="@jihu9"></a>|<a href="https://github.com/ctycode"><img class="avatar" src="https://avatars3.githubusercontent.com/u/56985178?s=96&amp;v=4" width="48" height="48" alt="@ctycode"></a>|
|:-:|:-:|:-:|:-:|:-:|:-:|
|[joseph31](https://gitee.com/joseph31)|[Joynice](https://github.com/Joynice)|[liangweiyang](https://github.com/liangweiyang)|[Hatcat123](https://github.com/Hatcat123)|[jihu9](https://github.com/jihu9)|[ctycode](https://github.com/ctycode)|

> wait for you
Expand Down Expand Up @@ -108,7 +111,7 @@
- [x] txt文本
- [x] csv
- [x] excel
- [ ] mysql
- [x] mysql
- [x] redis
- [x] mongodb
- 反爬验证
Expand All @@ -123,7 +126,7 @@
- [x] 多进程
- [x] 异步协成
- [x] 生产者消费者多线程
- [ ] 分布式爬虫系统
- [x] 分布式爬虫系统

> *链接标识官方文档或推荐例子*
Expand Down
74 changes: 74 additions & 0 deletions TaobaoCrawler(new)/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# 说明

## 进度说明

- 本程序设计思路对阿里系自主平台(如:淘宝 taobao、天猫 tmall、闲鱼、菜鸟裹裹等平台均有效),此处提供 淘宝 taobao 的程序

- 已完成

1. 整体框架设计
2. 搜索页面 csv 存储
3. 多线程

- 未完成
1. 详情页面 csv 存储
2. 搜索页面 mysql 存储
3. 详情页面 mysql 存储
4. 搜索页面与详情页面同时爬取, mysql + redis 存储

## 使用方法

1.`config.py` 文件中根据需要配置
2. 运行 `python3 main.py`

## 思路

1. 生产-消费 模式
2. 各功能单独建文件
3. 多线程
4. 数据库: csv \ redis \ mysql

## 阿里系自主平台(非收购)cookie 自动配置策略

1. 第一次无 cookie 请求,返回 cookie
2. 从返回的 cookie 提取 token 并计算 sign(token, timestamp, appKey, data),拼接新的 url
3. 第二次带返回的 cookie 请求 url,得到结果

- 注:
1. cookie、token 有效期为 30 天,sign 有效期为 1 小时
2. 理论上:只要一个小时跟换一次时间戳、重新计算一次 sign 即可,不断重复第二次请求
3. 实践中:一小时更换一次有被反爬虫风险;可用 30 秒隧道代理,每次都重复第一步生成新 cookie(效率极高,时间可忽略),理论上无反爬虫风险
4. 程序中对第一次请求 url 固定(不影响程序),若以后能从 js 文件中看懂其生成机制,则可改为每次自动生成

## taobao 入口

http://uland.taobao.com/sem/tbsearch?keyword=XXX

把最后的 XXX 换成您要搜索的内容即可

(用以第一步请求,得到真正的请求地址,程序中已经配置,不用管)

## Tmall 入口

http://www.tmall.com/

(用以第一步请求,得到真正的请求地址,程序中已经配置,不用管)

## mysql 表结构

## 关于作者

本人从事 `大数据``数据分析` 工作,欢迎各位大牛叨扰~

- github : [https://github.com/SoliDeoGloria31](https://github.com/SoliDeoGloria31)

- 码云 Gitee : [https://gitee.com/joseph31](https://gitee.com/joseph31)

- 微信 : mortaltiger

<img src="https://gitee.com/joseph31/picture_bed/raw/master/mortaltiger.jpg" width="15%">

- 个人公众号: JosephNest(Joseph 的小窝)
经常测试新功能导致服务器不稳定,可能会出故障, 实现`自动推荐系统``自动回复功能``需求留言功能``人工智能集成(图片识别)``其他功能定制`

<img src="https://gitee.com/joseph31/picture_bed/raw/master/JosephNest.jpg" width="15%">
51 changes: 51 additions & 0 deletions TaobaoCrawler(new)/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# 搜索配置
# 搜索内容
str_searchContent = 'iPhone Xs'
# 每页显示数量
num_pageSize = 100
# 从第一页 至 第几页(理论上可穷尽阿里服务器),推荐填入 1~100 ,页数再大则显示的内容匹配度不足
num_page = 2
# 阿里服务编号,12574478 固定不要更改,如菜鸟裹裹为 12574478 固定
appKey = '12574478' # 不要更改!!!

###########################################
# 爬取内容设置

# 开启线程数
threads_num_get_pages = 1 # 抓取搜素页的线程数, 默认为 1
threads_num_get_comments = 3 # 抓取评论页的线程数,当为 0 时,不抓取详情页面(评论)

###########################################
# 储存
switch_save = 0 # 本地 csv 存储
# switch_save = 1 # mysql 存储
# switch_save = 2 # mysql + redis 存储

# redis
redis_host = '127.0.0.1'
redis_port = 6379

# mysql
mysql_host = '127.0.0.1'
mysql_port = 3306
mysql_user = 'root'
mysql_passwd = '123456'
mysql_db = 'taobao'
mysql_charset = 'utf8'

###########################################
# 代理设置
# 隧道服务器
_tunnel_host = "tps189.kdlapi.com"
_tunnel_port = "15818"

# 隧道用户名密码
_tid = "t17888082960619"
_password = "gid72p4o"

proxies = {
"http": "http://%s:%s@%s:%s/" % (_tid, _password, _tunnel_host, _tunnel_port),
"https": "https://%s:%s@%s:%s/" % (_tid, _password, _tunnel_host, _tunnel_port)
}

###########################################
156 changes: 156 additions & 0 deletions TaobaoCrawler(new)/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
# encoding: utf-8

from config import *
import requests
import hashlib
import time
from urllib.parse import quote
import threading


class TaoBao:
def __init__(self, str_searchContent, num_pageSize, num_page, appKey, threads_num_get_pages, threads_num_get_comments, switch_save, proxies):
self.str_searchContent = str_searchContent
self.num_pageSize = num_pageSize
self.num_page = num_page
self.appKey = appKey
self.threads_num_get_pages = threads_num_get_pages
self.threads_num_get_comments = threads_num_get_comments
self.switch_save = switch_save
self.proxies = proxies
self.cookie = ''
self.token = ''
self.file_name = ''
self.L_itemId = []

self.run()

def first_requests(self):
# 第一次请求,无cookie请求,获取cookie
base_url = 'https://h5api.m.taobao.com/h5/mtop.alimama.union.sem.landing.pc.items/1.0/?jsv=2.4.0&appKey=12574478&t=1582738149318&sign=fe2cf689bdac8258a1d12507a06bd289&api=mtop.alimama.union.sem.landing.pc.items&v=1.0&AntiCreep=true&dataType=jsonp&type=jsonp&ecode=0&callback=mtopjsonp1&data=%7B%22keyword%22%3A%22%E8%8B%B9%E6%9E%9C%E6%89%8B%E6%9C%BA%22%2C%22ppath%22%3A%22%22%2C%22loc%22%3A%22%22%2C%22minPrice%22%3A%22%22%2C%22maxPrice%22%3A%22%22%2C%22ismall%22%3A%22%22%2C%22ship%22%3A%22%22%2C%22itemAssurance%22%3A%22%22%2C%22exchange7%22%3A%22%22%2C%22custAssurance%22%3A%22%22%2C%22b%22%3A%22%22%2C%22clk1%22%3A%22%22%2C%22pvoff%22%3A%22%22%2C%22pageSize%22%3A%22100%22%2C%22page%22%3A%22%22%2C%22elemtid%22%3A%221%22%2C%22refpid%22%3A%22%22%2C%22pid%22%3A%22430673_1006%22%2C%22featureNames%22%3A%22spGoldMedal%2CdsrDescribe%2CdsrDescribeGap%2CdsrService%2CdsrServiceGap%2CdsrDeliver%2C%20dsrDeliverGap%22%2C%22ac%22%3A%22%22%2C%22wangwangid%22%3A%22%22%2C%22catId%22%3A%22%22%7D'
try:
with requests.get(base_url) as response:
get_cookies = requests.utils.dict_from_cookiejar(
response.cookies)
_m_h5_tk = get_cookies['_m_h5_tk']
_m_h5_tk_enc = get_cookies['_m_h5_tk_enc']
self.token = _m_h5_tk.split('_')[0]
self.cookie = '_m_h5_tk={}; _m_h5_tk_enc={}'.format(
_m_h5_tk, _m_h5_tk_enc)
except Exception as e:
print('first_requests 出错: ', e)

def sign(self, token, tme, appKey, data):
st = token+"&"+tme+"&"+appKey+"&"+data
m = hashlib.md5(st.encode(encoding='utf-8')).hexdigest()
return(m)

def second_requests(self):
# 第二次带cookie请求,返回数据并存储
searchContent = '"sc"'.replace('sc', self.str_searchContent)
pageSize = '"ps"'.replace('ps', str(self.num_pageSize)) # 每页结果属
page = '"p"'.replace('p', str(self.num_page)) # 第几页

str_data = '{"keyword":'+searchContent+',"ppath":"","loc":"","minPrice":"","maxPrice":"","ismall":"","ship":"","itemAssurance":"","exchange7":"","custAssurance":"","b":"","clk1":"","pvoff":"","pageSize":'+pageSize+',"page":' + \
page+',"elemtid":"1","refpid":"","pid":"430673_1006","featureNames":"spGoldMedal,dsrDescribe,dsrDescribeGap,dsrService,dsrServiceGap,dsrDeliver, dsrDeliverGap","ac":"","wangwangid":"","catId":""}'
data = quote(str_data, 'utf-8')

tme = str(time.time()).replace('.', '')[0:13]

sgn = self.sign(self.token, tme, self.appKey, str_data)

url = 'https://h5api.m.taobao.com/h5/mtop.alimama.union.sem.landing.pc.items/1.0/?jsv=2.4.0&appKey={}&t={}&sign={}&api=mtop.alimama.union.sem.landing.pc.items&v=1.0&AntiCreep=true&dataType=jsonp&type=jsonp&ecode=0&callback=mtopjsonp2&data={}'.format(
appKey, tme, sgn, data)

headers = {'cookie': self.cookie} # 未使用proxies
try:
with requests.get(url, headers=headers) as res:
html = res.text

res_str = html.split(
'"mainItems":')[-1].split('},"ret":')[0].replace('true', '"true"').replace('false', '"false"')
res_list = eval(res_str)
if self.switch_save == 0:
self.switch_save_0(res_list)
elif self.switch_save == 1:
self.switch_save_1(res_list)
elif self.switch_save == 2:
self.switch_save_2(res_list)
else:
print('config.py 文件中存储部分设置有误!')
except Exception as e:
print('second_requests 出错: ', e)

def switch_save_0(self, res_list):
from save_csv import save_csv
csv_file_name = self.file_name+'.csv'
# 返回该页面所有的 itemId 存入 L_itemId 列表中
self.L_itemId += save_csv(res_list, csv_file_name)
print('\n爬取项目数目: ', len(self.L_itemId))

def switch_save_1(self, res_list):
from save_mysql import save_mysql
save_mysql(res_list)

def switch_save_2(self, res_list):
from save_mysql_redis import save_mysql_redis
save_mysql_redis(res_list)

def get_search_page(self):
print('搜索页面 线程启动: ', threading.current_thread().name)
for i in range(1, self.num_page+1):
self.first_requests() # 可以在此调整获取cookie的频率
self.second_requests()
print('完成第 {} 页爬取\n====================\n'.format(i))

def get_comments_page(self):
print('评论页面 线程启动: ', threading.current_thread().name)
time.sleep(5)
n = 3 # 三次请求 self.L_itemId 无返回, 则认为所有数据爬取完毕
while True:
if n == 0:
break
try:
itemId = self.L_itemId.pop(0)
self.get_comments(itemId)
n = 3
except Exception as e:
n -= 1
time.sleep(5)

def get_comments(self, itemId):
pass

def run(self):
tme = str(time.time()).replace('.', '')[0:13]
self.file_name = '搜索页面'+'_'+self.str_searchContent+'_' + str(self.num_pageSize)+'_' + str(
self.num_page)+'_'+time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(int(tme[:10]))).replace(' ', '_').replace(':', '_')

threads = []
# 一条线程爬取搜索页面
if self.threads_num_get_pages != 1:
print('请在 config.py 文件中 设置 threads_num_get_pages = 1')
thread0 = threading.Thread(target=self.get_search_page, args=())
threads.append(thread0)

# 新建线程爬取详情页面
if self.threads_num_get_comments:
for i in range(self.threads_num_get_comments):
thread = threading.Thread(
target=self.get_comments_page, args=())
threads.append(thread)

# 启动多线程
for t in threads:
t.start()

for t in threads:
t.join()
print('关闭线程: ', t.name)

print('主线程结束!', threading.current_thread().name)


if __name__ == "__main__":
TaoBao(str_searchContent, num_pageSize, num_page, appKey,
threads_num_get_pages, threads_num_get_comments, switch_save, proxies)
Binary file added TaobaoCrawler(new)/pictures/JosephNest.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added TaobaoCrawler(new)/pictures/mortaltiger.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
30 changes: 30 additions & 0 deletions TaobaoCrawler(new)/save_csv.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
import csv
import os


def save_csv(res_list, csv_file_name):
L_itemId = []

path = './csv/'
# 判断\新建文件夹
if not os.path.exists(path):
os.makedirs(path)
print(path, ' 文件夹创建成功')
file_name = path+csv_file_name
# 判断\新建文件
if not os.path.exists(file_name):
header = ["dsrDeliver", "dsrDeliverGap", "dsrDescribe", "dsrDescribeGap", "dsrService", "dsrServiceGap", "imgUrl", "ismall",
"itemId", "loc", "price", "promoPrice", "redkeys", "sellCount", "sellerPayPostfee", "spGoldMedal", "title", "wangwangId"]
with open(file_name, 'a', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(header)
# 写入文件
for item in res_list:
with open(file_name, 'a', newline='', encoding='utf-8') as f:
L_itemId.append(item["itemId"])
writer = csv.writer(f)
L = [item["dsrDeliver"], item["dsrDeliverGap"], item["dsrDescribe"], item["dsrDescribeGap"], item["dsrService"], item["dsrServiceGap"], 'https:'+item["imgUrl"], item["ismall"],
item["itemId"], item["loc"], item["price"], item["promoPrice"], item["redkeys"], item["sellCount"], item["sellerPayPostfee"], item["spGoldMedal"], item["title"], item["wangwangId"]]
writer.writerow(L)

return L_itemId
Loading

0 comments on commit 5835247

Please sign in to comment.