Elasticsearch 分析大乐透

使用 scrapy 爬取历史大乐透数据，然后放入 ES 中，再利用 Kibana 进行可视化展示走势。

大乐透的历史开奖页：http://www.lottery.gov.cn/historykj/history.jspx?_ltype=dlt

可选好查询期数后，将网页保存到本地。

爬虫

生成爬虫代码：

$ pip3 install scrapy
$ scrapy startproject letou_spider
$ cd letou_spider/
$ scrapy genspider letou www.lottery.gov.cn

安装 ES 依赖：

$ pip3 install elasticsearch

编写代码:

# -*- coding: utf-8 -*-
import scrapy
from elasticsearch import Elasticsearch
from elasticsearch import helpers


def gendata(trs):
    for tr in trs:
        tds = tr.css("td")
        stage = tds[0].css("::text").get()
        front_one = tds[1].css("::text").get()
        front_two = tds[2].css("::text").get()
        front_three = tds[3].css("::text").get()
        front_four = tds[4].css("::text").get()
        front_five = tds[5].css("::text").get()
        after_one = tds[6].css("::text").get()
        after_two = tds[7].css("::text").get()
        first_prize_count = tds[8].css("::text").get()
        first_prize_money = tds[9].css("::text").get()
        first_prize_append_count = tds[10].css("::text").get()
        first_prize_append_money = tds[11].css("::text").get()
        second_prize_count = tds[12].css("::text").get()
        second_prize_money = tds[13].css("::text").get()
        second_prize_append_count = tds[14].css("::text").get()
        second_prize_append_money = tds[15].css("::text").get()
        sales_volume = tds[17].css("::text").get()
        prize_pool = tds[18].css("::text").get()
        date = tds[19].css("::text").get()

        letou_map = {"stage": stage, "front_one": front_one, "front_two": front_two, "front_three": front_three,
                     "front_four": front_four, "front_five": front_five, "after_one": after_one, "after_two": after_two,
                     "first_prize_count": first_prize_count, "first_prize_money": first_prize_money,
                     "first_prize_append_count": first_prize_append_count,
                     "first_prize_append_money": first_prize_append_money,
                     "second_prize_count": second_prize_count, "second_prize_money": second_prize_money,
                     "second_prize_append_count": second_prize_append_count,
                     "second_prize_append_money": second_prize_append_money,
                     "sales_volume": sales_volume, "prize_pool": prize_pool, "date": date}
        index_map = {
            "_index": "daletou",
            "_source": letou_map
        }
        yield index_map


class LetouSpider(scrapy.Spider):
    es = Elasticsearch(
        hosts=["http://fueltank-4:9200/"]
    )
    name = 'letou'
    allowed_domains = ['www.lottery.gov.cn']
    start_urls = ['file:///Users/jiyouxu/Downloads/%E4%B8%AD%E5%9B%BD%E4%BD%93%E5%BD%A9%E7%BD%91%20-%20%E5%BC%80%E5%A5%96%E5%8E%86%E5%8F%B2%E9%A1%B5-%E5%A4%A7%E4%B9%90%E9%80%8F.htm']

    def parse(self, response):
        trs = response.css("tbody tr")
        helpers.bulk(self.es, gendata(trs))

搞定，一个 bulk 操作，将数据全部插入到 ES 中了，大约有 1900 多条数据：

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch分析大乐透.md

Elasticsearch分析大乐透.md

Elasticsearch 分析大乐透

爬虫

Files

Elasticsearch分析大乐透.md

Latest commit

History

Elasticsearch分析大乐透.md

File metadata and controls

Elasticsearch 分析大乐透

爬虫