Web scraping in Python

Using Scrapy

Install Scrapy by entering pip install scrapy in your Terminal.
Navigate to the directory where you would like to create your Scrapy project.

Enter scrapy startproject myproject. This will create a project with the following directory structure:

myproject/
    scrapy.cfg
    myproject/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py

Navigate to the directory that contains the project's spiders by entering cd myproject/myproject/spiders.
Create a new spider by entering touch myspider.py, then open it in your default Python code editor by entering open myspider.py.

Enter the following code:

import scrapy
from myproject.items import MyItem

class MySpider(scrapy.Spider):
    name = 'myspider'
    allowed_domains = ['bwog.com']
    start_urls = ['http://bwog.com']

    def parse(self, response):
        for section in response.xpath('//div[@class="blog-section"]'):
            link = section.xpath('.//a/@href').extract_first()
            yield scrapy.Request(link, callback=self.parse_entry)
        next = response.xpath('//div[@class="comnt-btn"]//@href').extract_first()
        yield scrapy.Request(next, callback=self.parse)

    def parse_entry(self, response):
        for comment in response.xpath('//div[contains(@class, " comment-body")]'):
            item = MyItem()
            
            item['author'] = comment.xpath('./div[@class="comment-author vcard"]/cite/text()').extract_first()
            
            metadata = comment.xpath('./div[@class="comment-meta datetime"]')
            item['up'] = int(metadata.xpath('./span[@data-voting-direction="up"]/span/text()').extract_first())
            item['down'] = int(metadata.xpath('./span[@data-voting-direction="down"]/span/text()').extract_first())
            item['datetime'] = metadata.xpath('./a/text()').extract_first().strip()
            
            paragraphs = comment.xpath('./div[contains(@class, "reg-comment-body")]/p/text()').extract()
            item['content'] = '\n'.join(paragraphs)
            
            yield item

Edit the project items file in your default Python code editor by entering open ../items.py.

Enter the following code:

import scrapy

class MyItem(scrapy.Item):
    author = scrapy.Field()
    up = scrapy.Field()
    down = scrapy.Field()
    datetime = scrapy.Field()
    content = scrapy.Field()

Navigate to the top directory of your project by entering cd ../...
Run the spider you created and store its output in a comments.json file by entering scrapy crawl myspider -o comments.json. View the stored comments by entering open comments.json.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
BeautifulSoup/cnn		BeautifulSoup/cnn
Scrapy/myproject		Scrapy/myproject
README.md		README.md
presentation.pdf		presentation.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web scraping in Python

Using Scrapy

About

Releases

Packages

Languages

thecobb/web-scraping

Folders and files

Latest commit

History

Repository files navigation

Web scraping in Python

Using Scrapy

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages