PyScraping is a universal web-scraping util for Python, built with simplicity in mind.
Start to do the installation.
pip install pyscraping
All scraping functionality can be accessed either as a function call or a property call. For example, the title can be accessed in two ways:
from pyscraping.PyScraping import PyScraping
pyscraping = PyScraping("https://google.com")
print(pyscraping.title())
Scraping the title from a website is simple.
pyscraping.title()
To access the defined charset, you can use the following method:
pyscraping.charset()
In some cases, such as the viewport and the meta keywords, the string is representing an array and will be provided as such:
pyscraping.viewport()
If you need to access the original "viewport"-string, you can use viewportString
:
pyscraping.viewportString()
The canonical URL, if given, can be accessed as shown in the example below:
pyscraping.canonical()
To access the content type you can use the following functionality:
pyscraping.contentType()
The CSFR token method assumes that the token is stored in a meta tag with the name "csrf-token". This is the default for Laravel. You can access it using the following code:
pyscraping.csrfToken()
The following example shows the extraction of three attributes:
- the Meta Author,
- the Meta Description and
- the Meta Image URL
pyscraping.author()
pyscraping.description()
pyscraping.image()
The keywords meta-tag is naturally an array and will be split for your convenience:
pyscraping.keywords()
Alternatively, you can access the original keyword string:
pyscraping.keywordString()
Fetching open-graph data can be done:
- og:site_name
- og:type
- og:title
- og:description
- og:url
- og:image
# Example
pyscraping.openGraph("og:title")
# All
pyscraping.openGraph()
Parsing the Twitter Card works similarly:
- twitter:card
- twitter:title
- twitter:description
- twitter:url
- twitter:image
# Example
pyscraping.twitterCard("twitter:title")
# All
pyscraping.twitterCard()
There might be cases, in which all headings of a particular level should be retrieved. The example below shows how to do so:
pyscraping.h1()
pyscraping.h2()
pyscraping.h3()
pyscraping.h4()
pyscraping.h5()
pyscraping.h6()
The following example will return a list of all paragraphs (<p>
-tags) on the website:
pyscraping.p()
The following example will return a list of all list (<ul>
-tags) on the website:
pyscraping.ul()
The following example will return a list of all list (<ol>
-tags) on the website:
pyscraping.ol()
The following example parses a web-page for images and returns absolute image URLs as an array.
pyscraping.images()
If you are in need of more details the following requests allows you to access attributes of the image tag:
pyscraping.imagesDetails()
The following example parses a web-page for any links and returns an array of absolute URLs:
pyscraping.links()
If you are in need of more details you can access these in a similar way as on the images. Below is an example to access the detailed data of the first link on the page:
pyscraping.linksDetails()
The following examples of custom selectors should be seen as a starting point for any custom information you need to scrape.
pyscraping.filter(element, attribute)
Example
pyscraping.filter('div', 'class="container"')
Contact me via email: [email protected], I'm waiting for your input or suggestions.