Skip to content

Latest commit

 

History

History
95 lines (74 loc) · 5.76 KB

README.md

File metadata and controls

95 lines (74 loc) · 5.76 KB

Parsel Selector API

Python application CodeQL

An API for selecting part of a document on the web based on a path to the content.


Quick Examples

Select these links for cool information about the world, powered by this API:

How it works: Users pass the API a url do a document on the web, and a path to particular content on that page. The page is scraped, and the data requested is returned!


Inspiration

In the Scrapy python project, a framework for web scraping, the Parsel library is used to parse content scraped from the internet to get at the data the scraper wants. Getting page content looks something like this:

>>> fetch("https://old.reddit.com/")
>>> response.xpath('//title/text()').get()
'reddit: the front page of the internet'

The above example is how a developer might develop the correct xpath needed to get at the content they want, this typically involves a lot of trial and error.

In another project I created a GUI to do this task within a browser which works quite nicely. The goal of this project is to create an API that does the same thing and more.

This API serves 2 purposes.

  1. A standalone API where a user can get specific page content with a path, a useful tool for all sorts of projects.
  2. An API that can back up a static website built as a static tool to assist Scrapy users.

Features

  • Parse HTML with Xpath or CSS selectors as you would in Scrapy/Parsel.
  • Parse JSON and XML with a path similar to an Xpath.
  • Parse any text content on the internet with a Regex pattern.
  • Test out how the site you're working on reacts to different User-Agents.
  • Built with Fast API which provides Swagger and ReDoc documentation.
  • Caching functionality on unique url/user_agent combos when the requests status_code = 200, suppressing the API from calling an endpoint too frequently.

Installation

You can clone this repo for your own hosted version, or you can use the hosted version at https://parsel-selector-api.herokuapp.com/docs

# Clone repo
git clone https://github.com/avi-perl/Parsel-Selector-API.git
cd Parsel-Selector-API

# Install requirements 
pip install -r requirements.txt

# Run the app
uvicorn app.main:app --reload

Usage

Additional examples can be found in the examples folder.

import requests

# Example using the default BASIC return style
params = {
    "url": "https://parsel-selector-api.herokuapp.com/examples/html",
    "path": "/html/body/div/span[3]/text()",
    "path_type": "XPATH"
}

r = requests.get("https://parsel-selector-api.herokuapp.com/parsel", params=params)
print(r.json())

Parsing Content

Select the links below for documentation on how to structure your path for each type based on the library's used to power it.

Type Library Used Notes
XPATH Parsel Currently only supporting the .get() method.
CSS Parsel Currently only supporting the .get() method.
REGEX Parsel
JSON dpath
XML XML converted to a dictionary by xmltodict, then parsed as JSON is with dpath

Contributing

This project has been mostly about learning, your pull requests and comments would be super appreciated!

TODO:

  • Add request cache so that the same URL is not called frequently.
  • Add more tests on basic functionality.
  • Create a front-end as a GUI for this tool.
  • Add path parsing errors to the response for types other than XML.

License

MIT