A simple library for interacting with the web from python
activesoup
combines familiar python web capabilities for convenient
headless "browsing" functionality:
- Modern HTTP support with requests - connection pooling, sessions, ...
- Convenient access to the web page with an interface inspired by beautifulsoup - convenient HTML navigation.
- Robust HTML parsing with html5lib - parse the web like browsers do.
Full documentation can be found at https://activesoup.dev.
activesoup
aims to provide just enough functionality for basic web automation
/ crawler tasks. Consider using activesoup
when:
- You've already checked out requests-html
- You need to actively interact with some web-page from Python (e.g. submitting forms, downloading files)
- You don't control the site you need to interact with (if you do, just make an API).
- You don't need javascript support (you'll need selenium or phantomjs).
In the example below, we'll load a page with a simple form, enumerate the fields, and make a submission:
>>> import activesoup
>>> # Start a session
>>> d = activesoup.Driver()
>>> page = d.get("https://httpbin.org/forms/post")
>>> # conveniently access elements, inspired by BeautifulSoup
>>> form = page.form
>>> # get the power of raw xpath search too
>>> form.find('.//input[@name="size"]')
BoundTag<input>
>>> # any element, searching by attribute
>>> form.find('.//*', name="size")
BoundTag<input>
>>> # or just search by attribute
>>> form.find(name="size")
BoundTag<input>
>>> # inspect element attributes
>>> print([i['name'] for i in form.find_all('input')])
['custname', 'custtel', 'custemail', 'size', 'size', 'size', 'topping', 'topping', 'topping', 'topping', 'delivery']
>>> # work actively with objects on the page
>>> r = form.submit({"custname": "john", "size": "small"})
>>> # responses parsed and ready based on content type
>>> r.keys()
dict_keys(['args', 'data', 'files', 'form', 'headers', 'json', 'origin', 'url'])
>>> r['form']
{'custname': 'john', 'size': 'small', 'topping': 'mushroom'}
>>> # access the underlying requests.Session too
>>> d.session
<requests.sessions.Session object at 0x7f283dc95700>
>>> # log in with cookie support
>>> d.get('https://httpbin.org/cookies/set/foo/bar')
>>> d.session.cookies['foo']
'bar'