This repository has been archived by the owner on Mar 5, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 19d8d8e
Showing
69 changed files
with
7,630 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
name: Deploy index to Github Pages | ||
|
||
on: | ||
push: | ||
branches: [ master ] | ||
|
||
# Allows you to run this workflow manually from the Actions tab | ||
workflow_dispatch: | ||
|
||
# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages | ||
permissions: | ||
contents: read | ||
pages: write | ||
id-token: write | ||
|
||
jobs: | ||
build: | ||
runs-on: ubuntu-22.04 | ||
steps: | ||
- name: Checkout master | ||
uses: actions/checkout@v2 | ||
with: | ||
path: master | ||
ref: master | ||
fetch-depth: '0' | ||
- run: | | ||
cd master | ||
./build_site.sh ../_site/ | ||
- uses: actions/upload-pages-artifact@v2 | ||
|
||
deploy: | ||
environment: | ||
name: github-pages | ||
url: ${{ steps.deployment.outputs.page_url }} | ||
runs-on: ubuntu-22.04 | ||
needs: build | ||
steps: | ||
- name: Deploy to GitHub Pages | ||
id: deployment | ||
uses: actions/deploy-pages@v2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# Scraper generated files | ||
*.json | ||
|
||
# Index build artifact | ||
/_site |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
# The Aylo API scraper | ||
|
||
This is arguably the biggest scraper in the repo and covers a _lot_ of networks and studios. It needs testing! | ||
|
||
![scraper-source](installation.png) | ||
|
||
| Field | Value | | ||
| ---------- | ----------------------------------------------------- | | ||
| Name | `AyloAPI Beta` | | ||
| Source URL | `https://maista6969.github.io/AyloAPI-beta/index.yml` | | ||
| Local Path | `AyloAPI-beta` | | ||
|
||
## Design goals: | ||
|
||
- Easy to modify and understand: documentation, examples, tests? | ||
- Split scrapers that can handle the individual complexities of subnetworks without overcomplicating the main scraper | ||
|
||
## Development | ||
|
||
The scraper is composed of one [main file](scrapers/AyloAPI/scrape.py) that contains the functions necessary to scrape scenes, movies and performers | ||
from the Aylo API along with a few supporting files with functions that handle things like [constructing URL slugs](scrapers/AyloAPI/slugger.py) and [caching instance tokens](scrapers/AyloAPI/domains.py). | ||
|
||
These functions are designed to be open for extension, but closed to modification: but what does this mean? | ||
The networks and studios in the Aylo API differ in how they construct their URLs and even how their | ||
parent/child studio relationships are expressed so these functions could easily end up being very complex | ||
if they were to handle every special case. Instead these scraping functions return their results in a standard format | ||
that works for most studios while also optionally taking a postprocessing function that callers can supply to handle their special requirements. | ||
|
||
This postprocessing function can be specific to every sub-network in the Aylo API and encapsulate their quirks. | ||
|
||
The standard URL formats the scraper returns look like this: | ||
|
||
- scenes: `https://www.<brand-domain>.com/scene/<scene-id>/<scene-title-slug>` | ||
- movies: `https://www.<brand-domain>.com/movie/<movie-id>/<movie-title-slug>` | ||
- performers: `https://www.<brand-domain>.com/model/<performer-id>/<performer-name-slug>` | ||
|
||
`brand-domain` is based on the parent studio: `bangbros` for Bang Bros, `gaywire` for Gay Wire, | ||
`bigstr` for BigStr (which has since consolidated under the Czech Hunter name, so those URLs are wrong!) | ||
|
||
Uses the `parse_args` helper from [py_common](scrapers/py_common/util.py) | ||
Developed to be ergonomic for testing and integrating into other Python scripts: | ||
|
||
```shell | ||
$ python AyloAPI/scrape.py scene-by-url --url "https://www.babes.com/scene/4474211/forbidden-fruit" | ||
d Scene ID: 4327711 | ||
t Sending GET request to https://site-api.project1service.com/v2/releases/4327711 | ||
d This scene has 13 markers but scraping markers hasn't been implemented yet | ||
{"title": "Forbidden Fruit", <omitted for brevity>} | ||
``` | ||
The simplest case is exemplified by the Babes network: they use the standard URL formats and their | ||
parent studio domain `www.babes.com` is correct for all substudios. Their scraper does not need | ||
to make any changes to the results returned by the API, so their scraper is fully defined in [Babes.yml](scrapers/Babes/Babes.yml). | ||
The only thing it needs to do is specify which domains it should use for search, which can be done inline. | ||
```shell | ||
$ python AyloAPI/scrape.py brazzers performer-by-name --name "Zazie Skymm" | ||
d Searching for 'Zazie Skymm' on 1 sites | ||
d Searching babes | ||
t Sending GET request to https://site-api.project1service.com/v1/actors?search=Zazie Skymm&limit=10 | ||
d Search finished, found 1 candidates | ||
[{"name": "Zazie Skymm", <omitted for brevity>}] | ||
``` | ||
## Testing | ||
The scrapers have all been split and tested manually based on the notes in [the overview](overview.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,89 @@ | ||
#!/bin/bash | ||
|
||
# builds a repository of scrapers | ||
# outputs to _site with the following structure: | ||
# index.yml | ||
# <scraper_id>.zip | ||
# Each zip file contains the scraper.yml file and any other files in the same directory | ||
|
||
outdir="$1" | ||
if [ -z "$outdir" ]; then | ||
outdir="_site" | ||
fi | ||
|
||
rm -rf "$outdir" | ||
mkdir -p "$outdir" | ||
|
||
buildScraper() | ||
{ | ||
f=$1 | ||
dir=$(dirname "$f") | ||
|
||
# get the scraper id from the filename | ||
scraper_id=$(basename "$f" .yml) | ||
versionFile=$f | ||
if [ "$scraper_id" == "package" ]; then | ||
scraper_id=$(basename "$dir") | ||
fi | ||
|
||
if [ "$dir" != "./scrapers" ]; then | ||
versionFile="$dir" | ||
fi | ||
|
||
echo "Processing $scraper_id" | ||
|
||
# create a directory for the version | ||
version=$(git log -n 1 --pretty=format:%h -- "$versionFile") | ||
updated=$(TZ=UTC0 git log -n 1 --date="format-local:%F %T" --pretty=format:%ad -- "$versionFile") | ||
|
||
# create the zip file | ||
# copy other files | ||
zipfile=$(realpath "$outdir/$scraper_id.zip") | ||
|
||
name=$(grep "^name:" "$f" | cut -d' ' -f2- | sed -e 's/\r//' -e 's/^"\(.*\)"$/\1/') | ||
ignore=$(grep "^# ignore:" "$f" | cut -c 10- | sed -e 's/\r//') | ||
dep=$(grep "^# requires:" "$f" | cut -c 12- | sed -e 's/\r//') | ||
|
||
# always ignore package file | ||
ignore="-x $ignore package" | ||
|
||
pushd "$dir" > /dev/null | ||
if [ "$dir" != "./scrapers" ]; then | ||
zip -r "$zipfile" . ${ignore} > /dev/null | ||
else | ||
zip "$zipfile" "$scraper_id.yml" > /dev/null | ||
fi | ||
popd > /dev/null | ||
|
||
# write to spec index | ||
echo "- id: $scraper_id | ||
name: $name | ||
version: $version | ||
date: $updated | ||
path: $scraper_id.zip | ||
sha256: $(sha256sum "$zipfile" | cut -d' ' -f1)" >> "$outdir"/index.yml | ||
|
||
# handle dependencies | ||
if [ ! -z "$dep" ]; then | ||
echo " requires:" >> "$outdir"/index.yml | ||
for d in ${dep//,/ }; do | ||
echo " - $d" >> "$outdir"/index.yml | ||
done | ||
fi | ||
|
||
echo "" >> "$outdir"/index.yml | ||
} | ||
|
||
# find all yml files in ./scrapers - these are packages individually | ||
for f in ./scrapers/*.yml; do | ||
buildScraper "$f" | ||
done | ||
|
||
find ./scrapers/ -mindepth 2 -name *.yml -print0 | while read -d $'\0' f; do | ||
buildScraper "$f" | ||
done | ||
|
||
# handle dependency packages | ||
find ./scrapers/ -mindepth 2 -name package -print0 | while read -d $'\0' f; do | ||
buildScraper "$f" | ||
done |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.