Skip to content
This repository has been archived by the owner on Mar 5, 2024. It is now read-only.

Commit

Permalink
First draft of AyloAPI
Browse files Browse the repository at this point in the history
  • Loading branch information
Maista6969 committed Jan 15, 2024
0 parents commit 19d8d8e
Show file tree
Hide file tree
Showing 69 changed files with 7,630 additions and 0 deletions.
40 changes: 40 additions & 0 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
name: Deploy index to Github Pages

on:
push:
branches: [ master ]

# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:

# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
permissions:
contents: read
pages: write
id-token: write

jobs:
build:
runs-on: ubuntu-22.04
steps:
- name: Checkout master
uses: actions/checkout@v2
with:
path: master
ref: master
fetch-depth: '0'
- run: |
cd master
./build_site.sh ../_site/
- uses: actions/upload-pages-artifact@v2

deploy:
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
runs-on: ubuntu-22.04
needs: build
steps:
- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v2
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Scraper generated files
*.json

# Index build artifact
/_site
67 changes: 67 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# The Aylo API scraper

This is arguably the biggest scraper in the repo and covers a _lot_ of networks and studios. It needs testing!

![scraper-source](installation.png)

| Field | Value |
| ---------- | ----------------------------------------------------- |
| Name | `AyloAPI Beta` |
| Source URL | `https://maista6969.github.io/AyloAPI-beta/index.yml` |
| Local Path | `AyloAPI-beta` |

## Design goals:

- Easy to modify and understand: documentation, examples, tests?
- Split scrapers that can handle the individual complexities of subnetworks without overcomplicating the main scraper

## Development

The scraper is composed of one [main file](scrapers/AyloAPI/scrape.py) that contains the functions necessary to scrape scenes, movies and performers
from the Aylo API along with a few supporting files with functions that handle things like [constructing URL slugs](scrapers/AyloAPI/slugger.py) and [caching instance tokens](scrapers/AyloAPI/domains.py).

These functions are designed to be open for extension, but closed to modification: but what does this mean?
The networks and studios in the Aylo API differ in how they construct their URLs and even how their
parent/child studio relationships are expressed so these functions could easily end up being very complex
if they were to handle every special case. Instead these scraping functions return their results in a standard format
that works for most studios while also optionally taking a postprocessing function that callers can supply to handle their special requirements.

This postprocessing function can be specific to every sub-network in the Aylo API and encapsulate their quirks.

The standard URL formats the scraper returns look like this:

- scenes: `https://www.<brand-domain>.com/scene/<scene-id>/<scene-title-slug>`
- movies: `https://www.<brand-domain>.com/movie/<movie-id>/<movie-title-slug>`
- performers: `https://www.<brand-domain>.com/model/<performer-id>/<performer-name-slug>`

`brand-domain` is based on the parent studio: `bangbros` for Bang Bros, `gaywire` for Gay Wire,
`bigstr` for BigStr (which has since consolidated under the Czech Hunter name, so those URLs are wrong!)

Uses the `parse_args` helper from [py_common](scrapers/py_common/util.py)
Developed to be ergonomic for testing and integrating into other Python scripts:

```shell
$ python AyloAPI/scrape.py scene-by-url --url "https://www.babes.com/scene/4474211/forbidden-fruit"
d Scene ID: 4327711
t Sending GET request to https://site-api.project1service.com/v2/releases/4327711
d This scene has 13 markers but scraping markers hasn't been implemented yet
{"title": "Forbidden Fruit", <omitted for brevity>}
```
The simplest case is exemplified by the Babes network: they use the standard URL formats and their
parent studio domain `www.babes.com` is correct for all substudios. Their scraper does not need
to make any changes to the results returned by the API, so their scraper is fully defined in [Babes.yml](scrapers/Babes/Babes.yml).
The only thing it needs to do is specify which domains it should use for search, which can be done inline.
```shell
$ python AyloAPI/scrape.py brazzers performer-by-name --name "Zazie Skymm"
d Searching for 'Zazie Skymm' on 1 sites
d Searching babes
t Sending GET request to https://site-api.project1service.com/v1/actors?search=Zazie Skymm&limit=10
d Search finished, found 1 candidates
[{"name": "Zazie Skymm", <omitted for brevity>}]
```
## Testing
The scrapers have all been split and tested manually based on the notes in [the overview](overview.md)
89 changes: 89 additions & 0 deletions build_site.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
#!/bin/bash

# builds a repository of scrapers
# outputs to _site with the following structure:
# index.yml
# <scraper_id>.zip
# Each zip file contains the scraper.yml file and any other files in the same directory

outdir="$1"
if [ -z "$outdir" ]; then
outdir="_site"
fi

rm -rf "$outdir"
mkdir -p "$outdir"

buildScraper()
{
f=$1
dir=$(dirname "$f")

# get the scraper id from the filename
scraper_id=$(basename "$f" .yml)
versionFile=$f
if [ "$scraper_id" == "package" ]; then
scraper_id=$(basename "$dir")
fi

if [ "$dir" != "./scrapers" ]; then
versionFile="$dir"
fi

echo "Processing $scraper_id"

# create a directory for the version
version=$(git log -n 1 --pretty=format:%h -- "$versionFile")
updated=$(TZ=UTC0 git log -n 1 --date="format-local:%F %T" --pretty=format:%ad -- "$versionFile")

# create the zip file
# copy other files
zipfile=$(realpath "$outdir/$scraper_id.zip")

name=$(grep "^name:" "$f" | cut -d' ' -f2- | sed -e 's/\r//' -e 's/^"\(.*\)"$/\1/')
ignore=$(grep "^# ignore:" "$f" | cut -c 10- | sed -e 's/\r//')
dep=$(grep "^# requires:" "$f" | cut -c 12- | sed -e 's/\r//')

# always ignore package file
ignore="-x $ignore package"

pushd "$dir" > /dev/null
if [ "$dir" != "./scrapers" ]; then
zip -r "$zipfile" . ${ignore} > /dev/null
else
zip "$zipfile" "$scraper_id.yml" > /dev/null
fi
popd > /dev/null

# write to spec index
echo "- id: $scraper_id
name: $name
version: $version
date: $updated
path: $scraper_id.zip
sha256: $(sha256sum "$zipfile" | cut -d' ' -f1)" >> "$outdir"/index.yml

# handle dependencies
if [ ! -z "$dep" ]; then
echo " requires:" >> "$outdir"/index.yml
for d in ${dep//,/ }; do
echo " - $d" >> "$outdir"/index.yml
done
fi

echo "" >> "$outdir"/index.yml
}

# find all yml files in ./scrapers - these are packages individually
for f in ./scrapers/*.yml; do
buildScraper "$f"
done

find ./scrapers/ -mindepth 2 -name *.yml -print0 | while read -d $'\0' f; do
buildScraper "$f"
done

# handle dependency packages
find ./scrapers/ -mindepth 2 -name package -print0 | while read -d $'\0' f; do
buildScraper "$f"
done
Binary file added installation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 19d8d8e

Please sign in to comment.