Scraper is a web service that grabs the HTTP response status codes for a given URL.
- Custom listen ports for scraper service and metrics service
- Limit of concurrent scrapes
- Custom timeout
- Settings via command line arguments or environment variables (SCRAPER_ prefix)
-
A client makes a HTTP POST request to the Scraper Service, to the Scraper listening port (default: 8080). The request is sent in the POST body:
{"url": "http://phaidra.ai"}
-
The Scraper tries to fullfil the request by creating an HTTP GET request to {target}
-
If the {target} url exists, Scraper receives the Status Code response from {target}
-
The following prometheus metrics are updated:
- http_requests_total{code}
- How many HTTP requests received, partitioned response code.
- http_get{code}
- How many HTTP GET scraped, partitioned by url and response code.
- workerWait
- Histogram. Time waiting for an available worker. In milliseconds.
- http_requests_total{code}
-
Client receives the response Status Code:
- 200 OK: Target exists and replied with a Status Code
- 408 RequestTimeout: No worker available in the timeout period
- 501 NotImplemented: Scaper received a HTTP method different than POST
- 400 BadRequest: Request POST body is malformed
- 500 InternalServerError: Unexpected internal error
Metrics are a prometheus (prometheus-client) service accessible on port 9095 (by default) under path /metrics.
Requirements:
- Golang go1.16
Compile and install:
$ go build .
Run Unit Tests:
$ go test .
Start Scraper Service:
$ ./scraper_service
Scraper Service settings:
$ ./scraper_service --help
Usage: scraper_service [FLAG]...
Flags:
--listen Service listen address. (type: string; env: SCRAPER_Listen; default: :8080)
--workers Number of serving workers. (type: uint8; env: SCRAPER_Workers; default: 2)
--timeout Maximum time (in milliseconds) to wait for a worker. (type: uint64; env: SCRAPER_Timeout; default: 1000)
--metrics Metrics listen address. (type: string; env: SCRAPER_MetricsListen; default: :9095)
-h, --help show help (type: bool)
This builds the docker image scraper_service/scraper:0.1.0
using a multi-stage build
docker build -t scraper_service/scraper:0.1.0 .
Launching a container
docker run -p 8080:8080 -p 9095:9095 scraper_service/scraper:0.1.0
./tools/request.sh <address-to-scrape> <scraper-service-address> <count>
Examples:
./tools/request.sh https://google.com localhost:8080 1
./tools/request.sh https://phaidra.ai localhost:8080 1
./tools/request.sh https://phaidra.ai/trackrecord localhost:8080 1
Number of requests received per 10s
:
sum(rate(http_requests_total[10s]))
Number of requests received per 10s
, partitioned by Status Code
:
sum by(code) (delta(http_requests_total[10s]))
Number of requests received per 1m
, where Status Code
was not 200 OK
:
delta(http_requests_total{code!="200"}[1m])
Number of requests received that resulted in timeout, per 1m
:
delta(http_requests_total{code!="408"}[1m])
Number of requests that waited for a worker, per bucket of wait time, per 1m
:
sum by (le) (rate(wait_available_worker_bucket[1m]))
Average wait time for a worker to process the request, per 1m:
sum(wait_available_worker_sum/wait_available_worker_count)
Image should be available in your cluster.
Kind example:
kind load docker-image scraper_service/scraper:0.1.0
Deploying the Scraper Service:
kubectl apply -f deployment/scraper.yaml
Deploying the Prometheus with service discovery:
kubectl apply -f deployment/prometheus-rbac.yaml
kubectl apply -f deployment/prometheus.yaml