Support for adding Referer and User-agent #33

shtrom · 2023-06-28T12:52:10Z

When dealing with the ACM website (e.g., https://github.com/shtrom/ftr-site-config/blob/shtrom-s-master/cacm.acm.org.txt), the login URL only works if the HTTP Referer header is from an acm.org URL.

In this particular instance, it sets a cookie, and serves a redirect to the original page.

For example, this works

$ curl -D - 'https://cacm.acm.org/login' -X POST -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/114.0'  -H 'referer: https://cacm.acm.org/oeuu'  --data-raw 'current_member%5Buser%5D=USER&current_member%5Bpasswd%5D=PASSWORD'
HTTP/2 302 
date: Wed, 28 Jun 2023 12:32:28 GMT
content-type: text/html; charset=utf-8
location: https://cacm.acm.org/oeuu
cf-ray: XXX-MEL
cf-cache-status: DYNAMIC
cache-control: no-cache
set-cookie: format=full; domain=acm.org; path=/
set-cookie: INDIV_CLIENT=XXX; domain=acm.org; path=/
set-cookie: _cacm_acm_session=XXX; domain=acm.org; path=/
status: 302 Found
x-powered-by: Phusion Passenger
x-runtime: 0.06599
server: cloudflare

<html><script src="/cdn-cgi/apps/head/nLYIPopMPWKseIlIthEH-UJkbT0.js"></script><body>You are being <a href="https://cacm.acm.org/oeuu">redirected</a>.</body></html>

But simply removing the `Referer` or `User-Agent` lead to failures:

$ curl -D - 'https://cacm.acm.org/login' -X POST   --data-raw 'current_member%5Buser%5D=USER&current_member%5Bpasswd%5D=PASSWORD'              
HTTP/2 403 
date: Wed, 28 Jun 2023 12:35:50 GMT
content-type: text/plain; charset=UTF-8
content-length: 16
x-frame-options: SAMEORIGIN
referrer-policy: same-origin
cache-control: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0
expires: Thu, 01 Jan 1970 00:00:01 GMT
server: cloudflare
cf-ray: 7de5f8d01f702ea6-MEL

error code: 1020%

$ curl -D - 'https://cacm.acm.org/login' -X POST -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/2
0100101 Firefox/114.0'  --data-raw 'current_member%5Buser%5D=USER&current_member%5Bpasswd%5D=PASSWORD'                           
HTTP/2 500                                                                                                                                                              
date: Wed, 28 Jun 2023 12:34:47 GMT                                                                                                                                     
content-type: text/html                                                                                                                                                 
cf-ray: 7de5f73e49741f64-MEL                                                                                                                                            
cf-cache-status: DYNAMIC                                                                                                                                                
server: cloudflare

This means that despite the login_* variables in the site-config, fetching full articles fails, as those two headers are missing.

I think this can be solved by

letting guzzle-site-authenticator pass headers on demand
making graby and/or wallabag pass the User-Agent override from the site-config if any
making graby and/or wallabag pass the Referer to be the original URL to be fetched

This should fix the ACM issue, and I think it is sufficiently generic to be equally helpful (or at least not detrimental) on other sites. If this turns out to break thing, we'd need additional site-config options to specify whether additional login_* headers should be included, and their value.

Now, this is all conjecture, as I haven't been able to successfully hack my wallabag instance to behave as described. I got lost jumping between wallabag, graby, and guzzle-site-authenticator.

I'm willing to keep going on this, but I would welcome pointers as to

where I can send headers from guzzle-site-authenticator (I unsuccessfully tried in LoginFormAuthenticator::login https://github.com/wallabag/guzzle-site-authenticator/blob/master/lib/Authenticator/LoginFormAuthenticator.php#L36-L37 by adding a headers array, but maybe I did it wrong)
how to see debug messages from the Authenticator about the requests they are sending (at the moment, I see graby and wallabag determining that a login is needed, and then failure from the login page, but no more debug in between)
how/where I could change/update the HttpClient that, I think, gets injected by wallabag or graby.
any other simpler way to achieve all this?

The text was updated successfully, but these errors were encountered:

shtrom mentioned this issue Jun 28, 2023

Wrong display in wallabag (acm.org) wallabag/wallabag#6677

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for adding Referer and User-agent #33

Support for adding Referer and User-agent #33

shtrom commented Jun 28, 2023

Support for adding Referer and User-agent #33

Support for adding Referer and User-agent #33

Comments

shtrom commented Jun 28, 2023