Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Redirect Chains #1082

Open
benibaeumle opened this issue Jan 17, 2024 · 5 comments
Open

Tracking Redirect Chains #1082

benibaeumle opened this issue Jan 17, 2024 · 5 comments

Comments

@benibaeumle
Copy link

Helloo,
I was involved in a project couple of years ago where we analyzed redirect chains. Back then, we could use old_channel_id and new_channel_id in order to track the redirect chains.
I have some free time currently and wanted to catch up on some work I did back then. Now, I realized that the database changed and you are supposed to track down the chains via old_request_id and new_request_id. But, the feature seems not to be implemented currently (see here).
There is also a reference in the comments (see here) to a quite old stack overflow post how you are supposed to implement the feature. Now, I was wondering if there is a chance that this will be implemented in the near future?

Currently, I also have some time and could maybe invest some days to try and fix it. But, I am not sure if I am able to or if it is possible to implement the feature again currently.

Can you provide me with any information on this?

Any help is appreciated. :)

And great work btw. :)

Best,
Beni

@benibaeumle
Copy link
Author

Maybe I was to quick with my request above. Am I right that I can track down the redirect chains as follows:
Each redirect chain can be identified by visit_id and old_request_id. So, for a redirect chain I get all http_requests by the visit_id and request_id. Then, I can order the http_redirects and http_requests by event_ordinal.
To verify the chain is correct, I can test that:

  • event_ordinal increase monotonically and between two http_requests there should be a http_redirectwith event_ordinal between the two http_requests
  • old_request_url of this http_redirect matches the url of the first http_request and new_request_url matches the url of the second http_request

@vringar
Copy link
Contributor

vringar commented Feb 7, 2024

Thanks for coming back and documenting this.
Did you check the documentation beforehand if so where would you expect this information?

@benibaeumle
Copy link
Author

benibaeumle commented Feb 7, 2024

Thanks a lot for the reply.

Unfortunately, I couldn't find the infomation in the docs. Could you please point out to me where to find how to extract redirect chains from the docs? Apologies, if it is written somewhere I couldn't find it.

But, I figured out that throughout triggered redirects, every new request triggered by a redirect seems to keep the same request_id. So, currently, I work with the following query to extract what I need:

-- Now, just group by old_request_id and aggregate result.
-- The resulting string of each row is supposed to contain the redirect chain originating from old_request_id. 
SELECT browser_id, visit_id, old_request_id, string_agg(concat(id, '|||', event_ordinal, '|||', url), '-->' order by event_ordinal)
FROM
  (
   -- Join http_redirects with http_requests and sort by http_redirect.event_ordinal.
   -- Sort by http_redirect.event so that redirects originating from the same request
   -- are in the correct order
   SELECT hr.browser_id, hr.visit_id, old_request_id, id, event_ordinal, url
   FROM
     -- Select unique request_ids from http_redirects table.
     (
      SELECT DISTINCT browser_id, visit_id, old_request_id
      FROM http_redirects
      ) AS distinct_redirects
   JOIN http_requests hr ON distinct_redirects.old_request_id = hr.request_id and distinct_redirects.browser_id = hr.browser_id and distinct_redirects.visit_id = hr.visit_id
   ORDER BY old_request_id, hr.event_ordinal ASC
   ) AS ordered_redirects
GROUP BY ordered_redirects.browser_id, ordered_redirects.visit_id, ordered_redirects.old_request_id

Inside columnn redirect_chain of the result, the query is supposed to return tuples of (http_request.id, http_requests.event_ordinal, http_requests.url) separated by an arrows '-->' containing all redirects originating from the first tuple.

I plan to write an extra query which recursively constructs the chains from old_request_url and new_request_url of the http_redirects table to verify the above query.

But, in case you can confirm that my query above extracts redirect chains it would give me some extra security.

Many thanks,
Beni

@vringar
Copy link
Contributor

vringar commented Feb 7, 2024

Unfortunately, I couldn't find the infomation in the docs. Could you please point out to me where to find how to extract redirect chains from the docs? Apologies, if it is written somewhere I couldn't find it.

Sorry for being too terse. It's not documented anywhere, I want to add your comment to an appropriate place and was wondering where you as a user woud look first.

Iirc the methodology you proposed in your second comment is the one we used but unfortunately some of our analysis remained proprietary and I can't find the code to confirm it in OpenWPM-utils. We used Spark and would transform the redirect chain into a list similar to what you proposed with the recursive query.

I can't verify the SQL query right now but if this implements the logic previously described I'd give it tentative approval.

@benibaeumle
Copy link
Author

No worries, thanks a lot.

From the current documentation, I would expect the information to be somewhere in the "Schema Documentation" part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants