Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(scrapers.admin): create materialized view and admin page #4662

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

grossir
Copy link
Contributor

@grossir grossir commented Nov 6, 2024

Concept for #3950

For this to work, the materialized view must be created directly on the DB by running the "query" attribute

For my local database, it looks like this
image

But, if if connected to the live DB, it would have these rows:
image

Concept for #3950

For this to work, the materialized view must be created directly
on the DB
@mlissner
Copy link
Member

mlissner commented Nov 6, 2024

We haven't used MV's before, so I have questions:

  1. I think I asked before, but what do we get from this that a query doesn't get us? I.e., why bother materializing it?

  2. How do we run the periodic refresh?

@grossir
Copy link
Contributor Author

grossir commented Nov 7, 2024

To do this kind of queries we have 3 options

  • run the query in a console, or hardcode it in a python string and run it from a postgres interface
  • give the query a name, but don't store the results. That's a VIEW. You could then do SELECT * FROM view_name and it will re-execute the query.
  • give the query a name, and store the results. That's a MATERIALIZED VIEW

In the case of this query, I don't think a new index would help. I think the slowness comes from joining big tables. Some query optimization may help reduce it more. The raw version takes 10 minutes, prefiltering the docket table by courts that have scrapers takes it down to 3 minutes

Why bother materializing it?

  • admin timeout
    If we want to use the admin to look at the results of this query, we risk timing out, as happened with Pghistory admin. I am not sure what the server timeout is, but from Pghistory I think it's about 60 seconds. Having seen that the query takes around 3 minutes to complete, I would expect a timeout.
  • waiting (in case it didn't time out)
    Even if it didn't timed out, waiting for 3 minutes each time would make for bad user experience.
    We could try to cache the query, too, for a couple days at least. That would alleviate some problems. I expect a few rows (less than 50) to be returned each time
  • django admin hacking
    to take advantage of the django admin HTML templates and ease of navigation, I am using the View (it would work on plain views, which are basically queries, and materialized views) as the concrete table for the django model.

How to refresh it
I expect to check this table a couple times a week. So we could refresh this every 2 or 3 days. We could use a plain cron job, the command is simple REFRESH MATERIALIZED VIEW <name>

Or, we could implement the refresh logic as an async call when accessing the admin page, and send a django admin message (those banners when editing an object) to the user advising to refresh the page in a few minutes

@mlissner
Copy link
Member

mlissner commented Nov 7, 2024

Hm, in the past we had a script that would send an email with this kind of information, and it ran on a cron job, so I guess this is basically the same thing, but instead of storing the results of the script in an email, this stores the results in the materialized view. Seems fine, I suppose. :)

- Includes a migration file for the materialized view
- MV will have to be refreshed manually or via a cronjob
- MV considers only courts that have an active scraper, and that have no updates in a week
@grossir grossir marked this pull request as ready for review November 25, 2024 17:22
@grossir
Copy link
Contributor Author

grossir commented Nov 25, 2024

To test this on a local environment

  • clone some clusters using clone from CL
docker exec -it cl-django python /opt/courtlistener/manage.py clone_from_cl --type search.OpinionCluster --id 10034763 
  • override the opinion's date_created and refresh the view
update search_opinion set date_created = '2024-11-01';
refresh materialized view scrapers_mv_latest_opinion ;
  • checkout the admin page
http://0.0.0.0:8000/admin/scrapers/mvlatestopinion/

Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment so far. We also need an admin task to update the view, right? Want to include that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wouldn't need this for the replicas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

2 participants