Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query for existing issues might fail silently and a new issue created for every issue detected by the task #2270

Open
Tracked by #2273
Archaeopteryx opened this issue Jul 23, 2024 · 4 comments
Assignees
Labels
teklia-2024 Issue for Teklia work in 2024

Comments

@Archaeopteryx
Copy link
Contributor

Log demonstrating issue

2024-07-23 06:00:42.000212 [INFO    ] [info     ] Checking for existing issues in the backend base_revision_changeset=ac4a1f84adfa69b77ccec3589f2a28ec7089fe10
2024-07-23 06:17:48.000023 [INFO    ] [info     ] Found 780 new issues (over 780 total detected issues) task=ZIJe3nlqQ4CvIivkOYtMNg

It took 17 min 06 s to query for known issues, yet no known issue has been detected by the task.

The time is close to 16 min 40 s, or 1000 s as a timeout.

If there is a performance issue which would cause the retrieval of the data to fail, the creation of tickets for every issue afterwards will further degrade the performance of the code review server.

Should the bulk of the known issues be served from a downloaded artifact and only the newest known issues be served as query of incremental data?

@La0 @marco-c

@Archaeopteryx
Copy link
Contributor Author

Time to create a ticket for a new issue varies between 0.5 and 5 seconds per ticket.

@marco-c marco-c added the teklia-2024 Issue for Teklia work in 2024 label Aug 2, 2024
@La0
Copy link
Collaborator

La0 commented Sep 5, 2024

The bot code only iterate on all issues path and query the list_repo_issues endpoint.

We could look into performance and even if the whole output is needed (the bot only consume hashes).

Or even build a new endpoint that directly check if a hash for a specific path+repo is known: it may be way faster to query in DB

@La0 La0 self-assigned this Sep 5, 2024
@La0
Copy link
Collaborator

La0 commented Sep 9, 2024

I just started a manual backup on heroku so we can test locally for performance issues.

@La0
Copy link
Collaborator

La0 commented Sep 10, 2024

I was able to restore the backup, and test API queries. The list issue endpoint is indeed super-slow (taking several seconds per hit...)

I noticed a few immediate issues:

  • no index on Revision.head_changeset & Issue.path which are used to filter the endpoint
  • we only need to serialize issue id & hash (so we only need to load these in the queryset)
  • the main slow query is joining twice on IssueLink just because of the multiple .filter ORM calls: by aggregating all filters into a dict, then calling once .filter, the ORM becomes smarter and only make a single join

I used the following test code & payload, but you can also simply hit the following url

from datetime import datetime
import json

from code_review_bot.backend import BackendAPI

from code_review_bot import taskcluster

taskcluster.secrets = {
    "backend": {
        "url": "http://localhost:8000",
        "username": "bot",
        "password": "Teklia12345",

    }
}

current_date = datetime.now().strftime("%Y-%m-%d")
api = BackendAPI()

with open("payload.json") as f:
    payload = json.load(f)


for path in payload["paths"]:
    print(path)
    out = api.list_repo_issues(
        "mozilla-central", date=current_date, revision_changeset=payload['revision_changeset'], path=path
    )
    print(out)

payload.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
teklia-2024 Issue for Teklia work in 2024
Projects
None yet
Development

No branches or pull requests

4 participants