Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add repository metrics #177

Open
wants to merge 20 commits into
base: master
Choose a base branch
from
Open

[WIP] Add repository metrics #177

wants to merge 20 commits into from

Conversation

steffen
Copy link

@steffen steffen commented Mar 1, 2019

Description

Earlier this week I presented the idea of a modified Hubble to display repository metrics to a few peers (including @larsxschneider) which received very positive feedback.
The proof of concept can be seen here: https://steffen.github.io/rosat/

I called the tool Rosat as that was the next satellite that has been deployed to space after Hubble and fittingly it was a German satellite, built in collaboration with the US and UK that used X-ray imaging to study its orbiting plane (earth).

@larsxschneider suggested that it would simplify maintenance for users if the repository metrics tool could be integrated directly into Hubble, which received a 👍 from @pluehne to my knowledge. 🙂

Therefore I'm creating this PR as a proof of concept to explore how an integration could look like.

Demo

  1. Open Hubble:
    https://steffen.github.io/hubble/
  2. Go to Repositories -> Monitored:
    https://steffen.github.io/hubble/repos-monitored
  3. Click on a repository:
    https://steffen.github.io/hubble/repository/?nwo=test-org/large-repo
    See Overview with the repository name and size (fetched via a SQL query)
  4. Click on GitHub Metrics:
    https://steffen.github.io/hubble/repository/github-api-requests?nwo=test-org/large-repo
    See API Request Types by Count (gathered from the log files)

Screenshots

image

image

image

Comments

Configuration

Repositories to be monitored can be configured through a new configuration option in config.py:
https://github.com/steffen/hubble/blob/13ee2ec9af40962b5f4225d1743fa82e7379b6da/updater/config.py.example#L75-L81

Color

I made the header background blue to make it obvious to the user that they are in "repository view mode". I found the blue fitting as it represents the sky which becomes bluer as one gets closer to planet earth (to follow-up on that Rosat analogy). 🙂

Monitored repositories page

Right now I added the list of monitored repositories and the links to the repository view mode to the Repositories section. I believe it would be handier to the user if we'd display the monitored repositories on Hubble's overview dashboard page, which does not exist yet. I would find it convenient to have an overview dashboard page similar to Rosat's overview dashboard page that displays a few key metrics of the instance. We can discuss this in a separate issue though.

"Back to instance statistics" link

I'm open for style and position suggestions of the "Back to instance statistics" link. For now I found the top left position within the navigation to be the best place.

Question about passing repository name with owner to bash scripts

The repository name with owner has to be passed to the bash scripts, such as to this one_
https://github.com/steffen/hubble/blob/13ee2ec9af40962b5f4225d1743fa82e7379b6da/updater/scripts/repository/github-api-request-types-by-count.sh

I'm not sure yet what the best way is to accomplish that via the executeScript method. Any ideas?

The code

I programmed with Python for the first time this week. 😁 As I have a background in Ruby and JavaScript, it was not hard and I rather enjoyed it. Though, don't expect the best Python code just yet. 🙂 I'm open for suggestions!
Also, obviously it needs more testing, but I think the changes are a good base for discussing the integration.
Having said that, huge kudos to the current Hubble code base, it's fun building on it! ✨

Looking forward to your thoughts!

@pluehne
Copy link
Contributor

pluehne commented Mar 1, 2019

@steffen: Thank you so much for this huge contribution 👍! I’ll want to look at this in detail, but I’ll be off the next days unfortunately. It will take a week or two until I can come back to your pull request. I hope that’s okay 🙂.

@pluehne pluehne self-requested a review March 1, 2019 22:22
@pluehne pluehne self-assigned this Mar 1, 2019
@steffen
Copy link
Author

steffen commented Mar 1, 2019

@pluehne No worries! Thanks for looking into it! Enjoy your time off. 🙂

@steffen
Copy link
Author

steffen commented Mar 13, 2019

Hi @pluehne 👋,

Just a "quick" question about my question about passing repository name with owner to bash scripts from the original post.
Do you have any pointers or idea of how to go about it? Here's the question:

The repository name with owner has to be passed to the bash scripts, such as to this one:
https://github.com/steffen/hubble/blob/13ee2ec9af40962b5f4225d1743fa82e7379b6da/updater/scripts/repository/github-api-request-types-by-count.sh

I'm not sure yet what the best way is to accomplish that via the executeScript method.

Basically I want to pass this.repository (which is a string with "owner/repo") to repository/github-api-request-types-by-count.sh via executeScript here: https://github.com/steffen/hubble/blob/13ee2ec9af40962b5f4225d1743fa82e7379b6da/updater/reports/repository/ReportGitHubApiRequestTypesByCount.py#L15

Thanks!!

@pluehne
Copy link
Contributor

pluehne commented Mar 21, 2019

@steffen: The best way to achieve that would be to extend the executeScript method to take a list of environment variables that it then, it turn, forwards to the executed script. That should make it possible to pass arguments to the script, what do you think?

And sorry that I still haven’t had the time to look over your pull request!

@steffen
Copy link
Author

steffen commented Mar 21, 2019

@pluehne I've thought about using environment variables as well. Any pointers on how executeScript could be extended to support that? 😬Thanks!! 🙂

@pluehne
Copy link
Contributor

pluehne commented Mar 21, 2019

@steffen: executeScript uses executeCommand under the hood, which makes a call to subprocess.Popen. According to the Python documentation, subprocess.Popen optionally takes a named parameter called env through which one can set the environment variables passed to the command to be invoked. There’s an answer on StackOverflow that suggests copying the preexisting environment as a basis to add new variables to be sure nothing important is removed by setting a new environment. That’s how I would do it I guess 🙂. If you want to implement this, I’d suggest you open a separate pull request so that I can review and merge it more quickly.

@steffen
Copy link
Author

steffen commented Mar 21, 2019

@pluehne Great, thanks! Speaking of reviewing and merging the PR. I thought about keeping this one here as small as possible, but big enough to get the whole picture of how I imagine the integration with repository specific metrics. I do have a few changes locally I want to push here which include the functionality of having working directories of the actual repositories in a temporary folder next to hubble-data. Do you think that's best to keep it all in this PR or in what way do you mean I should open a separate pull request for the executeScript extension? Right now the API Request Types by Count chart data gathering is not working correctly because of the missing executeScript functionality. Also, this PR is based off a branch on my fork so not sure against which branch you want me to open the PR. Thanks!! 🙂

"trees",
"blobs",
"tags",
"refs"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These series can have dramatically different ranges. Should we put them into individual charts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I separated counts and bytes into different charts. The charts allow to view selected series by clicking on the legend items, which I thought is better than too many charts. But yeah, worth thinking about splitting it up. Which series would you suggest to separate, e.g. tags and refs into a separate chart?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if refs include tags? If they do, then I would even drop tags.

Plus, I am not sure if the total "number of trees" and/or "number of blobs" are actionable metrics. Maybe number of trees/files of HEAD in the default branch represents the current performance characteristics of the repo for the user better?

In general, I believe we should track all this data but only visualize the most important/actionable subset.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larsxschneider Great feedback! I changed the git-sizer report to include all values and now we are flexible in picking what to display in the charts.

The refs count includes the tags, so I agree that we only need to show the number of refs.
I tweaked the charts to show more actionable metrics and included some comments from your article:
https://steffen.github.io/hubble/repository/git-repository-size?nwo=test-org/large-repo

Great point on the number of trees/files of HEAD in the default branch. git-sizer doesn't include that metric, but we can just get it directly from the repo.

Thank you! And Happy Eastern! 🙂🐰

return round(bytes / 1024 ** 2, 3)

def updateData(self):
stdout = self.executeScript(["ghe-repo", self.repository, "-c", self.configuration["gitSizerPath"] + " --json --json-version=2"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is nice! But don't we run git-sizer here twice per repo? Here and there: https://github.com/Autodesk/hubble/pull/177/files#diff-6b012ceb81adedd4a22e2089a0102dd3R10

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'm aware and thought it's ok to optimize later. 🙂It would need a bit of refactoring I think, as the common design in the updater is to have one report file and class per report. But it shouldn't be too hard to find a solution for this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two options:

(1) All reports are executed sequentially. That means we could pass all reports a "cache dictionary" which is used to use data that a previous report calculated.

(2) We add a cache flag to executeScript. If this flag is set, then the output is cached and reused in a subsequent call with the same command.

@pluehne/@steffen Do you see another option? If no, which one would you prefer? I am leaning towards (2).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larsxschneider Thanks for these ideas! I consolidated the git-sizer reports into one: 7008ec9#diff-66c529dd9266003f0fc5892af8ac0cb8

All quantity values are now tracked in one report. This gives us more flexibility on what to show while having all the data as discussed here. I didn't really think of this idea of using the same report for different charts until I worked on the git-download reports which does exactly that. 🙂

uniq -ic |
sort -rn |
head -20 |
awk '{ printf("%s\t%s\n", $2, $1) }'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks pretty similar to https://github.com/Autodesk/hubble/blob/master/updater/scripts/api-requests-by-user.sh ... should we add a repo option to the original script to avoid code duplication?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! There are other reports I wanted to add for repo metrics that are pretty much the same as well. I'll explore adding a repo option to the original scripts.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larsxschneider I added a repo option for git-download.sh. Let me know if you think that's a good solution and if it's ok to use for other reports/scripts as well.

@toddocon
Copy link
Contributor

toddocon commented Apr 5, 2019

I ran the same greps by hand against my top repo as reported in Hubble:API requests per repo. The results were oddly formatted.

"/repositories/:repository_id/contents/?" 79534
"/repositories/:repository_id/pulls/:id" 12816
"/repositories/:repository_id/collaborators/:collab/permission" 7321
"/repositories/:repository_id/pulls" 1181
"/repositories/:repository_id/branches" 870
"/repositories/:repository_id" 762
"/repositories/:repository_id/statuses/:sha" 45
/user 26
git-server-23205134-da01-11e6-a478-005056849626 25
"/repositories/:repository_id/pulls/:id/commits" 19
"/repositories/:repository_id/git/refs/
" 13
/ 7
"/repositories/:repository_id/issues" 6
"/user/:user_id" 4
/user/teams 4
/user/repos 3
"/repositories/:repository_id/pulls/:id/comments" 3
"/repositories/:repository_id/issues/:id" 3
"/repositories/:repository_id/status/*" 2
"/repositories/:repository_id/git" 2

Whereas on another repo, the results looked OK.

"/repositories/:repository_id/pulls/:id/comments" 25860
"/repositories/:repository_id/pulls" 19176
"/repositories/:repository_id/pulls/:id" 10798
"/repositories/:repository_id/collaborators" 9576
"/repositories/:repository_id" 9476
"/repositories/:repository_id/statuses/:sha" 4523
"/repositories/:repository_id/branches" 3977
"/repositories/:repository_id/contents/?" 3158
"/repositories/:repository_id/issues/:id/comments" 1832
"/repositories/:repository_id/git/refs/
" 1696
"/repositories/:repository_id/statuses/" 1251
"/repositories/:repository_id/status/
" 1196
"/repositories/:repository_id/commits/" 1171
"/repositories/:repository_id/branches/
" 988
"/repositories/:repository_id/pulls/:id/files" 513
"/repositories/:repository_id/issues/comments/:id" 293
"/repositories/:repository_id/git" 287
"/repositories/:repository_id/pulls/:id/commits" 178
"/repositories/:repository_id/commits" 160
"/repositories/:repository_id/issues" 62

This was the api-request-types

zcat -f /var/log/github/unicorn.log.1* |
grep -F 'request_category=api' |
grep -Fv 'remote_address=127.0.0.1' |
grep -Fv "controller="Api::Internal" |
grep "repo=OWNER/REPO" |
grep -oP 'route=\K\S+' |
grep -Fvx 'nil' |
sort |
uniq -ic |
sort -rn |
head -20 |
awk '{ printf("%s\t%s\n", $2, $1) }'

@toddocon
Copy link
Contributor

toddocon commented Apr 5, 2019

Here's the raw data up to the uniq -ic command if it helps.

$ zcat -f /var/log/github/unicorn.log.1* |
grep -F 'request_category=api' |
grep -Fv 'remote_address=127.0.0.1' |
grep -Fv "controller="Api::Internal" |
grep "repo=OWNER/REPO" |
grep -oP 'route=\K\S+' |
grep -Fvx 'nil' |
sort |
uniq -ic
7 /
25 git-server-23205134-da01-11e6-a478-005056849626
2 "/organizations/:organization_id"
1 "/organizations/:organization_id/repos"
1 /rate_limit
762 "/repositories/:repository_id"
870 "/repositories/:repository_id/branches"
7321 "/repositories/:repository_id/collaborators/:collab/permission"
1 "/repositories/:repository_id/commits/"
79534 "/repositories/:repository_id/contents/?
"
1 "/repositories/:repository_id/events"
2 "/repositories/:repository_id/git"
13 "/repositories/:repository_id/git/refs/"
1 "/repositories/:repository_id/hooks"
6 "/repositories/:repository_id/issues"
3 "/repositories/:repository_id/issues/:id"
1 "/repositories/:repository_id/issues/:id/comments"
1181 "/repositories/:repository_id/pulls"
12816 "/repositories/:repository_id/pulls/:id"
3 "/repositories/:repository_id/pulls/:id/comments"
19 "/repositories/:repository_id/pulls/:id/commits"
2 "/repositories/:repository_id/status/
"
1 "/repositories/:repository_id/statuses/*"
45 "/repositories/:repository_id/statuses/:sha"
1 /search/code
26 /user
1 /user/emails
3 /user/repos
4 /user/teams
4 "/user/:user_id"

@steffen
Copy link
Author

steffen commented Apr 8, 2019

Hi @toddocon, thank you for running the API queries!!
I'm trying to distinguish what exactly you see as oddly formatted.
I see some routes without quotes, and then I see git-server-*, which fall out of line a little bit. Are these two type of entries the ones that you see as oddly formatted?

Thanks again for your feedback!

@steffen steffen changed the title Proof of concept for repository metrics [WIP] Add repository metrics Apr 17, 2019
@codecov-io
Copy link

Codecov Report

Merging #177 into master will decrease coverage by 0.19%.
The diff coverage is n/a.

steffen added 2 commits April 18, 2019 17:01
This allows running `python3 update-stats.py` on the GitHub Enterprise appliance
@codecov-io
Copy link

Codecov Report

Merging #177 into master will decrease coverage by 0.19%.
The diff coverage is n/a.

Which aligns with Hubble install instructions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants