Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: ignore file path and diffs #21

Merged

Conversation

sshivaditya2019
Copy link
Collaborator

@sshivaditya2019 sshivaditya2019 commented Oct 27, 2024

Resolves #19

  • Introduces a method to selectively retrieve diffs.
  • Allows for the exclusion of files based on their directory and file extensions.

@sshivaditya2019
Copy link
Collaborator Author

QA: issue

With the same 15 mb index.js file in the PR

Copy link

Unused devDependencies (1)

Filename devDependencies
package.json @types/diff

Unused exports (1)

Filename exports
src/helpers/issue.ts optimizeContext

//Find the filenames which do not have more than 200 changes
let files = stats.filter((file) => file.changes < 500).map((file) => file.filename);
//Ignore files like in dist or build or .lock files
const ignoredFiles = ["dist/*", "build/*", ".lock", "index.js"];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should definitely be dynamic from git ignore and such. Otherwise this was not included in the spec.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It uses the gitignore from the Repository if one is present.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it combine this + gitignore?

Copy link
Collaborator Author

@sshivaditya2019 sshivaditya2019 Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It used to, I have removed all of that now. It just checks for file sizes in bytes now.

//Fetch the statistics of the pull request
const stats = await githubDiff.getPullRequestStats(org, repo, issue);
//Find the filenames which do not have more than 200 changes
let files = stats.filter((file) => file.changes < 500).map((file) => file.filename);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong approach. Should not be any hard cut off from some arbitrary number. Follow the spec

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised the approach:

  • Sorts the diffs in ascending order based on file size
  • Continues adding diffs until the maximum context token limit is reached

How should the maximum token limit be configured—via config settings or environment variables?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Max i guess config so that partners can budget their token use.

@sshivaditya2019 sshivaditya2019 marked this pull request as ready for review October 27, 2024 16:27
@@ -83,7 +88,18 @@ async function createContextBlockSection(
if (!issueNumber || isNaN(issueNumber)) {
throw context.logger.error("Issue number is not valid");
}
const prDiff = await fetchPullRequestDiff(context, org, repo, issueNumber);
const pulls = await fetchLinkedPrFromIssue(org, repo, issueNumber, context);
Copy link
Member

@Keyrxng Keyrxng Oct 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These pulls rely on searching the repo for all PRs then parsing the PR bodies for a #<issueNumber>, which is recommended but not enforced so this isn't 100% reliable. We tend to use GQL API for this nowadays as it's most reliable. Also for queries that potentially return more than 100 items I'm not sure of the limit actually but it's likely 100, we should use octokit.paginate(octokit.pulls.list)) as when scanning all PRs it's likely.

It's also relying on the PR author actually linking using # because if they copy paste the url which most do then it'll be a url in the pr body instead. I had logic for hashtag + url matching in my implementation but I think you may have removed it.

I think it may also be pulling more PR diffs than intended within this promise on each call to createContextBlockSection, unsure without QA-ing and inspecting the formatted chat on a context rich repo.

Copy link
Collaborator Author

@sshivaditya2019 sshivaditya2019 Oct 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to GraphQL to directly fetch pull requests that close issues through GitHub GQL's closedByPullRequestsReferences field. This replaces the previous API-based approach which relied on parsing PR body text for issue references (via # or URLs).

@@ -17,7 +17,7 @@
"knip-ci": "knip --no-exit-code --reporter json --config .github/knip.ts",
"prepare": "husky install",
"test": "jest --setupFiles dotenv/config --coverage",
"worker": "wrangler dev --env dev --port 4000"
"worker": "wrangler dev --env dev --port 5000"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason that you change this? The template uses :4000 so all other plugins also, it's tedious having to change it back for no reason

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is port 4000 required by some other service/kernel? I'm currently unable to use it but can switch back if needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plugin-template comes with :4000 as standard so it's habitual for plugin devs to copy paste that url into -config.yml and expecting it to work thinking something is broken for it to be the dev port lmao.

Ideally don't commit it if you want to use another port, like running multiple plugins locally for example, always change it back to the template port if you can remember to.

@@ -2,6 +2,7 @@ import OpenAI from "openai";
import { Context } from "../../../types";
import { SuperOpenAi } from "./openai";
import { CompletionsModelHelper, ModelApplications } from "../../../types/llm";
import { encode } from "gpt-tokenizer";
const MAX_TOKENS = 7000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should move this into config and pass this via the yml config

import { splitKey } from "./issue";
const MAX_TOKENS_ALLOWED = 7000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Should also be renamed to MAX_COMPLETION_TOKENS.

return data as unknown as string;
const githubDiff = new GithubDiff(octokit);
//Fetch the statistics of the pull request
const stats = await githubDiff.getPullRequestStats(org, repo, issue);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lib is only called twice and can be done just as simply with one or two rest calls I'm sure, it's not clear if this is ideal relying on a lib for this. Nothing to hold back the PR for just an observation.

@@ -296,3 +318,41 @@ function castCommentsToSimplifiedComments(comments: Comments, params: FetchParam
url: comment.html_url,
}));
}

export async function fetchLinkedPrFromIssue(owner: string, repo: string, issueNumber: number, context: Context) {
const prs = await context.octokit.rest.pulls.list({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be best to wrap this in a try-catch because if e.g the repo has been deleted this will likely throw

@@ -296,3 +318,41 @@ function castCommentsToSimplifiedComments(comments: Comments, params: FetchParam
url: comment.html_url,
}));
}

export async function fetchLinkedPrFromIssue(owner: string, repo: string, issueNumber: number, context: Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be better named as it's actually fetching the linked issue from the pull request body

state: "all",
});
//Filter the PRs which are linked to the issue using the body of the PR
return prs.data.filter((pr) => pr.body?.includes(`#${issueNumber}`));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be best to add handling for url matching too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not applicable - replaced with GraphQL solution

@sshivaditya2019
Copy link
Collaborator Author

QA: Issue

Copy link
Member

@0x4007 0x4007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps my spec isnt clear but youre not following it.

  • Remove all the file ignores.
  • Count amount of changes per file.
  • sort in order of amount of changes
  • filter out files from most changes to least based on context limits.

This will automatically filter out large automated changes like compiled dist and lock files.

//Fetch the statistics of the pull request
const stats = await githubDiff.getPullRequestStats(org, repo, issue);
//Ignore files like in dist or build or .lock files
const ignoredFiles = (await buildIgnoreFilesFromGitIgnore(context, org, repo)) || [];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey i just realized this is pointless.

If its ignored it wont be on git or on the diff. Remove all this logic and just rely on filtering files out based on largest amount of changes to smallest.

repo,
path: ".gitignore",
});
// Build an array of files to ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get rid of all this logic too

@sshivaditya2019
Copy link
Collaborator Author

Perhaps my spec isnt clear but youre not following it.

  • Remove all the file ignores.
  • Count amount of changes per file.
  • sort in order of amount of changes
  • filter out files from most changes to least based on context limits.

This will automatically filter out large automated changes like compiled dist and lock files.

I think that counting the number of changes isn't the best metric. For instance, dist/index.js had only about 300 changes. A more effective metric would be the diff size in bytes.

I'm following the guidelines outlined in this comment. Is there another specification for this issue?

@0x4007
Copy link
Member

0x4007 commented Oct 27, 2024

Basically this strategy would start by excluding dist and likely other large changes like lock file etc.

This could have been worded better. I meant that it would automatically exclude those based on how large their diffs are.

GitHub pull request code view UI handles this in a smart way, where it wont display the dist/index.js and lock files due to "large diffs"

Perhaps they also have a line length limitation. It may be wise to replicate.

Then again, 300 line changes is significant. Much more than a normal file diff i think?

This would almost certainly be filtered out if there were a ton of file changes with ~10 lines changed per each etc.

@sshivaditya2019
Copy link
Collaborator Author

sshivaditya2019 commented Oct 27, 2024

GitHub pull request code view UI handles this in a smart way, where it wont display the dist/index.js and lock files due to "large diffs"

It seems to use file size as a criterion to determine whether to show the UI, as those files wouldn't be accessible even in a regular file viewer without the diffs.

Then again, 300 line changes is significant. Much more than a normal file diff i think?

It is indeed significant, more than what you'd typically see in a standard file diff. However, while 300 lines could potentially be changed through code, the diff size in bytes for index.js was around 15.6 MB. In comparison, a code file with the same number of line changes would be much smaller in size.

@0x4007
Copy link
Member

0x4007 commented Oct 28, 2024

Alright so then it seems clear to me that we can measure the diff size in bytes and then go from there.

@sshivaditya2019
Copy link
Collaborator Author

sshivaditya2019 commented Oct 28, 2024

@0x4007 I need some clarification. Currently, the system retrieves both filenames and the difference size in bytes, and then it fetches the diffs for each filename. Once that’s done, we sort the results in ascending order and continue adding to the context until we reach the maximum limit.

I followed along up to the sorting step in the specifications, but I'm unsure if I understand it correctly. What criteria should be used to select the diffs? Is it based on file size being under a certain threshold, or is there another method we should use?

@0x4007
Copy link
Member

0x4007 commented Oct 28, 2024

I think that's all you need to do. What is the problem?

@sshivaditya2019
Copy link
Collaborator Author

sshivaditya2019 commented Oct 28, 2024

  • Filter out files from the most changes to the least, according to context limits.

This will automatically exclude large automated changes, such as compiled distribution and lock files.

Ok, I see now that I misunderstood these lines. I think this PR is ready to be merged. I'll post more QA for this.

@@ -23,6 +23,7 @@ export const pluginSettingsSchema = T.Object({
model: T.String({ default: "o1-mini" }),
openAiBaseUrl: T.Optional(T.String()),
similarityThreshold: T.Number({ default: 0.9 }),
maxTokens: T.Number({ default: 10000 }),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default should be the max for the model we are using? 10k seems arbitrary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10k seemed like a reasonable limit. If someone is using older OpenAI models (3.5 Turbo, Gpt4Turbo-28K), it could result in a hefty bill instantly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default to the latest sensible models then like 4o etc

@0x4007 0x4007 merged commit ec32970 into ubiquity-os-marketplace:development Oct 28, 2024
2 checks passed
@ubiquity-os-beta ubiquity-os-beta bot mentioned this pull request Oct 28, 2024
@sshivaditya2019
Copy link
Collaborator Author

QA: Link

@Keyrxng Keyrxng mentioned this pull request Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Payload optimization
5 participants