fix: ignore file path and diffs #21

sshivaditya2019 · 2024-10-27T13:13:01Z

Resolves #19

Introduces a method to selectively retrieve diffs.
Allows for the exclusion of files based on their directory and file extensions.

sshivaditya2019 · 2024-10-27T13:13:41Z

With the same 15 mb index.js file in the PR

github-actions · 2024-10-27T13:14:17Z

Unused devDependencies (1)

Filename	devDependencies
package.json	`@types/diff`

Unused exports (1)

Filename	exports
src/helpers/issue.ts	`optimizeContext`

0x4007 · 2024-10-27T13:16:28Z

src/helpers/issue-fetching.ts

+    //Find the filenames which do not have more than 200 changes
+    let files = stats.filter((file) => file.changes < 500).map((file) => file.filename);
+    //Ignore files like in dist or build or .lock files
+    const ignoredFiles = ["dist/*", "build/*", ".lock", "index.js"];


Should definitely be dynamic from git ignore and such. Otherwise this was not included in the spec.

It uses the gitignore from the Repository if one is present.

Does it combine this + gitignore?

It used to, I have removed all of that now. It just checks for file sizes in bytes now.

0x4007 · 2024-10-27T13:16:57Z

src/helpers/issue-fetching.ts

+    //Fetch the statistics of the pull request
+    const stats = await githubDiff.getPullRequestStats(org, repo, issue);
+    //Find the filenames which do not have more than 200 changes
+    let files = stats.filter((file) => file.changes < 500).map((file) => file.filename);


Wrong approach. Should not be any hard cut off from some arbitrary number. Follow the spec

Revised the approach:

Sorts the diffs in ascending order based on file size

Continues adding diffs until the maximum context token limit is reached

How should the maximum token limit be configured—via config settings or environment variables?

Max i guess config so that partners can budget their token use.

Keyrxng · 2024-10-27T16:44:16Z

src/helpers/format-chat-history.ts

@@ -83,7 +88,18 @@ async function createContextBlockSection(
  if (!issueNumber || isNaN(issueNumber)) {
    throw context.logger.error("Issue number is not valid");
  }
-  const prDiff = await fetchPullRequestDiff(context, org, repo, issueNumber);
+  const pulls = await fetchLinkedPrFromIssue(org, repo, issueNumber, context);


These pulls rely on searching the repo for all PRs then parsing the PR bodies for a #<issueNumber>, which is recommended but not enforced so this isn't 100% reliable. We tend to use GQL API for this nowadays as it's most reliable. Also for queries that potentially return ~~more than 100 items~~ I'm not sure of the limit actually but it's likely 100, we should use octokit.paginate(octokit.pulls.list)) as when scanning all PRs it's likely.

It's also relying on the PR author actually linking using # because if they copy paste the url which most do then it'll be a url in the pr body instead. I had logic for hashtag + url matching in my implementation but I think you may have removed it.

I think it may also be pulling more PR diffs than intended within this promise on each call to createContextBlockSection, unsure without QA-ing and inspecting the formatted chat on a context rich repo.

Switched to GraphQL to directly fetch pull requests that close issues through GitHub GQL's closedByPullRequestsReferences field. This replaces the previous API-based approach which relied on parsing PR body text for issue references (via # or URLs).

Keyrxng · 2024-10-27T16:46:06Z

package.json

@@ -17,7 +17,7 @@
    "knip-ci": "knip --no-exit-code --reporter json --config .github/knip.ts",
    "prepare": "husky install",
    "test": "jest --setupFiles dotenv/config --coverage",
-    "worker": "wrangler dev --env dev --port 4000"
+    "worker": "wrangler dev --env dev --port 5000"


Is there a specific reason that you change this? The template uses :4000 so all other plugins also, it's tedious having to change it back for no reason

Is port 4000 required by some other service/kernel? I'm currently unable to use it but can switch back if needed

The plugin-template comes with :4000 as standard so it's habitual for plugin devs to copy paste that url into -config.yml and expecting it to work thinking something is broken for it to be the dev port lmao.

Ideally don't commit it if you want to use another port, like running multiple plugins locally for example, always change it back to the template port if you can remember to.

Keyrxng · 2024-10-27T16:46:28Z

src/adapters/openai/helpers/completions.ts

@@ -2,6 +2,7 @@ import OpenAI from "openai";
 import { Context } from "../../../types";
 import { SuperOpenAi } from "./openai";
 import { CompletionsModelHelper, ModelApplications } from "../../../types/llm";
+import { encode } from "gpt-tokenizer";
 const MAX_TOKENS = 7000;


should move this into config and pass this via the yml config

Keyrxng · 2024-10-27T16:46:56Z

src/helpers/format-chat-history.ts

 import { splitKey } from "./issue";
+const MAX_TOKENS_ALLOWED = 7000;


Same here. Should also be renamed to MAX_COMPLETION_TOKENS.

Keyrxng · 2024-10-27T16:50:19Z

src/helpers/issue-fetching.ts

-    return data as unknown as string;
+    const githubDiff = new GithubDiff(octokit);
+    //Fetch the statistics of the pull request
+    const stats = await githubDiff.getPullRequestStats(org, repo, issue);


This lib is only called twice and can be done just as simply with one or two rest calls I'm sure, it's not clear if this is ideal relying on a lib for this. Nothing to hold back the PR for just an observation.

Keyrxng · 2024-10-27T16:51:19Z

src/helpers/issue-fetching.ts

@@ -296,3 +318,41 @@ function castCommentsToSimplifiedComments(comments: Comments, params: FetchParam
      url: comment.html_url,
    }));
 }
+
+export async function fetchLinkedPrFromIssue(owner: string, repo: string, issueNumber: number, context: Context) {
+  const prs = await context.octokit.rest.pulls.list({


Might be best to wrap this in a try-catch because if e.g the repo has been deleted this will likely throw

Keyrxng · 2024-10-27T16:53:38Z

src/helpers/issue-fetching.ts

@@ -296,3 +318,41 @@ function castCommentsToSimplifiedComments(comments: Comments, params: FetchParam
      url: comment.html_url,
    }));
 }
+
+export async function fetchLinkedPrFromIssue(owner: string, repo: string, issueNumber: number, context: Context) {


Could be better named as it's actually fetching the linked issue from the pull request body

Keyrxng · 2024-10-27T16:54:04Z

src/helpers/issue-fetching.ts

+    state: "all",
+  });
+  //Filter the PRs which are linked to the issue using the body of the PR
+  return prs.data.filter((pr) => pr.body?.includes(`#${issueNumber}`));


Might be best to add handling for url matching too

Not applicable - replaced with GraphQL solution

sshivaditya2019 · 2024-10-27T17:53:08Z

QA: Issue

0x4007

Perhaps my spec isnt clear but youre not following it.

Remove all the file ignores.
Count amount of changes per file.
sort in order of amount of changes
filter out files from most changes to least based on context limits.

This will automatically filter out large automated changes like compiled dist and lock files.

0x4007 · 2024-10-27T23:01:55Z

src/helpers/issue-fetching.ts

+    //Fetch the statistics of the pull request
+    const stats = await githubDiff.getPullRequestStats(org, repo, issue);
+    //Ignore files like in dist or build or .lock files
+    const ignoredFiles = (await buildIgnoreFilesFromGitIgnore(context, org, repo)) || [];


Hey i just realized this is pointless.

If its ignored it wont be on git or on the diff. Remove all this logic and just rely on filtering files out based on largest amount of changes to smallest.

0x4007 · 2024-10-27T23:02:39Z

src/helpers/issue-fetching.ts

+      repo,
+      path: ".gitignore",
+    });
+    // Build an array of files to ignore


Get rid of all this logic too

sshivaditya2019 · 2024-10-27T23:23:23Z

Perhaps my spec isnt clear but youre not following it.

Remove all the file ignores.

Count amount of changes per file.

sort in order of amount of changes

filter out files from most changes to least based on context limits.

This will automatically filter out large automated changes like compiled dist and lock files.

I think that counting the number of changes isn't the best metric. For instance, dist/index.js had only about 300 changes. A more effective metric would be the diff size in bytes.

I'm following the guidelines outlined in this comment. Is there another specification for this issue?

0x4007 · 2024-10-27T23:38:20Z

Basically this strategy would start by excluding dist and likely other large changes like lock file etc.

This could have been worded better. I meant that it would automatically exclude those based on how large their diffs are.

GitHub pull request code view UI handles this in a smart way, where it wont display the dist/index.js and lock files due to "large diffs"

Perhaps they also have a line length limitation. It may be wise to replicate.

Then again, 300 line changes is significant. Much more than a normal file diff i think?

This would almost certainly be filtered out if there were a ton of file changes with ~10 lines changed per each etc.

sshivaditya2019 · 2024-10-27T23:46:14Z

GitHub pull request code view UI handles this in a smart way, where it wont display the dist/index.js and lock files due to "large diffs"

It seems to use file size as a criterion to determine whether to show the UI, as those files wouldn't be accessible even in a regular file viewer without the diffs.

Then again, 300 line changes is significant. Much more than a normal file diff i think?

It is indeed significant, more than what you'd typically see in a standard file diff. However, while 300 lines could potentially be changed through code, the diff size in bytes for index.js was around 15.6 MB. In comparison, a code file with the same number of line changes would be much smaller in size.

0x4007 · 2024-10-28T00:16:06Z

Alright so then it seems clear to me that we can measure the diff size in bytes and then go from there.

sshivaditya2019 · 2024-10-28T01:45:33Z

@0x4007 I need some clarification. Currently, the system retrieves both filenames and the difference size in bytes, and then it fetches the diffs for each filename. Once that’s done, we sort the results in ascending order and continue adding to the context until we reach the maximum limit.

I followed along up to the sorting step in the specifications, but I'm unsure if I understand it correctly. What criteria should be used to select the diffs? Is it based on file size being under a certain threshold, or is there another method we should use?

0x4007 · 2024-10-28T02:56:06Z

I think that's all you need to do. What is the problem?

sshivaditya2019 · 2024-10-28T03:01:21Z

Filter out files from the most changes to the least, according to context limits.

This will automatically exclude large automated changes, such as compiled distribution and lock files.

Ok, I see now that I misunderstood these lines. I think this PR is ready to be merged. I'll post more QA for this.

0x4007 · 2024-10-28T03:08:36Z

src/types/plugin-inputs.ts

@@ -23,6 +23,7 @@ export const pluginSettingsSchema = T.Object({
  model: T.String({ default: "o1-mini" }),
  openAiBaseUrl: T.Optional(T.String()),
  similarityThreshold: T.Number({ default: 0.9 }),
+  maxTokens: T.Number({ default: 10000 }),


Default should be the max for the model we are using? 10k seems arbitrary.

10k seemed like a reasonable limit. If someone is using older OpenAI models (3.5 Turbo, Gpt4Turbo-28K), it could result in a hefty bill instantly.

Default to the latest sensible models then like 4o etc

sshivaditya2019 · 2024-10-28T03:11:12Z

QA: Link

0x4007 requested changes Oct 27, 2024

View reviewed changes

sshivaditya2019 mentioned this pull request Oct 27, 2024

Fix entity too large error #15

Open

sshivaditya2019 marked this pull request as ready for review October 27, 2024 16:27

sshivaditya added 5 commits October 27, 2024 12:33

fix: ignore file path and diffs

5d19ba6

fix: adds sorting diff size and then adds diffs size wise add

312bede

fix: removed context optimizer

4e928df

fix: knip

e0b252a

fix: conflicts

9efaf53

sshivaditya2019 force-pushed the development branch from 0fb6d5f to 9efaf53 Compare October 27, 2024 16:41

Keyrxng reviewed Oct 27, 2024

View reviewed changes

fix: updated logic for fetching linked pr

c4b43d7

0x4007 requested changes Oct 27, 2024

View reviewed changes

sshivaditya added 2 commits October 27, 2024 21:35

fix: remove file ignores

195e8e8

fix: package.json missing issue

30f876c

0x4007 reviewed Oct 28, 2024

View reviewed changes

0x4007 merged commit ec32970 into ubiquity-os-marketplace:development Oct 28, 2024
2 checks passed

ubiquity-os-beta bot mentioned this pull request Oct 28, 2024

Payload optimization #19

Closed

Keyrxng mentioned this pull request Oct 28, 2024

API Rate Limit #24

Open

		import { splitKey } from "./issue";
		const MAX_TOKENS_ALLOWED = 7000;

fix: ignore file path and diffs #21

fix: ignore file path and diffs #21

Conversation

sshivaditya2019 commented Oct 27, 2024 • edited Loading

sshivaditya2019 commented Oct 27, 2024

github-actions bot commented Oct 27, 2024

Unused devDependencies (1)

Unused exports (1)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sshivaditya2019 Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Keyrxng Oct 27, 2024 • edited Loading

Choose a reason for hiding this comment

sshivaditya2019 Oct 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sshivaditya2019 commented Oct 27, 2024

0x4007 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sshivaditya2019 commented Oct 27, 2024

0x4007 commented Oct 27, 2024 • edited Loading

sshivaditya2019 commented Oct 27, 2024 • edited Loading

0x4007 commented Oct 28, 2024

sshivaditya2019 commented Oct 28, 2024 • edited Loading

0x4007 commented Oct 28, 2024

sshivaditya2019 commented Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sshivaditya2019 commented Oct 28, 2024

sshivaditya2019 commented Oct 27, 2024 •

edited

Loading

sshivaditya2019 Oct 28, 2024 •

edited

Loading

Keyrxng Oct 27, 2024 •

edited

Loading

sshivaditya2019 Oct 27, 2024 •

edited

Loading

0x4007 commented Oct 27, 2024 •

edited

Loading

sshivaditya2019 commented Oct 27, 2024 •

edited

Loading

sshivaditya2019 commented Oct 28, 2024 •

edited

Loading

sshivaditya2019 commented Oct 28, 2024 •

edited

Loading