Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Truncate diff of very long texts if not in verbose mode (fix #12406) #12634

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

devdanzin
Copy link

The _diff_text_ function might try to calculate diffs of huge texts, taking a very long time, even if that diff is going to be truncated in non-verbose mode. This PR adds truncation of the texts before calculating the diff in that case. New tests added, old tests pass.

I'm not sure the code is correct for when equal trailing characters aren't skipped.

Happy to address any reviews or suggestions.

Closes #12406.

@psf-chronographer psf-chronographer bot added the bot:chronographer:provided (automation) changelog entry is part of PR label Jul 20, 2024
@Pierre-Sassoulas Pierre-Sassoulas added the type: performance performance or memory problem/improvement label Jul 20, 2024
@devdanzin
Copy link
Author

The logic for when equal leading characters aren't skipped is now sound.

Copy link
Member

@nicoddemus nicoddemus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @devdanzin!

Really sorry about the delay on this one, it fell through the cracks.

I'm afraid it needs further updates since #12766 landed, see my comments.

Also it is understandable if you would prefer to drop this instead, as it has not been given the attention it deserved and the implementation will probably be a bit more complicated to handle the new limit options.

@@ -308,6 +311,21 @@ def _diff_text(left: str, right: str, verbose: int = 0) -> list[str]:
]
left = left[:-i]
right = right[:-i]
shortest = min(left, right, key=lambda x: len(x))
lines = j = start = 0
if shortest.count("\n") >= DEFAULT_MAX_LINES:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid this PR got outdated after #12766.

I think the path forward would be for _diff_text() to receive both "max lines" and "max chars" by parameter instead (they being int | None, with None meaning "no limits").

@@ -308,6 +311,21 @@ def _diff_text(left: str, right: str, verbose: int = 0) -> list[str]:
]
left = left[:-i]
right = right[:-i]
shortest = min(left, right, key=lambda x: len(x))
lines = j = start = 0
if shortest.count("\n") >= DEFAULT_MAX_LINES:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm under the impression the code can be made simpler by skipping max chars first, then dealing with max lines... 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bot:chronographer:provided (automation) changelog entry is part of PR type: performance performance or memory problem/improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

text diffs on huge files are slow
3 participants