Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization Poor Performance With Large Number of Token Instances #115

Open
aolszowka opened this issue Jun 9, 2023 · 0 comments
Open

Comments

@aolszowka
Copy link

The Tokenizer appears to perform very poorly when you have a large number of replacement token instances.

For example in a file this line:

$matches = select-string -Path $tempFile -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value }

Returns 8600 instances on a file I am attempting to have it replace.

Based on the logic of this loop:

ForEach ($match in $matches) {

This will attempt to perform this operation 8600 times. If you look at the code this loops though the file row, by row, attempting a replacement of all of the variables that are found.

This is inefficent, rather the above line should have gathered distinct values like so (the following tries to follow the PowerShell-isms and is not 100% efficent):

$matches = select-string -Path $tempFile -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } | Sort-Object | Get-Unique

Note that in order to use Get-Unique the documentation states that the list must be ordered, which is why a Sort-Object is called before hand.

Running this on that same file returns a mere 38 instances to have to attempt to replace. which is an order of magnitude smaller than the previous attempts.

There are still other performance issues, for example the row-by-row replacement per variable as seen here:

(Get-Content $tempFile -Encoding $encoding) |
Foreach-Object {
$_ -replace $match, $variableValue
} |
Set-Content $tempFile -Encoding $encoding -Force

This becomes painful as the number of lines in the file increase, However, this simple fix would resolve the most obvious performance issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant