You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Looking at get_token_stream() I think that the issue is the combination of subset, collapse and beautify (which is TRUE by default). With these arguments, the following line essentially causes the issue:
The issue is that when removing tokens via the subset, the length of the input object does not correspond to the number of whitespace characters actually needed here. Then, in the final line
whitespace is longer than tokens. The remaining tokens then are simply recycled until the length of whitespace is reached.
Potential Fix
If there is no reason to use the length of the unmodified input object here, I think that changing .Object to tokens in the first chunk I quoted should be sufficient to address this.
The text was updated successfully, but these errors were encountered:
I noticed that
get_token_stream()
occasionally returns token streams which are longer than expected when used with thesubset
argument.As an example, I have the following subcorpus:
Without
subset
I retrieve the token stream for the subcorpus without subsetting it:
The returned character vector has a length of 185 characters.
With
subset
If I repeat the same process but include a
subset
argument to remove stop words and punctuation, the return value gets longer instead of shorter.Issue
Looking at
get_token_stream()
I think that the issue is the combination ofsubset
,collapse
andbeautify
(which is TRUE by default). With these arguments, the following line essentially causes the issue:polmineR/R/token_stream.R
Line 150 in 650c75f
The issue is that when removing tokens via the subset, the length of the input object does not correspond to the number of whitespace characters actually needed here. Then, in the final line
polmineR/R/token_stream.R
Line 154 in 650c75f
whitespace
is longer thantokens
. The remaining tokens then are simply recycled until the length ofwhitespace
is reached.Potential Fix
If there is no reason to use the length of the unmodified input object here, I think that changing
.Object
totokens
in the first chunk I quoted should be sufficient to address this.The text was updated successfully, but these errors were encountered: