You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm writing a Python script that mimics the behavior of lmplz.
When I tested it out on a large corpus, I found the estimated probabilities differed slightly from lmplz's output.
By shrinking the corpus, I found the bug would manifest itself when training a bigram LM on the following minimum example:
The correct values should be s.n[1] == 1 and s.n[2] == 4:
The 1-grams r, o, t, and </s> each follow two different tokens, and therefore have an adjusted count of 2;
but the 1-gram a comes only after o, so it should have an adjusted count of 1.
Apparently lmplz treated the adjusted count of a as 2, too.
The wrong values of s.n may affect the discounts, and that's what caused the discrepancy when I used the full corpus.
In this minimum example, the discounts are not affected because the counts are so small that the discounts would fall back to the default.
Miraculously, the final LM is correct: a has a lower 1-gram probability than other tokens.
I've also noticed that if I change the order of the lines in the corpus to:
o a r
o a t
r o t
The values of s.n[1] and s.n[2] become correct.
I haven't been able to debug the code because it looks rather complicated.
Could someone look into this?
The text was updated successfully, but these errors were encountered:
Another example on which KenLM miscalculates s.n[1] and s.n[2] can be found in #405.
In that example, this affects the discounts, and the probabilities in the final LM.
I'm writing a Python script that mimics the behavior of lmplz.
When I tested it out on a large corpus, I found the estimated probabilities differed slightly from lmplz's output.
By shrinking the corpus, I found the bug would manifest itself when training a bigram LM on the following minimum example:
I added a debugging statement after this line to print
s.n[1]
ands.n[2]
(the number of 1-grams with adjusted count = 1 and adjusted count = 2):https://github.com/kpu/kenlm/blob/master/lm/builder/adjust_counts.cc#L45
and found that
s.n[1] == 0
ands.n[2] == 5
.The correct values should be
s.n[1] == 1
ands.n[2] == 4
:The 1-grams
r
,o
,t
, and</s>
each follow two different tokens, and therefore have an adjusted count of 2;but the 1-gram
a
comes only aftero
, so it should have an adjusted count of 1.Apparently lmplz treated the adjusted count of
a
as 2, too.The wrong values of
s.n
may affect the discounts, and that's what caused the discrepancy when I used the full corpus.In this minimum example, the discounts are not affected because the counts are so small that the discounts would fall back to the default.
Miraculously, the final LM is correct:
a
has a lower 1-gram probability than other tokens.I've also noticed that if I change the order of the lines in the corpus to:
The values of
s.n[1]
ands.n[2]
become correct.I haven't been able to debug the code because it looks rather complicated.
Could someone look into this?
The text was updated successfully, but these errors were encountered: