Deinflection of 見とれていたら #1966

Tomalak · 2024-09-04T03:55:27Z

This is currently (v 1.20.0) deinflected as (< humble or Kansai dialect < continuous < potential < continuous < -tara) from 見る, which seems a little weird, not at least because of the duplicated "continuous",

Deinflecting as (< continuous < -tara) from 見とれる makes more sense. This is only the second hit in the result list - even for itself.

birtles · 2024-09-04T04:08:39Z

@enellis Any ideas what went wrong here?

We certainly should fix 見とれる too. I thought we had a rule that prioritized longer matches so I'm surprised that 見とれる comes before 見る.

Tomalak · 2024-09-04T06:58:20Z

It would probably be useful to take the number of deinflection steps into account for result sorting - in cases where a verb form can be deinflected in more than one way, the simpler deinflection chains should sort higher.

enellis · 2024-09-04T07:50:01Z

Also, recognizing the continuous form twice doesn't make any sense grammatically, right? So we should probably add a rule to prevent this.

enellis · 2024-09-28T10:45:33Z

After preventing duplicate reasons in the reason chains, the results for 見とれていたら are now as expected:

However, when scanning 見とれる, the result for 見る somehow still appears before the result for 見とれる.

EDIT: Hmm, something seems off with the sorting in general. When matching なる, I would expect 成る to be the top result, but it only appears fifth.

birtles · 2024-09-30T01:54:58Z

EDIT: Hmm, something seems off with the sorting in general. When matching なる, I would expect 成る to be the top result, but it only appears fifth.

Yeah, there's something off there. I added some logic to prioritize kana-only entries when searching with kana and it seems to be throwing the result off here. I have an alternative implementation that gets this case right so I'll have to look into exactly why. Unfortunately I'm mostly unavailable this week so it might be a few days.

birtles · 2024-10-04T06:20:05Z

However, when scanning 見とれる, the result for 見る somehow still appears before the result for 見とれる.

I had a look into this and here's what I worked out so far.

Firstly, the 見とれる/見る case is different to the なる/成る case.

The 見とれる/見る case

This comes about due to the fix for #1722. That is, after deinflecting, we sort within the results by priority.

For the input 見とれる we end up producing deinflection candidates "見とれる, 見とれるる, 見とる, 見る". We look each up to see what matching results we find and then sort the results.

For the issue in #1722, that allows us to sort 進む before 進ぶ when deinflecting 進んでいます。

For this case, however, it will mean we'll sort the result for 見る before the result for 見とれる.

I think it's legitimate to address that by first sorting by the length of the matched text.

I've implemented that locally and so far it seems to work.

(Edit: Now pushed to the fix-sorting branch.)

The なる/成る case

This comes about because of #1610 and #1657. That is, when you look up a kana result, we'll favor words that have kana-only headwords.

In that sense, it's sort-of doing the right thing. The first result is 生る whose only sense is marked as "usually kana" so it gets the special "kana match" treatment.

It's unfortunate, however, because 成る has 11 senses, 10 of which are marked as "usually kana". If all 11 senses were marked as "usually kana" it would have gotten the special treatment.

So I suspect the sorting heuristic needs some tweaking to so that if some portion of the matched senses are marked as "usually kana" we give it the special "kana match" ranking.

Without this, when deinflecting 見とれる we'll end up sorting the results for 見る before the results for 見とれる itself. See #1966.

Tomalak · 2024-10-04T07:48:10Z

This comes about due to the fix for #1722

Ironically, that was reported by me as well...

I think it's legitimate to address that by first sorting by the length of the matched text.

I.e. longer matches sort first? Wouldn't this still lead to ties within candidate matches of equal length? That's where my idea came from to take into account the length of the deinflection chain as well.

Without this, when deinflecting 見とれる we'll end up sorting the results for 見る before the results for 見とれる itself. See #1966.

…king up by kana Without this, when we look up "なる" we'll fail to prioritize the entry for 成る because 1 of its 11 senses is not marked as "usually kana". This also ensures we don't consider non-English senses (which don't have "usually kana" annotations) or unmatched senses. See #1966.

birtles · 2024-10-05T10:49:03Z

I think it's legitimate to address that by first sorting by the length of the matched text.

I.e. longer matches sort first? Wouldn't this still lead to ties within candidate matches of equal length? That's where my idea came from to take into account the length of the deinflection chain as well.

If there are ties within candidate matches of equal length, they will be sorted by priority.

I suspect there are some cases where the length of the deinflection chain makes sense but I haven't found any yet.

Without this, when deinflecting 見とれる we'll end up sorting the results for 見る before the results for 見とれる itself. See #1966.

…king up by kana Without this, when we look up "なる" we'll fail to prioritize the entry for 成る because 1 of its 11 senses is not marked as "usually kana". This also ensures we don't consider non-English senses (which don't have "usually kana" annotations) or unmatched senses. See #1966.

enellis · 2024-10-21T13:29:39Z

I just stumbled upon this case: 同じ.

I would expect 同じ to be the first result as it is more common, and then 同じる < masu-stem. Somehow it is the other way around.

birtles · 2024-10-22T02:30:20Z

I just stumbled upon this case: 同じ.

I would expect 同じ to be the first result as it is more common, and then 同じる < masu-stem. Somehow it is the other way around.

Oh, yes, that's definitely a regression. I'll have to investigate. Maybe we do need to sort by the length of the deinflection chain as suggested by @Tomalak after all.

Konstantin-Glukhov · 2024-11-26T03:21:52Z

Shouldn't the match be determined by a word boundary first and then the frequency (commonness)?

enellis · 2024-12-02T17:23:54Z

Shouldn't the match be determined by a word boundary first and then the frequency (commonness)?

Could you explain what you mean by word boundary exactly?

Konstantin-Glukhov · 2024-12-03T01:49:50Z

I mean Regex word boundary. E.g. when "同じ." is moused over in the examples above the definition is shown for "同じる" first, though "同じ." clearly bounded by non-word class character dot "."

birtles · 2024-12-03T01:53:43Z

I mean Regex word boundary. E.g. when "同じ." is moused over in the examples above the definition is shown for "同じる" first, though "同じ." clearly bounded by non-word class character dot "."

Yes, we don't read past the '.' as part of the regex that fetches text to lookup.

The 同じ case has been fixed on main using the length of deinflection chain approach (f125f91).

Konstantin-Glukhov · 2024-12-03T02:05:10Z

I did not review the code, just judging by this thread I assume you still do de-inflection before dictionary lookup of the match?

birtles · 2024-12-03T02:10:28Z

Yes, we generate all the deinflection candidates along with their word types, then we check which ones exist in the dictionary with the corresponding word type.

Konstantin-Glukhov · 2024-12-03T02:24:51Z

Shouldn't be the other way around? Looking in the dictionary first, then doing frequency sort, and then doing de-inflection?

birtles · 2024-12-03T02:29:21Z

I might be misunderstanding the suggestion but the trouble is you have to de-inflect first in order to have something to look up in the dictionary.

For example, the input text is "食べちゃった" you have to de-inflect to 食べる first so you can look it up. (You could progressively shorten the input until you get to "食べ*" but then you'd have hundreds of matches to sort through.)

Konstantin-Glukhov · 2024-12-03T02:33:19Z

I think it is more logical to lookup "食べちゃった" first, if no match is found then de-inflect to 食べる and look up again.

birtles · 2024-12-03T03:00:11Z

I think it's good to look up both so that in cases like 預かり, for example, you'd get both the direct match and the root 預かる. (The current version will get those in the wrong order, but in the next version it will put 預かり before 預かる.)

There are a lot of cases in JMdict where there is an entry for an inflected form so this comes up quite a bit.

Konstantin-Glukhov · 2024-12-03T03:26:11Z

Yes, I agree, though the match lookup should be first, de-inflection is just additional info.

enellis · 2024-12-03T07:27:32Z

Yes, I agree, though the match lookup should be first, de-inflection is just additional info.

I disagree with this one. Deinflected results are equally valid. This is especially true for the masu stem of verbs, which is used very often as a conjunction.
Take this sentence for example: 国内には行政区分として47の都道府県があり、日本人と外国人が居住し、日本語を通用する。Both あり and 居住し are conjunctional.
I would even argue, that in the case of あり we should think about keeping ある < masu stem to be the first result, as it is so much more commonly used in this context than as the word 蟻 / ant.

Konstantin-Glukhov · 2024-12-03T08:14:28Z

You are correct, if you do de-inflection anyway, it does not matter in what order you do lookup: the match or the de-inflection results. With two competing entry, your decision on the display order based on frequency is correct. Probably you already have pre-computed de-inflection results for "to be" as the DB index with the order display weight?

enellis · 2024-12-03T12:52:34Z

Maybe we do need to sort by the length of the deinflection chain as suggested by @Tomalak after all.

@birtles I'm so sorry for bringing this up again, but after reconsidering—particularly regarding the あり case—I believe the initial approach of sorting by the length of the matched text was actually the right choice. The issue with 同じ arises from the unique behavior of Ichidan verbs, which effectively drop their final kana during inflections. I suggest we revert to sorting by matched text length, with an adjustment to subtract 1 from this value if the result is a deinflection and an Ichidan verb.
This should lead to correct results for all aforementioned edge cases.

birtles · 2024-12-05T02:53:10Z

Maybe we do need to sort by the length of the deinflection chain as suggested by @Tomalak after all.

@birtles I'm so sorry for bringing this up again, but after reconsidering—particularly regarding the あり case—I believe the initial approach of sorting by the length of the matched text was actually the right choice. The issue with 同じ arises from the unique behavior of Ichidan verbs, which effectively drop their final kana during inflections. I suggest we revert to sorting by matched text length, with an adjustment to subtract 1 from this value if the result is a deinflection and an Ichidan verb. This should lead to correct results for all aforementioned edge cases.

Thanks for looking into this. I'm not sure I'm quite convinced. あり seems like more of an edge case to me. Without any context, if you look up あり I think the first result should be 蟻. For example, if you're looking up the words for おつかいありさん then I think it would be odd if it presented you with 有る first.

enellis mentioned this issue Sep 30, 2024

fix: prevent duplicate reasons in deinflection reason chains #2013

Merged

birtles added a commit that referenced this issue Oct 4, 2024

fix: sort deinflected results by length

210b383

Without this, when deinflecting 見とれる we'll end up sorting the results for 見る before the results for 見とれる itself. See #1966.

birtles added a commit that referenced this issue Oct 5, 2024

fix: sort deinflected results by length

3668cb4

Without this, when deinflecting 見とれる we'll end up sorting the results for 見る before the results for 見とれる itself. See #1966.

birtles added a commit that referenced this issue Oct 5, 2024

fix: sort deinflected results by length

2599e65

Without this, when deinflecting 見とれる we'll end up sorting the results for 見る before the results for 見とれる itself. See #1966.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deinflection of 見とれていたら #1966

Deinflection of 見とれていたら #1966

Tomalak commented Sep 4, 2024

birtles commented Sep 4, 2024

Tomalak commented Sep 4, 2024

enellis commented Sep 4, 2024

enellis commented Sep 28, 2024 •

edited

Loading

birtles commented Sep 30, 2024

birtles commented Oct 4, 2024 •

edited

Loading

Tomalak commented Oct 4, 2024

birtles commented Oct 5, 2024

enellis commented Oct 21, 2024

birtles commented Oct 22, 2024

Konstantin-Glukhov commented Nov 26, 2024

enellis commented Dec 2, 2024

Konstantin-Glukhov commented Dec 3, 2024

birtles commented Dec 3, 2024

Konstantin-Glukhov commented Dec 3, 2024

birtles commented Dec 3, 2024

Konstantin-Glukhov commented Dec 3, 2024 •

edited

Loading

birtles commented Dec 3, 2024 •

edited

Loading

Konstantin-Glukhov commented Dec 3, 2024

birtles commented Dec 3, 2024

Konstantin-Glukhov commented Dec 3, 2024

enellis commented Dec 3, 2024

Konstantin-Glukhov commented Dec 3, 2024

enellis commented Dec 3, 2024

birtles commented Dec 5, 2024

Deinflection of 見とれていたら #1966

Deinflection of 見とれていたら #1966

Comments

Tomalak commented Sep 4, 2024

birtles commented Sep 4, 2024

Tomalak commented Sep 4, 2024

enellis commented Sep 4, 2024

enellis commented Sep 28, 2024 • edited Loading

birtles commented Sep 30, 2024

birtles commented Oct 4, 2024 • edited Loading

The 見とれる/見る case

The なる/成る case

Tomalak commented Oct 4, 2024

birtles commented Oct 5, 2024

enellis commented Oct 21, 2024

birtles commented Oct 22, 2024

Konstantin-Glukhov commented Nov 26, 2024

enellis commented Dec 2, 2024

Konstantin-Glukhov commented Dec 3, 2024

birtles commented Dec 3, 2024

Konstantin-Glukhov commented Dec 3, 2024

birtles commented Dec 3, 2024

Konstantin-Glukhov commented Dec 3, 2024 • edited Loading

birtles commented Dec 3, 2024 • edited Loading

Konstantin-Glukhov commented Dec 3, 2024

birtles commented Dec 3, 2024

Konstantin-Glukhov commented Dec 3, 2024

enellis commented Dec 3, 2024

Konstantin-Glukhov commented Dec 3, 2024

enellis commented Dec 3, 2024

birtles commented Dec 5, 2024

enellis commented Sep 28, 2024 •

edited

Loading

birtles commented Oct 4, 2024 •

edited

Loading

Konstantin-Glukhov commented Dec 3, 2024 •

edited

Loading

birtles commented Dec 3, 2024 •

edited

Loading