Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi Language Support #6

Open
wants to merge 23 commits into
base: master
Choose a base branch
from
Open

Conversation

simonegli8
Copy link

Motivation:

When developing Software with an UI other than english, spell checker needs to be able to spell check english and the UI langauge,
because many programmers still use english for comments, error messages etc.

Implemented Features:

  • Support for spell checking multiple languages.
    The checker checks for correct spelling in multiple configurable languages in the same text. Languages are
    determined on a per sentence basis. Foreign language sequences of less than 3 words are treated as misspellings.
  • Support for custom dictionaries in dic & lex format.
  • Configuration dialog to configure langauges & custom dictionaries.
  • Exclusion of email addresses.

Simon Egli added 2 commits October 4, 2012 21:09
Motivation:

When developing Software with an UI other than english, spell checker needs to be able to spell check enligsh and the UI langauge,
because many programmers still use english for comments, error messages etc.

Implemented Features:

- Support for spell checking multiple languages.
   The checker checks for correct spelling in multiple configurable languages in the same text.
- Exclusion of Email addresses.

Todo:
Currently the languages are statically configured in the file SpellingTagger.Langauge string.
I've never implemented a VisualStudio extension, so I don't know how to create a configuration dialog in the VisualStudio configurations.
@NoahRic
Copy link
Owner

NoahRic commented Oct 4, 2012

I don't like this approach, sorry. The WPF bits of spellchecking are the most resource intensive, and I've had tons of complaints about it. This makes it at least twice as expensive (more, if you add other languages). It has the net effect of using the union of both dictionaries as valid words; that works for mixed language cases, but will cause false negatives in non-mixed cases (e.g. words that are misspelled in one language but properly spelled in the other; if spanish was supported, then "si" wouldn't be a misspelling, even if the file was english and that was a misspelling of "is").

I think a better approach is to:

  1. Write a file on disk with the currently selected langauge (if it doesn't exist, it defaults to en-US)
  2. Add a getter/setter to ISpellingDictionaryService to get/set the current language. Set writes to the file in Comment / uncomment code, squiggles remain. #1 and refreshes the spell checker in all open files (maybe want a new event for this)
  3. In the smart tag provider, provide a new smart action that has a collection of actions for each language.
  4. When the user selects a language in that smart tag sub-menu, the default language is changed via System.InvalidOperationException: Collection was modified #2.

Downsides include that being more, only being triggerable if there are misspellings (likely, if the language is wrong), being really expensive to switch (hopefully only once), and not supporting mixed-language projects.

If you are interesting in trying that out, go for it :) If not, I may get to it in the next couple weeks.

@simonegli8
Copy link
Author

Hi,

I don't like this approach, sorry. The WPF bits of spellchecking are the most resource intensive, and I've had tons of complaints about it. This makes it at least twice as
expensive (more, if you add other languages). It has the net effect of using the union of both dictionaries as valid words; that works for mixed language cases, but will cause false
negatives in non-mixed cases (e.g. words that are misspelled in one language but properly spelled in the other; if spanish was supported, then "si" wouldn't be a misspelling, even if
the file was english and that was a misspelling of "is").

The problem is, as a german developer, I can say that mixed languages in text are very common. In a .cs file comments and strings most of the time are in different languages. Even
strings, if one does not use Resources/localization, are in english for error messages, and german for the UI. A spell checker only supporting one language is barely usable in this scenario.
Also usually you won’t need more than 2 languages.
If the langauges are configurable in a easy way, as you proposed in the smart tag provider, then this feature can easily be turned off, by only specifying one language.
For performance, I think the approach is not twice as expensive, because it first checks spelling with the current language, and only on a spelling error it needs to check the other langages.

You’re right about the false negatives. Maybe there would be a way to set the language for individual strings or comments, and for whole html files, as html usually has only one language. This could work the way, that when a change of the current language in a string occurs, the minority language words would have to be checked again in the majority language. This way the checker could automatically determine the language.
Is there a way to determine individual strings & comments? Is it correct that a string corresponds to a single span?

As I see you always pass one word to the TextBox for spell checking. Would it improve performance, to pass the whole text to check for?

The approach you suppose might have better performance, but I don’t think the need to switch langauges is acceptable if one has different files with different languages open. This should be an automatic process.

Greetings & Blessings,

Simon Egli

From: Noah Richards
Sent: Thursday, October 04, 2012 10:30 PM
To: NoahRic/Spellchecker
Cc: simonegli8
Subject: Re: [Spellchecker] Multi Language Support (#6)

I don't like this approach, sorry. The WPF bits of spellchecking are the most resource intensive, and I've had tons of complaints about it. This makes it at least twice as expensive (more, if you add other languages). It has the net effect of using the union of both dictionaries as valid words; that works for mixed language cases, but will cause false negatives in non-mixed cases (e.g. words that are misspelled in one language but properly spelled in the other; if spanish was supported, then "si" wouldn't be a misspelling, even if the file was english and that was a misspelling of "is").

I think a better approach is to:

  1. Write a file on disk with the currently selected langauge (if it doesn't exist, it defaults to en-US)
  2. Add a getter/setter to ISpellingDictionaryService to get/set the current language. Set writes to the file in Comment / uncomment code, squiggles remain. #1 and refreshes the spell checker in all open files (maybe want a new event for this)
  3. In the smart tag provider, provide a new smart action that has a collection of actions for each language.
  4. When the user selects a language in that smart tag sub-menu, the default language is changed via System.InvalidOperationException: Collection was modified #2.

Downsides include that being more, only being triggerable if there are misspellings (likely, if the language is wrong), being really expensive to switch (hopefully only once), and not supporting mixed-language projects.

If you are interesting in trying that out, go for it :) If not, I may get to it in the next couple weeks.


Reply to this email directly or view it on GitHub.

different langauage textBoces are only assigned text when really needed.
@simonegli8
Copy link
Author

Pushed another version, with performance improvements.

From: Noah Richards
Sent: Thursday, October 04, 2012 10:30 PM
To: NoahRic/Spellchecker
Cc: simonegli8
Subject: Re: [Spellchecker] Multi Language Support (#6)

I don't like this approach, sorry. The WPF bits of spellchecking are the most resource intensive, and I've had tons of complaints about it. This makes it at least twice as expensive (more, if you add other languages). It has the net effect of using the union of both dictionaries as valid words; that works for mixed language cases, but will cause false negatives in non-mixed cases (e.g. words that are misspelled in one language but properly spelled in the other; if spanish was supported, then "si" wouldn't be a misspelling, even if the file was english and that was a misspelling of "is").

I think a better approach is to:

  1. Write a file on disk with the currently selected langauge (if it doesn't exist, it defaults to en-US)
  2. Add a getter/setter to ISpellingDictionaryService to get/set the current language. Set writes to the file in Comment / uncomment code, squiggles remain. #1 and refreshes the spell checker in all open files (maybe want a new event for this)
  3. In the smart tag provider, provide a new smart action that has a collection of actions for each language.
  4. When the user selects a language in that smart tag sub-menu, the default language is changed via System.InvalidOperationException: Collection was modified #2.

Downsides include that being more, only being triggerable if there are misspellings (likely, if the language is wrong), being really expensive to switch (hopefully only once), and not supporting mixed-language projects.

If you are interesting in trying that out, go for it :) If not, I may get to it in the next couple weeks.


Reply to this email directly or view it on GitHub.

@NoahRic
Copy link
Owner

NoahRic commented Oct 4, 2012

Thanks Simon, inlined:

The problem is, as a german developer, I can say that mixed languages in text are very common. In a .cs file comments and strings most of the time are in different languages. Even strings, if one does not use Resources/localization, are in english for error messages, and german for the UI. A spell checker only supporting one language is barely usable in this scenario. Also usually you won’t need more than 2 languages.

Fair enough, though this was never a use case I was aiming for. I can see how single-language spellchecking would be frustrating.

If the langauges are configurable in a easy way, as you proposed in the smart tag provider, then this feature can easily be turned off, by only specifying one language.

I'd prefer it to be, in this order:

  1. Default is english (people have been asking for that for a long time)
  2. User can specify single other default language
  3. User can specify multiple languages

For performance, I think the approach is not twice as expensive, because it first checks spelling with the current language, and only on a spelling error it needs to check the other langages.

Ah, my apologies. I believe I was confused at the code, which may be wrong(?). It looks like, on a misspelling, it checks the other languages to see if they have misspellings, but it doesn't seem to be populating them first with the text to check. In any event, as you say, it is only twice as expensive for the number of misspelled words.

You’re right about the false negatives. Maybe there would be a way to set the language for individual strings or comments, and for whole html files, as html usually has only one language. This could work the way, that when a change of the current language in a string occurs, the minority language words would have to be checked again in the majority language. This way the checker could automatically determine the language.
Is there a way to determine individual strings & comments? Is it correct that a string corresponds to a single span?

That starts to sound really complicated. I'm fine with the language union case, as long as the user explicitly picks it.

As I see you always pass one word to the TextBox for spell checking. Would it improve performance, to pass the whole text to check for?

Sadly, exactly the opposite. WPF's natural language processing is incredibly processor intensive, and I got a 3-4x performance improvement by pre-splitting words (and removing ignored words).

The approach you suppose might have better performance, but I don’t think the need to switch langauges is acceptable if one has different files with different languages open. This should be an automatic process.

Agreed, setting the default for multi-language scenarios would be a pain. If you have a solution that offers the three things I listed above (default english, allow select of single other language, allow select of multiple languages), that would be ideal. I'm thinking, then:

  1. Smart tags have a menu called "Language settings"
  2. First tag in that set is a disabled action with the title: "Currently active language(s): blah"
  3. The rest of the tags in that set are the languages you can pick. In the single language case, you can make the currently selected language inactive. In the multiple languages case, clicking will add/remove the language from the set, unless it is the last one (I guess).
  4. There is a second set which is called either "Use multiple languages" / "Use single language" (depending on state)

That's still kinda ugly, but it would let a user (slowly) right-click to pick multiple languages and/or which languages are used. Thoughts?

… window.

Implemented Features:

- Support for spell checking multiple languages.
   The checker checks for correct spelling in multiple configurable languages in the same text.
- Configurable language support through smarttags.
- Configuration of custom Dictionaries.

Spell checker now determines language on a per sentence base. It tolerates foreign language in a sentence, if it is a group of at least SpellingTagger.MinForeignWordSequence words which is currently set to 3 words.
If only one language is enabled performance is identical to the old algorithm, with more languages, it decreqases proportional to the misspells.
@simonegli8
Copy link
Author

Hi,

I pushed a new totally revised version. But it still has some minor bugs, will still be half a day of work for the final version. This version has Language Settings support in the SmartTags. One can Enable/Disable any language, use custom dictionaries (not fully tested), supply default custom dictionaries for languages in the vsix, supports .lex & .dic (rundimentarily) formats.

The spell checker is smart, and determines the language on a per sentence base. It allows foreign language in a sentence, but only sequences of a minimal length, what is currently set to 3 words, to avoid false negatives.

Actually, with many misspells it is really rather slow, but this is all configurable, with only en-US enabled (what is the default) the speed is as before.

I’ll be in holidays next week, so I can’t finish it till 16. September.

Greetings & Blessings,

Simon Egli

From: Noah Richards
Sent: Friday, October 05, 2012 12:59 AM
To: NoahRic/Spellchecker
Cc: simonegli8
Subject: Re: [Spellchecker] Multi Language Support (#6)

Thanks Simon, inlined:

The problem is, as a german developer, I can say that mixed languages in text are very common. In a .cs file comments and strings most of the time are in different languages. Even strings, if one does not use Resources/localization, are in english for error messages, and german for the UI. A spell checker only supporting one language is barely usable in this scenario. Also usually you won’t need more than 2 languages.

Fair enough, though this was never a use case I was aiming for. I can see how single-language spellchecking would be frustrating.

If the langauges are configurable in a easy way, as you proposed in the smart tag provider, then this feature can easily be turned off, by only specifying one language.

I'd prefer it to be, in this order:

  1. Default is english (people have been asking for that for a long time)
  2. User can specify single other default language
  3. User can specify multiple languages

For performance, I think the approach is not twice as expensive, because it first checks spelling with the current language, and only on a spelling error it needs to check the other langages.

Ah, my apologies. I believe I was confused at the code, which may be wrong(?). It looks like, on a misspelling, it checks the other languages to see if they have misspellings, but it doesn't seem to be populating them first with the text to check. In any event, as you say, it is only twice as expensive for the number of misspelled words.

You’re right about the false negatives. Maybe there would be a way to set the language for individual strings or comments, and for whole html files, as html usually has only one language. This could work the way, that when a change of the current language in a string occurs, the minority language words would have to be checked again in the majority language. This way the checker could automatically determine the language.
Is there a way to determine individual strings & comments? Is it correct that a string corresponds to a single span?

That starts to sound really complicated. I'm fine with the language union case, as long as the user explicitly picks it.

As I see you always pass one word to the TextBox for spell checking. Would it improve performance, to pass the whole text to check for?

Sadly, exactly the opposite. WPF's natural language processing is incredibly processor intensive, and I got a 3-4x performance improvement by pre-splitting words (and removing ignored words).

The approach you suppose might have better performance, but I don’t think the need to switch langauges is acceptable if one has different files with different languages open. This should be an automatic process.

Agreed, setting the default for multi-language scenarios would be a pain. If you have a solution that offers the three things I listed above (default english, allow select of single other language, allow select of multiple languages), that would be ideal. I'm thinking, then:

  1. Smart tags have a menu called "Language settings"
  2. First tag in that set is a disabled action with the title: "Currently active language(s): blah"
  3. The rest of the tags in that set are the languages you can pick. In the single language case, you can make the currently selected language inactive. In the multiple languages case, clicking will add/remove the language from the set, unless it is the last one (I guess).
  4. There is a second set which is called either "Use multiple languages" / "Use single language" (depending on state)

That's still kinda ugly, but it would let a user (slowly) right-click to pick multiple languages and/or which languages are used. Thoughts?


Reply to this email directly or view it on GitHub.

No bugs found so far. Ready for testing, or even RTM.

Implements multiple language feature. Languages are checked on a per sentence basis. Foreign language sequences of at least MinForeignWordSequence (3) words are allowed in a sentence, smaller sequences are treated as misspelled.
Custom Dictionaries are also supported either in the ISpell dic or in the WPF lex format.
@simonegli8
Copy link
Author

Hello Noah,

Sorry, I couldn’t reach you via mail and tought you’re not answering, but I checked this now, I think the mails never leaved my mailserver or whatever, as I have my own, and its still in its experimental stage, ugh.

For Spellchecker, the thing is finished long ago, maybe apart from some bugs and additional custom dictionaries for langauges the WPF Spellchecker doesn’t support, that could be included in the VSIX. I use it in my VS everyday.

A short summay:
Features:

  • Smart tag contextmenu that has two new things, first a list of active languages and a menu entry Language Options. The Options entry leads to a WPF dialog where one can configure
    languages.
  • Custom and Default dictionaries in .dic and .lex formats. The default dictionaries can be included in the VSIX, so Spellchecker could support more languages than the WPF checker supports out of the box.
  • Multi language on a per sentence basis. Foreign language in a sentence is limited to sequences longer than 3 words to avoid false negatives.

Lacking features:
Re-checking of open files after language configuration changes, what I don’t know how to implement myself.

Unfortunately, the algorithm for determining the language became a bit complicated and not very handsome. But apart from being ugly, it is smart, and determines the language on a per sentence base, choosing from the active languages. It allows foreign language in a sentence, but only sequences of a minimal length, what is currently set to 3 words, to avoid false negatives.

Performace drops proportinal to the misspellings times active languages for single language sentences, and for mixed language sentences there is additional overhead when the checker chooses a language and possibly has to check some words again in that language. For one active language performance is approximately the same as it was with your solution.

A small bug:
Once when I configured a language, it didn’t appear in the list in the SmartTag menu, but I coulnd’t reproduce that and did not investigate further. In everyday use I never change language settings anymore, so I would have to test a little and track down the bug.

TODO:

  • Implement Re-Checking on language option modifications.
  • Include default .dic or .lex dictionaries in the VSIX for languages that WPF doesn’t support out of the box.
  • Fix the above bug.
  • Conversion to VS 2012 ?

I didn’t receive any mails from you, so if I can’t get through to you, I wonder if I should publish my fork in the VS Gallery myself? But I would want to avoid this since re-checking is lacking.

Greetings & Blessings,

Simon Egli

From: Noah Richards
Sent: Friday, October 5, 2012 12:59 AM
To: NoahRic/Spellchecker
Cc: simonegli8
Subject: Re: [Spellchecker] Multi Language Support (#6)

Thanks Simon, inlined:

The problem is, as a german developer, I can say that mixed languages in text are very common. In a .cs file comments and strings most of the time are in different languages. Even strings, if one does not use Resources/localization, are in english for error messages, and german for the UI. A spell checker only supporting one language is barely usable in this scenario. Also usually you won’t need more than 2 languages.

Fair enough, though this was never a use case I was aiming for. I can see how single-language spellchecking would be frustrating.

If the langauges are configurable in a easy way, as you proposed in the smart tag provider, then this feature can easily be turned off, by only specifying one language.

I'd prefer it to be, in this order:

  1. Default is english (people have been asking for that for a long time)
  2. User can specify single other default language
  3. User can specify multiple languages

For performance, I think the approach is not twice as expensive, because it first checks spelling with the current language, and only on a spelling error it needs to check the other langages.

Ah, my apologies. I believe I was confused at the code, which may be wrong(?). It looks like, on a misspelling, it checks the other languages to see if they have misspellings, but it doesn't seem to be populating them first with the text to check. In any event, as you say, it is only twice as expensive for the number of misspelled words.

You’re right about the false negatives. Maybe there would be a way to set the language for individual strings or comments, and for whole html files, as html usually has only one language. This could work the way, that when a change of the current language in a string occurs, the minority language words would have to be checked again in the majority language. This way the checker could automatically determine the language.
Is there a way to determine individual strings & comments? Is it correct that a string corresponds to a single span?

That starts to sound really complicated. I'm fine with the language union case, as long as the user explicitly picks it.

As I see you always pass one word to the TextBox for spell checking. Would it improve performance, to pass the whole text to check for?

Sadly, exactly the opposite. WPF's natural language processing is incredibly processor intensive, and I got a 3-4x performance improvement by pre-splitting words (and removing ignored words).

The approach you suppose might have better performance, but I don’t think the need to switch langauges is acceptable if one has different files with different languages open. This should be an automatic process.

Agreed, setting the default for multi-language scenarios would be a pain. If you have a solution that offers the three things I listed above (default english, allow select of single other language, allow select of multiple languages), that would be ideal. I'm thinking, then:

  1. Smart tags have a menu called "Language settings"
  2. First tag in that set is a disabled action with the title: "Currently active language(s): blah"
  3. The rest of the tags in that set are the languages you can pick. In the single language case, you can make the currently selected language inactive. In the multiple languages case, clicking will add/remove the language from the set, unless it is the last one (I guess).
  4. There is a second set which is called either "Use multiple languages" / "Use single language" (depending on state)

That's still kinda ugly, but it would let a user (slowly) right-click to pick multiple languages and/or which languages are used. Thoughts?


Reply to this email directly or view it on GitHub.

@NoahRic
Copy link
Owner

NoahRic commented Nov 27, 2012

Hey Simon, good to hear from you :)

I'll take another look at the changes (
https://github.com/simonegli8/Spellchecker/compare/master). Since it's a
pretty huge overall change, it may take a little while. It would be a lot
easier if you could fix the indentation to at least be consistent; you can
do that pretty easily by opening the files and hitting ctrl-k ctrl-d
(Format Document). You would have to switch over your VS settings

Let's see, a few answers/thoughts:

  1. To the TODO about including other .dic/.lex dictionaries, you have to be
    pretty careful about the distribution of those things. I know a lot of open
    source apps have separate install steps for things like aspell and
    dictionaries, and I imagine some of it may be licensing/rights related.
    That was the big reason that I stuck with using WPF for spellchecking in
    the first place (can't depend on Office being installed, couldn't ship
    dictionaries with the extension). Also keep in mind that the VS update
    manager doesn't do binary diffs for updates, so users would have to
    redownload anything that ships with the extension.

  2. Publishing your own fork - regardless of what happens with this
    extension, I think you should publish your own fork. You won't have to wait
    on me to implement new features and fix bugs, you'll get more
    credit/recognition for what you are doing, and it's good to have multiple
    solutions (at least until VS includes this as a feature, I hope).

  3. "Re-checking of open files after language configuration changes, what I
    don’t know how to implement myself." - I can probably help with this
    if/when we get the change merged in.

I'll go back through the comparison this week. I hope to have some time at
night to do so.

Thanks!

-Noah

On Mon, Nov 26, 2012 at 9:14 AM, simonegli8 [email protected]:

Hello Noah,

Sorry, I couldn’t reach you via mail and tought you’re not answering, but
I checked this now, I think the mails never leaved my mailserver or
whatever, as I have my own, and its still in its experimental stage, ugh.

For Spellchecker, the thing is finished long ago, maybe apart from some
bugs and additional custom dictionaries for langauges the WPF Spellchecker
doesn’t support, that could be included in the VSIX. I use it in my VS
everyday.

A short summay:
Features:

  • Smart tag contextmenu that has two new things, first a list of active
    languages and a menu entry Language Options. The Options entry leads to a
    WPF dialog where one can configure
    languages.
  • Custom and Default dictionaries in .dic and .lex formats. The default
    dictionaries can be included in the VSIX, so Spellchecker could support
    more languages than the WPF checker supports out of the box.
  • Multi language on a per sentence basis. Foreign language in a sentence
    is limited to sequences longer than 3 words to avoid false negatives.

Lacking features:
Re-checking of open files after language configuration changes, what I
don’t know how to implement myself.

Unfortunately, the algorithm for determining the language became a bit
complicated and not very handsome. But apart from being ugly, it is smart,
and determines the language on a per sentence base, choosing from the
active languages. It allows foreign language in a sentence, but only
sequences of a minimal length, what is currently set to 3 words, to avoid
false negatives.

Performace drops proportinal to the misspellings times active languages
for single language sentences, and for mixed language sentences there is
additional overhead when the checker chooses a language and possibly has to
check some words again in that language. For one active language
performance is approximately the same as it was with your solution.

A small bug:
Once when I configured a language, it didn’t appear in the list in the
SmartTag menu, but I coulnd’t reproduce that and did not investigate
further. In everyday use I never change language settings anymore, so I
would have to test a little and track down the bug.

TODO:

  • Implement Re-Checking on language option modifications.
  • Include default .dic or .lex dictionaries in the VSIX for languages that
    WPF doesn’t support out of the box.
  • Fix the above bug.
  • Conversion to VS 2012 ?

I didn’t receive any mails from you, so if I can’t get through to you, I
wonder if I should publish my fork in the VS Gallery myself? But I would
want to avoid this since re-checking is lacking.

Greetings & Blessings,

Simon Egli

From: Noah Richards
Sent: Friday, October 5, 2012 12:59 AM
To: NoahRic/Spellchecker
Cc: simonegli8
Subject: Re: [Spellchecker] Multi Language Support (#6)

Thanks Simon, inlined:

The problem is, as a german developer, I can say that mixed languages in
text are very common. In a .cs file comments and strings most of the time
are in different languages. Even strings, if one does not use
Resources/localization, are in english for error messages, and german for
the UI. A spell checker only supporting one language is barely usable in
this scenario. Also usually you won’t need more than 2 languages.

Fair enough, though this was never a use case I was aiming for. I can see
how single-language spellchecking would be frustrating.

If the langauges are configurable in a easy way, as you proposed in the
smart tag provider, then this feature can easily be turned off, by only
specifying one language.

I'd prefer it to be, in this order:

  1. Default is english (people have been asking for that for a long time)
  2. User can specify single other default language
  3. User can specify multiple languages

For performance, I think the approach is not twice as expensive, because
it first checks spelling with the current language, and only on a spelling
error it needs to check the other langages.

Ah, my apologies. I believe I was confused at the code, which may be
wrong(?). It looks like, on a misspelling, it checks the other languages to
see if they have misspellings, but it doesn't seem to be populating them
first with the text to check. In any event, as you say, it is only twice as
expensive for the number of misspelled words.

You’re right about the false negatives. Maybe there would be a way to set
the language for individual strings or comments, and for whole html files,
as html usually has only one language. This could work the way, that when a
change of the current language in a string occurs, the minority language
words would have to be checked again in the majority language. This way the
checker could automatically determine the language.
Is there a way to determine individual strings & comments? Is it correct
that a string corresponds to a single span?

That starts to sound really complicated. I'm fine with the language union
case, as long as the user explicitly picks it.

As I see you always pass one word to the TextBox for spell checking. Would
it improve performance, to pass the whole text to check for?

Sadly, exactly the opposite. WPF's natural language processing is
incredibly processor intensive, and I got a 3-4x performance improvement by
pre-splitting words (and removing ignored words).

The approach you suppose might have better performance, but I don’t think
the need to switch langauges is acceptable if one has different files with
different languages open. This should be an automatic process.

Agreed, setting the default for multi-language scenarios would be a pain.
If you have a solution that offers the three things I listed above (default
english, allow select of single other language, allow select of multiple
languages), that would be ideal. I'm thinking, then:

  1. Smart tags have a menu called "Language settings"
  2. First tag in that set is a disabled action with the title: "Currently
    active language(s): blah"
  3. The rest of the tags in that set are the languages you can pick. In the
    single language case, you can make the currently selected language
    inactive. In the multiple languages case, clicking will add/remove the
    language from the set, unless it is the last one (I guess).
  4. There is a second set which is called either "Use multiple languages" /
    "Use single language" (depending on state)

That's still kinda ugly, but it would let a user (slowly) right-click to
pick multiple languages and/or which languages are used. Thoughts?


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHubhttps://github.com//pull/6#issuecomment-10724181.

@simonegli8
Copy link
Author

Hi,

I’ll pushed some minor changes to Spellchecker: Changed all formatting of my sources to standard & replaced the nations flags globe with an image I have the copyright of.
I also contacted the author of NetSpell, if we can distribute his dictionaries. Maybe NetSpell would also be an alternative to the WPF spellchecker.

Greetings & Blessings,

Simon Egli

P.S.
I googled for Midori yesterday, and saw an article in the internet and a job advertisement from Microsoft for a programmer. For a moment I was tempted to apply, as I love this kind of stuff, because I was also into managed OS development back in 1996. I ported the ETH Zurich Oberon System to the Atari-ST, and since then, there is still no cool commercial managed OS available. Oberon was lightning fast and very secure, as everything run in the same address space, and security was enforced by managed code.
Do you know if Microsoft will release something related to Midory sometime?

From: Noah Richards
Sent: Wednesday, November 28, 2012 12:35 AM
To: NoahRic/Spellchecker
Cc: simonegli8
Subject: Re: [Spellchecker] Multi Language Support (#6)

Hey Simon, good to hear from you :)

I'll take another look at the changes (
https://github.com/simonegli8/Spellchecker/compare/master). Since it's a
pretty huge overall change, it may take a little while. It would be a lot
easier if you could fix the indentation to at least be consistent; you can
do that pretty easily by opening the files and hitting ctrl-k ctrl-d
(Format Document). You would have to switch over your VS settings

Let's see, a few answers/thoughts:

  1. To the TODO about including other .dic/.lex dictionaries, you have to be
    pretty careful about the distribution of those things. I know a lot of open
    source apps have separate install steps for things like aspell and
    dictionaries, and I imagine some of it may be licensing/rights related.
    That was the big reason that I stuck with using WPF for spellchecking in
    the first place (can't depend on Office being installed, couldn't ship
    dictionaries with the extension). Also keep in mind that the VS update
    manager doesn't do binary diffs for updates, so users would have to
    redownload anything that ships with the extension.

  2. Publishing your own fork - regardless of what happens with this
    extension, I think you should publish your own fork. You won't have to wait
    on me to implement new features and fix bugs, you'll get more
    credit/recognition for what you are doing, and it's good to have multiple
    solutions (at least until VS includes this as a feature, I hope).

  3. "Re-checking of open files after language configuration changes, what I
    don’t know how to implement myself." - I can probably help with this
    if/when we get the change merged in.

I'll go back through the comparison this week. I hope to have some time at
night to do so.

Thanks!

-Noah

On Mon, Nov 26, 2012 at 9:14 AM, simonegli8 [email protected]:

Hello Noah,

Sorry, I couldn’t reach you via mail and tought you’re not answering, but
I checked this now, I think the mails never leaved my mailserver or
whatever, as I have my own, and its still in its experimental stage, ugh.

For Spellchecker, the thing is finished long ago, maybe apart from some
bugs and additional custom dictionaries for langauges the WPF Spellchecker
doesn’t support, that could be included in the VSIX. I use it in my VS
everyday.

A short summay:
Features:

  • Smart tag contextmenu that has two new things, first a list of active
    languages and a menu entry Language Options. The Options entry leads to a
    WPF dialog where one can configure
    languages.
  • Custom and Default dictionaries in .dic and .lex formats. The default
    dictionaries can be included in the VSIX, so Spellchecker could support
    more languages than the WPF checker supports out of the box.
  • Multi language on a per sentence basis. Foreign language in a sentence
    is limited to sequences longer than 3 words to avoid false negatives.

Lacking features:
Re-checking of open files after language configuration changes, what I
don’t know how to implement myself.

Unfortunately, the algorithm for determining the language became a bit
complicated and not very handsome. But apart from being ugly, it is smart,
and determines the language on a per sentence base, choosing from the
active languages. It allows foreign language in a sentence, but only
sequences of a minimal length, what is currently set to 3 words, to avoid
false negatives.

Performace drops proportinal to the misspellings times active languages
for single language sentences, and for mixed language sentences there is
additional overhead when the checker chooses a language and possibly has to
check some words again in that language. For one active language
performance is approximately the same as it was with your solution.

A small bug:
Once when I configured a language, it didn’t appear in the list in the
SmartTag menu, but I coulnd’t reproduce that and did not investigate
further. In everyday use I never change language settings anymore, so I
would have to test a little and track down the bug.

TODO:

  • Implement Re-Checking on language option modifications.
  • Include default .dic or .lex dictionaries in the VSIX for languages that
    WPF doesn’t support out of the box.
  • Fix the above bug.
  • Conversion to VS 2012 ?

I didn’t receive any mails from you, so if I can’t get through to you, I
wonder if I should publish my fork in the VS Gallery myself? But I would
want to avoid this since re-checking is lacking.

Greetings & Blessings,

Simon Egli

From: Noah Richards
Sent: Friday, October 5, 2012 12:59 AM
To: NoahRic/Spellchecker
Cc: simonegli8
Subject: Re: [Spellchecker] Multi Language Support (#6)

Thanks Simon, inlined:

The problem is, as a german developer, I can say that mixed languages in
text are very common. In a .cs file comments and strings most of the time
are in different languages. Even strings, if one does not use
Resources/localization, are in english for error messages, and german for
the UI. A spell checker only supporting one language is barely usable in
this scenario. Also usually you won’t need more than 2 languages.

Fair enough, though this was never a use case I was aiming for. I can see
how single-language spellchecking would be frustrating.

If the langauges are configurable in a easy way, as you proposed in the
smart tag provider, then this feature can easily be turned off, by only
specifying one language.

I'd prefer it to be, in this order:

  1. Default is english (people have been asking for that for a long time)
  2. User can specify single other default language
  3. User can specify multiple languages

For performance, I think the approach is not twice as expensive, because
it first checks spelling with the current language, and only on a spelling
error it needs to check the other langages.

Ah, my apologies. I believe I was confused at the code, which may be
wrong(?). It looks like, on a misspelling, it checks the other languages to
see if they have misspellings, but it doesn't seem to be populating them
first with the text to check. In any event, as you say, it is only twice as
expensive for the number of misspelled words.

You’re right about the false negatives. Maybe there would be a way to set
the language for individual strings or comments, and for whole html files,
as html usually has only one language. This could work the way, that when a
change of the current language in a string occurs, the minority language
words would have to be checked again in the majority language. This way the
checker could automatically determine the language.
Is there a way to determine individual strings & comments? Is it correct
that a string corresponds to a single span?

That starts to sound really complicated. I'm fine with the language union
case, as long as the user explicitly picks it.

As I see you always pass one word to the TextBox for spell checking. Would
it improve performance, to pass the whole text to check for?

Sadly, exactly the opposite. WPF's natural language processing is
incredibly processor intensive, and I got a 3-4x performance improvement by
pre-splitting words (and removing ignored words).

The approach you suppose might have better performance, but I don’t think
the need to switch langauges is acceptable if one has different files with
different languages open. This should be an automatic process.

Agreed, setting the default for multi-language scenarios would be a pain.
If you have a solution that offers the three things I listed above (default
english, allow select of single other language, allow select of multiple
languages), that would be ideal. I'm thinking, then:

  1. Smart tags have a menu called "Language settings"
  2. First tag in that set is a disabled action with the title: "Currently
    active language(s): blah"
  3. The rest of the tags in that set are the languages you can pick. In the
    single language case, you can make the currently selected language
    inactive. In the multiple languages case, clicking will add/remove the
    language from the set, unless it is the last one (I guess).
  4. There is a second set which is called either "Use multiple languages" /
    "Use single language" (depending on state)

That's still kinda ugly, but it would let a user (slowly) right-click to
pick multiple languages and/or which languages are used. Thoughts?


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHubhttps://github.com//pull/6#issuecomment-10724181.


Reply to this email directly or view it on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants