Single character highlights within a word are not being evaluated correctly #189

sscheib · 2024-03-10T21:01:09Z

sscheib
Mar 10, 2024

Sorry for the noise! This might again be just a misunderstanding, but I cannot figure it out.

I have portions in my markdown where I highlight single characters to explain an abbreviation, e.g.:

Name	Abbreviation
Partition Table	Partition Table

In markdown this looks something like this:

| Name            | Abbreviation               |
| :-------------- | :------------------------- |
| Partition Table | **P**artition **T**able    |

Now pyspelling picks up on artition. I would have expected, when I either load markdown.extensions.legacy_em or pymdownx.betterem, pyspelling will be able to retrieve the "complete"
word instead of a portion of it.

This also happens outside of tables, e.g. simply:

**P**artition **T**able

With the following simple markdown file, this can be easily reproduced:

| Name            | Abbreviation               |
| :-------------- | :------------------------- |
| Partition Table | **P**artition **T**able    |

**P**artition **T**able

Currently I am using the following config:

---
spellchecker: 'aspell'
jobs: 4
matrix:
  - name: 'markdown'
    default_encoding: 'utf-8'
    expect_match: true
    sources:
      - '_posts/*kickstart*'

    dictionary:
      wordlists:
        - '.github/spellcheck/wordlist.txt'

      output: '.github/spellcheck/spellcheck.dic'

    aspell:
      lang: 'en'
      d: 'en_US'
      mode: 'markdown'
      ignore-case: true

    pipeline:
      - pyspelling.filters.markdown:
          markdown_extensions:
            - pymdownx.superfences: {}
            - pymdownx.betterem: {}
            - pymdownx.details: {}
            - pymdownx.emoji: {}
            - pymdownx.betterem: {}
            - markdown.extensions.legacy_em: {}
            - markdown.extensions.tables: {}

      - pyspelling.filters.html:
          comments: false
          attributes:
            - 'title'
            - 'alt'
          ignores:
            - ':matches(code, pre)'

      - pyspelling.filters.url: {}
      - pyspelling.filters.context:
          context_visible_first: true
          escapes: '\\[\\`~]'
          delimiters:
            # ignore liquid code blocks
            - open: '{% highlight [A-Za-z]+ %}'
              close: '{% endhighlight %}'

            # ignore gists
            - open: '{% gist (?:[0-9a-f]){32}'
              close: '%}'

            # ignore text between inline back ticks
            - open: '(?P<open>`+)'
              close: '(?P=open)'

            # ignore title
            - open: '(^title:)'
              close: '(.+$)'

            # ignore author
            - open: '(^author:)'
              close: '(.+$)'

            # ignore liquid tags
            - open: '({% [a-z]+ )'
              close: '%}'
...

I have to say that I played around a lot with the markdown extension; Initially I only had pymdownx.superfences: {} enabled.

I greatly appreciate any hints!

Thanks, and all the best,
Steffen

Answered by sscheib

Mar 11, 2024

Just if somebody finds this discussion and also wants to use pyspelling with a Jekyll blog, here's my config:

---
spellchecker: 'aspell'
jobs: 4
matrix:
  - name: 'markdown'
    default_encoding: 'utf-8'
    expect_match: true
    sources:
      - '_posts/*.md'

    dictionary:
      wordlists:
        - '.github/spellcheck/wordlist.txt'

      output: '.github/spellcheck/spellcheck.dic'

    aspell:
      lang: 'en'
      d: 'en_US'
      mode: 'markdown'
      ignore-case: true

    pipeline:
      - pyspelling.filters.context:
          context_visible_first: true
          escapes: '\\[\\`~]'
          delimiters:

            # ignore liquid nospell blocks
            #
            # …

View full answer

facelessuser · 2024-03-10T22:16:00Z

facelessuser
Mar 10, 2024
Maintainer

I'll have to take a look, but I am pretty sure that when we are gathering the content under a block that all text in tags are separated by spaces. It is to ensure that we don't accidentally smoosh stuff together that we do not mean to. So, the following gets evaluated as:

<p><strong>P</strong>artition <strong>T</strong>able</p>

P artition T able

To be fair, mid-word emphasis is not as common, but obviously, not impossible to run into. I'm curious what Aspell does in its own HTML filter.

If we were to change this behavior and group span tags together, it would require a lot of testing. I can imagine all sorts of undesirable situations. Maybe <sup> and <sub> tags, which are inline, should be ignored. Most likely, if we did allow this, we would add it as an optional feature switch. Maybe it makes sense to have a small set of know tags that we don't separate by spaces, maybe bold and italic specifically.

Anyway, currently, I don't think there is a way to avoid this except to not break a word up by tags.

0 replies

sscheib · 2024-03-10T22:27:09Z

sscheib
Mar 10, 2024
Author

Damn, you're quick!

I've just tried it with aspell itself (on the markdown file, not the HTML file), and indeed, it also complains about the same words.

I guess this is just a limitation then for spell checking with aspell

I might be able to workaround this by selectively disabling pyspelling for certain lines. But since I couldn't find a 'native' way,
I guess I need to use the context filter and play around with the
regular expressions - or do I miss something?

How do you disable pyspelling for certain content in your projects (if that's even a use case for you, of course)?

0 replies

facelessuser · 2024-03-10T22:43:24Z

facelessuser
Mar 10, 2024
Maintainer

Well, aspell has its own HTML filter I believe. I'll take a look at some point and compare. It is at least a limitation of using our HTML filter with aspell.

I guess I need to use the context filter and play around with the
regular expressions - or do I miss something?

If you are using the HTML filter, it will split those parts, so context filter won't really help.

How do you disable pyspelling for certain content in your projects (if that's even a use case for you, of course)?

What do you mean by disable? You mean disable using PySpelling filters specifically? I think that is covered in the documentation.

0 replies

sscheib · 2024-03-10T23:07:59Z

sscheib
Mar 10, 2024
Author

What do you mean by disable? You mean disable using PySpelling filters specifically? I think that is covered in the documentation.

Sorry for my poor wording. Similar to the noqa: or similar flags used by linters, which basically 'disable' the linting on this specific line/block - in the sense of skipping over a specific line.

I know, this is not a linter in that sense, so this might go beyond the scope of pyspelling.

I thought of introducing something like an HTML comment  (to 'disable' spell checking) and later down the file have a .

I know, you cannot skip lines, but I - theoretically at least - could filter that specific content inbetween those tags. I thought of something like:

- pyspelling.filters.context:
  context_visible_first: true
  escapes: '\\[\\`~]'
  delimiters:
    - open: '(?m)^(\s+)?\s+?pyspelling-disable\s+?'
      content: '[\S\s]+'
      close: '\s+?pyspelling-enable\s+?'

I wasn't yet able to make this work - it might not even work at all; It's just an idea I had in order to
keep my mid-word emphasis and not add a bunch of "non-sense" words to my dictionary.

0 replies

facelessuser · 2024-03-10T23:12:13Z

facelessuser
Mar 10, 2024
Maintainer

You can disable it however you want. You can add HTML class to force a disable.

0 replies

sscheib · 2024-03-10T23:17:58Z

sscheib
Mar 10, 2024
Author

You are absolutely correct - once again 💯.
I thought of it waaay to complicated.

That does the trick:

<div class=nospell>
| Prefix            | Meaning                        |   
| :---------------- | :----------------------------- |
| `pt-`             | **P**artition **T**able        |   
| `pvt-`            | **P**ro**v**ision **T**emplate |
| `snt-`            | **Sn**ippe**t**                |   
</div>

      - pyspelling.filters.html:
          comments: false
          attributes:
            - 'title'
            - 'alt'
          ignores:
            - ':matches(code, pre, details)'
            - '.nospell'

0 replies

facelessuser · 2024-03-10T23:34:22Z

facelessuser
Mar 10, 2024
Maintainer

Yep, that is what I usually do.

0 replies

sscheib · 2024-03-11T21:02:20Z

sscheib
Mar 11, 2024
Author

Just if somebody finds this discussion and also wants to use pyspelling with a Jekyll blog, here's my config:

---
spellchecker: 'aspell'
jobs: 4
matrix:
  - name: 'markdown'
    default_encoding: 'utf-8'
    expect_match: true
    sources:
      - '_posts/*.md'

    dictionary:
      wordlists:
        - '.github/spellcheck/wordlist.txt'

      output: '.github/spellcheck/spellcheck.dic'

    aspell:
      lang: 'en'
      d: 'en_US'
      mode: 'markdown'
      ignore-case: true

    pipeline:
      - pyspelling.filters.context:
          context_visible_first: true
          escapes: '\\[\\`~]'
          delimiters:

            # ignore liquid nospell blocks
            #
            # example:
            #
            # {% comment %} begin nospell {% endcomment %}
            # [..]
            # {% comment %} end nospell {% endcomment %}
            #
            - open: '(?m)(\s{0,}?){%(\s+)?comment\2?%}\2?begin\2?nospell\2?{%\2?endcomment\2?%}'
              content: '[\S\s]+'
              close: '\1{%\2?comment\2?%}\2?end\2?nospell\2?{%\2?endcomment\2?%}'

            # ignore liquid highlight blocks
            #
            # example:
            #
            # {% highlight yaml %}
            # [..]
            # {% endhighlight %}
            #
            - open: '(?m)^(\s{0,}?){%(\s+)?highlight\2[A-z0-9]+\2?%}'
              content: '[\S\s]+'
              close: '\1{%\2?endhighlight\2?%}$'

            # ignore any liquid tags
            #
            # examples:
            #
            # - {% raw %}
            # - {% endhighlight %}
            # - {% gist somerandomeid %}
            #
            - open: '(?s)^\s{0,}?{%\s+[A-Za-z0-9]+\s+'
              close: '%}$'

            # ignore title and author in the header
            #
            # example:
            #
            # ---
            # title: My blog post title
            # author: John and Jane Doe
            # ---
            #
            - open: '(?s)^(?:title|author):'
              content: '[^\n]+'
              close: '$'

      - pyspelling.filters.markdown:
          markdown_extensions:
            - pymdownx.superfences: {}

      - pyspelling.filters.html:
          comments: false
          attributes:
            - 'title'
            - 'alt'
          ignores:
            - ':matches(code, pre)'

      - pyspelling.filters.url: {}

      - pyspelling.filters.context:
          context_visible_first: true
          escapes: '\\[\\`~]'
          delimiters:
            # ignore text between inline back ticks
            - open: '(?P<open>`+)'
              close: '(?P=open)'
...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single character highlights within a word are not being evaluated correctly #189

{{title}}

Replies: 8 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Single character highlights within a word are not being evaluated correctly #189

sscheib Mar 10, 2024

Replies: 8 comments

facelessuser Mar 10, 2024 Maintainer

sscheib Mar 10, 2024 Author

facelessuser Mar 10, 2024 Maintainer

sscheib Mar 10, 2024 Author

facelessuser Mar 10, 2024 Maintainer

sscheib Mar 10, 2024 Author

facelessuser Mar 10, 2024 Maintainer

sscheib Mar 11, 2024 Author

sscheib
Mar 10, 2024

facelessuser
Mar 10, 2024
Maintainer

sscheib
Mar 10, 2024
Author

facelessuser
Mar 10, 2024
Maintainer

sscheib
Mar 10, 2024
Author

facelessuser
Mar 10, 2024
Maintainer

sscheib
Mar 10, 2024
Author

facelessuser
Mar 10, 2024
Maintainer

sscheib
Mar 11, 2024
Author