Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cursor issue in Markdown files #5

Open
Azeirah opened this issue Sep 3, 2017 · 8 comments
Open

Cursor issue in Markdown files #5

Azeirah opened this issue Sep 3, 2017 · 8 comments

Comments

@Azeirah
Copy link

Azeirah commented Sep 3, 2017

I'm trying to get breadcrumbs to work in Markdown files, to show the nesting of headers (#, ##, ###, ####, #####, #######) in succession.

I came up with the following RegExp

^#{1,6}\\s*(?P<name>.*)

Which I test using Sublime's find function after slightly modifying it ^#{1,6}\s*(.*) (removing \\s and (?P<name> ......) because those are not supported by the find function in contrast to sublime's config regexes)

sublime-find

But if I insert the RegExp in the breadcrumbs setting for markdown specific syntax, I only get a breadcrumb for the line that is directly above my cursor, see the below images

When I move my cursor right below the heading line, it shows up fine

crumb-1

But when I move my cursor one line down, no more breadcrumbs :(

crumbs-2

Is something wrong with my RegExp, or is this a plugin issue?

@qbolec
Copy link
Owner

qbolec commented Sep 3, 2017

I believe this is because Markdown does not use indents to convey nesting of contexts. And this is an assumption of the Breadcrumbs plugin, that to find breadcrumbs you can simply find the lines with particular indentation. I was well aware that this is a limitation, but one that that lets me cover many languages at once without having to parse them.

I believe that it would be hard to use indentation in a Markdown file just to fit this plugin needs. So, perhaps we should add a way to define a language specific config file which would hint how the hierarchy of concepts works for a particular language.

Do you see some easy way to specify such rules?

@Azeirah
Copy link
Author

Azeirah commented Sep 3, 2017

Ah, that makes sense. Yeah, I though of something like a solution for this, a nestlevel regex selector.

For JSON files, hierarchy is based on number of <TAB>'s (or spaces)

{
  "hello": "hi", // nestlevel-1
  "nest": {
    "hello2": "hi-again" // nestlevel-2
  }
}

For Markdown, hierarchy is based on amount of #'s.

If you could define a regex group (?P<nestlevel>), where nestlevel captures a string. You could then define the nesting level depending on the length of the string. E.g, in Python

nestlevelgroup = match.group('nestlevel')
nestlevel = len(nestlevelgroup)

For Markdown, '##' would return a nest level of 2, '###' a nest level of 3.

It would look something like (?P<nestlevel>\\s*) for JSON, and (?P<nestlevel>#*) for Markdown.

(this paragraph is probably kind of confusing, it's not essential atm)
For JSON, ' ' would return a nest level of 2, ' ', nest level of four. I can see a potential issue with JSON if you use different indentation sizes, if you're using 2-space indentation for example, the series of nesting levels would be [2, 4, 6, 8, 10, ...], if you're using 4-space indentation, the series would be [4, 8, 12, 16, ...], I guess you could work around that by treating nest levels not as successive by +1, i.e. assume that your set looks like [1, 2, 3, 4, 5], but by treating nest levels as successive by previous < next, 4 < 8, 8 < 12.

@qbolec
Copy link
Owner

qbolec commented Sep 3, 2017

I like the simplicity of this idea.
But is it general enough, or Markdown specific?
By "general" I mean something which could handle other languages without indentation, like LaTeX (where you have \chapter \section \subsection ..) or HTML's h1,h2,h3,..

The issue with whitespace is even more complicated than you have already noticed: one can mix tabs with spaces in a way where it is not so simple to tell what is the final amount of indent without carefully checking the reminders modulo 4 of the current column number before tab. You can see the code for that which is copied from some example plugin for reindent.

Perhaps, a more general solution would be to allow people to define a hierarchy as a tree in which each node has a regex. Or instead of using regexes, which is always a hack, just tap into the Sublime's token info API.

@Azeirah
Copy link
Author

Azeirah commented Sep 3, 2017

You could solve this problem with two functions, a hierarchical element extractor, and a numerical function:

  1. A function written in the regex language that extracts hierarchical elements, hel's, from a text file; (?P<hel> ...)
  2. A function written in the Python language that converts a hel into a number lambda hel: ..., we'll call this function, helnum. The number returned by helnum represents a level of nestedness, or depth.

I'm not saying the Python function should be a lambda, it's just for easier notation purposes.

Examples

1. Markdown

hel ^(?P<hel>#{1,6}), or, extract up to 6 #'s at the beginning of a line.

helnum lambda hel: len(hel)

# bla                        helnum = 1
## bla                       helnum = 2

2. Python

hel ^(?P<hel>\\s*) == extract all whitespace at the beginning of a line

(select to see spaces)
def bla():                    helnum = 0
  i = 0                       helnum = 1
  while i < 10:               helnum = 1
    i += 1                    helnum = 2

helnum lambda(hel): return len(hel) / 2

Note that the regex doesn't take into account tabs; for a real implementation, the author of the Python hel can use or statements to deal with various styles of indentation

3. HTML

This one is more interesting, there are different kinds of hierarchies that the end-user might be interested in. When you're building a website, you want to see what the parent element of the element you're currently writing is. When you're writing a blog post, you want to see in what context you're writing in. There are probably more kinds of hierarchies that are useful to extract.

Good thing is, with this approach, you can support either, if you want indentation-based hierarchy, write a helnum-function that retrieves Sublime's indentation (i.e. are you using two-space tabs, three-space tabs, four-space tabs ...?)

If you want the other, write a hel that extracts heading tags, \<h(?P<hel>[1-6])> and give it a helnum like this lambda hel: int(hel)

4. LaTeX

LaTeX is a complex beast. It requires a case-study of its own to be honest. I'm fairly certain you could use the approach showed above to extract hierarchical elements out of LaTeX source code, the big question for LaTeX is only, what hierarchical elements are significant to the user? I think only the user knows. Easy things should be easy, hard things should be possible; as long as the above two primitives, hel's and helnum's permit you to extract the exact hierarchical patterns you're interested in, I think it's ok.

one can mix tabs with spaces

I understand the concern, but I don't think it's worth focusing on this issue in particular, this plugin's purpose is to show relevant hierarchical context. If you're working in a file that has a messed up hierarchy, I'd say too bad. Yes, there are programs out there that can deal with complex messes like these, the HTML parsers embedded in modern browsers are good examples, they deal with many weird off-spec html messes, all of these render just fine as complete documents: <h1> hello, <h1> hello </; the result of dealing with these complex messes? The program itself becomes a complex mess.

Sublime itself also has an answer for this same issue, try opening a JSON file with inconsistent indentation, tabs mixed with spaces, sometimes 3-space indentation, sometimes 1-space, etc... Sublime's indentation guessing feature suddenly doesn't work anymore, it's just not worth it.

The only downside I see right now is that someone who likes this plugin loads a document from the internet, the indentation is messed up, and this plugin won't work. Well, too bad? :S

It only needs to be correct when the users wants it to be correct.

Anyhow

Despite this big post, another important question to ask is, do you need to be general? Do you think that this plugin will be so popular that it must support the most complicated languages? An alternative would be to roll with a simple solution like the one I proposed, and deal with insane complexities of the likes of LaTeX when the need arises, if ever.

@qbolec
Copy link
Owner

qbolec commented Sep 4, 2017

I like your idea.
It's worth investigating.
Are you willing to try implementing it yourself?
In general what you want is to change the implementation of get_row_indentation, which currently is:

def get_row_indentation(line_start, view, tab_size, indentation_limit=None):
  if indentation_limit is None:
    row = view.rowcol(line_start)[0]
    indentation_limit = view.text_point(row + 1, 0) - line_start

  pos = 0
  for ch in view.substr(sublime.Region(line_start, line_start + indentation_limit)):
    if ch == '\t':
      pos += tab_size - (pos % tab_size)

    elif ch == ' ':
      pos += 1

    else:
      break

    if indentation_limit <= pos:
      break

  return pos

to something you've described.
You might want to change the name to get_hierarchy_level, and remove tab_size from the list of arguments to make it less "indentation specific".
You might want to use get_setting utitility function which is defined as

def get_setting(view_settings, key, default_value):
  defaults = sublime.load_settings('Breadcrumbs.sublime-settings')
  return view_settings.get(key, defaults.get(key, default_value))

to extract the hel regexp, and helnum function body. I guess, you will have to eval it somehow to get a Python function from it.

The main concern here is that this must work extremely fast. This function will be called for each of preceding lines. I see a potential problem here, as running hel regexp against each line could be too slow - perhaps one could safely trim the line to say, 100 chars before executing the regexp, but still, this needs some performance testing.

I was using this file:
big_example.txt
to test performance when navigating final lines of this file.
For "profiling" I've used a rather simple approach:

import time
...
  last_time = [time.time()]
  def stamp(text):
    now = time.time()
    print(text, now-last_time[0])
    last_time[0] = now
...
    stamp('start')
...
    stamp('finished stage 1')
...
    stamp('done')

And looked into the console (which you open with ctrl+backtick) for the output.

@Azeirah
Copy link
Author

Azeirah commented Sep 5, 2017

Are you willing to try implementing it yourself?

Unfortunately, no. I have (too) many projects I'm perpetually working on for myself, taking on another one is not high on my priority list.

However, I'm be fully willing to help you test future updates, or discuss issues with you.

@Azeirah
Copy link
Author

Azeirah commented Sep 5, 2017

The main concern here is that this must work extremely fast. This function will be called for each of preceding lines. I see a potential problem here, as running hel regexp against each line could be too slow - perhaps one could safely trim the line to say, 100 chars before executing the regexp, but still, this needs some performance testing.

For this case, I think it's really easy to solve it with hash-bashed memoizations, since a lot of the text won't change very often (especially true if the file is very large), you could compute a short hash for each line, cache the results of the regexes, and only recompute a given line if it's not in the cache.

@qbolec
Copy link
Owner

qbolec commented Sep 6, 2017

Nice idea! Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants