-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] Advancement #1
Comments
A quick summary of my findings since the last time we discussed this (more to come later)
With regard to the format for the expected output for tests. I use laundry's intermediate representation in partially expanded s-expression form as the target right now. However, it is still evolving so some names could change, though it is slowly approaching stability. I think that talking about org parsing semantics in terms of the s-expression form that they should parse to is a reasonably good idea idea. It also happens to be a format that is easy for this implementation to test against. An alternative would be to test via a round-trip. But that is something that tree sitter is unlikely to be able to do easily. I'll have to look into what might be possible for TS. |
Cool, that will allow to put more focus/stress on the hard part, which is definitely going to be that external scanner.
Do you have specific examples in mind ? I'm not too familiar with language theory. I agree that an external scanner will be mandatory anyway, and TS allows to somewhat mix external scanner types with the "classic"
lex-abbrev is probably going to be most of the base types for the laundry/laundry/lex-abbrev.rkt Line 105 in 5a396be
Currently my "wild" idea would be to support it by
The only way I see this solution working is if it's implemented entirely in the external scanner, since it's the only place where we have some state we can manipulate at runtime for TS parsers as far as I understand.
Definitely, very muchy, can't wait-y looking forward to this
Looking at that
👍
Makes sense yes. Anyway I am not close to good enough to make it happen now, and I'll soon go back to working, so I'll have less time to give to this effort.
Round-trip seems too costly to do, because that means I'd also have to work on some ways to write org files from a tree, which another challenge as well. It'd be nice to have that for sure, but the cost seems really high. As I hinted towards the end of my message, I kickstarted a simple converter from Currently I target the tree-sitter S-expressions mostly because I didn't look enough into laundry to find the tests, but I'm very open to suggestions to the format anyway. For the time being I'll incrementally raise the coverage of For the record, I added ERT tests to show the current format of S-expressions that I'm targeting: |
Of significant interest https://github.com/milisims/tree-sitter-org. @milisims for awareness. |
This has way more chance to reach maturity than what I'm doing, happy to see it. Hopefully I'll reach at least parity on the types of elements parsed between my EmacsLisp generator and the current corpus, and then try to get incrementally more coverage |
I think that the corpus generator is of great interest for all the projects. Being able to interconvert so that everyone has access to the same test cases and expected results should allow us to communicate more clearly about any divergence we see on certain tests. |
Hey cool! I don't really know anything about racket, but I'm happy to play in where possible. One thing to note when considering tree-sitter parsing is that the design of TS is largely meant to facilitate permissive grammars, which can be made more specific via queries at runtime. For example, rather than using a messy external scanner (which was my first attempt), in the grammar the first word in a headline is assigned an anonymous node called As for my current implementation, I definitely made a few Decisions to ignore or change a few details separate from the emacs orgmode spec, which I didn't document :(, and those were chosen to simplify the grammar where I could. A simple example is that sections include the headline, and the orgmode syntax's section is closer to the Another consideration for generating the parser might be that conflicting tokens can be hard to wrap your head around, in my experience a lot of trial and error was necessary. Taking a peek at conflicting tokens, in particular #4, it was necessary for me to awkwardly hand write some conflicts so that the parser would consider that the currently matched characters could be just text or some other syntax element, and it's not necessarily an error if the text element isn't matched. It could just be text.
This would definitely be super cool! |
Hi, Corpus
Sure it looks pretty cool:
Basically, from what I gather, running
would give (org-data (headline)) whereas I want the more precise (org_data (headline (stars) (title) (tags))) because if Tree-Sitter is to use such a grammar, we need all the tokens for structural editing. My goal with ottsc is to be the easiest way to create a set of corpus tests for Tree-Sitter, directly from emacs-lisp implementation, since the spec is a little hard to navigate currently. For example, if there’s an issue coming in Other stuff
LSPI’m not a huge fan of the lsp-server for org-mode, just because if you need to run an emacsen as the server, I might as well just edit the org-mode file in Emacs. Having an LSP-server that is not an emacs process is basically the same problem as writing a parser that’s not elisp, so I’m assuming that "LSP server" means "make an Emacs package that follows the server-side spec of LSP, and plug it to org-mode files". @tecosaur doesn’t think so though and might have a different opinion. Separating parsing from editingI don’t think I agree with the idea that "consumers" and "producers" of syntax should be different. It is true that while editing almost all the time the buffer is in an erroneous state regarding the expected grammar. That’s one of the reason tree-sitter got popular and is performant in my opinion: you give tree-sitter the actual grammar, and it’s TS’ role to recover grammar errors and produce an AST taking this into account. It doesn’t mean that the editor should ignore the syntax because the buffer is almost always bad. I think that the separation between org-readers and org-editors mostly come from the difficulty of manipulating the tree once it has been parsed. But there are example of tools that do both like python parsers, I don’t think that still thinking both operations as separate is really good. I really really like the structural editing approach that tree-sitter can provide, and it’s also part of the appeal of org in Emacs: how can you
I think we all agree that Org-mode not having an easy EBNF grammar, and being mostly specced from |
I'm having a hard time following your explanation. Mostly my point is that there need not be a tight coupling of the the parser implemented in emacs and users of the file format. I'm assuming you want to use the format outside of emacs, because there really is no point otherwise. The various org mode parsers out there in various languages are basically trying to make use of the format outside of emacs. All this speaks to the need for either a looser coupling or for emacs itself to use a more portable spec with an EBNF. A looser coupling means we could have the best of both worlds while we wait for the emacs implementation to use a more well defined spec. More simply, we don't need emacs to enforce a stricter grammar. We can let a linter enforce the grammar. For people who need stuff to conform to a stricter grammar, they would have to use the linter. This can happen at the same time that the emacs folks do whatever they want with More specifically, outside of emacs they can use any of these tools which parse the format (which have been written with a strict grammar) but when they work inside of emacs, they can use a linter to make sure it's compatible with the tools used outside of emacs. I'm not sure what you see producing this corpus. You say "directly from emacs-lisp implementation", but I don't know what that means. The org mode code doesn't generate files, it only parses them. Then you talk about "extracting" from an issue in a Lua implementation of org-mode. My guess is that you mean that someone has produced an org file to show a difference between that implementation and the Lua implementation. The important point is that someone must write that file. In any event, if you're not interested in collaborating that's fine. It sounds like you're not, though it's not clear to me that you've understood the suggestions. |
I guess I was unclear once again, I can never explain things in github issues it seems. Anyway this is getting off-topic as far as I understand so it doesn’t really matter now. The point of producing corpus from |
I see. That description says nothing about avoiding writing tests manually. I don’t think there is any way to avoid writing tests manually. If you take a look you’ll see my project provides a collection of samples which could be used for this purpose. If you look at the description, it describes a very similar goal to the one you describe. The suggestion to provide a “linter” for org-mode is secondary but I think fits in with the overall aim as I described above. Waiting for org-mode itself to be built on a EBNF may take a very long time. A linter would assure compatibility in the meantime for users who choose to use it. |
Hello Tom,
Sorry for barging in, but I didn’t really find another way to get news on that front (at least org-mode ML didn’t bring any results, and to be honest I’d rather keep it discreet).
Last time we interacted, I remember that you proposed to try to make a converter from your parser to a tree-sitter grammar, and it looked like the best way to go, since it would take me literal months (that I don’t have) to try to follow the changes in your parser.
Current state of the parser
I see that the scanning/lexing part (that’s how I see the raw
parser.rkt
file) is a lot leaner that last time I tried to translate it to TS. Do you think it’s going to be enough to grasp all the elements we’d like before doing actual parsing ? I’d be happy if it were the case, I really don’t know. (I’m currently reading the beginning of Crafting Interpreters hoping to have "aha" moments and have better understanding of what I’m trying)Current state of the tests
Simple : I saw that you added tests in the repo. What kind of "expected results" format do you think would be the best to formalize/centralize the spec-as-tests you’re currently building ?
Feasability of tree-sitter-org based on laundry
Even if I’m not the one ending up doing a TS parser (looking at what TS support for Markdown is, I actually have shivers thinking about the work to be done), I think this approach of
bool tree_sitter_org_external_scanner_scan(void *, TSLexer*, const bool*)
)is currently the safest way to go. But if you have any comments on that I’d be glad to hear them.
At least if that’s the case, my next step would "only" be to translate your tests into a full sized TS test/corpus, which is both less daunting to me, and probably the best usage of my time I can do to try to get org-mode support in TS.
Have a nice day
The text was updated successfully, but these errors were encountered: