Tree-sitter integration #83

larssondaniel · 2024-04-12T13:04:54Z

larssondaniel
Apr 12, 2024

Thanks for an awesome product 🫡

Has a Tree-sitter integration been up for discussion? If a syntax tree is generated at the start of a new plan, that opens multiple ways to validate syntax and make refactoring safer. If including it in the context isn't feasible due to size, we could perform syntax validation and linting (depending on language) and pass those results to a qa/reviewer agent that can take action.

When I have time, I'll try to convert a Tree-sitter Concrete Syntax Tree into a relevant Abstract Syntax Tree, and see how many tokens it ends up being compared to the source code. In a best-case scenario, the complete syntax tree could be included in the context, meaning the LLM could access things like functions signatures for source code that is not included in the context. I'm also interested in seeing what kind of latency hit we would get if we let the LLM access Tree-sitter through a function.

There may be a case to be made for opting for an actual compiler instead, but I think complexities around environments, non-uniform compiler interfaces, and partial compilation suggests a Tree-sitter solution is more scalable.

I appreciate your work, and please point me in the right direction if similar concepts have been discussed before!

danenania · 2024-04-12T17:16:33Z

danenania
Apr 12, 2024
Maintainer

@larssondaniel Thanks for your post. It's funny that you mention it, because I just had a discussion about this on HN yesterday: https://news.ycombinator.com/threads?id=danenania#40002284

TLDR: I think incorporating tree-sitter is a very promising idea, but my first instinct wouldn't be to automatically pull in a map of the whole repo (as Aider does). I think I'd prefer an option to load in definitions by file or directory, similar to how directory layouts can be loaded in now plandex load some-dir --tree or plandex load . --tree to load the layout for the whole project. I like the sound of a --defs flag that works similarly to --tree and can either pull in defs for a file/list of files or recursively for a whole directory.

This would give the user the option of pulling in a whole repo map with plandex load . --defs, but would also let the user be more selective when appropriate, or avoid adding the extra tokens when a map isn't needed.

I think I also prefer the above approach to exposing tree-sitter functionality as a function that the LLM can call, at least for now.

I came across this repo in some initial explorations of the idea yesterday: https://github.com/smacker/go-tree-sitter

0 replies

larssondaniel · 2024-04-12T23:28:54Z

larssondaniel
Apr 12, 2024
Author

@danenania That HN thread was a good read, thanks for sharing.

I wasn't aware that's how Aider does it, but I completely agree with not including the entire tree preemptively.

I've been trying to find a way for the LLM to do retrieval using a combo of semantic embeddings/vectors and a read-only interface to the AST. That way, it could use regular similarity search for semantic retrieval and then use the AST for functional retrieval.

Your proposal is the way to go, at least for now. I'll post back here if I see meaningful results from an AST approach. Now that I'm thinking about it, I remember reading about Sourcegraph abandoning the AST idea due to the monotony of the source tree tokens.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tree-sitter integration #83

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Tree-sitter integration #83

larssondaniel Apr 12, 2024

Replies: 2 comments

danenania Apr 12, 2024 Maintainer

larssondaniel Apr 12, 2024 Author

larssondaniel
Apr 12, 2024

danenania
Apr 12, 2024
Maintainer

larssondaniel
Apr 12, 2024
Author