Tree-sitter integration #83
Replies: 2 comments
-
@larssondaniel Thanks for your post. It's funny that you mention it, because I just had a discussion about this on HN yesterday: https://news.ycombinator.com/threads?id=danenania#40002284 TLDR: I think incorporating tree-sitter is a very promising idea, but my first instinct wouldn't be to automatically pull in a map of the whole repo (as Aider does). I think I'd prefer an option to load in definitions by file or directory, similar to how directory layouts can be loaded in now This would give the user the option of pulling in a whole repo map with I think I also prefer the above approach to exposing tree-sitter functionality as a function that the LLM can call, at least for now. I came across this repo in some initial explorations of the idea yesterday: https://github.com/smacker/go-tree-sitter |
Beta Was this translation helpful? Give feedback.
-
@danenania That HN thread was a good read, thanks for sharing. I wasn't aware that's how Aider does it, but I completely agree with not including the entire tree preemptively. I've been trying to find a way for the LLM to do retrieval using a combo of semantic embeddings/vectors and a read-only interface to the AST. That way, it could use regular similarity search for semantic retrieval and then use the AST for functional retrieval. Your proposal is the way to go, at least for now. I'll post back here if I see meaningful results from an AST approach. Now that I'm thinking about it, I remember reading about Sourcegraph abandoning the AST idea due to the monotony of the source tree tokens. |
Beta Was this translation helpful? Give feedback.
-
Thanks for an awesome product 🫡
Has a Tree-sitter integration been up for discussion? If a syntax tree is generated at the start of a new plan, that opens multiple ways to validate syntax and make refactoring safer. If including it in the context isn't feasible due to size, we could perform syntax validation and linting (depending on language) and pass those results to a qa/reviewer agent that can take action.
When I have time, I'll try to convert a Tree-sitter Concrete Syntax Tree into a relevant Abstract Syntax Tree, and see how many tokens it ends up being compared to the source code. In a best-case scenario, the complete syntax tree could be included in the context, meaning the LLM could access things like functions signatures for source code that is not included in the context. I'm also interested in seeing what kind of latency hit we would get if we let the LLM access Tree-sitter through a function.
There may be a case to be made for opting for an actual compiler instead, but I think complexities around environments, non-uniform compiler interfaces, and partial compilation suggests a Tree-sitter solution is more scalable.
I appreciate your work, and please point me in the right direction if similar concepts have been discussed before!
Beta Was this translation helpful? Give feedback.
All reactions