Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gitteh needs a primary maintainer! #68

Open
samcday opened this issue Aug 31, 2013 · 61 comments
Open

Gitteh needs a primary maintainer! #68

samcday opened this issue Aug 31, 2013 · 61 comments

Comments

@samcday
Copy link
Contributor

samcday commented Aug 31, 2013

I had an epiphany this week: I've been selfish when it comes to Gitteh.

I've acknowledged several times in issues / emails that I don't really have enough time for Gitteh. But then, like most developers, I'm presented with a problem of some kind and want to solve it. Then suddenly Gitteh seems interesting again, and I feel all inspired and start making some promises to do things that I generally don't.

Today, I think node-gitteh holds the most promise out of most git libraries for Node. Here's why:

  • It's under the official libgit2 umbrella and has compatible licencing.
  • The latest version is stable and has a decent suite of tests.
  • It has a decent amount of interest (260 stars, 37 forks, 10 watchers).
  • It's had a lot of blood, sweat, and tears poured into it :)

I think the design choices made for gitteh are sound, there's been some quibbling regarding some implementation choices (such as CS), but I think these can be addressed in the short-term easily enough.

SO. What I need is for someone to step up to the plate. I need someone who wants to become the primary maintainer for gitteh.

Please mention your interest here, and we'll hash it out. The person I pick to be the primary maintainer will ideally be someone who's shown a healthy amount of interest in gitteh over the past few months and has some Github projects demonstrating an understanding of Node, and C++ (read: best candidate would be someone who's written a Node native module before). Once I find the right person, I will ensure they are given contributor access to this repository.

@samcday
Copy link
Contributor Author

samcday commented Aug 31, 2013

@FrozenCow @lushzero @jmendeth @ben @iamwilhelm - Who's it gonna be fellas? :)

@mildsunrise
Copy link
Contributor

Does it need to be just one?

[...] has some Github projects demonstrating an understanding of Node, and C++ [...]

I have some overall experience in it, I've mostly written Robotskirt (Node.JS wrapper for the now-frozen Sundown library), as you can see in the contributors page; from that I derived the V8U set of microutilities for writing V8 addons. And some other small modules, like parport.js (high-level access to parallell ports from Node).

But I wouldn't like to hold that responsibility alone.

PS: @ben is already a maintainer. :)

@FrozenCow
Copy link

I have experience with C++ and I have experience with Node/JS. However I don't have any experience writing C++ addons for Node (apart from the PR I did here), so I don't think I'm fit for the job.

Does it need to be just one?

I agree, 2 or more would be best.

@mildsunrise
Copy link
Contributor

However I don't have any experience writing C++ addons for Node (apart from the PR I did here)

you'll get experience from gitteh! :D

@iamwilhelm
Copy link
Contributor

@samcday That's good to hear. I'd hate to see it languish. I have a vested interest in making sure gitteh is up to date and up to par, as I use it in cubehero.com. However, I haven't written a node native module before, I'm not against spending the time to learn, however. I think as one of the maintainer, I'd probably start off doing things like making sure documentation is up to date, and making sure PRs and issues don't just sit there.

@mildsunrise
Copy link
Contributor

I think as one of the maintainer, I'd probably start off doing things like making sure documentation is up to date, and making sure PRs and issues don't just sit there.

That's the spirit! It's clear that we need several maintainers. 😃

@ben
Copy link
Member

ben commented Aug 31, 2013

I'm afraid I have to bow out for the maintainer role. I'm a bit overcommitted just now, and more projects are coming down the pipeline. I'll keep watching this, and feel free to summon me for libgit2 questions.

@mildsunrise
Copy link
Contributor

I'm afraid I have to bow out for the maintainer role. I'm a bit overcommitted just now, and more projects are coming down the pipeline. I'll keep watching this, and feel free to summon me for libgit2 questions.

I understand. But don't worry, we'll still @ summon you. ;)

@lushzero
Copy link

lushzero commented Sep 1, 2013

I'm pretty much out on gitteh at this point and working on sgit. To me a lot of the fundamental technological decisions made in gitteh were sound at the time they were made but not now. Libgit2 is very different now so they don't make sense anymore, i.e. C++ async, thread handling and thread locking which is 70% of gitteh's code.

@samcday
Copy link
Contributor Author

samcday commented Sep 1, 2013

@lushzero Fair enough, I'm not sure I'd want someone working on gitteh if they make statements like that without backing it up with proper reasoned arguments ;)

@jmendeth @FrozenCow @iamwilhelm Thanks for indicating interest. I wrote up the initial issue pretty fast, and didn't mean for it to sound like it precludes more than one active maintainer. What I meant by it is I need someone (or someones, hehe) to be involved in gitteh enough so that it keeps moving along, as it's not really something I realistically have time for. That said, I plan on remaining involved, and may be able to contribute some more features over the coming weeks / months. We'll see.

@iamwilhelm I didn't actually realise you were using gitteh in anger somewhere like that. That's pretty cool! Gitteh is actually something I wrote for a project I was working on a while ago, but I've since scrapped that project and I think that's one of the biggest reasons why I haven't had much time for gitteh.

@ben Understandable!

Okay so the guys who are interested, sounds like we need to divvy up some responsibilities. Here's a quick rough cut of what might work:

  • @jmendeth + @FrozenCow work on getting bindings up to date with latest libgit2.
  • @iamwilhelm Get new version of documentation up, implement some higher level methods in gitteh userland.

Whatcha think?

@mildsunrise
Copy link
Contributor

Yeah, that sounds great to me.

But I think that, before trying to start working, we should set a simpler
build system (i.e. remove coffeescript), so that we (and future contributors)
can focus better on it.

There's also things we can simplify on gitteh's code, now that LibUV is
standard in every Node.

@FrozenCow
Copy link

Since @lushzero was working on a fork, out of interest I thought I'd look around whether there were other git libraries for node based on libgit2. I found https://github.com/nodegit/nodegit/tree/wip . Does anyone know anything about that project? They seem to do things very similar to what node-gitteh does and still active. I'm not sure how far along they are, but they do have treebuilder stuff in there, which was something I wanted to put in gitteh too.

With sgit, there are 3 projects doing the exact same thing with slight in their API and the way it works. I'm a bit torn at what to do here. Focus our efforts on one project? Keep developing multiple projects side-by-side? Collaborate a bit? I'm still interested in the new methods @lushzero is suggesting, because the current way to implement methods in gitteh is quite hard as a beginner. With the code being in a better state, it would be easier for people to contribute.

That being said (back on-topic), I don't mind being a maintainer, though I do not know how long it will be of interest for me. I'm working on a project for myself that uses gitteh and for that I'm adding features that are needed, but I don't know how long that project will last. I'll likely lose interest/become less active once that project dies. So, short-term: yes. long-term: I'm not sure.

@mildsunrise
Copy link
Contributor

@lushzero @FrozenCow @samcday

My thoughts:

  1. @lushzero has interesting views (which I don't understand yet; I want to talk about them on a dedicate place) which can help improve Gitteh as well as SGit.

  2. I understand that @lushzero didn't back his arguments on his comment; it's very natural since this is not the adequate place to talk about implementation issues, so we'd better open a dedicate issue to talk about it.

  3. Nodegit was a quick & low-level binding. Lately it has improved a lot, but Gitteh was has been a serious high-level approach since the beginning. So we shouldn't focus much on nodegit.

  4. @FrozenCow About your interest matter, I have to say: this is what happens to everyone. 😄 Even @samcday admits he once lost interest (and now Gitteh seems interesting again). I also wanted to do a rewrite and actually started it (and developed it for some time), and it has some functionality, but now it sits there, abandoned.

    So, that "interest" problem is normal on everyone, what (I think) it's important here is that we inform others about it before leaving. Other than that, I don't see any problem in being a maintainer.

PS. Sorry for my ugly English.

@nkallen
Copy link

nkallen commented Sep 1, 2013

I thought I might introduce myself as I'm now making active contributions to nodegit. I'm the author of the wip branch https://github.com/nodegit/nodegit/tree/wip

The wip branch is a rewrite of nodegit from the ground up. It uses codegen to generate the node bindings automatically, from a JSON description of the API. The wip branch is, I believe, already more comprehensive than both nodegit/master and node-gitteh. IMO, codegen is a better approach than writing bindings by hand because it requires less manual work to add support for a new function -- and libgit2's API is huge! Other advantages include: a bug fixed in one place is a bug fixed in all places, etc.

This wip version isn't exactly quick and low-level, although it follows this simple[istic] design philosophy: it aims to implement something like a 1:1 mapping between the node API and the C API, with a layer added in pure Javascript that is more "convenient". For example, there is a wrapper around the treebuilder to make it simple to make deep (nested) changes to trees https://github.com/nodegit/nodegit/blob/wip/example/new-commit.js and there are event emitters for walking history https://github.com/nodegit/nodegit/blob/wip/example/walk-history.js.

All of that "convenience" is implemented in Javascript so people can make nodegit as friendly as possible without having to write C++ -- and at the same time, since nearly 100% of the libgit2 functionality is available to them, people are no longer blocked waiting for someone else to implement some function (like clone or treebuilder) in C++.

Even the 1:1 "raw" API is not as bad as it sounds. It's still object-oriented in style, and is asynchronous exactly and only where libgit2 (potentially) does blocking IO:

  git.Repo.open(path.resolve(__dirname, '../.git'), function(error, repo) {
    if (error) throw error;

    repo.getCommit('59b20b8d5c6ff8d09518454d4dd8b7b30f095ab5', function(error, commit) {
      if (error) throw error;

      commit.getEntry('README.md', function(error, entry) {
        if (error) throw error;

        entry.getBlob(function(error, blob) {
          if (error) throw error;

          console.log(entry.name(), entry.sha(), blob.size());
        });
      });
    });
  });

Note that in the above, code, nothing is part of the "convience" API -- everything is 1:1 and automatically generated. Preliminary documentation of this raw API is here: http://www.nodegit.org/nodegit/ . Documentation of the convenience layer is forthcoming (but check the examples at https://github.com/nodegit/nodegit/tree/wip/example )

I'd like to explore the idea of whether we can merge these two projects. That is, whether we can just take the codegen rewrite of nodegit, and once it is stable and well documented, call it libgit2/WHATEVER and deprecate the previous two projects. One project has the benefit of: one implementation, more maintainers, fewer bugs, better documentation.

So that's my hope.

@mildsunrise
Copy link
Contributor

Hi @nkallen,

that explains many things (I remember nodegit to be a lot hackish and have a raw API, and the results with codegen really impressed me). However, going from manual to auto is a huge change, and has to be studied accordingly (it's also a big phylosophy change).

You see, Gitteh now wants to remodellate itself and needs some people who care of it (reason of this thread). So I think the best approach would be to let @samcday decide, then wait a bit so we can simplify and get things to stability again, and then we would all debate about the fusion you're proposing ---I'm serious about getting a single project with all our efforts combined.

Conclusion: we need some time to get Gitteh itself to normality, and then we can talk about that change. :)

My thoughts.

@lushzero
Copy link

lushzero commented Sep 1, 2013

@nkallen I find the codegen approach interesting. I guess it sort of leads into the main goals of a binding project and what individual developers want to use it for. I have been making pretty swift progress on node-sgit, https://github.com/DerpKitten/node-sgit, and ended up with something relatively high level. My needs are to be able to create repositories, commit files to them and then get history and revisions of files. What particular codegen tool are you using?

I am really curious as to what everyone else's needs are and would a higher level or lower level approach be a better fit for them?

I do concur that finding someone to combine efforts would make sense if it can be worked out. Although I could definitely see a situation that results in 2 binding modules, one low level and one higher level.

@mildsunrise
Copy link
Contributor

@lushzero Oohh I googled node-sgit and looked at your account and couldn't found it. I've now catched you! >:)

I am really curious as to what everyone else's needs are and would a higher level or lower level approach be a better fit for them?

I'm also curious as to what benefits node-sgit gives and how much precision it offers over the generated bindings. Performance is also an important matter.

I do concur that finding someone to combine efforts would make sense if it can be worked out. Although I could definitely see a situation that results in 2 binding modules, one low level and one higher level.

Me too.

@samcday
Copy link
Contributor Author

samcday commented Sep 1, 2013

@jmendeth I don't think this is something that needs me to make a decision on actually! For one, I'm putting gitteh out there for some new people to take ownership of, so my vote means a whole lot less anyway. Second of all, I think @nkallen's approach sounds very promising, as maintaining C++ bindings are tedious and error prone. If the general consensus from those who's opinions I value (basically everyone in this thread) is that a codegen approach is better, then I think that we should go for it. If we can deprecate gitteh and nodegit in favour of a unified codebase, vision and group of contributors, I think that's the best possible outcome for the Node community that just wants to load a couple of commits / blobs and walk a reflog or two :P

@mildsunrise
Copy link
Contributor

If we can deprecate gitteh and nodegit in favour of a unified codebase, vision and group of contributors, I think that's the best possible outcome for the Node community that just wants to load a couple of commits / blobs and walk a reflog or two :P

Don't get me wrong, I just would like to know i.e. the performance impact, modularity change, API compatibility, etc. of such a big change, before jumping into it. But if you guys see it clear, let's go on! :)
So what do you suggest?

@FrozenCow
Copy link

EDIT: replies going fast, this was in a response to @samcday ;) I agree with what @jmendeth just said.

Sounds good. We could also decide to start in a separate project (like nodegit's wip branch) and move it over under the libgit2 umbrella once we think things are stable/settled. We could also decide to stay api-compatible with gitteh and keep everything under the same name. People who were using gitteh can keep using gitteh that way.

Anyway, this is a pretty substantial decision. I still need to try nodegit's wip branch by running it in an application. That should give me a better idea what to do next.

@mildsunrise
Copy link
Contributor

Sounds good. We could also decide to start in a separate project (like nodegit's wip branch) and move it over under the libgit2 umbrella once we think things are stable/settled. We could also decide to stay api-compatible with gitteh and keep everything under the same name. People who were using gitteh can keep using gitteh that way.

That's one option. The other is to just deprecate this and focus our efforts on nodegit as @samcday suggested.

If we were to deprecate gitteh, we could write a last gitteh version module that adapts nodegit's API to Gitteh's.
(that is, a simple JS module that depends on nodegit and uses it to provide Gitteh's API, plus new features).

Anyway, this is a pretty substantial decision. I still need to try nodegit's wip branch by running it in an application. That should give me a better idea what to do next.

👍

@lushzero
Copy link

lushzero commented Sep 2, 2013

Could anyone else chime in with their requirements. Mine are pretty simple but were without much or reliable support in any of the existing modules I tested. I settled on a very simple api that is probably closer to the command line git than anything else. Super simple, no nesting, calls are self contained. The result is easy to learn, use, extremely ?fast? and light on memory. https://github.com/DerpKitten/node-sgit/blob/master/example.js . I'm not sure if my use cases are anything like anyone else's though.

This isn't a reflection of the qualitative value of nodegit, just the fact that it is much lower level but the "log" case in nodegit is roughly 38 lines (see API example, https://github.com/nodegit/nodegit) to only 6 in sgit. In both cases the timing is quite fast. Across ten runs against the same .git repo (sgit) with 479 commit log entries I get for sgit: 27541637 median nanoseconds and for nodegit 107501468 median nanoseconds. Sgit is marginally faster (about 70 million nanoseconds) but it's neglible for any real world use and the benchmark methodology is imperfect (process.hrtime). Mostly the difference is in the approach I think, nodegit loops over each, sgit gets them in a large batch.

If someone can provide a working example of the "log" case with gitteh I can time that as well.

To me it's sort of meaningless to talk about performance in abstract without talking about actual numbers and applicable use cases. From my read everything anyone here is talking about is pretty damn fast, there's not all that much road between the JS call, the C++ wrapping and the underlying C library no matter how you cut it. I don't think performance is going to be a big issue for anyone but please explain how that thinking is wrong.

My suggestion is that if we call all agree on the rough sketch of what the node API should look like we can then decide what backend(s) configuration makes the most sense for most of the people most of the time.

@nkallen
Copy link

nkallen commented Sep 2, 2013

Hi guys,

I'm excited you are going to consider the codegen approach as an option. I agree with the commenters above ( @jmendeth , @FrozenCow) that you guys should evaluate it thoroughly before making a commitment. It's still a work in progress, and an outside evaluation is exactly what I need to get the quality to a high level. I'm sure you'll find bugs and have suggestions for improvements, but I'm confident that after I incorporate your feedback, you'll find the functionality comprehensive, stable, and performant.

Whoever is willing to volunteer to beta-test it, I'm available to help. The documentation is a bit sparse, as I said, but you can read the examples linked to above and also the source code to this application I built http://gitdb.elasticbeanstalk.com/repos/gitdb/refs/heads/master/tree/ which uses it.

Making the decision to adopt codegen won't be easy. The API will break. Yes, a compatibility layer in Javascript can be written -- and it's "easy" to do so, if incredibly tedious.

If I may "chime in" with some requirements, I think the goal first and foremost should be to support as large a portion of libgit2 functionality as possible. That supports the widest variety of use-cases, obviously. A good reason not to have 3 separate projects is that libgit's API is large, and getting decent coverage is a lot of effort.

For example @lushzero , here is a pretty terse way to do a git log in nodegit/wip:

git.Repo.open(path.resolve(__dirname, '../.git'), function(error, repo) {
  repo.getMaster(function(error, branch) {
    var history = branch.history();
    history.on('commit', function(commit) {
      console.log('commit ' + commit.sha());
      console.log('Author:', commit.author().name() + ' <' + commit.author().email() + '>');
      console.log('Date:', commit.date());
      console.log('\n    ' + commit.message());
    }).start();
  });

Again, following the same approach: nearly all of libgit2's functionality is directly available in javascript-land, and the API is as friendly as you want by writing wrappers in javascript.

The reason libgit2 doesn't JUST have a simple git log command is because you might want to do git log a..b, abort after 100 entries, change the sorting to topological, or any manner of crazy combinations. Thus libgit2 provides git_revwalk_new, git_revwalk_push, and git_revwalk_next. And similarly, nodegit/wip now provides all of that to you in javascript land too -- and ALSO the simple, friendlier branch.history.on(), which is implemented in pure javascript. That is surely easier writing a bunch of friendly code in C++, for example as you have here: https://github.com/DerpKitten/node-sgit/blob/master/src/sgit.cc#L130 .

That said, your approach has, at least in theory, higher throughput because it crosses the runtime boundary less often and it has less context switching (because fewer items are put in the libuv event queue). On the other hand, your approach doesn't interleave well. It hogs the libuv thread while it looks up all of the commits -- each of which is potentially a random access on disk -- which means that a concurrent application (like a web server) will have extremely latency variance as disproportionately large jobs clog up your thread pool.

In summary, I think libgit2's API strikes the right balance of small, compose-able tools that provide maximum flexibility. Making it directly available in Javascript (in an OO way) provides the greatest functionality and flexibility. The only downside of the 1:1 mapping is that it means crossing the runtime barrier and context switching more often -- potentially a performance hit. However, I'm confident I can address performance issues as they arise, and my preliminary benchmarks suggest that preloading data/minimizing runtime boundary crossings provides a minimal benefit:

https://gist.github.com/nkallen/6406128

@mildsunrise
Copy link
Contributor

It hogs the libuv thread while it looks up all of the commits -- each of which is potentially a random access on disk -- which means that a concurrent application (like a web server) will have extremely latency variance as disproportionately large jobs clog up your thread pool.

Exactly. That's the big inconvenient I see in node-sgit.

@mildsunrise
Copy link
Contributor

@nkallen Looking at the generated (because they're autogenerated?) nodegit C files, I see it uses Boost to escape values, joins them into JSON and passes that to the JS side. It'll be interesting to see the performance of that against what a manually written module such as Gitteh. :)

I must have some benchmarker around, will try to adapt it to nodegit.

@nkallen
Copy link

nkallen commented Sep 2, 2013

The generated code doesn't use boost. See https://github.com/nodegit/nodegit/blob/wip/src/revwalk.cc

This uses a custom code generator so the code can generate any style of C++

On Sep 2, 2013, at 4:15 AM, Xavier Mendez [email protected] wrote:

@nkallen Looking at the generated (because they're autogenerated?) nodegit C files, I see it uses Boost to escape values, joins them into JSON and passes that to the JS side. It'll be interesting to see the performance of that against what a manually written module such as Gitteh. :)

Had some benchmarker around, will try to adapt it to nodegit.


Reply to this email directly or view it on GitHub.

@iamwilhelm
Copy link
Contributor

@iamwilhelm I didn't actually realise you were using gitteh in anger somewhere like that. That's pretty cool!

@samcday thanks! Gitteh did cause me to burst veins sometimes, when the lib didn't install or build on npm, but that's over now. Re: documentation How did you plan on generating the documentation? Did you have a Cake task somewhere, or have a particular preference? Should we move the documentation talk to a different issue, and leave this thread for handover talk?

@iamwilhelm
Copy link
Contributor

@jmendeth I don't think this is something that needs me to make a decision on actually! For one, I'm putting gitteh out there for some new people to take ownership of, so my vote means a whole lot less anyway.

@samcday I think we look to you, since you're the current maintainer, and you actually have an informed opinion of what might work for gitteh. You may not have to make the final decision, but you will probably have to guide the decision, or else the project will languish from indecision.

@iamwilhelm
Copy link
Contributor

When it comes to needs and requirements, my needs are the following:

  1. It has the higher level functions (what git refers to as porcelain). I'm not clear on the distinction between what everyone calls the higher and lower level functions right now, since I'm not as familiar with the libgit2 library, but I think it's important to have the main API match the commands you'd use on the command line, day to day--though I don't mind walking trees and commits myself. That will help new users wrap their head around how to use the library more easily. Any more than that is icing on the cake for me. That said, I do see the value in having coverage of the entire libgit2 lib, so you can implement your own stuff.

  2. The API should be relatively stable. I know we're still at 0.17.x and not 1.0, but like others before me, I find no pleasure in building on top of shifting sands. That said, as long as we agree on something, and we're not changing things every couple of months, that's fine by me.

  3. I hadn't thought about the issue with libuv and great variance in response times. But now that ya'll mention it, yes, it would be bad for me to have high variance in response times when using the lib. Fast git lib with low response time variance is something I'd get behind. But we'd need to benchmark the two different approaches.

@lushzero
Copy link

lushzero commented Sep 2, 2013

@iamwilhelm I'm pretty much inline with this view, I'm looking for an API much closer to the CLI git usage than the actual plumbing of libgit2.

After doing some additional testing one of my concerns with the codegen approach seems to be true. It opens the doors to a lot of memory leaks. Node itself is not awesome in that regard, but I don't see how without a lot of manual coding on the C++ bindings and without additional attention on the JS side in relation to those you can keep that under control and that sort of defeats the purpose of using codegen in the first place.

A cursory review with valgrind that should not be taken as gospel shows in the log example I noted earlier that nodegit leaks:

1 run:
==19285== LEAK SUMMARY:
==19285== definitely lost: 58,013 bytes in 1,738 blocks
==19285== indirectly lost: 218,421 bytes in 5,173 blocks
==19285== possibly lost: 512 bytes in 3 blocks
==19285== still reachable: 43,163 bytes in 50 blocks
==19285== suppressed: 0 bytes in 0 blocks

50 runs:
==19323== LEAK SUMMARY:
==19323== definitely lost: 2,092,324 bytes in 53,720 blocks
==19323== indirectly lost: 9,883,880 bytes in 246,865 blocks
==19323== possibly lost: 707,256 bytes in 12,646 blocks
==19323== still reachable: 48,326 bytes in 541 blocks
==19323== suppressed: 0 bytes in 0 blocks

versus the sgit approach to the same log function:

1 run:
==19297== LEAK SUMMARY:
==19297== definitely lost: 660 bytes in 46 blocks
==19297== indirectly lost: 8,558 bytes in 10 blocks
==19297== possibly lost: 512 bytes in 3 blocks
==19297== still reachable: 42,117 bytes in 38 blocks
==19297== suppressed: 0 bytes in 0 blocks

50 runs:
==19303== LEAK SUMMARY:
==19303== definitely lost: 4,872 bytes in 93 blocks
==19303== indirectly lost: 26,580 bytes in 452 blocks
==19303== possibly lost: 0 bytes in 0 blocks
==19303== still reachable: 4,727 bytes in 33 blocks
==19303== suppressed: 0 bytes in 0 blocks

It's not a completely fair head to head comparison given that the two approached are diametrically opposed but it's the reason I chose the approach I did in sgit. My use case can't survive substantive memory leaking. It's not that the nodegit approach can't be fixed either, it's that it will take as much effort as simply writing quality C++ bindings in the first place and exposing release functions.

That I also see as a problem though, node and JS in general does not have a lot of convention to handle the manual release of resources. I think that will be a problem to the extent that any references and pointers to libgit2 items are brought all the way into the JS side. I can see a few cases were very limited use would make sense but I don't see how to avoid doing it wholesale in the codegen approach. I'm generally a big fan of "automatic" bindings and was hopeful here but the data has convinced me. I'm a huge fan of SWIG and such but I just don't think it makes sense in the node/git case or at least as it relates to my use of it. If there is a compelling counter-argument I'm all for hearing it.

Libgit2 has roughly 500 functions of which I could potentially see about 30 needing direct access from the JS side. Maybe even as few as a dozen. So to me my investigation leaves a couple relatively straightforward technical items that can determine how best to proceed and combine efforts.

  • Who needs an api closer to the actual libgit2 plumbing and who needs one that is closer to the git CLI?
  • Who favors a self contained approach to addon api calls, and who favors bringing pointers/references into the JS side? I.e. taking strings as arguments and returning strings versus taking references/objects as arguments and returning references/objects .
  • Who needs what level of async and thread locking. Gitteh has async in the JS and C++ side and locking in the JS side. Both nodegit and sgit do async in only the JS side and let libgit2 do locking and threading. (When gitteh was written async was needed in the C++ side and thread/locking was needed on the JS side, as best as I can tell neither of those things is true today)

I'm not only advocating for sgit, I'd be just as happy to contribute to nodegit. Bottom line for me I need a library that can create repos, commit to repos and pull revisions/history (log) ASAP and I need it not to leak memory. I don't really care how that gets accomplished and would prefer a solution that has multiple contributors. On the performance level I think all the solutions here can be made to perform more than adequately for normal use cases.

@FrozenCow
Copy link

Hmm, I wonder what is causing the memory leak since both sgit and nodegit are leaking. Could you share the test-case?

It seems that sgit doesn't do asynchronous calls, since it's using v8 types inside the same function where libgit2 is called. If that's indeed the case, I think it's skewed to compare the two projects at the level of memory or performance. Asynchronous calls are vital for node libraries, even though they do cost more cpu/memory. From what I can tell nodegit frees its libgit2 objects once the garbage collector of v8 cleans up its objects (in the destructor of its ObjectWrappers). I could see this as being one potential reason why you're seeing memory leaks: v8 might not delete all objects cleanly if the process is killed before v8 cleans up the garbage. It seems nodegit does do its freeing consistently for all its objectwrappers, but only if a libgit2-free function is defined. This sounds like a good choice: it assures all objects are freed correctly.

It might be a good idea to test the memory footprint by walking the tree a great number of times and give the GC some time to clean up the garbage. I'm also interested in whether valgrind also shows memory leakage when creating a great bunch of objects in node without using any nodegit/sgit/gitteh libraries.

Anyway, maybe we should do this memory leak discussion on nodegit's issue tracker.

That all being said, I do think it's better to have a minimal layer between node and libgit2. That assures us we can use all functionality that's in libgit2. The code generation allows keeping the bindings up-to-date with libgit2 more easily and assuring everything works somewhat the same (no forgotten free's for example). We can focus on getting these low-level bindings stable.
On top of the low-level bindings we can build highlevel convenience functions in JS (like clone). If we only implement the functions that we have use-cases for now, you'd probably see people needing to implement their own needed bindings themselves and do pull requests for those. It worked for gitteh, but it requires a lot of time for outsiders to get familiar with the code and do things right. It would be better for them to contribute convenience functions in JS, since that's much easier, less error-prone and results in less disastrous situations when a mistake is made (segfaults instead of an exception in JS that can be caught).

My use-cases mostly requires creating bare repositories, creating commits (based on previous commits/trees), retrieving history, retrieving trees, retrieving files, but all without an index/working directory. That's short-term and I can use what's in gitteh for now. Long term I see use in merging, pushing/pulling and preferably streaming blobs (for saving and retrieving files directly, instead of buffering things first in memory).

@lushzero
Copy link

lushzero commented Sep 3, 2013

@FrozenCow I would urge you to conduct your own tests, it's pretty straightforward. My examples were using the nodegit one from it's api readme (Git Log Emulation) and the log example for node-sgit. The command line I was using for valgrind was "valgrind --leak-check=full --show-reachable=yes node test.js " The nodegit example does not appear to use async (on the JS side) and I was using sgit without async either. The scripts did nothing but run the example in a for loop and neither printed the results out, only the api calls were made. I ran them against the node-sgit repo which is roughly 500 commits.

As I understand it the node/v8 GC should not be depended upon to cleanup external resources -- see https://code.google.com/p/v8-juice/wiki/CommentaryOnV8#No_GC_guaranty . In the test cases I pointed out I'm pretty sure the GC wouldn't be able to cleanup those items anyway. For node, my point is that you're going to have to write a bunch of C++ cleanup code -- or even a borderline custom "GC", unless you want to push all the release work to the JS end user -- to address the fact that the node GC may never run. That's just as much work for 12 to 30 api_calls as just writing those non-automatically (not codegen) to begin with.

I see your use cases encompass perhaps 12 or less api calls as well. Is anyone else in favor of a non codegen approach? My mind is made up at this point, I'm happy to let others contribute to nodegit/gitteh/etc but neither meets my needs today and I forsee months of work to iron out issues that seem apparent to me. I'd be happy to start adding other peoples api needs to sgit if you're willing to test and contribute. If you can point to a roughly working example of your need in C, it can be added to sgit in about 15 minutes.

@mildsunrise
Copy link
Contributor

As I understand it the node/v8 GC should not be depended upon to cleanup external resources -- see https://code.google.com/p/v8-juice/wiki/CommentaryOnV8#No_GC_guaranty . In the test cases I pointed out I'm pretty sure the GC wouldn't be able to cleanup those items anyway. For node, my point is that you're going to have to write a bunch of C++ cleanup code -- or even a borderline custom "GC", unless you want to push all the release work to the JS end user -- to address the fact that the node GC may never run. That's just as much work for 12 to 30 api_calls as just writing those non-automatically (not codegen) to begin with.

So, I'm not getting you... the V8 garbage collector will free most JS objects, unless they have been explicitely persisted in the C++ side using a Persistent handle. These objects need to be freed manually by calling ->Dispose().

Other than that, yes, the GC should be expected to remove any V8 objects which have no references.

  • Non-V8 objects aren't destroyed by the GC, it's your responsibility to do that.
  • Persistent objects aren't destroyed by the GC, it's your responsibility to explicitely free them.

If you aren't doing those two things correctly, you are leaking. That's it.

@mildsunrise
Copy link
Contributor

@lushzero, you were using this code to test sgit? Just to confirm.

@mildsunrise
Copy link
Contributor

It might be a good idea to test the memory footprint by walking the tree a great number of times and give the GC some time to clean up the garbage.

You don't need to do that. You can call Node with the --expose-gc flag and then call gc() at any point in your code to clean everything.

I'm also interested in whether valgrind also shows memory leakage when creating a great bunch of objects in node without using any nodegit/sgit/gitteh libraries.

I'm also interested, since I looked extensively at node-sgit's code and couldn't find any leaks in the log function.

Ideal process would be:

  1. setup everything and require the module.
  2. call gc() & measure.
  3. run loop in a nested function.
  4. call gc() and measure again.

@lushzero
Copy link

lushzero commented Sep 3, 2013

@jmendeth I was using the code you identified with the exception that I was pointing to an instance of the node-sgit repo instead of the test sandbox repo. I'm not sure valgrind is the absolute best tool to use for memory leak detection in node, something like this is probably better: https://hacks.mozilla.org/2012/11/tracking-down-memory-leaks-in-node-js-a-node-js-holiday-season/ but I'm not versed in it.

Node has a tendency to leak memory nearly everywhere for many reasons, but not least of which is closures. Here is perhaps the most famous example: vvo/node@e138f76

V8 has a fundamental tradeoff in that the GC is NOT called until there is memory pressure. You need to make sure that long lived code will not unnecessarily "leak" prior to the GC being called or you end up with a very disadvantageous performance profile I would refer to as "boom and bust". The effect is exaggerated on larger systems as the scale of memory beings to greatly outstrip the speed of the cpu's.

I think you are oversimplifying the complexities of memory management a bit across a stack that encompasses C, C++ and V8 (node) see https://developers.google.com/v8/embed as a start. I would say as a general observation that if you are depending on the node GC to cleanup after your addon module you are going to have a bad time.

@mildsunrise
Copy link
Contributor

V8 has a fundamental tradeoff in that the GC is NOT called until there is memory pressure.

Yes, similar to Java's GC.

I think you are oversimplifying the complexities of memory management a bit across a stack that encompasses C, C++ and V8 (node) see https://developers.google.com/v8/embed as a start.

I may have explained myself bad, but I had already read that page several times and I understand it (or at least I think so!) correctly. Yes, V8 values are freed when they have no references to them, and by "references" I mean JS and C++ handles. Local handles are deallocated when the scope exits, so you shouldn't have problems with that. Persistent handles have to be Disposed manually. And any other value (which is not managed by V8) won't be destructed [directly].

So if a single call to my API creates lots of V8 handles but doesn't leak them (i.e. they have no extra references that avoid them being GC'd) everything should be okay?

I would say as a general observation that if you are depending on the node GC to cleanup after your addon module you are going to have a bad time.

Does that mean that I should for example have a free method on Repository objects instead of relying on GC to notify me when they've been removed?

@mildsunrise
Copy link
Contributor

Correction: I explained myself bad again. V8 values are ready to be freed on the next GC round when they have no references pointing at them.

@nkallen
Copy link

nkallen commented Sep 3, 2013

I think I've "lost sight of the forest for the trees", so please forgive a few questions.

There is no question that there are a few memory leaks in the generated code -- just as there is no question there are a few bugs -- this is a work in progress! So please bear with me. I have some valgrind data that I'll share below, but I wanted to make sure I understood a few of your criticisms first.

Debugging memory leaks isn't super-obvious in node.js -- you definitely need to ensure that you call the garbage collector global.gc() explicitly if you're using valgrind. I'm not sure, @lushzero , whether the data you pasted about (thanks for that!) above takes this into account or not. I'm also not clear if you're running your test on the generated code nodegit/wip or just nodegit/master -- the example from the README that you cite doesn't make sense for nodegit/wip since the API has changed.

It's not categorically different to do memory management in generated versus hand-written code. The generated code generates free()s and so forth. You have to configure the system to put them in the right place, just as you have to remember to put them in the right place when you're coding by hand. I would argue that it's in fact easier, because the generator will, once you configure it to free for a certain class of cases -- do this automatically all over the place.

@lushzero you are totally right that garbage collected runtimes have "boom and bust". It's the same in the JVM. However, this is a fundamental "limitation" of node.js -- I don't see how you can get around it. You can call free() on the libgit2 objects -- but when you pass your JSON hashes as strings in https://github.com/DerpKitten/node-sgit/blob/master/src/sgit.cc#L215

        Local<Value> argv[2] = { Local<Value>::New(String::New(err)), Local<String>::New(String::New(stmp.c_str())) };

-- those strings need to be gc'd by v8, and if the user parses those strings with JSON.parse() -- well that too needs to be GC'd by v8. If you're going to pass data to javascript land, you're going to produce garbage.

Another thing I'm unclear about is the 'async' issues you refer to, @lushzero . As you know there is no 'locking' or 'threading' in javascript -- this only applies to C++ bindings. libgit2 does blocking IO, and thus many functions need to be made async (i.e., pass the baton to the libuv threadpool). They need to be async because node.js is a single-threaded event-loop; if you do blocking IO you will cause disastrous performance. Currently, sgit lacks the thread scheduling code that would be necessary for it to be production ready, but I'm sure it's easy for you to add.

As a philosophical note, I'm interested in providing the majority of libgit2's functionality. I am building a library, like libgit2 itself, where as as @sam said,

"The "Goldilocks" level of abstraction for libgit2 is to provide the building blocks you'd need to implement a full git client (command-line, visual, or otherwise), without actually being opinionated on how that client is structured."

This is not possible with a few dozen functions. This requires several dozen functions. I understand why you think you don't need this, but if you were to actually consider all the things that the git command line tool can do (like rebasing, merging with different strategies, creating tags, manipulating the index, making commits without an index, and on and on) you would realize you need a lot of functionality! The libgit2 people aren't just adding features for no good reason. True, 20% of the libgit2 functions are C-idiomatic, like allocfmt, but 80% are useful in javascript land.

I'm also curious why you said, @lushzero ,

My mind is made up at this point

Isn't it a bit premature to have already made up your mind when you (like me!) are still learning about memory management and concurrency in node.js, when your preliminary valgrind tests may not be accurate for reasons pointed out by @jmendeth , and when you haven't seen whether it takes a long time to fix memory leaks and bugs in generated code or not. Maybe it'll take forever! Maybe it'll take me a day or two. Are you sure you have enough information to make up your mind?

In conclusion I'd like to share some valgrind data. Check out this gist: https://gist.github.com/nkallen/6430565 -- this provides a repeatable valgrind testcase. You'll note, after 20 runs:

==26064== LEAK SUMMARY:
==26064==    definitely lost: 1,216 bytes in 48 blocks
==26064==    indirectly lost: 8,192 bytes in 1 blocks

This isn't so bad is it? And as far as I can tell, none of these "leaks" have to do with nodegit. They're internal caches of libgit2 and objects that are internal to node.js.

But I did fix a couple memory leaks today! Check out nodegit/nodegit@7f9f3c4 and nodegit/nodegit@79f1115

I think the thing to take away from these commits is fixing memory leaks is not only fairly easy, but it can actually be done systematically: fix it in one place, fix it in many places.

@lushzero
Copy link

lushzero commented Sep 3, 2013

@jmendeth I don't want to sound like an expert on V8 memory management because I'm definitely not. Also the terminology here becomes a complete mess with a word like reference which means one thing in C, another in C++ and another in V8. My point about "boom and bust" may not apply to node 0.10 as incremental deallocation seems to be employed in some circumstances now as of 0.10 or maybe even 0.8.

In short closures, wrapped objects and failure to release memory allocated inside the addon all make it really hard and in many cases impossible to depend on the GC for deallocation.

Valgrind is not the be all and end all here, it sounds like the tool I pointed to in my last post is much more of a gold standard for node. Valgrind just tells me enough to say that there appear to be some very difficult problems in the codegen approach with regard to memory management.

Does that mean that I should for example have a free method on Repository objects instead of relying on GC to notify me when they've been removed?

if you aren't calling the various git_repository_free(), git__free, etc, then definitely yes those have to be called somewhere. Resources don't deallocate themselves GC or not. Having a manual free method on the JS side is possible but very un-node like. This is my point about having to create almost a, this is not really an appropriate term but I lack a better one: "virtual GC", in the addon code. I guarantee that will take more time and energy then simply coding a dozen or two api functions in C++.

@nkallen Move the for loop to encompass the git.Repo.open call and move the global.gc call to run at the end of each loop and that will match my numbers. The gc call doesn't make much difference. I tried both the regular and wip branch, wip was better but not by a whole lot. It's just my opinion but I would posit that this quote "...fixing memory leaks is not only fairly easy, but it can actually be done systematically: fix it in one place, fix it in many places." will haunt you to the ends of the earth ;) I recall a long time ago saying something like that about a project and it haunts me still.

My mind is made up in that I need to have something passable soonish and I can easily see chasing edge cases for a long time. Yes the benefit to codegen is that you "fix it in one place, fix it in many places" but then you also need to start branching out to address edge cases and pretty soon the code complexity is through the roof.

If ligbit2 was a v1 final well established library and you said, let's use codegen once and patch on top of the result I don't really see a problem with that. But libgit2 is due for plenty of changes itself and I fear how that is going to mesh with a codegen approach. You said "... but 80% are useful in javascript land." regarding the libgit2 api calls. I think that is madness, I can see maybe 3-5% being reasonable to expose to JS/Node. Imagine if zlib exposed it's 90 odd api calls instead of the two dozen it does in node. That just doesn't fundamentally make sense to me, Node/JS is a very high level language.

We can agree to disagree about the level of abstraction. The command line git has 22 options including help, everything else is parameterized. That would be a rough approximation of the maximum API I would be looking at in sgit, mostly flat with parameterized options.

@sam
Copy link

sam commented Sep 3, 2013

I think you meant another @sam.

-Sam

On Sep 3, 2013, at 6:27 PM, Nick Kallen [email protected] wrote:

I think I've "lost sight of the forest for the trees", so please forgive a few questions.

There is no question that there are a few memory leaks in the generated code -- just as there is no question there are a few bugs -- this is a work in progress! So please bear with me. I have some valgrind data that I'll share below, but I wanted to make sure I understood a few of your criticisms first.

Debugging memory leaks isn't super-obvious in node.js -- you definitely need to ensure that you call the garbage collector global.gc() explicitly if you're using valgrind. I'm not sure, @lushzero , whether the data you pasted about (thanks for that!) above takes this into account or not. I'm also not clear if you're running your test on the generated code nodegit/wip or just nodegit/master -- the example from the README that you cite doesn't make sense for nodegit/wip since the API has changed.

It's not categorically different to do memory management in generated versus hand-written code. The generated code generates free()s and so forth. You have to configure the system to put them in the right place, just as you have to remember to put them in the right place when you're coding by hand. I would argue that it's in fact easier, because the generator will, once you configure it to free for a certain class of cases -- do this automatically all over the place.

@lushzero you are totally right that garbage collected runtimes have "boom and bust". It's the same in the JVM. However, this is a fundamental "limitation" of node.js -- I don't see how you can get around it. You can call free() on the libgit2 objects -- but when you pass your JSON hashes as strings in https://github.com/DerpKitten/node-sgit/blob/master/src/sgit.cc#L215

    Local<Value> argv[2] = { Local<Value>::New(String::New(err)), Local<String>::New(String::New(stmp.c_str())) };

-- those strings need to be gc'd by v8, and if the user parses those strings with JSON.parse() -- well that too needs to be GC'd by v8. If you're going to pass data to javascript land, you're going to produce garbage.

Another thing I'm unclear about is the 'async' issues you refer to, @lushzero . As you know there is no 'locking' or 'threading' in javascript -- this only applies to C++ bindings. libgit2 does blocking IO, and thus many functions need to be made async (i.e., pass the baton to the libuv threadpool). They need to be async because node.js is a single-threaded event-loop; if you do blocking IO you will cause disastrous performance. Currently, sgit lacks the thread scheduling code that would be necessary for it to be production ready, but I'm sure it's easy for you to add.

As a philosophical note, I'm interested in providing the majority of libgit2's functionality. I am building a library, like libgit2 itself, where as as @sam said,

"The "Goldilocks" level of abstraction for libgit2 is to provide the building blocks you'd need to implement a full git client (command-line, visual, or otherwise), without actually being opinionated on how that client is structured."

This is not possible with a few dozen functions. This requires several dozen functions. I understand why you think you don't need this, but if you were to actually consider all the things that the git command line tool can do (like rebasing, merging with different strategies, creating tags, manipulating the index, making commits without an index, and on and on) you would realize you need a lot of functionality! The libgit2 people aren't just adding features for no good reason. True, 20% of the libgit2 functions are C-idiomatic, like allocfmt, but 80% are useful in javascript land.

I'm also curious why you said, @lushzero ,

My mind is made up at this point

Isn't it a bit premature to have already made up your mind when you (like me!) are still learning about memory management and concurrency in node.js, when your preliminary valgrind tests may not be accurate for reasons pointed out by @jmendeth , and when you haven't seen whether it takes a long time to fix memory leaks and bugs in generated code or not. Maybe it'll take forever! Maybe it'll take me a day or two. Are you sure you have enough information to make up your mind?

In conclusion I'd like to share some valgrind data. Check out this gist: https://gist.github.com/nkallen/6430565 -- this provides a repeatable valgrind testcase. You'll note, after 20 runs:

==26064== LEAK SUMMARY:
==26064== definitely lost: 1,216 bytes in 48 blocks
==26064== indirectly lost: 8,192 bytes in 1 blocks
This isn't so bad is it? And as far as I can tell, none of these "leaks" have to do with nodegit. They're internal caches of libgit2 and objects that are internal to node.js.

But I did fix a couple memory leaks today! Check out nodegit/nodegit@7f9f3c4 and nodegit/nodegit@79f1115

I think the thing to take away from these commits is fixing memory leaks is not only fairly easy, but it can actually be done systematically: fix it in one place, fix it in many places.


Reply to this email directly or view it on GitHub.

@nkallen
Copy link

nkallen commented Sep 4, 2013

@lushzero Here is the valgrind output for the loop moved outside the opening of the Repo. gc in the same place --where it should be. https://gist.github.com/nkallen/6431776 As you can see there is essentially no change. No memory leaks.

==27613== definitely lost: 1,216 bytes in 48 blocks
==27613== indirectly lost: 8,320 bytes in 2 blocks

As far as I can tell, you have made a mistake in your profile. At this point, you must admit that nodegit/wip is not leaking memory for the git log use-case or -- if not, please provide a repeatable test case in a gist so we can evaluate what's going on.

Again, I don't really understand your arguments against code-gen. You seem to be bringing irrelevant considerations into the discussion. What do closures have to do with codegen versus non? What does freeing locally malloc'd objects have to do with codegen? Everything code-gen malloc'd is code-gen free'd; it's no different than anything hand-malloc'd is hand-freed.

And you've provided no evidence nor arguments why wrapped objects don't work. They are garbage collected just as every other object in the js world is; they are in fact a standard way to do library bindings in the node.js world.

I'm not sure what your "haunting" experience has to do with this case. You seem to be promulgating fear rather than an argument.

@lushzero
Copy link

lushzero commented Sep 4, 2013

@nkallen Here is substantially similar to the original code which I did not save. Rerunning it just now produces results very close to what I posted earlier. https://gist.github.com/lushzero/6432272

The exact command line I used last was: valgrind --leak-check=full --show-reachable=yes node --nouse_idle_notification --expose_gc gist.js > log2 2>&1

I feel I've adequately expressed my concerns regarding the codegen approach and why I will be pursuing something else. You seem to be taking this personally and that is in no way my intention. I'm essentially betting that the headaches from upstream libgit2 changes and chasing a variety of edge cases including memory management in the codegen case will be a lot more work than coding a flat set of self contained API calls manually, that's it. Also we don't seem to agree at all on what the user facing API should look like, fair enough.

In the simple example I tested that reflects one small element of my immediate use case nodegit leaked between 100K and 2.5MB depending on the branch and other parameters (such as using the non wit branch readme api example code) in a basic 50 loop cycle of the script. "Leak" mine as well be substituted with "used" for my purposes depending on the resolution or the need to depend on the GC. I may not have been choosing the optimal example code but I went with what was posted on the nodegit readme's which is why I chose that example to begin with.

I'm not saying that it can't be fixed, I see you did commit several patches for memory management already today, I'm just saying that amongst the other reasons I've noted above I don't see nodegit as a workable solution to the needs I have. Node-gitteh also is not a good fit as the binding code is a little clumsy, it lacks needed functionality and seems not to have been vetted for memory leaks or consumption/use either. Finally let me say I started yesterday with hope that nodegit would be a good fit for my needs but my own testing, however flawed, led me elsewhere.

If you are insisting that there aren't any memory issues in nodegit as it stands so be it, everyone can run their own benchmarks and decide for themselves, which is what I suggested in the first post in this thread.

Valgrind appears not to be a good tool to hunt and resolve memory leaks in a node application. I used it because it was handy. Most people seem to think https://github.com/lloyd/node-memwatch is the best one. I feel as if this thread has gone wildly off track and a good deal of that is my fault for which I apologize. I would suggest further discussion of memory specifics and benchmarking be moved to a thread on the nodegit issues list or a new thread in this forum.

@nkallen
Copy link

nkallen commented Sep 4, 2013

I apologize for taking this personally. I hadn't had enough to eat and my temper was acting up.

FWIW, there is a bug in your benchmark and so you are being mislead by your data. You have the following code:

for(i=0;i<50;i++) {
  git.Repo.open('/root/lushzero/node-sgit/.git', function(error, repository) {
    ...
  }
  global.gc();
}

The place you put the global.gc() is incorrect. git.Repo.open, etc. is asynchronous, meaning global.gc() runs 50 times BEFORE you even open the repo. This is the reason you're seeing what looks like a memory leak. But it isn't a memory leak. If you were to do this:

process.on('exit', global.gc)

You'd ensure that the garbage collector runs just before valgrind, which is the only way to get accurate data out of valgrind.

For future benchmarks, please skip any versions of nodegit other than the wip branch. wip is a complete rewrite. It is the only version that uses codegen. Older versions of nodegit are most similar to node-gitteh in style. And finally, if this assuages any of your concerns, the wip codegen is generated from a slightly modified version of libgit2s documentation. That is, it takes a modified version of libgit2's own docurium specification of the API and generates code from that (see https://github.com/nodegit/nodegit/blob/wip/v0.18.0.json ). This approach, you will note, is resilient to dramatic API change.

@mildsunrise
Copy link
Contributor

@nkallen @lushzero

So I coded a little program that will asynchronously run that
git log emulation test on the last wip nodegit. It'll use global.gc
and memwatch to take snapshot diffs and display them beautifully
on your console. You'll be able to see what type is exactly leaking.
It has a variable number of iterations.

You should totally try it out: https://gist.github.com/jmendeth/6434819

In my test, the repository is only opened one time. Everything
else is repeated. I think I'll change it to constantly open and close
the repository, but I think this is a good place to start.

My results

I have tried to put console.logs around to make sure they are looping,
also tried putting timeouts so that the GC can clean everything, but
the results are always the same:

  1. Logically, the number of nodes allocated before is always the same.

  2. At 50 iterations (default), the balance is negative (there's less memory
    after the test than before). The details don't list any nodegit types, which
    means all the classes Repo, RevWalk, History, EventEmitter are
    correctly deallocated. Only generic types like Code, Array, Object, appear.

  3. Exactly the same at 100 iterations. "Exactly the same" means the output
    is exactly. the. same. in both cases. Same numbers, same types.

  4. At 500 iterations, we see some nodegit-specific types leaking:

     Commit: +6 (which is the number of commits in the repo)
     EventEmitter: +1
     Oid: +1
     Repo: +1 (yeah, that makes sense)
     RevWalk: +1
    
  5. At 5000, the results are exactly the same as in 500.

You can see the results here: https://gist.github.com/jmendeth/6440106

@nkallen
Copy link

nkallen commented Sep 4, 2013

@jmendeth -- This script is awesome though, thanks! I'd love it if we can continue this discussion over here nodegit/nodegit#84 . As @lushzero rightly mentioned, at this level of technical detail we are derailing the broader discussion.

Some quick observations:

  1. memwatch detects leaks in js code only; it won't detect missing frees in the C++ code. Valgrind is still the only way to do that AFAIK.
  2. You can tell in advance there is no leak because the leftover objects aren't O(N) where N is the number of iterations. The 6 objects must be phantoms:
  3. For the leftover 6 Commit objects, this is simply due to the fact that v8 does partial garbage-collections. https://developers.google.com/v8/design?csw=1#garb_coll:

processes only part of the object heap in most garbage collection cycles. This minimizes the impact of stopping the application.

I don't know how to force a full gc, but if you incrementally call gc() as in https://gist.github.com/nkallen/6442106#file-gistfile1-js-L4 , even for 5000 runs there are no nodegit objects left over

@mildsunrise
Copy link
Contributor

it won't detect missing frees in the C++ code. Valgrind is still the only way to do that AFAIK.

Yes. :(

this is simply due to the fact that v8 does partial garbage-collections.

hum, I thought that global.gc() would force a full garbage-collection;
if that's not true, maybe it's a better idea to use memwatch's own GC:

To aid in speedier debugging, memwatch provides a gc() method to force V8 to do a full GC and heap compaction.

@FrozenCow
Copy link

@lushzero, before we can use sgit, it first needs to be fully asynchronous. If it is not, weird behavior will crop up and comparing it to other libraries wouldn't be fair (whether we're comparing memory, cpu, complexity does not matter). Or am I still missing something? Is sgit already async? I was under the impression that v8 objects couldn't be used outside of v8's main thread, so I presumed it was not.

Anyway, I've dabbled a bit with nodegit a bit. It seems to work pretty well, although I have found some minor issues (documentation slightly wrong and one usecase not correctly implemented), but all in all I'm pretty impressed. I like that it's just a 1-to-1 mapping of libgit2, which means you can just lookup libgit2's documentation if you want to have details about a particular function. It might also be a good idea to include those mappings into the documentation where possible.

I haven't used all functions that I used for gitteh yet, but for now I'm very positive about the current work. Compared to gitteh it already has more up-to-date documentation, it is easier to find the function you need and it was easy to find the function that was causing a bug and fix it. This is mostly due to the 1-to-1 mapping and the fact that there is documentation (although still minimal and wrong in places).

@mildsunrise
Copy link
Contributor

Or am I still missing something? Is sgit already async? I was under the impression that v8 objects couldn't be used outside of v8's main thread, so I presumed it was not.

No, no, you're right. The functions don't return until all the LibGit2 operations (which are blocking) have been done.
LibGit2 is synchronous.

There are many potentially dangerous things done in SGit. Just to list an example:

const git_oid *oid;
{...}
oid = git_commit_oid(wcommit);
{...}
git_oid *w_oid = (git_oid*) oid;

The first lines save the OID of a commit, which is a pointer to a const git_oid. The last line casts that pointer, stripping out the const, which is nasty. Then it modifies the contents of oid through the new pointer.

LibGit2 gives you const pointers because you are expected to not modify them (the memory they point to)! oid contains a pointer to the area where LibGit2 holds the commit's info. That area is expected to always contain the commit's OID. If you change that area, LibGit2 will think the commit suddenly has another OID which will break caching, produce crashes if you call certain other funtions on that commit and other nasty things!


Another bad idea is that git_log outputs this:

{
  'OID of the tree of first commit': {
    'author': '...',
    'date': '...',
    'message': '...',
    '...': '...'
  },
  'OID of the tree of second commit': {
    'author': '...',
    'date': '...',
    'message': '...',
    '...': '...'
  },
  '...': '...'
}

This doesn't allow to iterate over the commits in order; since keys in a hash don't have a concrete order, the original order in which commits were walked is lost.

It's not useful to put the OIDs of the tree as hash keys, because what uniquely identifies a commit object is the commit's OID! I want to list commits, not trees. Also, since two commits can share the same tree (which is what happens when you do a revert; you have the same tree as in the other commit) setting both values on the same key will result in one of the two commits lost.

@mildsunrise
Copy link
Contributor

This is the particular problem with sgit; the API is too concrete. We need more abstract APIs like the ones provided in Gitteh or nodegit. If SGit had a Commit type in its API, for example, the avobe problems wouldn't exist, since git_log would return an array of Commit objects. Also the same object would be used i.e. when committing, resulting in a modular API.

@lushzero
Copy link

lushzero commented Sep 4, 2013

@jmendeth you raise a valid bug regarding the the keys being the tree OID's rather than the commit OID's. Sorting is intended to be a parameter just as it is in the git CLI. The pointer cast you mention is ugly but not an immediate cause for concern as the function is self contained. It should be cleaned up though. Sgit is definitely 0.1, I'm just starting to cleanup some things now that I'm committed to the project. Things like removing the dependency on boost, getting travis builds to work, unit tests, documentation, etc. The code itself has had less than a half-day of coding on it.

I think there are two very different approaches, I absolutely do not want to expose the libgit2 api into JS space. I think something like the sqlite addon is an outstanding example to work towards. It's almost automagical but also very fast and memory efficient. Having a large list of 1x1 C api mappings to me it's just not nodelike and I don't like the memory "leaking/consumption" and context switching problems it raises. Try to creating hundreds of repos each with a thousand commits and look at memory (which is very close to my project's needs) and performance with any of the options here.

The only way nodegit could work for me would be to manually call the GC or implement additional cleanup in the runtime of the addon and there are numerous problems with both approaches. My scenario involves creating lots of repositories and creating lots of commits and then doing data analysis of those. I'm focusing on an API that is very close to the git command line both for ease of use and ease of documentation. That lends itself to about 20 main api functions and parameterized options arrays.

@FrozenCow No doubt async support needs to be implemented. Having said that though there are a couple of different ways to do it each with various tradeoffs. I haven't determined what the best approach is yet. I might suggest trying it, as is, so that you can at least see what the impact looks like for your project. I see a lot of comment about the async issues but little data, I need to see more data to make a good decision of how to implement it best. For the scenario I have, the impact of sync is much less than I would have thought it would be for the currently supported operations. If you're on NFS it's going to be potentially catastrophoc. I would be very interested in feedback on semi-realworld cases to implement as benchmarks and tests to see the impact of various tradeoffs in implementing async support.

My next steps are to implement object gets, resolve the tree OID issue jmendeth brought up and implement git_index_add_all, at that point I think I can construct a good benchmark that will inform as to the best means of implementing async.

@nkallen
Copy link

nkallen commented Sep 5, 2013

Look up disk seek http://julianhyde.blogspot.com/2010/11/numbers-everyone-should-know.html and you will have your answer

@FrozenCow
Copy link

@lushzero Doing things async in node shouldn't be a discussion. Say you're creating a http server in node that does things with git. Now in that same script you want to clone a repository like the Linux kernel. Cloning a repo of that scale takes a while, say a couple of minutes. If cloning is done synchronously your server is not responing, since no other Javascript code can be ran on v8's thread, since you're using that thread for cloning.

Cloning is just an example that takes a long time, but any operation will suspend the whole node process if done synchronously. That's why the gitteh and nodegit implementations are so much more complex compared to sgit, they need to do everything in a separate thread.

@lushzero
Copy link

lushzero commented Sep 5, 2013

@FrozenCow Just to clarify I'm not debating async or not async, I'm debating the precise implementation for async. What level of abstraction is employed that wraps what api function, the number of threads used, how the libgit2 threads interplay, etc. From a cursory review both gitteh and nodegit seem to use the same approximate abstraction regarding async and it's very simple. I just started experimenting and benchmarking for sgit async this afternoon.

@nkallen
Copy link

nkallen commented Sep 6, 2013

Just an update for everyone following along. Enough people are productively using the nodegit wip branch without segfaults or noticeable memory leaks that I've merged to master. Furthermore, the documentation is now all to date -- it correctly merges the "raw" and "convenience" layer together. Feel free to use master as the npm artifact for now.

I will publish a nodegit 0.2.0 release in a week or so if it seems like the code is stable enough.

@lushzero
Copy link

lushzero commented Sep 9, 2013

I updated sgit to operate asynchronously and better reflect git CLI naming conventions. Best of luck to you all.

@walterheck
Copy link

Just wondering 8 months later: I'm about to start a project that needs git and I have a rough time figuring out which of these projects to use. Has the view on that changed since 8 months back?

@samcday
Copy link
Contributor Author

samcday commented May 5, 2014

@walterheck this project is still unmaintained at this point. nodegit might be a better approach.

@walterheck
Copy link

Had a feeling yeah, already started looking at nodegit here:
http://www.nodegit.org/

On Mon, May 5, 2014 at 8:54 AM, Sam [email protected] wrote:

@walterheck https://github.com/walterheck this project is still
unmaintained at this point. nodegit https://github.com/nodegit/nodegitmight be a better approach.


Reply to this email directly or view it on GitHubhttps://github.com//issues/68#issuecomment-42163014
.

Best regards,

Walter Heck
CEO / Founder OlinData http://olindata.com/?src=wh_gapp - Open Source
Training & Consulting

Check out our upcoming Drupal and Puppet
trainingshttp://olindata.com/?src=wh_gapp#calendar-icon-title

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants