Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use JGit in favor of shelling to git #16

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

dwhjames
Copy link

@dwhjames dwhjames commented Oct 6, 2013

This pull request replaces the use of Runtime.exec to invoke git, with the use of the JGit library. This appears to result in much faster import and analysis times. Aside from the new functions, the major change to datomic.codeq.core is the need to pass around an instance of org.eclipse.jgit.lib.Repository.

- replace all git execution and parsing with use of the (JGit)[http://eclipse.org/jgit] library.
- much faster import and analysis!
@dwhjames
Copy link
Author

dwhjames commented Oct 7, 2013

Apologies, a mistake fell through the net, corrected in 840bddd.

- replace shallow-tree-walk with deep-tree-walk
- invert control of tree walker: rather than the recursive helper function in commit-tx-data, pass a walker function to deep-tree-walk
- a tree walker signals if the path is new, and if not, the deep-tree-walk will skip over uninteresting subtrees
- add index to :node/object and make check for existing node much more efficient
- fix logic for testing if path is new
- make sure future is deref’ed when transacting a commit’s tx data
@dwhjames
Copy link
Author

I believe I’ve discovered and corrected a bug in c1977d3.

The code originates in 1d862fc.

Here is the code from the current master.

newpath (or (tempid? pathid) (tempid? nodeid)
            (not (ffirst (d/q '[:find ?node :in $ ?path
                                :where [?node :node/paths ?path]]
                              db pathid))))

I believe the logic is intended to determine if a path is new with respect to a tree node.

The path is new for the node if…

  1. path id is new
  2. path exists and node id is new
  3. path and node exist, but … “the node does not already have this path”

I’m a little thrown by the intention of (not (ffirst …)), but to me this expression says that “there does not exist at least one node that has the given path”.

Now, suppose I copy a file a.txt from a directory d1 in my source tree to another one d2, keeping the file name and contents the same. Now also suppose that my copy operation replaced a file that already existed d2/a.txt.

In this scenario, we are in the third case as the d2/a.txt path already existed, and the node already exists as the a.txt file was unmodified when it was copied. However, the query above will return the node for the overwritten file, and thus newpath will be incorrect computed as false.

In c1977d3, I’ve replaced the code above with this:

newpath (or (tempid? pathid) (tempid? nodeid)
            (every? #(not= nodeid (:e %))
                    (d/datoms db :vaet pathid :node/paths)))

Which in the third case is supposed to express that every node with the given path is distinct from the given node.

I’d be happy to hear any further insight on this.

no need to put an extra index on :node/object, just use the reverse index
- test a few scenarios to exhibit interesting corner cases
- utilities to perform git operations
- utilities to query and introspect the codeq db
@dwhjames
Copy link
Author

I’ve developed some tests in a9ab20d. In the process I’ve hit some interesting edge cases…

In test-mix-of-file-ops, I’ve found that codeq’s model behaves in ways one might not immediately expect if one is comparing it to browsing a change log—in particular, when reverting commits or files.

Given a base commit c1, if it is followed by c2 and only to be reverted by c3, then c3 appears not so much as the inverse of c2, but as ‘identical’ to c1. After all, their trees are identical. Thus, when searching for commits via objects, file names / paths, you’ll always get c1 and c3 together.

Furthermore, if you revert a file as part of a commit, to a state in history, then obviously the ids for the object, file name, file path, and node all exist, and so that file will not show up as part of the commit in codeq.

The contrast between these two cases of revert operations is interesting. I suppose one just has to be careful interpreting a query for commits from blobs or codeqs in the presence of reverts.

In test-path-trickery, I’ve found that ‘weird’ things start to happen when you have more than one file with the same name and contents on different paths. Given the motivation of codeq, it makes sense to be content focused and thus tree nodes are uniquely determined by the pairing of object sha and file name (or path segment). This just means that query results for finding commits by file paths can become a little unexpected in this extreme case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant