Using Askgit
SQL is so nice
- tags
- git
- sql
- tools
Contents
askgit provides a sql interface to your git repository. Let's install it and see what we can figure out about the repo.
Installing
Following the instructions on the website, we can build the go binary with:
|
|
This will download the package, the dependencies, and compile
everything into an executable. If you can't figure out where it's
installed, check that you have GOPATH
set correctly, and it will end
up in $GOPATH/bin
. And make sure that it's in your path by checking
which askgit
.
Get a test repository
I'm going to look at the ruby-git
repo so you can follow along.
|
|
Commits
The basic unit that git works on is a commit, which is a series of file changes applied to a parent, or multiple parents in the case of a merge. Let's do some basic analysis of what's in a repo, how long it's been worked on, and who has been working on it.
How old is the repo?
Let's find the earliest timestamps in the repo. For most workflows
these dates will be similar, but if the project shares patches over
email (like linux or postgres or anything on sr.ht) the author_when
is
when the code was written (according to the author) and the
committer_when
is when it was added to the repo.
|
|
First commit
And here's who did it:
|
|
Biggest commit
Which commits added the most code?
|
|
We can see that these are at all the beginning of the project.
Biggest cleanup
Which commit removed the most code?
|
|
Again most of these are at the beginning of the project, through there is a big 2014 cleanup that is probably a refactor of some kind.
How many commits do we have?
We've all seen this query before:
|
|
How about most frequent multi-committer?
Lets group the commits by author_name
and sort by the most committers.
It looks like somewhere down the line scott Chacon
changed to Scott
Chacon
so he's counted twice. Still, it looks like he's been doing a
lot.
|
|
Commits by month
We can look at the activity of the project over time using the sqlite date functions. We can see that there's a lot of activity in the beginning and it trails off.
|
|
+---------+---------------+----------------+----------------+ | MONTH | TOTAL COMMITS | SUM(ADDITIONS) | SUM(DELETIONS) | +---------+---------------+----------------+----------------+ | 2007-11 | 78 | 27597 | 6204 | | 2008-01 | 3 | 5 | 5 | | 2008-03 | 12 | 279 | 1101 | | 2008-05 | 10 | 180 | 34 | | 2008-07 | 4 | 33 | 6 | | 2008-08 | 1 | 2 | 2 | | 2008-10 | 1 | 1 | 1 | | 2008-11 | 1 | 1 | 1 | | 2008-12 | 6 | 180 | 15039 | | 2009-02 | 16 | 202 | 216 | | 2009-03 | 2 | 9 | 2 | | 2009-04 | 1 | 1 | 1 | | 2009-05 | 2 | 115 | 16 | | 2009-08 | 13 | 168 | 97 | | 2009-09 | 4 | 10 | 6 | | 2009-10 | 6 | 11 | 11 | | 2011-08 | 1 | 1 | 1 | | 2011-09 | 1 | 0 | 0 | | 2011-10 | 1 | 251 | 240 | | 2011-12 | 1 | 30 | 1 | | 2012-03 | 1 | 1 | 0 | | 2012-04 | 1 | 1 | 1 | | 2013-01 | 1 | 6 | 4 | | 2013-04 | 45 | 704 | 388 | | 2013-05 | 3 | 8 | 5 | | 2013-06 | 1 | 9 | 0 | | 2013-08 | 17 | 184 | 41 | | 2013-09 | 1 | 8 | 8 | | 2013-12 | 2 | 4 | 4 | | 2014-01 | 26 | 254 | 96 | | 2014-02 | 5 | 8 | 753 | | 2014-05 | 1 | 2 | 2 | | 2014-06 | 4 | 14 | 2 | | 2014-07 | 6 | 44 | 35 | | 2014-08 | 4 | 4 | 4 | | 2014-09 | 5 | 109 | 54 | | 2014-10 | 1 | 20 | 0 | | 2014-11 | 4 | 10 | 4 | | 2014-12 | 11 | 244 | 60 | | 2015-01 | 36 | 446 | 216 | | 2015-10 | 1 | 1 | 0 | | 2016-02 | 12 | 216 | 196 | | 2018-01 | 2 | 40 | 42 | | 2018-02 | 1 | 17 | 17 | | 2018-03 | 4 | 233 | 64 | | 2018-04 | 1 | 8 | 4 | | 2018-05 | 4 | 81 | 35 | | 2018-06 | 5 | 192 | 19 | | 2018-07 | 4 | 37 | 9 | | 2018-08 | 5 | 64 | 15 | | 2018-10 | 1 | 169 | 0 | | 2019-02 | 1 | 65 | 45 | | 2019-09 | 1 | 8 | 2 | | 2019-12 | 1 | 763 | 17 | | 2020-01 | 6 | 172 | 56 | | 2020-02 | 5 | 15 | 10 | | 2020-04 | 9 | 63 | 9 | +---------+---------------+----------------+----------------+
Looking at this you can see that nothing happened in 2010 and 2017, a whole bunch of stuff was removed on 2008-11, and really there's only a fairly small amount of work done after the initial flurry of implementation.
Tags
Tags can represent releases. Presumably the project is at a point where you want to mark it in some way, and we'll use that to look at the life cycle of what's been going on.
Get a list of tags
Here we can see a list of tags for the project.
|
|
When target_type
is NULL
it means that it's a lightweight tag and
doesn't have a commit associated with it. Most of the time people
will have a commit, and since a commit has a date it makes it easier for
us to associate with.
Finding out who did the actual tagging and when
Here we can find out who did the tagging and what the message was.
|
|
Looking at the dates of this, we can see that Scott Chacon started the project in 2008, Roberto Decurnex took over from 2013-2016, Per Lundberg in 2018, and finally James Couball did a release in 2020.
Looking at the notes it seems like it's the world around ruby-git
that is changing (Travis, ruby very support, etc.).
Figuring out start and end dates
We can ask all of the questions above based upon the timestamps of when a tag was created. This will give us an idea of what's going on in the project during that time. Let's create a simple CSV file that will make it easier to ask these questions.
|
|
Then we can make this a little simpler to loop over, where we have a start time and end time for each of the tags. We start with the initial commit as the start time, and then use the previous tag time going forward.
|
|
Tag stats
We can loop over this file and pull out some stats for what went into each tag.
|
|
refs/tags/1.0.5.1 0 +-------+-------------+ | COUNT | AUTHOR NAME | +-------+-------------+ +-------+-------------+ refs/tags/v1.0.7 10 +-------+-----------------+ | COUNT | AUTHOR NAME | +-------+-----------------+ | 6 | Scott Chacon | | 3 | Jorge Bernal | | 1 | Michael Siebert | +-------+-----------------+ refs/tags/v1.2.6 128 +-------+--------------------+ | COUNT | AUTHOR NAME | +-------+--------------------+ | 54 | robertodecurnex | | 20 | Joshua Nichols | | 8 | Daniel Mendler | | 5 | Scott Chacon | | 4 | elliottcable | | 4 | Roberto Decurnex | | 4 | Josh Nichols | | 4 | James Rosen | | 3 | Yuichi Tateno | | 3 | Michael Mallete | | 2 | Jonathan Rudenberg | | 2 | Joe Moore | | 2 | Gianni Chiappetta | | 1 | schneems | | 1 | schacon | | 1 | devyn | | 1 | Zachery Hostens | | 1 | TJ Biddle | | 1 | TIT | | 1 | Steve Schwartz | | 1 | Nathan Fixler | | 1 | Michael Hale | | 1 | Ken Pratt | | 1 | Gabriel Cardoso | | 1 | Chad Rosen | | 1 | Bryan Larsen | +-------+--------------------+ refs/tags/v1.2.7 39 +-------+------------------+ | COUNT | AUTHOR NAME | +-------+------------------+ | 22 | robertodecurnex | | 9 | Roberto Decurnex | | 2 | Cameron Walsh | | 1 | OHTAKE Tomohiro | | 1 | Guy Hughes | | 1 | David Underwood | | 1 | Ben Sedat | | 1 | Arnvald | | 1 | Andy Schrage | +-------+------------------+ refs/tags/v1.2.8 8 +-------+------------------+ | COUNT | AUTHOR NAME | +-------+------------------+ | 3 | Roberto Decurnex | | 2 | robertodecurnex | | 2 | Yuya.Nishida | | 1 | Xavier Calland | +-------+------------------+ refs/tags/v1.2.9 56 +-------+------------------------+ | COUNT | AUTHOR NAME | +-------+------------------------+ | 46 | Roberto Decurnex | | 4 | Kelly Stannard | | 1 | mwrock | | 1 | Harald Sitter | | 1 | Francis Luong (Franco) | | 1 | Dave Lane | | 1 | Brian Stevens | | 1 | Adam Hooper | +-------+------------------------+ refs/tags/v1.2.9.1 4 +-------+------------------+ | COUNT | AUTHOR NAME | +-------+------------------+ | 3 | Roberto Decurnex | | 1 | Eric Tsiliacos | +-------+------------------+ refs/tags/v1.2.10 12 +-------+------------------+ | COUNT | AUTHOR NAME | +-------+------------------+ | 10 | Roberto Decurnex | | 1 | David Vartanian | | 1 | Christian Paling | +-------+------------------+ refs/tags/v1.3.0 2 +-------+------------------+ | COUNT | AUTHOR NAME | +-------+------------------+ | 2 | Roberto Decurnex | +-------+------------------+ refs/tags/v1.4.0 12 +-------+-------------------------+ | COUNT | AUTHOR NAME | +-------+-------------------------+ | 6 | Per Lundberg | | 1 | Vern Burton | | 1 | Tom Potts | | 1 | Richard Vodden | | 1 | Rafael Reggiani Manzo | | 1 | Guilherme Maluf Balzana | | 1 | Alexander Maslov | +-------+-------------------------+ refs/tags/v1.5.0 12 +-------+---------------------+ | COUNT | AUTHOR NAME | +-------+---------------------+ | 3 | Vern Burton | | 2 | Per Lundberg | | 2 | Joshua Liebowitz | | 1 | Taichiro Suzuki | | 1 | Matias Garcia Isaia | | 1 | Kody | | 1 | Kelly Stannard | | 1 | Denis Defreyne | +-------+---------------------+ refs/tags/v1.7.0 26 +-------+--------------------+ | COUNT | AUTHOR NAME | +-------+--------------------+ | 9 | James Couball | | 2 | Yuta Harima | | 1 | cyclotron3k | | 1 | Vern Burton | | 1 | TIT | | 1 | Stephen Paul Weber | | 1 | Salim Afiune | | 1 | Per Lundberg | | 1 | Michael Camilleri | | 1 | Marcel Hoyer | | 1 | Jonas Müller | | 1 | James Bunch | | 1 | Harald | | 1 | Evgenii Pecherkin | | 1 | Antonio Terceiro | | 1 | Alex Mayer | | 1 | Agora Security | +-------+--------------------+
Files
We can also look at the files inside of the commits. askgit
does have
a file
table, but that shows the repository state at as specific
commit, and thus is less useful for actually tracking changes that
happened in a commit, or between commits.
But there's no reason why we can't use git diff-tree
to find the
information.
Files modified with a specific commit
We can see which files were changed in the latest commit:
|
|
CHANGELOG.md lib/git/version.rb
Which, at the time of this writing, is the same as:
|
|
CHANGELOG.md lib/git/version.rb
Files modified by a tag
We can use the ..
way of specifying commit ranges, so let's look to
see which files have changed between v1.5.0
and v1.7.0
.
|
|
.github/stale.yml .jrubyrc .travis.yml CHANGELOG.md CONTRIBUTING.md MAINTAINERS.md PULL_REQUEST_TEMPLATE.md README.md RELEASING.md git.gemspec lib/git.rb lib/git/base/factory.rb lib/git/branch.rb lib/git/config.rb lib/git/diff.rb lib/git/lib.rb lib/git/version.rb tests/files/encoding/dot_git/COMMIT_EDITMSG tests/files/encoding/dot_git/HEAD tests/files/encoding/dot_git/config tests/files/encoding/dot_git/description tests/files/encoding/dot_git/hooks/applypatch-msg.sample tests/files/encoding/dot_git/hooks/commit-msg.sample tests/files/encoding/dot_git/hooks/fsmonitor-watchman.sample tests/files/encoding/dot_git/hooks/post-update.sample tests/files/encoding/dot_git/hooks/pre-applypatch.sample tests/files/encoding/dot_git/hooks/pre-commit.sample tests/files/encoding/dot_git/hooks/pre-push.sample tests/files/encoding/dot_git/hooks/pre-rebase.sample tests/files/encoding/dot_git/hooks/pre-receive.sample tests/files/encoding/dot_git/hooks/prepare-commit-msg.sample tests/files/encoding/dot_git/hooks/update.sample tests/files/encoding/dot_git/index tests/files/encoding/dot_git/info/exclude tests/files/encoding/dot_git/logs/HEAD tests/files/encoding/dot_git/logs/refs/heads/master tests/files/encoding/dot_git/objects/20/aefc8947d5bf08710afabe7712a1d6040ed5bd tests/files/encoding/dot_git/objects/54/82c9609dd461acafcc859279490acfdea01f00 tests/files/encoding/dot_git/objects/87/d9aa884f84c67ac2185530f0b84d5eebda3eca tests/files/encoding/dot_git/objects/91/59312af5dd77ca1fac174a3b965a806451b5c6 tests/files/encoding/dot_git/objects/cf/921422e5382afe0c90a772a2cb37867839ae64 tests/files/encoding/dot_git/objects/d4/fc598fff13f7bd681ceb38afafcae631ab3e50 tests/files/encoding/dot_git/refs/heads/master tests/files/encoding/test1.txt tests/files/encoding/test2.txt tests/files/working/dot_git/config tests/test_helper.rb tests/units/test_config.rb tests/units/test_diff_non_default_encoding.rb tests/units/test_lib.rb tests/units/test_logger.rb tests/units/test_merge_base.rb tests/units/test_thread_safety.rb tests/units/test_thread_safty.rb
File history
It can be useful to see when a file was added. In this case again,
askgit
isn't necessarily the right way to go about it, since git log
is much faster.
Here we can show the history of a specific file,
PULL_REQUEST_TEMPLATE.md
. We'll pass in some formatting options so
that we could parse this later, and the --reverse
field so that it's
in chronological order. The pretty format that we are using is:
%aI | Author date in iso8601 |
%an | Author Name |
%h | Short commit id |
%s | Summary string |
|
|
2018-06-20T06:58:17-05:00 4aa6d1f Vern Burton Creates CONTRIBUTOR and MAINTAINERS files (#353) 2020-02-04T03:31:12+09:00 e0a385f Yuta Harima Fix broken link in a PR template (#444) 2020-02-05T00:20:01+09:00 5c89ef6 Yuta Harima fix broken link in a PR template again (#446)
So we can see that Vern Burton
did some GitHub related things in mid
2018, and Yuta Harima
did some maintenance on it in Feb of 2020.
Thoughts
Static analysis of code repositories can add some interesting insights into how a project is maintained, in terms of who is working on it, and whether the original maintainers left or moved on to other things.
For example, we could look at who has been making changes to part of a
system and, based upon who has been commiting changes to those files
and how recently they've contributed to the project, understand if the
knowledge of how it works is still with the project, or if the
knowledge has decayed. git blame
– terribly named, btw, I think it
should be more like git gratitude
– shows who last touched a specific
line of code, but they may have moved on to other things. But if
people have touched the surrounding lines or even the file itself, you
can believe that the knowledge has been refreshed within the project
as a whole.
Another thing that would be interesting to look at is churn on files, and churn on sets of files – which files have been changing, and which files change together. When commits change lots of things together, does this tell us something about the complexity of the project? Are we more likely to find bugs in "big" commits? Are files that have been unchanged in a long time more polished and complete, but ones that are frequently changing immature and need to be more tested?
Previously
Next