askgit is a great way to look at information inside of a repository. However it currently doesn't support looking at the files inside of the commit itself – it gives you a view of the repository at the time of the commit, but not the patch itself. Since I don't know enough about golang and sqlite virtual tables, let's just create a sqlite3 database from the logfile.
Get the gitlog
We are going to use our favorite test repo, ruby-git because it's so deliciously meta.
And then, let's pull out a list of commits, that includes
|Files with changes in the commit|
We are going to load this file into a sqlite database that we'll use to do some analysis.
Parse and load
First we need a
Gemfile for the
And then we can write a script that loads everything up.
- Create the database and tables
- Go through commits.log file one line at a time
- If the line starts with
|parse out the fields and add them to the
- If the line isn't blank (and isn't already a
|) then parse out the file changes and add them to the
- If the commit is already in the database, then abort.
|email@example.com|James Couball|2020-04-25T14:40:51-07:00|Release v1.7.0 Found existing commit, exiting
Sanity check the data
We can run some of the queries that we did before.
402 2007-11-07T12:54:26-08:00 f5baa11,"beginning of Ruby/Git project","scott Chacon",firstname.lastname@example.org,2007-11-07T12:54:26-08:00
We can also look to see who has contributed the most commits in a certain time frame, like say after 2015.
count(*) author_name ---------- ---------------- 44 Roberto Decurnex 9 James Couball 9 Per Lundberg 5 Vern Burton 2 Joshua Liebowitz 2 Yuta Harima
Looking into file change patterns
All of this we could have done with
askgit which maps directly to the
repository without having to do a seperate processing step. Let's see
what interesting information we can find looking inside the commits
Most touched file
This shows how many commits reference a specific file, so we can see which is the most active.
count(*)|name 22|lib/git/lib.rb 9|README.md 8|lib/git/version.rb 7|.travis.yml 7|git.gemspec 4|CHANGELOG.md 4|lib/git/diff.rb 4|tests/units/test_lib.rb 3|PULL_REQUEST_TEMPLATE.md 3|VERSION 3|tests/test_helper.rb 2|.github/stale.yml 2|CHANGELOG 2|CONTRIBUTING.md 2|MAINTAINERS.md 2|lib/git/branch.rb 2|tests/units/test_archive.rb 2|tests/units/test_bare.rb 2|tests/units/test_config.rb 2|tests/units/test_remotes.rb
Most changed files in a time frame
Lets look to see which files have changed the most in a certain time frame. This gives you an idea of where the activity of the project has been focused.
commits|lines_touched|name 22|430|lib/git/lib.rb 2|222|CONTRIBUTING.md 1|196|lib/git/status.rb 1|169|tests/files/encoding/dot_git/hooks/pre-rebase.sample 1|144|tests/units/test_merge_base.rb 1|128|tests/files/encoding/dot_git/hooks/update.sample 4|115|tests/units/test_lib.rb 1|114|tests/files/encoding/dot_git/hooks/fsmonitor-watchman.sample 4|70|lib/git/diff.rb 1|65|tests/units/test_status.rb 1|61|tests/units/test_diff_non_default_encoding.rb 1|57|RELEASING.md 1|53|tests/files/encoding/dot_git/hooks/pre-push.sample 1|52|tests/units/test_object.rb 7|51|.travis.yml 1|49|tests/files/encoding/dot_git/hooks/pre-commit.sample 1|44|tests/units/test_init.rb 2|43|tests/units/test_remotes.rb 1|42|tests/files/encoding/dot_git/hooks/prepare-commit-msg.sample 2|42|tests/units/test_config.rb
Finding commits that changed a specific file
This is a fairly mature library that doesn't require that many
changes. But those changes seem to be focused in
What do we think was going on?
summary author_name author_when --------------------------------------------------- ----------- ------------------------- Ruby version compatibility conflict solution (#453) TIT 2020-04-05T20:33:35+03:00 Add no verify for commit with documentation (#454) Agora Secur 2020-04-05T17:33:04-05:00 Remove extraneous '--' from `git stash save -- mess Antonio Ter 2020-04-05T17:06:53-03:00 Git::Lib#normalize_encoding early return fix (#461) James Bunch 2020-04-05T14:15:30-07:00 Fix issue with color escape codes after recent upda Marcel Hoye 2020-02-10T22:29:14+01:00 Fix describe command's dirty, abbrev, candidates, a Harald 2020-02-06T02:13:29+01:00 Implementation and tests required to ensure that co James Couba 2019-12-11T10:04:03-08:00 Support merge-base (#370) Evgenii Pec 2018-10-02T10:03:44-04:00 Add support for unshallow (#377) Stephen Pau 2018-08-24T18:06:12-05:00 using File as a block so that it tears down once ev Vern Burton 2018-08-01T09:16:42-05:00 Support 'push <remote> <branch> --delete' (#325) Kody 2018-07-28T17:33:52+01:00 commit with custom author date (#374) Matias Garc 2018-07-12T16:49:52-03:00 Check if branch contains commit (#174) Kelly Stann 2018-07-12T06:19:21-04:00 config_get: Fix incorrect block name (#373) Joshua Lieb 2018-06-25T11:46:53-07:00 Allow fetch operation to receive a `ref` param (#36 Joshua Lieb 2018-06-25T11:45:38-07:00 Fix space prefix in tag message (#316) Denis Defre 2018-06-20T14:02:19+02:00 Enable set url on remotes (#338) Tom Potts 2018-05-03T13:48:03+01:00 Enable mirror option for git clone and push (#342) Guilherme M 2018-05-03T09:37:17-03:00 Fix UTF-8 message errors (#327) Alexander M 2018-04-30T14:47:26+03:00 Fix ls-files for git 2.16 (#350) Rafael Regg 2018-03-28T08:44:59-03:00 Updating String#encode usage to prevent silly type Roberto Dec 2016-02-25T12:31:33-03:00 Fix the encoding issue - 'split': invalid byte sequ David Varta 2016-02-16T15:42:14+00:00
Change coupling by commit
Let's look to see which files are changed together, which will give us a sense of which files are coupled together.
First we'll create
query_and_pivot.rb that we'll use to do our query
and do the count pivot stuff to parse the output.
And now we'll create a
cochange_by_commit.rb script that does our
| file | commits | cofile | coorelation | count | |-------------------------------+---------+----------------------------+-------------+-------| | CHANGELOG.md | 4 | lib/git/version.rb | 100% | 4 | | lib/git/config.rb | 4 | tests/units/test_config.rb | 100% | 4 | | Gemfile.lock | 3 | git.gemspec | 100% | 3 | | tests/units/test_config.rb | 5 | lib/git/config.rb | 80% | 4 | | lib/git.rb | 4 | git.gemspec | 75% | 3 | | tests/units/test_remotes.rb | 8 | lib/git/lib.rb | 75% | 6 | | tests/units/test_tags.rb | 4 | lib/git/lib.rb | 75% | 3 | | tests/units/test_index_ops.rb | 8 | lib/git/lib.rb | 75% | 6 | | VERSION | 7 | git.gemspec | 71% | 5 | | tests/units/test_logger.rb | 6 | lib/git/lib.rb | 66% | 4 | | lib/git/diff.rb | 8 | lib/git/lib.rb | 62% | 5 | | tests/units/test_remotes.rb | 8 | lib/git/base.rb | 62% | 5 | | lib/git/base.rb | 31 | lib/git/lib.rb | 61% | 19 | | VERSION | 7 | lib/git/version.rb | 57% | 4 | | tests/units/test_init.rb | 7 | lib/git/lib.rb | 57% | 4 | | tests/units/test_init.rb | 7 | lib/git/base.rb | 57% | 4 | | tests/units/test_lib.rb | 12 | lib/git/lib.rb | 50% | 6 | | tests/units/test_index_ops.rb | 8 | lib/git/base.rb | 50% | 4 | | tests/test_helper.rb | 9 | lib/git/lib.rb | 44% | 4 | | lib/git/version.rb | 12 | git.gemspec | 41% | 5 | | lib/git/version.rb | 12 | CHANGELOG.md | 33% | 4 | | lib/git/version.rb | 12 | VERSION | 33% | 4 | | tests/test_helper.rb | 9 | tests/units/test_lib.rb | 33% | 3 |
I'm limiting the output to only show files that have changed more than
2 times, mainly to get the data down to something manageable to show on
this point. The first two columns show the file name and the number
of commits that it's in (for the selected time period). The next few
columns show the file that changed at the same time as the first time,
and the % of times that it was in the commit. So a correlation of 1.0
indicates they change everytime together. For
0.75 it changes 3 out of 4
times that the main file changes.
Most of the things here make sense. Unit tests change at a high
coupling rate as the files that they appear to test, but you also can
lib/git/base.rb change a lot with unit
tests that are testing other parts of the system. Meaning, the unit
test code isn't super modular with the classes that it's testing, and
there's probably a bunch of stuff in these
lib modules that
have spread responsibilities.
VERSION totally correlated to
Looking above we can see that changes to
VERSION are correlated to
lib/git/version.rb only 57% of the time. Weird. Let's see if we can
figure out what those other changes are.
We are going to use the sql
EXCEPT keyword to first get a list of all
of the commit ids that reference
VERSION, and substract that from the
list of ids that reference
lib/git/version.rb to get the commits where they
6d5bacd 6db4fdc cc6d6ef
Now if we do
git show 6d5bacd we can see that, in fact, the change was
VERSION was removed from the repo since it was restructured. Not
a super interesting finding, but it's an anomaly so it's fun to check
Let's see which days of this project were the most active.
count(*) date(author_when) ---------- ----------------- 16 2013-04-11 14 2015-01-05 12 2007-11-16 12 2007-11-19 12 2014-01-13 11 2015-01-12 10 2009-02-12 9 2013-08-16 8 2009-08-01 8 2013-04-29
Just glancing at this you can see that there's a flurry of activity that happened over wide gaps of time. Closing in one of the events you can get a sense of what happened – someone tackled a bunch of backlog work on the project and merged quite a few pull requests:
Roberto Decurnex|Updating checkout tests to cover `-b` option Roberto Decurnex|Fixing :new_branch usage on checkout Roberto Decurnex|Merge pull request #37 from JangoSteve/ls_remote Roberto Decurnex|Merge pull request #161 from xavier-calland/fetch_prune Roberto Decurnex|Updating `clone` RDoc Sorting clone options closes #178 Roberto Decurnex|Merge pull request #132 from arnvald/test-branch-create-does-not-switch-branch Roberto Decurnex|Updating clone --b to clone --branch (just to make it a little more verbose) Roberto Decurnex|Merge branch 'test_unit_needs_specificity' of https://github.com/kwstannard/ruby-git into kwstannard-test_unit_needs_specificity Roberto Decurnex|Merge branch 'diff_fix_current_vs_head' of https://github.com/francisluong/ruby-git into francisluong-diff_fix_current_vs_head Roberto Decurnex|Merge branch 'NotDaveLane-master' Roberto Decurnex|Adding Git.clone test for :branch option Roberto Decurnex|Merge branch 'master' of https://github.com/NotDaveLane/ruby-git into NotDaveLane-master Roberto Decurnex|Merge branch 'master' of github.com:schacon/ruby-git Roberto Decurnex|Updating ssh_key -> git_ssh
If you were in charge of testing things, you'd definately want to be a bit more thorough than normal.
Change coupling by time
Commits aren't necessarily the only unit of work on a team. Depending upon the process and work style, commits could be small and frequent (I'm sure with comments like argggh and work dammnit) or they could be well structured units of work that can be thought of as logically discrete units.
For those repositories that aren't like that, let's group changes by time and see if that gives us any interesting insights in to which files where modified together.
| file | commits | cofile | coorelation | count | |-------------------------------+---------+-------------------------------+-------------+-------| | CHANGELOG.md | 4 | lib/git/version.rb | 100% | 4 | | lib/git/config.rb | 4 | tests/units/test_config.rb | 100% | 4 | | tests/units/test_remotes.rb | 6 | lib/git/lib.rb | 100% | 6 | | tests/units/test_tags.rb | 3 | lib/git/lib.rb | 100% | 3 | | tests/units/test_init.rb | 6 | lib/git/lib.rb | 100% | 6 | | lib/git/object.rb | 4 | lib/git/lib.rb | 100% | 4 | | lib/git/object.rb | 4 | lib/git/base.rb | 100% | 4 | | tests/units/test_base.rb | 4 | lib/git/lib.rb | 100% | 4 | | tests/units/test_index_ops.rb | 6 | lib/git/lib.rb | 100% | 6 | | lib/git/path.rb | 4 | lib/git/base.rb | 100% | 4 | | VERSION | 6 | lib/git/lib.rb | 83% | 5 | | tests/units/test_remotes.rb | 6 | lib/git/base.rb | 83% | 5 | | lib/git/base.rb | 18 | lib/git/lib.rb | 83% | 15 | | tests/units/test_init.rb | 6 | lib/git/base.rb | 83% | 5 | | tests/units/test_config.rb | 5 | lib/git/config.rb | 80% | 4 | | lib/git.rb | 4 | git.gemspec | 75% | 3 | | lib/git/diff.rb | 8 | lib/git/lib.rb | 75% | 6 | | lib/git/config.rb | 4 | README.md | 75% | 3 | | tests/units/test_base.rb | 4 | .travis.yml | 75% | 3 | | lib/git/path.rb | 4 | .travis.yml | 75% | 3 | | lib/git/path.rb | 4 | lib/git/lib.rb | 75% | 3 | | lib/git/path.rb | 4 | tests/units/test_index_ops.rb | 75% | 3 | | .travis.yml | 14 | lib/git/lib.rb | 71% | 10 | | tests/units/test_logger.rb | 6 | lib/git/lib.rb | 66% | 4 | | VERSION | 6 | git.gemspec | 66% | 4 | | tests/units/test_index_ops.rb | 6 | .travis.yml | 66% | 4 | | tests/units/test_index_ops.rb | 6 | lib/git/base.rb | 66% | 4 | | tests/units/test_lib.rb | 11 | lib/git/lib.rb | 63% | 7 |
I truncated the table since there are a lot more things that are correlated now that we've grouped by date. A lot of things make sense here, but again it seems like files are changed together – a lot which could be nothing but makes you wonder about modularity of files.
lib/git/lib.rb are both changed a lot (
and frequently changed together
83% of the time.
Perhaps things have been split up prematurely if multiple files always change at the same time? Or is this normal in terms of how object oriented encapulsation works? Based upon the generic all encompasing names (base and lib) I wonder if you could guess why something would go into one file or another.
Knowledge and maturity
- Knowledge of a file is related to how much someone touched it
- Knowledge decays over time
- Active files means that something is changing
- People come and go on projects
One thing to note in this data is that
scott Chacon is treated
Scott Chacon, similar to
Roberto Decurnex and
robertodecurnex. One simple way to normalize this is to update your index database to
add the alias and mappings there.
These need to be identified by hand and retweaked depending upon the particular repo you are looking at.
Authorship of a file
Let's look at who "knows" about a file based upon how much they've
authored it. I'm going to pick on
lib/git/base.rb since that seems to
be an exciting hotspot.
name author_name commits last_touched --------------- ------------ ---------- ------------ lib/git/base.rb Scott Chacon 30 2008-05-27 lib/git/base.rb Roberto Decu 23 2015-01-12 lib/git/base.rb Jorge Bernal 3 2008-05-06 lib/git/base.rb Joe Moore 2 2013-04-29 lib/git/base.rb Bryan Larsen 1 2009-10-13 lib/git/base.rb Cameron Wals 1 2014-01-28 lib/git/base.rb Daniel Mendl 1 2009-02-10 lib/git/base.rb Harald Sitte 1 2015-01-12 lib/git/base.rb James Rosen 1 2008-12-21 lib/git/base.rb Jonathan Rud 1 2008-12-20 lib/git/base.rb Joshua Peek 1 2008-03-15 lib/git/base.rb Ken Pratt 1 2008-08-13 lib/git/base.rb Michael Mall 1 2013-05-23 lib/git/base.rb TJ Biddle 1 2013-06-14 lib/git/base.rb Tom Potts 1 2018-05-03 lib/git/base.rb Yuya.Nishida 1 2014-07-07
Scott Chacon and
Jorge Bernal moved on to other things in
Roberto did a bunch of work in
2015 but the file has been
Tom Potts did something in
Roberto last touched the specific file
2015-01-12 though he was last active in the repo for another year.
author_name last_active ---------------- ----------- Roberto Decurnex 2016-02-25
Authorship by weight
We can also look to see how many lines of code were added by someone, to go a little deeper. Maybe they only made 1 commit, but it changed every last thing?
author_name last_touched added ------------ ------------ ---------- Scott Chacon 2008-05-27 520 Roberto Decu 2015-01-12 192 Jorge Bernal 2008-05-06 23 James Rosen 2008-12-21 18 Joe Moore 2013-04-29 13 Tom Potts 2018-05-03 11 TJ Biddle 2013-06-14 9 Daniel Mendl 2009-02-10 6 Joshua Peek 2008-03-15 4 Bryan Larsen 2009-10-13 2 Cameron Wals 2014-01-28 2 Jonathan Rud 2008-12-20 2 Ken Pratt 2008-08-13 2 Michael Mall 2013-05-23 2 Yuya.Nishida 2014-07-07 2 Harald Sitte 2015-01-12 1
We do see that
Tom Potts moved up a bit in the list.
Who knows about that file?
If you had a question about
lib/git/base.rb who would you ask?
Chacon wrote most of it, but he hasn't been around in a while.
Roberto Decurnex was the last person to make substantial changes to
it, but that was back in
2015. Even if he was still active on the
project, which he doesn't appear to be, do we think he'd remember any
of the details? Does the most recent contributor make sense to talk
to, even though
Tom Potts only made a small change in
We know that the file is signifigant because it used to change a lot with a bunch of other files. Maybe the file is super clean and looking at it means that all of its resposibilities are self-evident – I haven't opened it up yet – but my sense is that it probably has a lot of stuff in there, and that knowledge will have to be recreated the next time that anyone opens it up to make a change.
Stable or dead?
Instead of focusing on the things that have changed, lets take a look at the things that haven't. Right now we don't have a good way to see if a file is still in latest revision, so we can put that into the database as well and join off of it.
596 current files
Now that we have that list, we can join off of it to get when things were last modified and by whom.
I'm going to filter out the
test files since there's a lot of sample
data there, not that test code isn't important.
name last_touched author ---------------------------- ------------ ------------ lib/git/working_directory.rb 2007-11-08 Scott Chacon lib/git/index.rb 2007-11-16 Scott Chacon LICENSE 2008-05-26 Scott Chacon lib/git/stash.rb 2008-12-21 James Rosen lib/git/branches.rb 2013-04-10 Roberto Decu lib/git/repository.rb 2013-04-10 Roberto Decu .gitignore 2013-04-11 Roberto Decu lib/git/log.rb 2013-08-17 Roberto Decu Gemfile 2014-01-13 Roberto Decu lib/git/remote.rb 2014-01-28 Cameron Wals lib/git/path.rb 2014-07-07 Yuya.Nishida lib/git/author.rb 2015-01-12 Roberto Decu lib/git/object.rb 2015-01-12 Roberto Decu lib/git/status.rb 2018-03-07 Vern Burton lib/git/base.rb 2018-05-03 Tom Potts Rakefile 2018-05-16 Per Lundberg ISSUE_TEMPLATE.md 2018-06-20 Vern Burton lib/git/stashes.rb 2018-08-01 Vern Burton .travis.yml 2018-08-22 Vern Burton lib/git/base/factory.rb 2018-10-02 Evgenii Pech lib/git/config.rb 2019-09-20 Salim Afiune git.gemspec 2019-12-11 James Coubal lib/git/diff.rb 2019-12-11 James Coubal .github/stale.yml 2020-01-19 Per Lundberg lib/git.rb 2020-01-22 cyclotron3k lib/git/branch.rb 2020-01-23 James Coubal CONTRIBUTING.md 2020-01-25 James Coubal MAINTAINERS.md 2020-01-25 James Coubal RELEASING.md 2020-01-25 James Coubal PULL_REQUEST_TEMPLATE.md 2020-02-04 Yuta Harima README.md 2020-04-05 Alex Mayer lib/git/lib.rb 2020-04-05 TIT CHANGELOG.md 2020-04-25 James Coubal lib/git/version.rb 2020-04-25 James Coubal
So from this we can see which parts of the project are pretty much
stable from early days,
working_directory.rb probably up through
object.rb and which things have been changing for this stable project.
In the beginning of the year there was a bunch of work for things
related to project adminstration, like
Coding creates artifacts that change over time, and we have access to this data in the repositories that we share. Having everything accessable in SQL helps with complicated joins and analysis. I've scripted this in Ruby and we are looking at a Ruby project, but nothing that we've done has looked inside of the project itself to identify what are areas of note and of interest. It's been purely on the shape of how things are changing, what is changing together, and meta data analysis.
Now that we've gotten a taste of what this looked like here – and I'm sure there's much more we can look into – there are a few other sources of meta data that we should look at:
- Issues/tickets and how they related to the code
- Mailing lists? Chat rooms? other discussions around the repository
- Projects that rely upon this as a dependency, and how/if those users are putting in tickets or bug fixes into the code
- Looking into the
Gemfileto examine this project's dependencies to see what's going on there.
But this is already long enough.