Skip to main content

Finding Code Hotspots in Git Repositories 🔥

Patrick DeVivo

Recently, I came across this tweet from Nicolas Carlo:

Nicolas Carlo tweet about finding hotspots in a git repo

Finding hotspots in a (git) codebase can be surfaced with the following:

git log --format=format: --name-only --since=12.month \
| egrep -v '^$' \
| sort \
| uniq -c \
| sort -nr \
| head -50

This defines hotspots as the files most frequently modified in the last year (by number of commits).

This bash script looks a lot like what both MergeStat and MergeStat Lite can surface, but using SQL 🎉!

MergeStat Example

MergeStat can be used to surface this list as well:

select file_path, count(*)
from git_commits join git_commit_stats on (git_commits.repo_id = git_commit_stats.repo_id and git_commits.hash = git_commit_stats.commit_hash)
join repos on git_commits.repo_id = repos.id
where repo like '%mergestat/mergestat' -- limit to a specific repo
and git_commits.parents < 2 -- ignore merge commits
and author_when > now() - '1 year'::interval
group by file_path
order by count(*) desc
limit 50

Screenshot of MergeStat Example

MergeStat Lite Example

MergeStat Lite (our CLI) can be run against a git repo on disk to surface the same set of file paths:

select
file_path, count(*)
from commits, stats('', commits.hash)
where commits.author_when > date('now', '-12 month')
and commits.parents < 2 -- ignore merge commits
group by file_path
order by count(*) desc
limit 50

Screenshot of MergeStat Lite Example

Why bother?

As Nicolas Carlo points out, identifying hotspots in a codebase is an effective way to determine which files are worth examining as candidates for a refactor.

The SQL queries above can be modified to better suit your needs. For example:

  • Filter for particular file types by extension (maybe you only care about hotspots in .go files, for example)
  • Filter out particular directories
  • Modify the time frame
  • Surface hotspots across multiple repositories
  • Filter hotspots based on authors