Skip to main content

3 posts tagged with "github"

View All Tags

Patrick DeVivo

It's a new year, and a post listing the top 30 committers to the Blender project in 2021 was on the front-page of HN this morning.

Below is the list of top committers to the Blender project in 2021. The amount of commits obviously doesn’t mean much, but it’s a neutral metric to put limelight on people who made Blender possible last year.

As the post says, counting commits doesn't mean much alone, but it's a decent proxy for relative activity within a codebase. As a maintainer of the MergeStat project, a SQL interface to data in git, I wanted to show how to make a similar list for any repo, and take it a step further with some additional metrics.

info

Take a look at our CLI guide to run the following queries yourself, on any repo.

Top 30 Contributors by Commit Count​

Replicating the list in the original blog post can be done with the following query:

SELECT
count(*), author_name, author_email
FROM commits
WHERE parents < 2 -- ignore merge commits
AND strftime('%Y', author_when) = '2021' -- commits only authored in 2021
GROUP BY author_name, author_email
ORDER BY count(*) DESC
LIMIT 30

This yields a list of names (ordered and with counts) very similar to what's in the original post πŸ™‚.

Adding Commit Stats​

The stats table allows us to also look at the number of files modified in each commit as well as LOC added or removed. We can join the stats table with commits to measure:

  1. Total lines added by a contributor
  2. Total lines removed by a contributor
  3. Distinct files modified by a contributor

In addition to just counting the number of commits.

Blender 2021 Author Stats

SELECT
count(DISTINCT hash) AS commits,
sum(additions),
sum(deletions),
count(DISTINCT file_path),
author_name, author_email
FROM commits, stats('', commits.hash)
WHERE parents < 2 -- ignore merge commits
AND strftime('%Y', author_when) = '2021' -- commits only authored in 2021
GROUP BY author_name, author_email
ORDER BY count(*) DESC
LIMIT 30

Similar to counting commits, LOC added or removed doesn't mean much indepedently. Files modified (distinct files changed by an author over the year) is a bit more interesting - depending on the size/nature of a codebase, it could be a measure of how "deeply" involved a particular contributor is - i.e. someone who contributes to all aspects of the project, not just a subset of it.

Making Some Graphs​

Finally, by adding the --format csv flag to the mergestat command when executing the above queries, we can copy the output into a spreadsheet to make some charts.

Commit Count by Author

LOC Added/Removed by Author

Distinct Files Modified by Author

These queries should be portable to your own repositories (or any other). Try them out in our web app or with our CLI.

Join our Slack

We've recently launched a community Slack - feel free to stop in if you have questions or anything to share πŸŽ‰.

Patrick DeVivo

When a project has more than a handful of regular contributors, it can be dificult to see a high level view of all the changes happening. I've felt this recently, even in a codebase that went from just 1 to 3 contributors - there's a lot going on every day, and cutting through the noise to understand the cadence of outward value is difficult.

I've found that using MergeStat to summarize progress has been a useful way to achieve aspects of this. For instance, simply knowing what PRs have merged in the past 7 days has been a valuable summary for us:

-- show me PRs that merged in the past 7 days
SELECT
base_repository_name,
title,
number,
url,
author_login,
created_at,
merged_at
FROM (
SELECT * FROM github_prs('mergestat/mergestat') UNION ALL -- replace with your repo
SELECT * FROM github_prs('mergestat/another-repo') -- can be run across multiple repositories
)
WHERE
merged_at > date('now', '-7 days') -- replace with the time period you care about
AND merged = true

Pull Requests in this context are a sensible "unit of work" (for us) - whereas individual commits or files changed would be far too noisy. This query can be modified to your needs and you can run it with our CLI or in the web app. It uses our GitHub API integration (and will need an auth token).

We run this query in a weekly GitHub action that reports the results to a Slack channel. It's been a cool way to automate a summary of work done across several of our key repositories.

Join our Slack

We've recently launched a community Slack - feel free to stop in if you have questions or anything to share πŸŽ‰.

Patrick DeVivo

Unanswered issues and pull requests on GitHub repos are no fun, especially if you're the one asking for help πŸ™‚. I've seen many projects making use of automation to close (or comment on) them, though folks have argued against that.

Regardless, identifying unanswered issues can be valuable, especially for maintainers to ensure they're addressing user concerns - or even for anyone looking for places to help.

You can use the MergeStat GitHub tables to identify stale issues or PRs in a repo (or even across many repos).

info

MergeStat is an open-source tool for running SQL queries against git repositories and related data sources (like the GitHub API). Check us out if you're not familiar.

Stale Issues​

This will return the oldest 25 issues created more than 30 days ago, with no comments, that remain open.

SELECT
title, author_login, comment_count, created_at, url
FROM github_repo_prs('uber-go/zap') -- replace with your repo
WHERE
created_at < date('now', '-30 days') -- replace with how long you care about (https://www.sqlite.org/lang_datefunc.html)
AND (merged = 0 OR closed = 0)
AND comment_count = 0
ORDER BY created_at ASC
LIMIT 25

(see example output)

Stale PRs​

SELECT
title, author_login, comment_count, created_at, url
FROM github_repo_prs('mergestat/mergestat') -- replace with your repo
WHERE
created_at < date('now', '-30 days') -- replace with how long you care about (https://www.sqlite.org/lang_datefunc.html)
AND merged = 0
AND closed = 0
AND comment_count = 0
ORDER BY created_at ASC
LIMIT 25
note

You can run these queries either in the Public workspace or by installing the CLI. In either case you will need to supply a GitHub token for API authentication.

export GITHUB_TOKEN="my-github-token"
cat query.sql | mergestat -v

Considerations​

  1. Maybe comment_count = 0 isn't good enough to indicate "staleness" - maybe finding issues/PRs where there are only comments from the original author, or where there are no comments from maintainers is more accurate.
  2. The above queries can be joined with github_org_repos to see stale PRs and issues across an entire org, not just in a single repo.
  3. Maybe you could produce a dashboard (charts) to track issue staleness and alert on it - an SLA on issue response time?

If you're interested in exploring these use cases, feel free to come say hi on our Slack or shoot us a note on Twitter.