Skip to main content

10 posts tagged with "git"

View All Tags

Patrick DeVivo

We're thrilled to announce the release of MergeStat v2, a completely new approach to extracting and syncing data from Git repositories and into SQL! πŸŽ‰

v2 banner image

MergeStat's mission has always been to enable anyone to ask and answer questions about anything involved in building and shipping software. Today we're taking a major step closer to that reality, by supporting container-based syncs.

That means all the data sources MergeStat supports are now defined as OCI containers, and live in a separate repo: mergestat/syncs.

By externalizing our sync implementations, adding and using new data sources is now much easier. This means working with data from open-source code scanners such as gosec can be implemented outside the main MergeStat repo!

Screenshot of repo syncs

Check it out! πŸš€

For example, we recently added a new sync for GitHub Dependabot data in this new container-based approach, allowing for queries on Dependabot alerts across repos.

Join our Slack

Come join our community Slack! We're always happy to chat about MergeStat there πŸŽ‰.

Patrick DeVivo

One of MergeStat's more unusual data syncs is GIT_BLAME. It's unusual because the output of git blame is typically not viewed in aggregate, let alone with SQL πŸ™‚.

We spent some time playing with this data, and came up with some interesting queries! We also learned a bit about the MongoDB source code running these queries (from the time of writing):

MongoDB Logo

Check our getting started instructions to run these queries on your own data!

The GIT_BLAME sync today, runs git blame ... on every file in a repo, and stores the output in a postgres table with the following columns:

columntype
repo_idUUID NOT NULL
author_emailTEXT
author_nameTEXT
author_whenTIMESTAMP(6) WITH TIME ZONE
commit_hashTEXT
line_noINTEGER NOT NULL
lineTEXT
pathTEXT NOT NULL
_mergestat_synced_atTIMESTAMP(6) WITH TIME ZONE NOT NULL

So for every line of code in a repo, we're able to access the author information (and commit hash, which can be joined with the git_commits table) about who last modified that line.

What percent of code (by line) is each author "blameable" for?​

Of all the lines of code in a repo, what percent of the code was last modified by each author:

WITH blamed_lines AS (
SELECT * FROM git_blame
JOIN repos ON repos.id = git_blame.repo_id
WHERE repo = 'https://github.com/mergestat/mergestat' -- only look at one repo
)
SELECT ROUND(100.0*count(*)/(SELECT count(*) FROM blamed_lines), 2) AS percent, author_name FROM blamed_lines
GROUP BY author_name
ORDER BY count(*) DESC
37.12   (author #1)
27.02 (author #2)
12.51 (author #3)
6.47 (author #4)
4.47 (author #5)
3.23 (author #6)
2.25 (author #7)
1.92 (author #8)
1.59 (author #9)
...

This query tells us about how much code in a repo is directly attributable to what authors (where "attributable" means directly blameable - i.e. the author was the last to modify the line in some way). This may tell us something about how much "impact" an author has on a codebase (at least in an aggregate view).

For example, authors directly attributable to more of the current code could be considered to be more knowledgeable about the project overall. Or, if an author has a relatively low ranking on this list, their impact on the project could be considered minimal (why has none of their contribution "stuck around" over time? Is their code frequently getting rewritten by others?).

This way of viewing the data could fall apart when:

  • An author changes a lot of files with superficial changes (like by applying auto-fixes with a code linter). This alone would move the author's blameable line count up quite a bit, but it would be unfair to conclude that this author is extra knowledgeable about the code from that alone.
  • There's a lot of old code in a project from an author who's no longer an active maintainer. This person may no longer be "knowledgeable," though they may have many lines still attributable to them.
  • An author brings in a large amount of vendored (3rd party) code.

To account for some of these scenarios, we can tweak the above query to apply certain filters. For instance, let's limit our analysis to only certain types of files in a repo.

WITH blamed_lines AS (
SELECT * FROM git_blame
JOIN repos ON repos.id = git_blame.repo_id
WHERE repo = 'https://github.com/mergestat/mergestat' -- only look at one repo
AND path LIKE '%.go' -- only look at .go files
)
SELECT ROUND(100.0*count(*)/(SELECT count(*) FROM blamed_lines), 2) AS percent, author_name FROM blamed_lines
GROUP BY author_name
ORDER BY count(*) DESC

A query similar to the above could allow you to look only at blameable lines for certain directories as well.

What's the average (mean) age of a line of code?​

How long has it been since a line of code in our repo was last modified? This may be an interesting way to measure the "staleness" of code in a project. What's the age of most of our code?

SELECT avg((extract(epoch FROM (SELECT (now() - author_when)))/86400)::int) as avg_age_days from git_blame
JOIN repos ON repos.id = git_blame.repo_id
WHERE repo = 'https://github.com/mergestat/mergestat'

What about only in certain file types?

SELECT avg((extract(epoch FROM (SELECT (now() - author_when)))/86400)::int) as avg_age_days from git_blame
JOIN repos ON repos.id = git_blame.repo_id
WHERE repo = 'https://github.com/mergestat/mergestat'
AND path LIKE '%.go' -- only look at go files

What's the average age of code, by author?​

Combining the two queries above, we can look at the relative age of code by author!

SELECT count(*), avg((extract(epoch FROM (SELECT (now() - author_when)))/86400)::int) AS avg_age_days, author_name FROM git_blame
JOIN repos on repos.id = git_blame.repo_id
WHERE repo = 'https://github.com/mergestat/mergestat'
GROUP BY author_name
ORDER BY count(*) DESC

This tells us something about how "relevant" certain authors remain in a codebase. Do some authors have more recent code (on average) than others?

Join our Slack

If you found this interesting, hop in our community Slack! We're always happy to chat about MergeStat there πŸŽ‰.

Patrick DeVivo

GitHub often feels like the center of the developer universe, but a lot of developers really call Bitbucket or GitLab home. Today we're very excited to announce that the latest releases of MergeStat support Bitbucket and GitLab git repos!

Screenshot of our Git Source UI

This comes hot on the tails of a refactor to how MergeStat handles Git Sources generally, and we're really excited to now support hosts other than GitHub!

This has been one of our more commonly requested features, and we're already beginning to work with users who primarily use Bitbucket and GitLab.

Bitbucket and GitLab repo auto imports are available today as well. πŸš€

Bitbucket logo

GitLab logo

Join our Slack

If you have questions, hop in our community Slack! We're always happy to chat about MergeStat there πŸŽ‰.

Patrick DeVivo

In software, as maybe in life, change is as inevitable as death and taxes πŸ™‚. SCM/VCS tools like Git, which manage code changes, are fundamental to anyone developing code. If DORA is your thing, you'll also know that "deployment frequency" is a key metric (as a proxy for measuring development velocity over time).

In our work on MergeStat, we've often wanted to understand how frequently we ship code. In particular, we want to ensure that our work makes it's way into the hands of our users on a regular basis. Luckily, we've been able to use MergeStat to keep tabs on our release frequency. We've built an internal dashboard using Grafana to support our activities around managing our releases:

Screenshot of Grafana dashboard tracking our release frequency

This dashboard, and the queries powering it, allow us to answer some questions quickly:

  • How long has it been since our last release?
  • On average, how frequently do we typically release new versions? (Over the past 90 days)
  • How are we doing relative to our own benchmarks of a "healthy" release cadence?
    • Where time-between-releases under 1 week is healthy βœ…
    • Between 1 and 2 weeks is a warning ❗
    • More than 2 weeks is an alert 🚨

And here are some of the SQL queries powering these views:

-- retrieve the days since the last MergeStat release
SELECT name, committer_when, ROUND(EXTRACT(EPOCH FROM now() - committer_when)/(86400), 2) AS days_since_last_release
FROM git_tags
JOIN repos ON git_tags.repo_id = repos.id
JOIN git_commits ON git_tags.tag_commit_hash = git_commits.hash
WHERE repo = 'https://github.com/mergestat/mergestat'
ORDER BY committer_when DESC LIMIT 1
-- retrieve the avg (mean) time between new releases, over the last 90 days
WITH tags AS (
SELECT
name, committer_when,
committer_when - LAG(committer_when) OVER (ORDER BY committer_when) AS since_prev_tag
FROM git_tags
JOIN repos ON git_tags.repo_id = repos.id
JOIN git_commits ON git_tags.tag_commit_hash = git_commits.hash
WHERE repo = 'https://github.com/mergestat/mergestat'
AND author_when > now() - '90 days'::interval
ORDER BY committer_when ASC
)
SELECT ROUND(EXTRACT(EPOCH FROM avg(since_prev_tag))/(86400), 2) AS mean_time_between_releases FROM tags

What comes next?

  • We'd like to alert to our Slack when it's been too long since a release
  • We'd like a regular push to our Slack for a regular update on how our release cadence is
Join our Slack

If you found this interesting or would like a similar view of your release cadence, hop in our community Slack! We're always happy to chat about MergeStat there πŸŽ‰.

Patrick DeVivo

Recently, I came across this tweet from Nicolas Carlo:

Nicolas Carlo tweet about finding hotspots in a git repo

Finding hotspots in a (git) codebase can be surfaced with the following:

git log --format=format: --name-only --since=12.month \
| egrep -v '^$' \
| sort \
| uniq -c \
| sort -nr \
| head -50

This defines hotspots as the files most frequently modified in the last year (by number of commits).

This bash script looks a lot like what both MergeStat and MergeStat Lite can surface, but using SQL πŸŽ‰!

MergeStat Example​

MergeStat can be used to surface this list as well:

select file_path, count(*)
from git_commits join git_commit_stats on (git_commits.repo_id = git_commit_stats.repo_id and git_commits.hash = git_commit_stats.commit_hash)
join repos on git_commits.repo_id = repos.id
where repo like '%mergestat/mergestat' -- limit to a specific repo
and git_commits.parents < 2 -- ignore merge commits
and author_when > now() - '1 year'::interval
group by file_path
order by count(*) desc
limit 50

Screenshot of MergeStat Example

MergeStat Lite Example​

MergeStat Lite (our CLI) can be run against a git repo on disk to surface the same set of file paths:

select
file_path, count(*)
from commits, stats('', commits.hash)
where commits.author_when > date('now', '-12 month')
and commits.parents < 2 -- ignore merge commits
group by file_path
order by count(*) desc
limit 50

Screenshot of MergeStat Lite Example

Why bother?​

As Nicolas Carlo points out, identifying hotspots in a codebase is an effective way to determine which files are worth examining as candidates for a refactor.

The SQL queries above can be modified to better suit your needs. For example:

  • Filter for particular file types by extension (maybe you only care about hotspots in .go files, for example)
  • Filter out particular directories
  • Modify the time frame
  • Surface hotspots across multiple repositories
  • Filter hotspots based on authors

Patrick DeVivo

Today we're very happy to announce a new flavor of MergeStat! If you've been following our progress over the last year, you'll know that our mission is to enable operational analytics for software engineering teams. You'll also know that our approach has been heavily based on SQLite (and its virtual table mechanism) to bring data from git repositories (source code, metadata, the GitHub API, etc) into a SQL context for flexible querying and analytics.

MergeStat Management Console Illustration

This has enabled initial use cases in many interesting domains related to the software-development-lifecycle, from audit and compliance, to code querying, to release time metrics, to technical debt surfacing, and much, much more.

For the past several months, we’ve been working on an approach in service of this mission that uses a different mechanism for enabling that SQL context. That mechanism is based on syncing data into a Postgres database for downstream querying, vs using a local SQLite based approach.

This grants us two significant advantages:

  1. More compatibility with the ecosystem of data (and BI/visualization) tools
  2. Much faster query time for data spread across multiple API sources

More Compatibility​

We love SQLite, but the fact of the matter is that it’s much easier to integrate a data visualization or BI product (open-source or not) with a Postgres server (or Postgres compatible server) than with a SQLite database file. A central part of our mission has always been not just enabling SQL, but allowing our SQL solution to play well with the wide array of existing tools that companies, teams and individuals are already using.

We want anyone to be able to query MergeStat data from tools like Metabase, Grafana, Tableau, Superset, etc. Postgres compatibility takes us a step in that direction, and we’re already working with early users integrating their SDLC data with these types of tools, via MergeStat.

Faster Queries​

Our SQLite virtual-table based approach is an effective way to query a local data source (such as a git repository on disk). However, it begins to fall short when it comes to querying over large sets of data spread across many pages in a web API (or multiple APIs). What ends up happening is queries spend much more time waiting for HTTP requests to finish than actually executing SQL. This means at query time, the user is forced to wait for data collection to occur. Depending on the scope of the data involved, this can potentially take a very long time. With our new approach, data is synced and maintained in a background process, so that data collection (from potentially slow sources) occurs out-of-band from user queries.

What’s Next?​

Our SQLite approach isn’t going anywhere. We still intend to invest in that and continue building it into a valuable open-source CLI that anyone can use. It’s been renamed MergeStat Lite and now lives at mergestat/mergestat-lite.

Our new approach now lives at mergestat/mergestat. Our documentation has been updated to reflect these changes, and we couldn’t be more excited to share more in the coming days and weeks.

We have some great features planned, and will be spending more time showcasing various use-cases from our growing community.

Join our Slack

As usual, our community Slack is a great place to find help and ask questions. We're always happy to chat about MergeStat there πŸŽ‰!

Patrick DeVivo

For organizations using GitHub Enterprise, the GitHub audit log is now supported as a table in the MergeStat CLI, enabling SQL queries against audit log events πŸŽ‰. Check it out!

The new table-value-function is github_org_audit_log, and may be queried like so:

SELECT * FROM github_org_audit_log('mergestat') -- replace with your org name

and has the following columns (type is TEXT unless otherwise specified):

  • id
  • entry_type
  • action
  • actor_type
  • actor_login
  • actor_ip
  • actor_location (JSON)
  • created_at (DATETIME)
  • operation_type
  • user_login
GitHub Enterprise Only

This feature only works with GitHub Enterprise organizations, since the audit log API is only available to enterprise accounts.

This enables some interesting use-cases for GitHub enterprise users looking to monitor, alert and report on events from the GitHub audit log. We're working with some early users of this table, and will publish some guides and examples in the near future πŸ”œ.

The JSON column (actor_location) can be accessed with the SQLite JSON operators.

SELECT actor_location->>'country' FROM github_org_audit_log('mergestat')
+----------------------------+
| ACTOR_LOCATION->>'COUNTRY' |
+----------------------------+
| United States |
+----------------------------+
| United States |
+----------------------------+
| Belgium |
+----------------------------+
| ... |
+----------------------------+

It's important to note the following from the GitHub API docs as well:

GitHub audit log notes

Join our Slack

Our community Slack is a great place to find help and ask questions. We're always happy to chat about MergeStat there πŸŽ‰!

Patrick DeVivo

SQLite 3.38.0 (released 2022-02-22) added support for built-in JSON functions (no longer a compile-time option) and two new JSON operators, -> and ->>, similar to ones available in MySQL and PostgreSQL.

MergeStat is a SQL interface to data in git repositories, based on SQLite. As such, we can now take advantage of these new operators in our queries, wherever JSON data is returned.

note

To install the MergeStat CLI via homebrew, run:

brew tap mergestat/mergestat
brew install mergestat

Or see other options

package.json​

An easy source of JSON to start with is the package.json file present in most JavaScript/TypeScript codebases. Let's take a look at the one in the popular tailwindlabs/tailwindcss repo.

SELECT contents->>'name', contents->>'license' FROM files WHERE path = 'package.json'
+-------------------+----------------------+
| CONTENTS->>'NAME' | CONTENTS->>'LICENSE' |
+-------------------+----------------------+
| tailwindcss | MIT |
+-------------------+----------------------+

Here we use contents->>'name' and contents->>'license' to pull out the name and license fields from the package.json JSON file.

Let's take this up a level, and look at at the name and license fields for all repos in an org where a package.json file is present. Sticking with tailwindlabs (and now adding a GITHUB_TOKEN when we execute the following):

-- CTE (https://www.sqlite.org/lang_with.html)
-- to select all the package.json files from repos in the tailwindlabs GitHub org
WITH package_json_files AS (
SELECT
name,
github_repo_file_content('tailwindlabs', repo.name, 'package.json') AS package_json
FROM github_org_repos('tailwindlabs') repo
)
SELECT
name as repo_name,
package_json->>'name',
package_json->>'license'
FROM package_json_files
WHERE package_json IS NOT NULL
note

This query may take some time to run, as it makes a fair number of GitHub API requests.

+-----------------------------+-----------------------------+--------------------------+
| REPO_NAME | PACKAGE_JSON->>'NAME' | PACKAGE_JSON->>'LICENSE' |
+-----------------------------+-----------------------------+--------------------------+
| tailwindcss | tailwindcss | MIT |
+-----------------------------+-----------------------------+--------------------------+
| webpack-starter | NULL | NULL |
+-----------------------------+-----------------------------+--------------------------+
| tailwindcss.com | NULL | NULL |
+-----------------------------+-----------------------------+--------------------------+
| tailwindcss-plugin-examples | NULL | NULL |
+-----------------------------+-----------------------------+--------------------------+
| tailwindcss-intellisense | root | NULL |
+-----------------------------+-----------------------------+--------------------------+
| tailwindcss-playground | NULL | NULL |
+-----------------------------+-----------------------------+--------------------------+
| tailwindcss-custom-forms | @tailwindcss/custom-forms | MIT |
+-----------------------------+-----------------------------+--------------------------+
| tailwindcss-typography | @tailwindcss/typography | MIT |
+-----------------------------+-----------------------------+--------------------------+
| heroicons | heroicons | MIT |
+-----------------------------+-----------------------------+--------------------------+
| tailwindui-vue | @tailwindui/vue | MIT |
+-----------------------------+-----------------------------+--------------------------+
| blog.tailwindcss.com | tailwind-blog | NULL |
+-----------------------------+-----------------------------+--------------------------+
| tailwindui-react | @tailwindui/react | MIT |
+-----------------------------+-----------------------------+--------------------------+
| heroicons.com | NULL | NULL |
+-----------------------------+-----------------------------+--------------------------+
| play.tailwindcss.com | play.tailwindcss.com | NULL |
+-----------------------------+-----------------------------+--------------------------+
| headlessui | headlessui | MIT |
+-----------------------------+-----------------------------+--------------------------+
| tailwind-play-api | NULL | NULL |
+-----------------------------+-----------------------------+--------------------------+
| tailwindcss-aspect-ratio | @tailwindcss/aspect-ratio | MIT |
+-----------------------------+-----------------------------+--------------------------+
| tailwindcss-forms | @tailwindcss/forms | MIT |
+-----------------------------+-----------------------------+--------------------------+
| tailwindcss-line-clamp | @tailwindcss/line-clamp | MIT |
+-----------------------------+-----------------------------+--------------------------+
| tailwindcss-jit | @tailwindcss/jit | MIT |
+-----------------------------+-----------------------------+--------------------------+
| prettier-plugin-tailwindcss | prettier-plugin-tailwindcss | NULL |
+-----------------------------+-----------------------------+--------------------------+

We see that most repos with a package.json file also have a name, while some do not (NULL). All the declared licences appear to be MIT.

Dependencies​

Finally, let's take a look at all the dependencies declared in package.json files across (JavaScript/TypeScript) codebases in an org. This time, let's try on the freeCodeCamp GitHub org.

WITH package_json_files AS (
SELECT
name,
github_repo_file_content('freeCodeCamp', repo.name, 'package.json') AS package_json
FROM github_org_repos('freeCodeCamp') repo
WHERE (primary_language = 'TypeScript' OR primary_language = 'JavaScript')
)
SELECT
count(*),
json_group_array(name) AS repo_names,
dep.key
FROM package_json_files, json_each(package_json->'dependencies') dep
WHERE package_json IS NOT NULL
GROUP BY dep.key
ORDER BY count(*) DESC

This query will show the most frequently used dependencies within a GitHub org (as extracted from the dependencies JSON value of package.json files). Here are the top 10 results from the above query, on freeCodeCamp:

  1. dotenv (35)
  2. express (31)
  3. body-parser (26)
  4. react-dom (16)
  5. react (16)
  6. lodash (16)
  7. cors (14)
  8. mongoose (12)
  9. chai (12)
  10. passport (11)
Join our Slack

We've recently launched a community Slack - feel free to stop in if you have questions or anything to share πŸŽ‰.

Patrick DeVivo

We had a user looking for a way to track merged GitHub pull requests that were unapproved (when merged into a main branch). This is typically done with an "Admin Override" when merging a PR, or from loose branch protection rules allowing for unapproved merges into a branch.

GitHub PR admin override

Allowing repo administrators to merge without a review can be a useful shortcut for "emergency" hotfixes that avoid a potentially time-consuming code review cycle. This of course depends on the habits and culture of your engineering organization.

Too many unapproved merges, however, is probably a sign of something wrong (again, depending on the organization and codebase). For instance, a high volume of "emergency fixes" shipping without any review, or developers simply short-circuiting the review cycle to get code merged/deployed quickly.

Surfacing Unapproved, Merged PRs​

The following MergeStat query will list all unapproved pull requests merged into the main branch.

SELECT
number,
title,
date(created_at) AS created_at,
date(merged_at) AS merged_at,
author_login,
review_decision,
merged
FROM github_prs('mergestat', 'mergestat') -- set to your own repo
WHERE
merged = true
AND review_decision <> 'APPROVED'
AND base_ref_name = 'main' -- set to your own branch
ORDER BY created_at DESC

This will yield a table that looks something like (in our web app):

Unapproved PRs in mergestat/mergestat

If you'd like to know which GitHub users most frequently merge without an approval, you can run the following:

SELECT
count(*), merged_by
FROM github_prs('mergestat', 'mergestat')
WHERE
merged = true
AND review_decision <> 'APPROVED'
AND base_ref_name = 'main'
GROUP BY merged_by
ORDER BY count(*) DESC

which will display an ordered list of the most common "unapproved mergers" in a codebase.

Additional Ideas​

  1. Filter by created_at or merged_at if you only care about recently merged PRs
  2. Query for unapproved (and merged) PRs across an entire GitHub org (rather than a single repo)
  3. Create a Slack/Discord bot that alerts when an unnapproved PR is merged, or shares a monthly report
  4. Track in a dashboard (a line chart) to record instances over time
Join our Slack

We've recently launched a community Slack - feel free to stop in if you have questions or anything to share πŸŽ‰.

Patrick DeVivo

It's a new year, and a post listing the top 30 committers to the Blender project in 2021 was on the front-page of HN this morning.

Below is the list of top committers to the Blender project in 2021. The amount of commits obviously doesn’t mean much, but it’s a neutral metric to put limelight on people who made Blender possible last year.

As the post says, counting commits doesn't mean much alone, but it's a decent proxy for relative activity within a codebase. As a maintainer of the MergeStat project, a SQL interface to data in git, I wanted to show how to make a similar list for any repo, and take it a step further with some additional metrics.

info

Take a look at our CLI guide to run the following queries yourself, on any repo.

Top 30 Contributors by Commit Count​

Replicating the list in the original blog post can be done with the following query:

SELECT
count(*), author_name, author_email
FROM commits
WHERE parents < 2 -- ignore merge commits
AND strftime('%Y', author_when) = '2021' -- commits only authored in 2021
GROUP BY author_name, author_email
ORDER BY count(*) DESC
LIMIT 30

This yields a list of names (ordered and with counts) very similar to what's in the original post πŸ™‚.

Adding Commit Stats​

The stats table allows us to also look at the number of files modified in each commit as well as LOC added or removed. We can join the stats table with commits to measure:

  1. Total lines added by a contributor
  2. Total lines removed by a contributor
  3. Distinct files modified by a contributor

In addition to just counting the number of commits.

Blender 2021 Author Stats

SELECT
count(DISTINCT hash) AS commits,
sum(additions),
sum(deletions),
count(DISTINCT file_path),
author_name, author_email
FROM commits, stats('', commits.hash)
WHERE parents < 2 -- ignore merge commits
AND strftime('%Y', author_when) = '2021' -- commits only authored in 2021
GROUP BY author_name, author_email
ORDER BY count(*) DESC
LIMIT 30

Similar to counting commits, LOC added or removed doesn't mean much indepedently. Files modified (distinct files changed by an author over the year) is a bit more interesting - depending on the size/nature of a codebase, it could be a measure of how "deeply" involved a particular contributor is - i.e. someone who contributes to all aspects of the project, not just a subset of it.

Making Some Graphs​

Finally, by adding the --format csv flag to the mergestat command when executing the above queries, we can copy the output into a spreadsheet to make some charts.

Commit Count by Author

LOC Added/Removed by Author

Distinct Files Modified by Author

These queries should be portable to your own repositories (or any other). Try them out in our web app or with our CLI.

Join our Slack

We've recently launched a community Slack - feel free to stop in if you have questions or anything to share πŸŽ‰.