10 posts tagged with "github"

Welcoming Bitbucket and GitLab Repos 🎉

March 15, 2023

Founder & CEO

GitHub often feels like the center of the developer universe, but a lot of developers really call Bitbucket or GitLab home. Today we're very excited to announce that the latest releases of MergeStat support Bitbucket and GitLab git repos!

Screenshot of our Git Source UI

This comes hot on the tails of a refactor to how MergeStat handles Git Sources generally, and we're really excited to now support hosts other than GitHub!

This has been one of our more commonly requested features, and we're already beginning to work with users who primarily use Bitbucket and GitLab.

Bitbucket and GitLab repo auto imports are available today as well. 🚀

Bitbucket logo

GitLab logo

Join our Slack

If you have questions, hop in our community Slack! We're always happy to chat about MergeStat there 🎉.

Querying GitHub Security Advisories With SQL 🛡️

January 24, 2023

Patrick DeVivo

Founder & CEO

Did you know that GitHub maintains a public database of known CVEs and security advisories for open-source codebases? The database is a public Git repository that holds JSON files in OSV format, partitioned by date. This is the data that's displayed on the github.com/advisories page, which also powers Dependabot alerts!

Since it's just a Git repo, we wanted to take it for a spin with MergeStat to see what we could learn applying some SQL to its contents.

Setup

We followed the getting started instructions to get a new local instance of MergeStat running with docker-compose. Since this is a public GitHub repo (and we're looking at the data brought in by the GIT_FILES syncer), we don't need to configure any authentication, we can just bring the repo in directly, and sync the file contents:

We started by establishing a view over the "raw" file contents of the repo, which extracts some values from the underlying JSON files and parses their file paths:

-- View of all GitHub Advisories in the repo
CREATE OR REPLACE VIEW github_advisories AS (
    SELECT
        path,
        split_part(path, '/', 2) as reviewed,
        split_part(path, '/', 3) as year,
        split_part(path, '/', 4) as month,
        split_part(path, '/', 5) as id,
        contents::jsonb->>'summary' as summary,
        contents::jsonb->'database_specific'->>'severity' as severity,
        a.value->'package'->'name' as affected_name,
        a.value->'package'->'ecosystem' as affected_ecosystem
    FROM git_files LEFT JOIN LATERAL jsonb_array_elements(contents::jsonb->'affected') a ON true
    WHERE path LIKE 'advisories/%.json'
)

This allows us to run much cleaner queries, such as:

SELECT * FROM github_advisories

without needing to carry over the parsing and JSON logic in every query we run.

How many advisories are reviewed vs unreviewed?

SELECT
    count(1) as total,
    count(1) FILTER (WHERE reviewed = 'github-reviewed') as reviewed,
    count(1) FILTER (WHERE reviewed = 'unreviewed') as unreviewed
FROM github_advisories

201,711 total advisories (at time of writing)
19,078 advisories have been reviewed by GitHub (~9% of total)
182,633 remain unreviewed

(Clearly the GitHub team has quite a bit on their plate 😀)

How many advisories exist by ecosystem?

Ecosystem here refers to the package registry the advisory was found in. Looks like PyPi has the most and Pub the least. Note that we're excluding results where an affected ecosystem is not listed.

SELECT
    affected_ecosystem,
    count(*)
FROM github_advisories
WHERE affected_ecosystem IS NOT NULL -- where the affected ecosystem is known
GROUP BY affected_ecosystem
ORDER BY count(*) DESC

How many advisories exist by ecosystem and severity?

Severity here is one of CRITICAL, HIGH, MODERATE, and LOW.

SELECT
    affected_ecosystem,
    severity,
    count(*)
FROM github_advisories
WHERE affected_ecosystem IS NOT NULL -- where the affected ecosystem is known
GROUP BY affected_ecosystem, severity
ORDER BY count(*) DESC

Some ecosystems have fewer advisories overall, but a higher percentage of those advisories are CRITICAL.

How many advisories have been published over time?

SELECT
    extract(year from published) || '-Q' ||extract(quarter from published) as quarter,
    count(*)
FROM github_advisories
GROUP by extract(year from published), extract(quarter from published)
ORDER BY extract(year from published) ASC, extract(quarter from published) ASC

Looks like a ton of advisories were published in Q2 2022, making it hard to see overall trends.

What's been the makeup of advisory severity over time?

SELECT
    extract(year from published) || '-Q' ||extract(quarter from published) as quarter,
    severity,
    count(*)
FROM github_advisories
GROUP by extract(year from published), extract(quarter from published), severity
ORDER BY extract(year from published) ASC, extract(quarter from published) ASC

With 100% stacking

What's been the makeup of advisory ecosystem over time?

SELECT
    extract(year from published) || '-Q' ||extract(quarter from published) as quarter,
    affected_ecosystem,
    count(*)
FROM github_advisories
GROUP by extract(year from published), extract(quarter from published), affected_ecosystem
ORDER BY extract(year from published) ASC, extract(quarter from published) ASC

(Where the ecosystem is known).

What packages have the most advisories?

SELECT affected_name, affected_ecosystem, count(*)
FROM github_advisories
WHERE affected_ecosystem IS NOT NULL
GROUP BY affected_name, affected_ecosystem
ORDER BY count(*) DESC

Looks like there's a lot going on in tensorflow!

What packages have the most advisories, by ecosystem?

SELECT affected_name, affected_ecosystem, count(*)
FROM github_advisories
WHERE affected_ecosystem IS NOT NULL AND affected_ecosystem = 'REPLACE_WITH_ECOSYSTEM'
GROUP BY affected_name, affected_ecosystem
ORDER BY count(*) DESC
LIMIT 10

`PyPI`

`Maven`

`npm`

`Packagist`

`NuGet`

`Go`

`RubyGems`

`crates.io`

Next Steps

This has been a high level view of all the advisories found in a specific GitHub-maintained public database. To examine the advisories that impact your organization or team, you'd want to join the data in this repository with information about what packages your code brings in - potentially by using SBOMs or even parsing package manifests.

Tools such as Dependabot already do this for the GitHub ecosystem, and automate the maintenance of dependencies as well. There are, however, other databases and ecosystems emerging that track advisories and vulnerabilities. MergeStat supports some of these tools today, and we are working to make it possible for anyone to easily bring the output of their own selection of tools in. We believe the easier it is to access data involved in key aspects of the software-supply-chain, the more secure and the more effective engineering organizations may be.

Join our Slack

Our community Slack is a great place to find help and ask questions. We're always happy to chat about MergeStat there 🎉!

Finding Long Running CI Jobs ⌛

January 19, 2023

Patrick DeVivo

Founder & CEO

Recently, MergeStat added support for collecting data from GitHub Actions. We wanted a way to start asking questions about our CI pipelines, and very soon we were able to leverage MergeStat to surface useful results.

Finding Our Most Expensive CI Pipelines

Last year we received a GitHub bill that surprised us. We had an expensive charge related to GitHub Actions minutes, even though most of our repos are public (Actions usage is free for public repos). We ended up finding a couple of experimental workflows in private repos that were the culprits of the higher-than-usual costs.

To find these, we exported our raw Actions usage data into a spreadsheet, aggregated the minutes, and ordered by the most expensive repos. This wasn't too difficult, but it made us think about how valuable it could be to always have this data on hand, with the ability to query it ad-hoc (vs needing to run an export, load a spreadsheet and run some formulas every time).

Now that we have GitHub Actions data syncing with MergeStat, we're able to do this:

-- find our longest running CI jobs
SELECT 
    EXTRACT(EPOCH FROM (public.github_actions_workflow_runs.updated_at - public.github_actions_workflow_runs.run_started_at))::integer/60 AS minutes_to_run,
    public.repos.repo,
    public.github_actions_workflow_runs.*
FROM public.github_actions_workflow_runs
INNER JOIN public.repos ON public.github_actions_workflow_runs.repo_id = public.repos.id
ORDER BY 1 DESC

Screenshot of SQL query for finding long running GitHub Actions in the MergeStat app

Improving DevEx by Finding Long Running Tests

In a very related use case, a partner reached out asking how to use MergeStat to query for long running tests. Long running tests can be a significant bottleneck in a software-delivery pipeline, especially at scale. This impacts developer experience and the rate at which teams can deliver value to production.

We were able to tweak the query above to answer this question. In short, we filtered out workflow runs that had a name that included the string test, which was good enough for the situation at hand:

SELECT 
    EXTRACT(EPOCH FROM (public.github_actions_workflow_runs.updated_at - public.github_actions_workflow_runs.run_started_at))::integer/60 AS minutes_to_run,
    public.repos.repo,
    public.github_actions_workflow_runs.*
FROM public.github_actions_workflow_runs
INNER JOIN public.repos ON public.github_actions_workflow_runs.repo_id = public.repos.id
WHERE public.github_actions_workflow_runs.name ilike '%test%'
ORDER BY 1 DESC

Next Up

These two examples hopefully convey the value of having this type of CI/CD data on hand and accessible with SQL. Since this data can sit next to pull requests, commits, code, issues and more, we can augment these queries with additional dimensions to get deeper insights into engineering operations overall.

In addition, we're working on some exciting features that will make accessing and using this data much easier. In particular, saved queries, charts, dashboards, reports and alerts will make it possible to use this data more effectively.

Visualize which repos or teams are using the most CI minutes.
Send a Slack alert when a CI run takes longer than expected (configurable).
Track improvements in CI runtimes over different periods and teams, use it to celebrate or communicate up.

Regardless, MergeStat's mission continues to be in service of data agility for software engineering. Check us out!

Join our Slack

Our community Slack is a great place to find help and ask questions. We're always happy to chat about MergeStat there 🎉!

MergeStat, now with some PostgreSQL 🐘 🎉

October 14, 2022

Patrick DeVivo

Founder & CEO

Today we're very happy to announce a new flavor of MergeStat! If you've been following our progress over the last year, you'll know that our mission is to enable operational analytics for software engineering teams. You'll also know that our approach has been heavily based on SQLite (and its virtual table mechanism) to bring data from git repositories (source code, metadata, the GitHub API, etc) into a SQL context for flexible querying and analytics.

This has enabled initial use cases in many interesting domains related to the software-development-lifecycle, from audit and compliance, to code querying, to release time metrics, to technical debt surfacing, and much, much more.

For the past several months, we’ve been working on an approach in service of this mission that uses a different mechanism for enabling that SQL context. That mechanism is based on syncing data into a Postgres database for downstream querying, vs using a local SQLite based approach.

This grants us two significant advantages:

More compatibility with the ecosystem of data (and BI/visualization) tools
Much faster query time for data spread across multiple API sources

More Compatibility

We love SQLite, but the fact of the matter is that it’s much easier to integrate a data visualization or BI product (open-source or not) with a Postgres server (or Postgres compatible server) than with a SQLite database file. A central part of our mission has always been not just enabling SQL, but allowing our SQL solution to play well with the wide array of existing tools that companies, teams and individuals are already using.

We want anyone to be able to query MergeStat data from tools like Metabase, Grafana, Tableau, Superset, etc. Postgres compatibility takes us a step in that direction, and we’re already working with early users integrating their SDLC data with these types of tools, via MergeStat.

Faster Queries

Our SQLite virtual-table based approach is an effective way to query a local data source (such as a git repository on disk). However, it begins to fall short when it comes to querying over large sets of data spread across many pages in a web API (or multiple APIs). What ends up happening is queries spend much more time waiting for HTTP requests to finish than actually executing SQL. This means at query time, the user is forced to wait for data collection to occur. Depending on the scope of the data involved, this can potentially take a very long time. With our new approach, data is synced and maintained in a background process, so that data collection (from potentially slow sources) occurs out-of-band from user queries.

What’s Next?

Our SQLite approach isn’t going anywhere. We still intend to invest in that and continue building it into a valuable open-source CLI that anyone can use. It’s been renamed MergeStat Lite and now lives at mergestat/mergestat-lite.

Our new approach now lives at mergestat/mergestat. Our documentation has been updated to reflect these changes, and we couldn’t be more excited to share more in the coming days and weeks.

We have some great features planned, and will be spending more time showcasing various use-cases from our growing community.

Join our Slack

As usual, our community Slack is a great place to find help and ask questions. We're always happy to chat about MergeStat there 🎉!

Querying the GitHub Audit Log with SQL

June 30, 2022

Patrick DeVivo

Founder & CEO

For organizations using GitHub Enterprise, the GitHub audit log is now supported as a table in the MergeStat CLI, enabling SQL queries against audit log events 🎉. Check it out!

The new table-value-function is github_org_audit_log, and may be queried like so:

SELECT * FROM github_org_audit_log('mergestat') -- replace with your org name

and has the following columns (type is TEXT unless otherwise specified):

id
entry_type
action
actor_type
actor_login
actor_ip
actor_location (JSON)
created_at (DATETIME)
operation_type
user_login

GitHub Enterprise Only

This feature only works with GitHub Enterprise organizations, since the audit log API is only available to enterprise accounts.

This enables some interesting use-cases for GitHub enterprise users looking to monitor, alert and report on events from the GitHub audit log. We're working with some early users of this table, and will publish some guides and examples in the near future 🔜.

The JSON column (actor_location) can be accessed with the SQLite JSON operators.

SELECT actor_location->>'country' FROM github_org_audit_log('mergestat')

+----------------------------+
| ACTOR_LOCATION->>'COUNTRY' |
+----------------------------+
| United States              |
+----------------------------+
| United States              |
+----------------------------+
| Belgium                    |
+----------------------------+
| ...                        |
+----------------------------+

It's important to note the following from the GitHub API docs as well:

Join our Slack

Our community Slack is a great place to find help and ask questions. We're always happy to chat about MergeStat there 🎉!

SQLite JSON Functions in MergeStat ⏩

March 8, 2022

Patrick DeVivo

Founder & CEO

SQLite 3.38.0 (released 2022-02-22) added support for built-in JSON functions (no longer a compile-time option) and two new JSON operators, -> and ->>, similar to ones available in MySQL and PostgreSQL.

MergeStat is a SQL interface to data in git repositories, based on SQLite. As such, we can now take advantage of these new operators in our queries, wherever JSON data is returned.

note

To install the MergeStat CLI via homebrew, run:

brew tap mergestat/mergestat
brew install mergestat

Or see other options

package.json

An easy source of JSON to start with is the package.json file present in most JavaScript/TypeScript codebases. Let's take a look at the one in the popular tailwindlabs/tailwindcss repo.

SELECT contents->>'name', contents->>'license' FROM files WHERE path = 'package.json'

+-------------------+----------------------+
| CONTENTS->>'NAME' | CONTENTS->>'LICENSE' |
+-------------------+----------------------+
| tailwindcss       | MIT                  |
+-------------------+----------------------+

Here we use contents->>'name' and contents->>'license' to pull out the name and license fields from the package.json JSON file.

Let's take this up a level, and look at at the name and license fields for all repos in an org where a package.json file is present. Sticking with tailwindlabs (and now adding a GITHUB_TOKEN when we execute the following):

-- CTE (https://www.sqlite.org/lang_with.html)
-- to select all the package.json files from repos in the tailwindlabs GitHub org
WITH package_json_files AS (
    SELECT
        name,
        github_repo_file_content('tailwindlabs', repo.name, 'package.json') AS package_json
    FROM github_org_repos('tailwindlabs') repo
)
SELECT
    name as repo_name,
    package_json->>'name',
    package_json->>'license'
FROM package_json_files
WHERE package_json IS NOT NULL

note

This query may take some time to run, as it makes a fair number of GitHub API requests.

+-----------------------------+-----------------------------+--------------------------+
| REPO_NAME                   | PACKAGE_JSON->>'NAME'       | PACKAGE_JSON->>'LICENSE' |
+-----------------------------+-----------------------------+--------------------------+
| tailwindcss                 | tailwindcss                 | MIT                      |
+-----------------------------+-----------------------------+--------------------------+
| webpack-starter             | NULL                        | NULL                     |
+-----------------------------+-----------------------------+--------------------------+
| tailwindcss.com             | NULL                        | NULL                     |
+-----------------------------+-----------------------------+--------------------------+
| tailwindcss-plugin-examples | NULL                        | NULL                     |
+-----------------------------+-----------------------------+--------------------------+
| tailwindcss-intellisense    | root                        | NULL                     |
+-----------------------------+-----------------------------+--------------------------+
| tailwindcss-playground      | NULL                        | NULL                     |
+-----------------------------+-----------------------------+--------------------------+
| tailwindcss-custom-forms    | @tailwindcss/custom-forms   | MIT                      |
+-----------------------------+-----------------------------+--------------------------+
| tailwindcss-typography      | @tailwindcss/typography     | MIT                      |
+-----------------------------+-----------------------------+--------------------------+
| heroicons                   | heroicons                   | MIT                      |
+-----------------------------+-----------------------------+--------------------------+
| tailwindui-vue              | @tailwindui/vue             | MIT                      |
+-----------------------------+-----------------------------+--------------------------+
| blog.tailwindcss.com        | tailwind-blog               | NULL                     |
+-----------------------------+-----------------------------+--------------------------+
| tailwindui-react            | @tailwindui/react           | MIT                      |
+-----------------------------+-----------------------------+--------------------------+
| heroicons.com               | NULL                        | NULL                     |
+-----------------------------+-----------------------------+--------------------------+
| play.tailwindcss.com        | play.tailwindcss.com        | NULL                     |
+-----------------------------+-----------------------------+--------------------------+
| headlessui                  | headlessui                  | MIT                      |
+-----------------------------+-----------------------------+--------------------------+
| tailwind-play-api           | NULL                        | NULL                     |
+-----------------------------+-----------------------------+--------------------------+
| tailwindcss-aspect-ratio    | @tailwindcss/aspect-ratio   | MIT                      |
+-----------------------------+-----------------------------+--------------------------+
| tailwindcss-forms           | @tailwindcss/forms          | MIT                      |
+-----------------------------+-----------------------------+--------------------------+
| tailwindcss-line-clamp      | @tailwindcss/line-clamp     | MIT                      |
+-----------------------------+-----------------------------+--------------------------+
| tailwindcss-jit             | @tailwindcss/jit            | MIT                      |
+-----------------------------+-----------------------------+--------------------------+
| prettier-plugin-tailwindcss | prettier-plugin-tailwindcss | NULL                     |
+-----------------------------+-----------------------------+--------------------------+

We see that most repos with a package.json file also have a name, while some do not (NULL). All the declared licences appear to be MIT.

Dependencies

Finally, let's take a look at all the dependencies declared in package.json files across (JavaScript/TypeScript) codebases in an org. This time, let's try on the freeCodeCamp GitHub org.

WITH package_json_files AS (
    SELECT
        name,
        github_repo_file_content('freeCodeCamp', repo.name, 'package.json') AS package_json
    FROM github_org_repos('freeCodeCamp') repo
    WHERE (primary_language = 'TypeScript' OR primary_language = 'JavaScript')
)
SELECT
    count(*),
    json_group_array(name) AS repo_names,
    dep.key
FROM package_json_files, json_each(package_json->'dependencies') dep
WHERE package_json IS NOT NULL
GROUP BY dep.key
ORDER BY count(*) DESC

This query will show the most frequently used dependencies within a GitHub org (as extracted from the dependencies JSON value of package.json files). Here are the top 10 results from the above query, on freeCodeCamp:

dotenv (35)
express (31)
body-parser (26)
react-dom (16)
react (16)
lodash (16)
cors (14)
mongoose (12)
chai (12)
passport (11)

Join our Slack

We've recently launched a community Slack - feel free to stop in if you have questions or anything to share 🎉.

Finding Unapproved, Merged PRs

January 18, 2022

Patrick DeVivo

Founder & CEO

We had a user looking for a way to track merged GitHub pull requests that were unapproved (when merged into a main branch). This is typically done with an "Admin Override" when merging a PR, or from loose branch protection rules allowing for unapproved merges into a branch.

GitHub PR admin override

Allowing repo administrators to merge without a review can be a useful shortcut for "emergency" hotfixes that avoid a potentially time-consuming code review cycle. This of course depends on the habits and culture of your engineering organization.

Too many unapproved merges, however, is probably a sign of something wrong (again, depending on the organization and codebase). For instance, a high volume of "emergency fixes" shipping without any review, or developers simply short-circuiting the review cycle to get code merged/deployed quickly.

Surfacing Unapproved, Merged PRs

The following MergeStat query will list all unapproved pull requests merged into the main branch.

SELECT
    number,
    title,
    date(created_at) AS created_at,
    date(merged_at) AS merged_at, 
    author_login,
    review_decision,
    merged
FROM github_prs('mergestat', 'mergestat') -- set to your own repo
WHERE
    merged = true
    AND review_decision <> 'APPROVED'
    AND base_ref_name = 'main' -- set to your own branch
ORDER BY created_at DESC

This will yield a table that looks something like (in our web app):

Unapproved PRs in mergestat/mergestat

If you'd like to know which GitHub users most frequently merge without an approval, you can run the following:

SELECT
    count(*), merged_by
FROM github_prs('mergestat', 'mergestat')
WHERE
    merged = true
    AND review_decision <> 'APPROVED'
    AND base_ref_name = 'main'
GROUP BY merged_by
ORDER BY count(*) DESC

which will display an ordered list of the most common "unapproved mergers" in a codebase.

Additional Ideas

Filter by created_at or merged_at if you only care about recently merged PRs
Query for unapproved (and merged) PRs across an entire GitHub org (rather than a single repo)
Create a Slack/Discord bot that alerts when an unnapproved PR is merged, or shares a monthly report
Track in a dashboard (a line chart) to record instances over time

Join our Slack

We've recently launched a community Slack - feel free to stop in if you have questions or anything to share 🎉.

A Year of Commits 🎊 📊

January 1, 2022

Patrick DeVivo

Founder & CEO

It's a new year, and a post listing the top 30 committers to the Blender project in 2021 was on the front-page of HN this morning.

Below is the list of top committers to the Blender project in 2021. The amount of commits obviously doesn’t mean much, but it’s a neutral metric to put limelight on people who made Blender possible last year.

As the post says, counting commits doesn't mean much alone, but it's a decent proxy for relative activity within a codebase. As a maintainer of the MergeStat project, a SQL interface to data in git, I wanted to show how to make a similar list for any repo, and take it a step further with some additional metrics.

info

Take a look at our CLI guide to run the following queries yourself, on any repo.

Top 30 Contributors by Commit Count

Replicating the list in the original blog post can be done with the following query:

SELECT
    count(*), author_name, author_email
FROM commits
WHERE parents < 2 -- ignore merge commits
AND strftime('%Y', author_when) = '2021' -- commits only authored in 2021
GROUP BY author_name, author_email
ORDER BY count(*) DESC
LIMIT 30

This yields a list of names (ordered and with counts) very similar to what's in the original post 🙂.

Adding Commit Stats

The stats table allows us to also look at the number of files modified in each commit as well as LOC added or removed. We can join the stats table with commits to measure:

Total lines added by a contributor
Total lines removed by a contributor
Distinct files modified by a contributor

In addition to just counting the number of commits.

Blender 2021 Author Stats

SELECT
    count(DISTINCT hash) AS commits,
    sum(additions),
    sum(deletions),
    count(DISTINCT file_path),
    author_name, author_email
FROM commits, stats('', commits.hash)
WHERE parents < 2 -- ignore merge commits
AND strftime('%Y', author_when) = '2021' -- commits only authored in 2021
GROUP BY author_name, author_email
ORDER BY count(*) DESC
LIMIT 30

Similar to counting commits, LOC added or removed doesn't mean much indepedently. Files modified (distinct files changed by an author over the year) is a bit more interesting - depending on the size/nature of a codebase, it could be a measure of how "deeply" involved a particular contributor is - i.e. someone who contributes to all aspects of the project, not just a subset of it.

Making Some Graphs

Finally, by adding the --format csv flag to the mergestat command when executing the above queries, we can copy the output into a spreadsheet to make some charts.

Commit Count by Author

LOC Added/Removed by Author

Distinct Files Modified by Author

These queries should be portable to your own repositories (or any other). Try them out in our web app or with our CLI.

Join our Slack

We've recently launched a community Slack - feel free to stop in if you have questions or anything to share 🎉.

Summarizing A Week of PRs 📅

December 8, 2021

Patrick DeVivo

Founder & CEO

When a project has more than a handful of regular contributors, it can be dificult to see a high level view of all the changes happening. I've felt this recently, even in a codebase that went from just 1 to 3 contributors - there's a lot going on every day, and cutting through the noise to understand the cadence of outward value is difficult.

I've found that using MergeStat to summarize progress has been a useful way to achieve aspects of this. For instance, simply knowing what PRs have merged in the past 7 days has been a valuable summary for us:

-- show me PRs that merged in the past 7 days
SELECT
   base_repository_name,
   title,
   number,
   url,
   author_login,
   created_at,
   merged_at
FROM (
   SELECT * FROM github_prs('mergestat/mergestat') UNION ALL -- replace with your repo
   SELECT * FROM github_prs('mergestat/another-repo') -- can be run across multiple repositories
)
WHERE
merged_at > date('now', '-7 days') -- replace with the time period you care about
AND merged = true

Pull Requests in this context are a sensible "unit of work" (for us) - whereas individual commits or files changed would be far too noisy. This query can be modified to your needs and you can run it with our CLI or in the web app. It uses our GitHub API integration (and will need an auth token).

We run this query in a weekly GitHub action that reports the results to a Slack channel. It's been a cool way to automate a summary of work done across several of our key repositories.

Join our Slack

We've recently launched a community Slack - feel free to stop in if you have questions or anything to share 🎉.

Finding Stale Issues and PRs ⌛

December 2, 2021

Patrick DeVivo

Founder & CEO

Unanswered issues and pull requests on GitHub repos are no fun, especially if you're the one asking for help 🙂. I've seen many projects making use of automation to close (or comment on) them, though folks have argued against that.

Regardless, identifying unanswered issues can be valuable, especially for maintainers to ensure they're addressing user concerns - or even for anyone looking for places to help.

You can use the MergeStat GitHub tables to identify stale issues or PRs in a repo (or even across many repos).

info

MergeStat is an open-source tool for running SQL queries against git repositories and related data sources (like the GitHub API). Check us out if you're not familiar.

Stale Issues

This will return the oldest 25 issues created more than 30 days ago, with no comments, that remain open.

SELECT
    title, author_login, comment_count, created_at, url
FROM github_repo_prs('uber-go/zap') -- replace with your repo
WHERE
    created_at < date('now', '-30 days') -- replace with how long you care about (https://www.sqlite.org/lang_datefunc.html)
    AND (merged = 0 OR closed = 0)
    AND comment_count = 0
ORDER BY created_at ASC
LIMIT 25

(see example output)

Stale PRs

SELECT
    title, author_login, comment_count, created_at, url
FROM github_repo_prs('mergestat/mergestat') -- replace with your repo
WHERE
    created_at < date('now', '-30 days') -- replace with how long you care about (https://www.sqlite.org/lang_datefunc.html)
    AND merged = 0
    AND closed = 0
    AND comment_count = 0
ORDER BY created_at ASC
LIMIT 25

note

You can run these queries either in the Public workspace or by installing the CLI. In either case you will need to supply a GitHub token for API authentication.

export GITHUB_TOKEN="my-github-token"
cat query.sql | mergestat -v

Considerations

Maybe comment_count = 0 isn't good enough to indicate "staleness" - maybe finding issues/PRs where there are only comments from the original author, or where there are no comments from maintainers is more accurate.
The above queries can be joined with github_org_repos to see stale PRs and issues across an entire org, not just in a single repo.
Maybe you could produce a dashboard (charts) to track issue staleness and alert on it - an SLA on issue response time?

If you're interested in exploring these use cases, feel free to come say hi on our Slack or shoot us a note on Twitter.

Setup​

How many advisories are reviewed vs unreviewed?​

How many advisories exist by ecosystem?​

How many advisories exist by ecosystem and severity?​

How many advisories have been published over time?​

What's been the makeup of advisory severity over time?​

What's been the makeup of advisory ecosystem over time?​

What packages have the most advisories?​

What packages have the most advisories, by ecosystem?​

PyPI​

Maven​

npm​

Packagist​

NuGet​

Go​

RubyGems​

crates.io​

Next Steps​

Finding Our Most Expensive CI Pipelines​

Improving DevEx by Finding Long Running Tests​

Next Up​

More Compatibility​

Faster Queries​

What’s Next?​

package.json​

Dependencies​

Surfacing Unapproved, Merged PRs​

Additional Ideas​

Top 30 Contributors by Commit Count​

Adding Commit Stats​

Making Some Graphs​

Stale Issues​

Stale PRs​

Considerations​

Setup

How many advisories are reviewed vs unreviewed?

How many advisories exist by ecosystem?

How many advisories exist by ecosystem and severity?

How many advisories have been published over time?

What's been the makeup of advisory severity over time?

What's been the makeup of advisory ecosystem over time?

What packages have the most advisories?

What packages have the most advisories, by ecosystem?

`PyPI`

`Maven`

`npm`

`Packagist`

`NuGet`

`Go`

`RubyGems`

`crates.io`

Next Steps

Finding Our Most Expensive CI Pipelines

Improving DevEx by Finding Long Running Tests

Next Up

More Compatibility

Faster Queries

What’s Next?

package.json

Dependencies

Surfacing Unapproved, Merged PRs

Additional Ideas

Top 30 Contributors by Commit Count

Adding Commit Stats

Making Some Graphs

Stale Issues

Stale PRs

Considerations