20 posts tagged with "sql"

Terraform Meets SQL to Secure Cloud Infrastructure ☁️

May 11, 2023

Founder & CEO

In this post we'll show how MergeStat can be used in conjunction with tfsec, a static analysis tool for terraform, to secure cloud infrastructure 🛡️.

As a reminder, MergeStat accesses git repos, runs an analysis on them, and stores the results in SQL for downstream querying and operationalization:

Our integration with tfsec can be used to build a report in Grafana that surfaces terraform IaC issues and misconfigurations. The end result here is a TFSec Grafana dashboard that can be explored and filtered, based specifically on what you're looking to target. The dashboard looks like this:

This dashboard is available as a JSON export here and should be easy to expand on!

The Setup

To begin scanning IaC repos with MergeStat, you'll need a MergeStat instance. You can start locally by following these instructions to get a Docker Compose instance running.

You'll also want to add the ghcr.io/mergestat/sync-scan-tfsec sync to your MergeStat instance:

Once this sync has been scheduled and has run:

You should see a new view available to query: tfsec_repo_issues:

-- returns all issues found by tfsec for a given repo
SELECT * FROM tfsec_repo_issues;

And that's it! You have output from tfsec stored in your PostgreSQL database. Query it from MergeStat, check out the Grafana board, or plug in another BI/SQL tool.

Join our Slack

If you found this interesting, hop in our community Slack! We're always happy to chat about MergeStat there 🎉.

Finding A Pin in a Docker Stack 🪡

April 26, 2023

Peter Freiberg

Consultant. DevSecOps. Application Security.

Guest Post

This is a guest post by Peter Freiberg, a DevSecOps and application security consultant based in Melbourne, Australia.

I am a consultant who specialises in Application Security and DevSecOps. I've built teams and capabilities for organisations around Penetration Testing (ethical hacking), Secure Code Reviews, DevSecOps, Security Test Automation, Security Development Lifecycle and training.

In this article, we will look at processes and techniques for finding Dockerfiles across git repos, looking for specific image pinning practices. We will use the lens of a DevSecOps persona assessing the current state of where we are using a #knowyourcode approach.

TLDR: Pinning Docker images is good for deterministic builds but needs to be checked regularly for security issues. Query your code for current practices to get a starting point.

We’re going to explore a few key things we can search for using MergeStat:

Where are all of our Dockerfiles?
Are we using :latest instead of a pinned version?
For our Dockerfiles (maybe golden images), when is the last time they were modified?

For the rest of this article, we’re going to refer to “Golden images”, also called gold or base images. By this, we mean an image built for your organisation, that handles patching and installing core libraries and frameworks required by a developer or team, rather than all teams doing it themselves. The idea is this practice is more time efficient, reduces redundancy and creates more consistency.

We will also assume that the MergeStat instance we're working with has the following syncs enabled and scheduled for the repos we're looking at:

As a DevSecOps person, we probably want to know where all our Dockerfiles are. Here’s a query to do just that for everything we’ve loaded into MergeStat.

SELECT
    repos.repo AS repo, path
FROM git_files
INNER JOIN repos ON git_files.repo_id = repos.id
WHERE path LIKE '%Dockerfile%'

Here are some results from public repositories that we’ve loaded into our MergeStat instance.

Screenshot of query results showing file paths of anything that looks like a Dockerfile

These results will tell us at least where we have Dockerfiles across our code bases to understand the scope of how many we have and where they are (what repos, what directories). It gives us a starting point for analysis and we can reach out to developers as we find issues or have questions.

From a deterministic build point of view, people tend to use a specific version of an image. So we can query to start understanding what the current practices are around pinning to a specific version. 📌

What are we using? (Are we using our golden images?)

If we know what our golden image naming convention is, we could start by just looking through the FROM statements in Dockerfiles and eyeball them. The following query is going to show what images we are currently using across our code bases.

WITH docker_base_images AS (
    SELECT
        repos.repo AS repo,
        path,
        (regexp_matches(contents, 'FROM (.*):(.*) AS', 'gm')) AS docker_image
    FROM git_files
    INNER JOIN repos ON git_files.repo_id = repos.id
    WHERE path LIKE '%Dockerfile%'
)
SELECT
    *,
    docker_image[1] AS image,
    docker_image[2] AS version
FROM docker_base_images

We’re going to get a list that looks like the following:

Screenshot showing query results of Dockerfile image versions

The results tells us that in this example, images are from Docker hub and not an organisation specific registry. The image column would have data like my.registry.address/repositoryname if it was using a different registry. This may actually be a policy you want to enforce, or detect variance: that you need to use your company's registry for all production images.

Let’s NOT use ‘latest’ Docker images (or at least be able to find them)

Using the :latest tag can be good from a security patching viewpoint, but may introduce compatibility challenges in libraries and frameworks. The following query searches for images pinned to the :latest tag.

WITH docker_base_images AS (
    SELECT
        repos.repo AS repo,
        path,
        (regexp_matches(contents, 'FROM (.*):(.*) AS', 'gm')) AS docker_image
    FROM git_files
    INNER JOIN repos ON git_files.repo_id = repos.id
    WHERE path LIKE '%Dockerfile%'
)
SELECT
    *,
    docker_image[1] AS image,
    docker_image[2] AS version
FROM docker_base_images
WHERE docker_image[2] = 'latest'

So, we should now have a list that looks like the following.

Repo	path	docker_image	image	version
`https://github.com/Org1/xyz`	Dockerfile	["alpine","latest"]	alpine	latest
`https://github.com/TeamZ/abc`	worker/Dockerfile	["centos","latest"]	centos	latest

Keep in mind, “latest” might not be the actual latest, it’s just a tag applied during the build process. It could even be really old if that image is no longer updated. From here, we can either reach out to the developer to understand why “latest” is being used.

Golden images need to be maintained!

I know this sounds very obvious, but here’s a real-life example though of when a process or people breaks down:

My team was an embedded DevSecOps capability in a transformation program. We started noticing that 15+ feature teams were getting vulnerabilities in their images. After a bit of digging, we discovered the issue was the golden image had not been maintained in months. We did get a bit of noise from people saying “can we just suppress the results and release, it’s the golden image and we’re adhering to the pattern”. Turns out the maintainer of the golden image was refocused on other priorities and wasn’t actually maintaining the image.

When was the last time these files were modified or updated?

So, if we are starting a program to analyse our current state, looking at the age (or last modified date) of a Dockerfile could be one way of doing this.

As of this week, there’s a new helper function in the MergeStat schema, getFilesOlderThan, which does some magic joining of tables to make things easier to query.

SELECT * 
FROM public.getFilesOlderThan('%Dockerfile', 30)
ORDER BY author_when DESC

From here, we can see the author_when column is the last time changes were made to the file, in this scenario, with the path pattern of %Dockerfile.

repo	file_path	author_when	author_name	author_email	hash
`https://github.com/XYZ/utils`	Dockerfile	2023-03-20T15:12:46.000Z	Bob Smith	bob@XYZ.com	sd8f67z8xvc7
`https://github.com/ABC/core`	svc/Dockerfile	2023-03-01T17:29:22.000Z	Alice Sharma	alice@ABC.COM	98sa7s9fd32
…	…	…	…	…	…

From here, we can either look and review this file ourself, run a checker over the top (or security scan the base image) and if we’re concerned, we’ve at least got a name and email to reach out to.

Who are you going to call? (when you want to check something)

The above query showed the last author to edit the file. That would be the first attempt at contact if we were concerned. However, if they are not available, who else would you talk to? Here’s a query to get the last five author's details (and committer) of who’s modified a given file in a repo:

SELECT
    author_email, author_name, author_when, committer_email, committer_name, repo, file_path
FROM git_commits
INNER JOIN repos ON git_commits.repo_id = repos.id 
INNER JOIN git_commit_stats ON git_commit_stats.repo_id = git_commits.repo_id AND git_commit_stats.commit_hash = git_commits.hash AND parents < 2
AND repo ILIKE 'https://github.com/OWNER/REPO'
AND file_path ILIKE 'Dockerfile'
ORDER BY author_when DESC
LIMIT 5

The result is a list of the people who’ve modified the file in question for the given repository and grab the last 5.

We can use code analytics for security and governance

So, depending on your organisation, some policies you have can be queried in a “Governance as Code” way of doing things such as:

Use pinned versions (e.g. no “latest”)
Use the approved registry to build images
Make sure our Dockerfiles are checked and updated at least every X days

As we’ve shown above, these can all be queried using MergeStat and SQL. This approach can assist with identifying, remediating and managing risk. The ability to query provides a fast approach to gaining broad visibility.

Parting thoughts

Get started with some analysis and don’t let perfect be the enemy of good. I’ve spoken to a few people and heard comments like “That query will miss edge case XYZ”. Maybe, but you’ve got the 80-90% of a good starting point and can iterate on accuracy from that query!

Join our Slack

If you found this interesting, hop in our community Slack! We're always happy to chat about MergeStat there 🎉.

Announcing V2: Extract & Sync Anything 🚀

April 25, 2023

Patrick DeVivo

Founder & CEO

We're thrilled to announce the release of MergeStat v2, a completely new approach to extracting and syncing data from Git repositories and into SQL! 🎉

MergeStat's mission has always been to enable anyone to ask and answer questions about anything involved in building and shipping software. Today we're taking a major step closer to that reality, by supporting container-based syncs.

That means all the data sources MergeStat supports are now defined as OCI containers, and live in a separate repo: mergestat/syncs.

By externalizing our sync implementations, adding and using new data sources is now much easier. This means working with data from open-source code scanners such as gosec can be implemented outside the main MergeStat repo!

Check it out! 🚀

For example, we recently added a new sync for GitHub Dependabot data in this new container-based approach, allowing for queries on Dependabot alerts across repos.

Join our Slack

Come join our community Slack! We're always happy to chat about MergeStat there 🎉.

Exploring Git Blame w/ SQL 🌎

March 31, 2023

Patrick DeVivo

Founder & CEO

One of MergeStat's more unusual data syncs is GIT_BLAME. It's unusual because the output of git blame is typically not viewed in aggregate, let alone with SQL 🙂.

We spent some time playing with this data, and came up with some interesting queries! We also learned a bit about the MongoDB source code running these queries (from the time of writing):

The average (mean) age of a LOC in MongoDB is 1042 days (almost 3 years!)
There are 807 distinct authors with "blameable" LOC (using author_email to measure distinctness)
The file with the oldest average (mean) LOC is debian/prerm
The file with the newest average (mean) LOC is src/mongo/db/s/README_routing_info_cache_consistency_model.md

MongoDB Logo

Check our getting started instructions to run these queries on your own data!

The GIT_BLAME sync today, runs git blame ... on every file in a repo, and stores the output in a postgres table with the following columns:

column	type
`repo_id`	`UUID NOT NULL`
`author_email`	`TEXT`
`author_name`	`TEXT`
`author_when`	`TIMESTAMP(6) WITH TIME ZONE`
`commit_hash`	`TEXT`
`line_no`	`INTEGER NOT NULL`
`line`	`TEXT`
`path`	`TEXT NOT NULL`
`_mergestat_synced_at`	`TIMESTAMP(6) WITH TIME ZONE NOT NULL`

So for every line of code in a repo, we're able to access the author information (and commit hash, which can be joined with the git_commits table) about who last modified that line.

What percent of code (by line) is each author "blameable" for?

Of all the lines of code in a repo, what percent of the code was last modified by each author:

WITH blamed_lines AS (
    SELECT * FROM git_blame
    JOIN repos ON repos.id = git_blame.repo_id
    WHERE repo = 'https://github.com/mergestat/mergestat' -- only look at one repo
)
SELECT ROUND(100.0*count(*)/(SELECT count(*) FROM blamed_lines), 2) AS percent, author_name FROM blamed_lines
GROUP BY author_name
ORDER BY count(*) DESC

12   (author #1)
02   (author #2)
51   (author #3)
47    (author #4)
47    (author #5)
23    (author #6)
25    (author #7)
92    (author #8)
59    (author #9)
...

This query tells us about how much code in a repo is directly attributable to what authors (where "attributable" means directly blameable - i.e. the author was the last to modify the line in some way). This may tell us something about how much "impact" an author has on a codebase (at least in an aggregate view).

For example, authors directly attributable to more of the current code could be considered to be more knowledgeable about the project overall. Or, if an author has a relatively low ranking on this list, their impact on the project could be considered minimal (why has none of their contribution "stuck around" over time? Is their code frequently getting rewritten by others?).

This way of viewing the data could fall apart when:

An author changes a lot of files with superficial changes (like by applying auto-fixes with a code linter). This alone would move the author's blameable line count up quite a bit, but it would be unfair to conclude that this author is extra knowledgeable about the code from that alone.
There's a lot of old code in a project from an author who's no longer an active maintainer. This person may no longer be "knowledgeable," though they may have many lines still attributable to them.
An author brings in a large amount of vendored (3rd party) code.

To account for some of these scenarios, we can tweak the above query to apply certain filters. For instance, let's limit our analysis to only certain types of files in a repo.

WITH blamed_lines AS (
    SELECT * FROM git_blame
    JOIN repos ON repos.id = git_blame.repo_id
    WHERE repo = 'https://github.com/mergestat/mergestat' -- only look at one repo
    AND path LIKE '%.go' -- only look at .go files
)
SELECT ROUND(100.0*count(*)/(SELECT count(*) FROM blamed_lines), 2) AS percent, author_name FROM blamed_lines
GROUP BY author_name
ORDER BY count(*) DESC

A query similar to the above could allow you to look only at blameable lines for certain directories as well.

What's the average (mean) age of a line of code?

How long has it been since a line of code in our repo was last modified? This may be an interesting way to measure the "staleness" of code in a project. What's the age of most of our code?

SELECT avg((extract(epoch FROM (SELECT (now() - author_when)))/86400)::int) as avg_age_days from git_blame
JOIN repos ON repos.id = git_blame.repo_id
WHERE repo = 'https://github.com/mergestat/mergestat'

What about only in certain file types?

SELECT avg((extract(epoch FROM (SELECT (now() - author_when)))/86400)::int) as avg_age_days from git_blame
JOIN repos ON repos.id = git_blame.repo_id
WHERE repo = 'https://github.com/mergestat/mergestat'
AND path LIKE '%.go' -- only look at go files

What's the average age of code, by author?

Combining the two queries above, we can look at the relative age of code by author!

SELECT count(*), avg((extract(epoch FROM (SELECT (now() - author_when)))/86400)::int) AS avg_age_days, author_name FROM git_blame
JOIN repos on repos.id = git_blame.repo_id
WHERE repo = 'https://github.com/mergestat/mergestat'
GROUP BY author_name
ORDER BY count(*) DESC

This tells us something about how "relevant" certain authors remain in a codebase. Do some authors have more recent code (on average) than others?

Join our Slack

If you found this interesting, hop in our community Slack! We're always happy to chat about MergeStat there 🎉.

Welcoming Bitbucket and GitLab Repos 🎉

March 15, 2023

Patrick DeVivo

Founder & CEO

GitHub often feels like the center of the developer universe, but a lot of developers really call Bitbucket or GitLab home. Today we're very excited to announce that the latest releases of MergeStat support Bitbucket and GitLab git repos!

Screenshot of our Git Source UI

This comes hot on the tails of a refactor to how MergeStat handles Git Sources generally, and we're really excited to now support hosts other than GitHub!

This has been one of our more commonly requested features, and we're already beginning to work with users who primarily use Bitbucket and GitLab.

Bitbucket and GitLab repo auto imports are available today as well. 🚀

Bitbucket logo

GitLab logo

Join our Slack

If you have questions, hop in our community Slack! We're always happy to chat about MergeStat there 🎉.

Are We Vulnerable? (e.g. Log4Shell) - Identifying Open-Source Library Risk Using MergeStat (Part 2)

March 7, 2023

Peter Freiberg

Consultant. DevSecOps. Application Security.

Guest Post

This is a guest post by Peter Freiberg, a DevSecOps and application security consultant based in Melbourne, Australia.

Previously, we looked at a few ways we can look at open-source library risk across our code. In this article, we’ll look at how we can search for new known vulnerabilities once details become available.

So, an email, article or colleague disrupts whatever you were doing and proclaims “XYZ library is vulnerable and it’s horrible!”. How do you tell if you’re vulnerable to XYZ vulnerability?

If you can query your code, you can start finding where you might be exploitable. If you’re new to MergeStat, the tldr is it's a tool to aggregate multiple Git sources, organisations and repos in a literal SQL interface (with some other magic sauce utilities too).

In this case, let’s take a look at Log4Shell. Here’s a rundown of the advisory: https://www.cisa.gov/news-events/cybersecurity-advisories/aa21-356a

Log4J Logo

One of their first steps was “Identifying assets affected by Log4Shell and other Log4j-related vulnerabilities.”

Now, assets is going to involve a lot of things that are off the shelf, but from a “knowing our code” angle there’s a few things we can do.

Our high-level process using MergeStat to identify potential locations is pretty simple:

Find all locations where we might include Log4J. In this example, we’ll look at the the package manager maven)
Search for Log4J (we can get more specific, but let’s start here)
Find the last author or committer who touched the file to assist with assessing and remediation

Find all the maven package manager files:

SELECT repo, path 
FROM git_files
INNER JOIN repos ON git_files.repo_id = repos.id
WHERE path LIKE '%pom.xml'

We’re going to get something that looks like the following

repo	path
https://github.com/ORG-A/repo1	service-discovery/blah/pom.xml
https://git.internal.xyz/TEAM-B/repoY	pom.xml
...	...

That’s pretty straight forward, but what we really want is to find all the maven files which look like they use Log4J.

SELECT repo, path 
FROM git_files
INNER JOIN repos ON git_files.repo_id = repos.id
WHERE path LIKE '%pom.xml' AND contents LIKE '%log4j%'

This will produce a similar table of results as above, this time with files that contain Log4J. Now, this is pretty broad keyword search, but if you take a look at the Log4J documentation this should catch most of the situations where Log4J appears, even if you change the package manager filename (e.g. build.gradle, build.sbt etc.) and even the Clojure build tool if that’s your thing.

Now to find out who can help us determine if we’re actually vulnerable. Hopefully we have a version number in the pom.xml or equivalent build tool. If we have to dive deeper to check deployment status then we probably need to find the last person who touched the package manager file.

Here’s a query to find all developers and committers who last modified the maven files which look like they use Log4J:

SELECT DISTINCT ON (repo, path) repo, path, author_name, author_email, author_when, committer_name, committer_email, committer_when
FROM git_files
INNER JOIN git_commit_stats ON git_files.repo_id = git_commit_stats.repo_id AND git_files.path = git_commit_stats.file_path
INNER JOIN git_commits ON git_commit_stats.repo_id = git_commits.repo_id AND git_commit_stats.commit_hash = git_commits.hash
INNER JOIN repos ON git_files.repo_id = repos.id
WHERE path LIKE '%pom.xml' AND contents LIKE '%log4j%' AND parents < 2
ORDER BY repo, path, committer_when DESC

Our results should look something like:

repo	path	author_name	author_email	author_when	committer_name	committer_email	committer_when
https://github.com/ORG-A/repo1	service/pom.xml	Bob	bob@xyz.com	2021-06-01T08:30:36Z	Alice	alice@alice.com	2021-06-01T08:30:36Z
https://int.xyz/TEAM-B/repoY	pom.xml	Josue	josue@aaa.com	2021-11-01T09:42:36Z	Patrick	pd@there.com	2021-06-01T08:30:36Z
…	…	…	…	…	…	…	…

The above query has some complexity, but this should be easy(ish) for anyone to modify, changing the path and searching on strings based on your use case.

If the above query has found people who are no longer in your organisation, then you can follow up by finding whoever was the last to author or commit to the repo:

SELECT *
FROM git_commits
INNER JOIN repos ON git_commits.repo_id = repos.id
WHERE repo = 'https://github.com/YOUR_ORG/YOUR_REPO' -- replace with your repo
ORDER BY committer_when DESC
LIMIT 1

What are the limitations of this process?

This example is only Maven, it doesn’t include other ecosystems (but could)

The above process and queries could be easily tweaked to query Gradle, Scala build tool and other JVM based languages or integrations. You could run your analysis on build.gradle or whatever your JVM package manager file is. As above take a look at the Log4J and Clojure build tool documentation for some other package managers.

It may be included in another library as a dependency (AKA Transitive Dependency)

This one is a bit trickier. If we don’t have that extracted and queryable already, we would then need to query the broader libraries and check if they use the library. However, we can at least start from a “where are all the maven, gradle, sbt etc” package manager files and run the analysis from that starting point.

We’re only looking at our code, what about 3rd party and COTS?

The above won’t help with “off the shelf” software that has something like Log4J bundled (e.g. Tomcat, or a commercial application built on top of Tomcat). Most likely, these will be discovered using other vulnerability management tools looking at virtual machines or containers.

This code may not be actually deployed

Is it test code? Proof of Concept? Decommissioned?

Who knows? At least you have a starting point with some names to start asking questions. As above, if you find the last author or committer and ask them you can then rule in or out whether it’s an issue for you.

It may be fixed in code, but not released to production

To actually confirm we’re not vulnerable across our organisation and applications, we still need to check the deployed artefact. We have the starting point of code, but can then follow up through developers we found who’ve updated those files.

Next Steps

As you can see, being able to query your code is extremely useful when looking for something which is “known bad”. We can quickly query our code bases to start the initial analysis of our posture. We’ve used Log4J here, but these types of queries could be used for Dockerfiles (e.g. some binary install through curl that’s compromised, different added packages with vulnerabilities), other language package managers (e.g. Ruby Gems, Python pip, Go mod etc).

Join our Slack

If you found this interesting, hop in our community Slack! We're always happy to chat about MergeStat there 🎉.

Querying YAML in PostgreSQL with PLV8 🐘

March 4, 2023

Patrick DeVivo

Founder & CEO

MergeStat's mission is to make it possible to query anything involved in building & shipping software with SQL. Recently, we had a user looking to query YAML files across their git repos, wanting to extract and aggregate values from config files.

We thought for a bit, and came up with a solution. An odd one, but one that works surprisingly well 😃.

How it works

Ensure the PLV8 extension is installed in Postgres. This allows us to run JavaScript in the database.
Create a user-defined function plv8_yaml_to_json() that uses an existing JavaScript YAML parser, to convert YAML strings into JSON.
Write and execute SQL that uses the new function and the native Postgres JSON operators to query what you're looking for 🎉.

Here it is altogether!

What can we query?

Well, to showcase this, let's look at a git repo with a lot of YAML. One good example is the public bitnami/charts repo, which is a collection of Helm Charts for popular OSS applications to run on Kubernetes.

List all Helm charts (and their annotated category)

WITH bitnami_charts AS (
    SELECT path, plv8_yaml_to_json(contents) AS chart FROM git_files
    JOIN repos ON git_files.repo_id = repos.id
    WHERE repo = 'https://github.com/bitnami/charts'
    AND path LIKE 'bitnami/%/Chart.yaml'
)
SELECT
    chart->>'name' AS chart_name,
    chart->>'description' AS chart_description,
    chart->'annotations'->>'category' AS category
FROM bitnami_charts

chart_name	chart_description	category
airflow	Apache Airflow is a tool to express and execute workflows as directed acyclic graphs (DAGs). It includes utilities to schedule tasks, monitor task progress and handle task dependencies.	WorkFlow
apache	Apache HTTP Server is an open-source HTTP server. The goal of this project is to provide a secure, efficient and extensible server that provides HTTP services in sync with the current HTTP standards.	Infrastructure
appsmith	Appsmith is an open source platform for building and maintaining internal tools, such as custom dashboards, admin panels or CRUD apps.	CMS
argo-cd	Argo CD is a continuous delivery tool for Kubernetes based on GitOps.	Infrastructure
argo-workflows	Argo Workflows is meant to orchestrate Kubernetes jobs in parallel. It uses DAG and step-based workflows	Infrastructure
...

How many charts are in each category?

WITH bitnami_charts AS (
    SELECT path, plv8_yaml_to_json(contents) AS chart FROM git_files
    JOIN repos ON git_files.repo_id = repos.id
    WHERE repo = 'https://github.com/bitnami/charts'
    AND path LIKE 'bitnami/%/Chart.yaml'
)
SELECT
    count(*),
    chart->'annotations'->>'category' AS category
FROM bitnami_charts
GROUP BY category
ORDER BY count(*) DESC

Hopefully that gives you a sense of what's possible chaining together YAML, MergeStat, JavaScript, and Postgres!

Join our Slack

If you found this interesting, hop in our community Slack! We're always happy to chat about MergeStat there 🎉.

Open-Source Git Release Tracking w/ SQL 🚀

February 26, 2023

Patrick DeVivo

Founder & CEO

In software, as maybe in life, change is as inevitable as death and taxes 🙂. SCM/VCS tools like Git, which manage code changes, are fundamental to anyone developing code. If DORA is your thing, you'll also know that "deployment frequency" is a key metric (as a proxy for measuring development velocity over time).

In our work on MergeStat, we've often wanted to understand how frequently we ship code. In particular, we want to ensure that our work makes it's way into the hands of our users on a regular basis. Luckily, we've been able to use MergeStat to keep tabs on our release frequency. We've built an internal dashboard using Grafana to support our activities around managing our releases:

This dashboard, and the queries powering it, allow us to answer some questions quickly:

How long has it been since our last release?
On average, how frequently do we typically release new versions? (Over the past 90 days)
How are we doing relative to our own benchmarks of a "healthy" release cadence?
- Where time-between-releases under 1 week is healthy ✅
- Between 1 and 2 weeks is a warning ❗
- More than 2 weeks is an alert 🚨

And here are some of the SQL queries powering these views:

-- retrieve the days since the last MergeStat release
SELECT name, committer_when, ROUND(EXTRACT(EPOCH FROM now() - committer_when)/(86400), 2) AS days_since_last_release
FROM git_tags
JOIN repos ON git_tags.repo_id = repos.id
JOIN git_commits ON git_tags.tag_commit_hash = git_commits.hash
WHERE repo = 'https://github.com/mergestat/mergestat'
ORDER BY committer_when DESC LIMIT 1

-- retrieve the avg (mean) time between new releases, over the last 90 days
WITH tags AS (
    SELECT
        name, committer_when,
        committer_when - LAG(committer_when) OVER (ORDER BY committer_when) AS since_prev_tag
    FROM git_tags
    JOIN repos ON git_tags.repo_id = repos.id
    JOIN git_commits ON git_tags.tag_commit_hash = git_commits.hash
    WHERE repo = 'https://github.com/mergestat/mergestat'
    AND author_when > now() - '90 days'::interval
    ORDER BY committer_when ASC
)
SELECT ROUND(EXTRACT(EPOCH FROM avg(since_prev_tag))/(86400), 2) AS mean_time_between_releases FROM tags

What comes next?

We'd like to alert to our Slack when it's been too long since a release
We'd like a regular push to our Slack for a regular update on how our release cadence is

Join our Slack

If you found this interesting or would like a similar view of your release cadence, hop in our community Slack! We're always happy to chat about MergeStat there 🎉.

Identifying Open-Source Library Risk Using MergeStat (Part 1)

February 19, 2023

Peter Freiberg

Consultant. DevSecOps. Application Security.

Guest Post

This is a guest post by Peter Freiberg, a DevSecOps and application security consultant based in Melbourne, Australia.

Identifying Open-Source Library Risk Using MergeStat (Part 1)

One topic that arises frequently in my work is understanding what open-source code is in use in an organisation, and where that organisation may have vulnerabilities. Moreover, how do we ensure consistent visibility into our open-source risk profile over time?

Large organisations often have a lot of code, a lot of developers, and a lot of dependencies. This can make it feel like quite a daunting task, and it’s often difficult to know where to begin. The reality, however, is that with a couple of open-source tools, almost every company can easily get accurate information on a great many topics if they know where to look and how to use their data. There’s so much useful data within Git, but not many people are using it today.

MergeStat is one such tool that can help, and here we’ll showcase some queries that can assess open-source library risk and identify the general development landscape across code in an organisation. MergeStat allows you to query all your Git repositories using SQL, to start asking questions like:

Where in our organisation do we use maven (pom.xml) or other package managers of interest?
I’ve found a security issue in our dependencies, which developer modified this last so I can collaborate on remediation?
What’s our technology and language landscape?

Being able to query code ourselves saves a lot of time, the data is the data, it’s actually what’s happened versus surveys, emails and spreadsheets that take weeks to collate and analyse.

I think it’s really useful for AppSec or DevSecOps people to be able to query and analyse code at scale, and then proactively reach out to developers.

We’re going to explore a few use cases for querying an organisation’s code bases, with a little help from automation and some data augmentation, to identify some specific areas of risk.

Show me XYZ open source package manager files, so I can look for known vulnerabilities

In this example, we’re simply going to look for known package manager file types, in this case maven.

-- Find all maven (pom.xml) files across the enterprise 
SELECT repo, path 
FROM git_files
INNER JOIN repos ON git_files.repo_id = repos.id
WHERE path LIKE '%pom.xml%'

From this generated list of files, we can do our own analysis. If you’re already using an SCA tool for your organisation you can also run your own analysis. Keep in mind that having a security tool for your organisation, doesn’t mean everyone is using it (yet).

If we find a potential vulnerability in a file, we can also query who the last author (developer) was and who committed it to the repo again using another query:

SELECT author_email, author_when, committer_email, committer_when
FROM git_commits
JOIN git_commit_stats ON git_commits.hash = git_commit_stats.commit_hash
JOIN repos ON git_commits.repo_id = repos.id
WHERE file_path = 'PATH/TO/THE-FILE.extenstion' AND 
repo = 'https://github.com/YOUR_ORG/REPO'
LIMIT 1

While the last developer or committer may not have added the library which has the issue, at least they have looked at the file recently and in a good position to collaborate with us.

Across all our Git repos, show the most frequently used 3rd party dependencies

One of the first and most fundamental challenges is simply understanding what 3rd party, open-source code a company uses. This is often difficult to aggregate across many code bases and teams.

It can be trivial to access this information in a single repo, but across many (hundreds or even thousands), it can be much more difficult, at least without some automation (or a tool like MergeStat).

MergeStat is able to run open-source SBOM generators on all the Git repositories in an organisation, and store their output in PostgreSQL. In this example, Syft is used to generate and store SBOMs for supported package managers, allowing us to execute a query like this one:

-- Find the most commonly used Syft artifacts across all repos
SELECT COUNT(*) as count, name, version, type
FROM syft_repo_artifacts
GROUP BY name, version, type
ORDER BY 1 DESC

This gives us the most common SBOM artifacts in use across all repositories. This is a great starting point, as it gives us a sense of what languages and tools are in use throughout a company’s code. This is an “out of the box” feature that MergeStat provides.

It also gives us a list of any “key” dependencies - the 3rd party libraries that are most commonly relied on throughout an organisation.

Show me the oldest files that contain security vulnerabilities

Similar to how MergeStat can generate an SBOM across many repos, it can run a vulnerability scan using a tool such as Trivy or Grype. Doing so allows us to surface detected vulnerabilities across the repos in an organisation. Combining this with Git activity allows us to also filter vulnerabilities based on when the related file was last modified. This allows us to order by the vulnerabilities in the least recently modified files first (as these may be a top priority to address, as they’ve been around for a while).

-- SELECT the top 10 oldest files that contain CRITICAL vulnerabilities
SELECT MAX(committer_when) last_commit_to_file, repo, target, vulnerability_id, vulnerability_pkg_name, vulnerability_installed_version, vulnerability_severity
FROM trivy_repo_vulnerabilities
INNER JOIN git_commit_stats ON trivy_repo_vulnerabilities.repo_id = git_commit_stats.repo_id AND trivy_repo_vulnerabilities.target = git_commit_stats.file_path
INNER JOIN git_commits ON git_commit_stats.repo_id = git_commits.repo_id AND git_commit_stats.commit_hash = git_commits.hash
INNER JOIN repos ON trivy_repo_vulnerabilities.repo_id = repos.id
WHERE vulnerability_severity = 'CRITICAL'
GROUP BY repo, target, vulnerability_id, vulnerability_pkg_name, vulnerability_installed_version, vulnerability_severity
ORDER BY 1 ASC
LIMIT 10

This query uses the Trivy integration to surface the top 10 oldest files (by when they were last modified) that contain a CRITICAL vulnerability. This serves as a starting point to identify high-priority vulnerabilities - most likely the most urgent and longest standing ones. Obviously, this query can be modified to suit your needs but it does illustrate the capability.

The key point is that once you can easily query or find the file types you want, running frequent analysis is easier and more repeatable.

Show me who typically maintains 3rd party dependencies across codebases

It’s one thing to know where vulnerabilities are, it’s another thing entirely to find the right people and process to mitigate these vulnerabilities with patches or upgrades. The last person to modify a file is probably a good person to help investigate and fix an issue. An additional query here that’s very valuable to me, is one that shows who are the most frequent committers to dependency manifest files in recent history?

-- show authors with most (non-merge) commits that change package.json files in the last year
SELECT count(*), author_name, author_email, file_path, repo FROM git_commits
JOIN git_commit_stats ON git_commits.hash = git_commit_stats.commit_hash
JOIN repos ON git_commits.repo_id = repos.id
WHERE file_path LIKE '%package.json' AND author_when > now() - '1 year'::interval AND parents < 2
GROUP BY author_name, author_email, file_path, repo
ORDER BY count(*) DESC

The query above can be modified easily to:

Look at go.mod, requirements.txt, Gemfile, etc instead for different package ecosystems
Look at a different time frame - maybe 1 year is too long (lots of engineering churn) and 1 month is a better filter
Filter for specific code bases - maybe you’re only concerned with one or two code bases, as that’s where your focus is.

What Next?

These queries are a way to produce a high-level survey of the open-source risk profile of code bases across an organisation. Getting data from either direct integration of tools (e.g. Trivy, Grype) or just looking for older package manager files provides a good starting point for a security program.

These queries can be expanded on to answer additional questions:

What teams or code bases do a better job of keeping dependencies up to date and vulnerabilities out of main?
Are there individuals or teams who typically update dependencies across many code bases in the organisation? If so, are these good people to target for additional questions about security maintenance?

Once you have your git data sources plumbed into MergeStat, you can query as needed alternating between engaging in discovery and analysing immediate issues.

In Part 2, we’ll cover how to use MergeStat to starting identify overall exposure to a single vulnerability (such as a Log4Shell) that may be thrust upon your day and answer questions in a similar way. Stay tuned!

Join our Slack

If you found this interesting, hop in our community Slack! We're always happy to chat about MergeStat there 🎉.

Querying Renovate Bot Across Repos w/ SQL

February 16, 2023

Patrick DeVivo

Founder & CEO

We've been big fans of Renovate, an open-source tool and GitHub app that automates dependency management by opening PRs into your repos. In particular, we've loved:

The breadth of supported dependency types, including Dockerfiles, Helm Charts and GitHub Actions
PRs are rate limited, so that we're not overwhelmed with code changes to review
Respect for conventional commits

Recently, we wanted to know which of our repos had Renovate installed, which is a use case several folks have now shared with us! This post showcases some of the queries we put together around our use of Renovate.

We spun up an instance of MergeStat and begin writing some queries 🎉.

Which Codebases Have Renovate Installed?

Get the list of repos that have Renovate installed (by looking for a renovate.json config file in the root of the repo).

-- which repos have a renovate.json file in the root
SELECT repo, path FROM git_files
JOIN repos ON git_files.repo_id = repos.id
WHERE path = 'renovate.json'

Retrieve when the renovate.json file was added to a codebase, by looking at the commit history.

-- how long ago (in days) were renovate.json files added to our repos
SELECT
    repo,
    max(author_when) AS last_modified,
    ROUND(EXTRACT(EPOCH FROM now() - max(author_when))/86400, 2) AS last_modified_days_ago,
    min(author_when) AS first_added,
    ROUND(EXTRACT(EPOCH FROM now() - min(author_when))/86400, 2) AS first_added_days_ago
FROM git_commit_stats
JOIN git_commits ON git_commit_stats.commit_hash = git_commits.hash
JOIN repos ON git_commit_stats.repo_id = repos.id
WHERE file_path = 'renovate.json'
GROUP BY repo
ORDER BY first_added ASC

How Many Renovate PRs Have Been Opened?

-- total count of renovate PRs across repos
SELECT count(*) FROM github_pull_requests
WHERE author_login = 'renovate'

-- count of renovate PRs across repos by state
SELECT state, count(*)
FROM github_pull_requests
WHERE author_login = 'renovate'
GROUP BY state
ORDER BY count(*) DESC

What Files Does Renovate Typically Modify?

When a Renovate PR merges, what files is it typically making changes to?

-- count of modifications made by Renovate to specific files
SELECT file_path, count(*) FROM git_commits
JOIN git_commit_stats ON git_commits.hash = git_commit_stats.commit_hash
JOIN repos ON git_commits.repo_id = repos.id
WHERE author_name = 'renovate[bot]'
GROUP BY file_path
ORDER BY count(*) DESC

Here's example output from our codebases:

file_path	count
package-lock.json	259
package.json	231
go.mod	37
go.sum	36
.nvmrc	12
.github/workflows/linter.yaml	6
.github/workflows/lint-test.yaml	6
renovate.json	6
.github/workflows/release.yml	5
.github/workflows/chromatic.yml	4

Show the Cumulative Sum of Merged Renovate PRs

How much has Renovate actually been up to - and how much value are we getting from it over time?

-- calculate the cumulative sum of merged renovate PRs over time
WITH data As (
    SELECT
        date_trunc('day', created_at) AS day,
        count(1)
    FROM github_pull_requests
    WHERE author_login = 'renovate' AND state = 'MERGED'
    GROUP BY 1
)
SELECT
    day,
    sum(count) OVER (ORDER BY day ASC rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM data

How Quickly Are Renovate PRs Merged?

What's the distribution of time-to-merge (measured as the number of days between PR creation and merge) of Renovate PRs.

SELECT
    FLOOR((EXTRACT(EPOCH FROM merged_at - created_at)/86400)/5)*5 as bin_floor,
    count(*)
FROM github_pull_requests
WHERE author_login = 'renovate' AND state = 'MERGED'
GROUP BY 1
ORDER BY 1

Conclusion

We were curious to play around with this data to learn about our use of Renovate and the value we've been getting from it. If you're a Renovate user, hopefully it's interesting to you as well!

Join our Slack

If you found this interesting, hop in our community Slack! We're always happy to chat about MergeStat there 🎉.

The Setup​

What are we using? (Are we using our golden images?)​

Let’s NOT use ‘latest’ Docker images (or at least be able to find them)​

Golden images need to be maintained!​

When was the last time these files were modified or updated?​

Who are you going to call? (when you want to check something)​

We can use code analytics for security and governance​

Parting thoughts​

What percent of code (by line) is each author "blameable" for?​

What's the average (mean) age of a line of code?​

What's the average age of code, by author?​

What are the limitations of this process?​

Next Steps​

How it works​

What can we query?​

List all Helm charts (and their annotated category)​

How many charts are in each category?​

Identifying Open-Source Library Risk Using MergeStat (Part 1)

Show me XYZ open source package manager files, so I can look for known vulnerabilities​

Across all our Git repos, show the most frequently used 3rd party dependencies​

Show me the oldest files that contain security vulnerabilities​

Show me who typically maintains 3rd party dependencies across codebases​

What Next?​

Which Codebases Have Renovate Installed?​

How Many Renovate PRs Have Been Opened?​

What Files Does Renovate Typically Modify?​

Show the Cumulative Sum of Merged Renovate PRs​

How Quickly Are Renovate PRs Merged?​

Conclusion

The Setup

What are we using? (Are we using our golden images?)

Let’s NOT use ‘latest’ Docker images (or at least be able to find them)

Golden images need to be maintained!

When was the last time these files were modified or updated?

Who are you going to call? (when you want to check something)

We can use code analytics for security and governance

Parting thoughts

What percent of code (by line) is each author "blameable" for?

What's the average (mean) age of a line of code?

What's the average age of code, by author?

What are the limitations of this process?

Next Steps

How it works

What can we query?

List all Helm charts (and their annotated category)

How many charts are in each category?

Show me XYZ open source package manager files, so I can look for known vulnerabilities

Across all our Git repos, show the most frequently used 3rd party dependencies

Show me the oldest files that contain security vulnerabilities

Show me who typically maintains 3rd party dependencies across codebases

What Next?

Which Codebases Have Renovate Installed?

How Many Renovate PRs Have Been Opened?

What Files Does Renovate Typically Modify?

Show the Cumulative Sum of Merged Renovate PRs

How Quickly Are Renovate PRs Merged?