Tracing down gitlab metrics with python

Published Mar 28, 2025

The content here is under the Attribution 4.0 International (CC BY 4.0) license

Data oriented approach for software development is a common practice in the industry. In this post, I will share my experience of building reports using Python, matplotlib, and GitLab. This post will cover the entire process from setting up the environment to generating reports. The Gitlab instance will be used the gitlab.com and apis used to generate the reports will be the Gitlab API. The reports will be generated using matplotlib and the data will be processed using pandas.

The process described here follows three main steps:

  1. getting the data from the Gitlab API
  2. processing the data using pandas
  3. plotting the data using matplotlib

For number 2, it can be skipped in case you are building simple visualization with matplotlib. The data can be passed directly to matplotlib.

@startuml
'This PlantUML code will be rendered soon. Meanwhile, you can check the source:

actor User
participant "GitLab API" as GitLab
participant "Python Script" as Script
participant "Pandas" as Pandas
participant "Matplotlib" as Matplotlib

User -> Script: Start script
Script -> GitLab: Request data
GitLab --> Script: Return data (JSON)
Script -> Pandas: Process data
Pandas --> Script: Return processed data
Script -> Matplotlib: Plot data
Matplotlib --> User: Display visualization
@enduml

Regardless of the Gitlab being used here, the concept is applicable to any VCS.

Requirements

The following sections will cover the details of each step, including code snippets and examples. Before, that, check the requirements :

  • Python 3.8 or higher
  • Pyenv
  • GitLab account which you want to get metrics from
  • matplotlib

In addition to that, for this post we will assume the following dynamics from using gitlab api:

  • Gitlab is used as a central source of truth for the source code (the entire team works there)
  • Gitlab Pipelines are used to deploy the application and run jobs

Let’s now dive into the details of each step.

Setting Up Your Environment

Local environment

For Python the environment needs to be prepared using pyenv, it is a good practice to use pyenv to manage the python version and the virtual environment. The virtual environment can be created using the command:

pyenv venv 3.8.10 gitlab-metrics

Then, activate the virtual environment using:

source 

Next, install the required libraries using pip:

pip install matplotlib pandas requests
Hint for reproduceable environment

You can even store the requirements in a file named requirements.txt to reuse it later and reinstall the dependencies with pip install -r requirements.txt.

GitLab token

We will need a GitLab token that can read the repository information, in order to do that, proceed to gitlab and generate a personal token with the read api checked.

Once the token is generated, store it in a environment variable named GITLAB_TOKEN, it needs to be accessed in by the python script later on to access the data.

To test if the gitlab token is working, create the following script in your terminal where you exposed the token and name the file check_gitlab_token.py:

import os
import sys

def check_gitlab_token():
    token = os.getenv("GITLAB_TOKEN")
    if token:
        print("GITLAB_TOKEN is set.")
    else:
        print("Error: GITLAB_TOKEN is not set.")
        sys.exit(1)

if __name__ == "__main__":
    check_gitlab_token()

Once it is done, execute it:

python check_gitlab_token.py

You should see the message: GITLAB_TOKEN is set.

Directory structure

To express the project structure and its intent, in my personal projects I prefer to adopt what is a “scream architecture” from the source code directory. For this project, the following structure is adopted:

root
   |__ deployment-frequency
          |__________ collect-data
          |__________ stored-data
          |__________ visualize-data

Note that the the top folder describes what is the intention of the metric, if it happens that we want to use another metric, we can follow the same structure:

root
   |__ deployment-frequency
          |__________ collect-data
          |__________ stored-data
          |__________ visualize-data
   |__ author-frequency-commits
          |__________ collect-data
          |__________ stored-data
          |__________ visualize-data

This structure favor the isolation between each metric instead of genealize the code to adapt into new ones. The down side is the duplication of code and data generated.

Data Acquisition

The data acquisition focuses on calling the gitlab api and store all the data locally, so we can do that only once. The process might take a few minutes based on your repository size. At this stage we do not care about the data format, only about storing it.

In order to get the data from the api, the following script fetches the data available and move on to the next page, until it ends, following the structure that was already presented it would be like the following:

root
   |__ deployment-frequency
          |__________ collect-data
                          |__________ fetch_jobs_runs_into_production.py
          |__________ stored-data
          |__________ visualize-data

With the following source code:

import requests
from datetime import datetime
import json
import os

TOKEN = os.environ['GITLAB_TOKEN']
PROJECT_ID = os.environ['PROJECT_ID']
BASE_URL = f"https://gitlab.com/api/v4/projects/{PROJECT_ID}/jobs"
HEADERS = {"PRIVATE-TOKEN": TOKEN}
PAGE = 1
PER_PAGE = 100

JOB_NAME = "deploy"

all_jobs = []

while PAGE:
    print("{} , {}, {}", BASE_URL, PAGE, PER_PAGE)
    response = requests.get(BASE_URL, params={"page": PAGE, "per_page": PER_PAGE}, headers=HEADERS)
    jobs = response.json()

    for job in jobs:
        # Convert job's created_at timestamp to a datetime object
        job_date = datetime.strptime(job["created_at"], "%Y-%m-%dT%H:%M:%S.%fZ")
        print("job date {}", job_date)

        if job["name"] == JOB_TO_BE_TRACKED:
            all_jobs.append(job)

    with open("stored-data/all_deploys.json", "w") as f:
      json.dump(all_jobs, f, indent=4)

    # Pagination logic
    PAGE = int(response.headers.get("X-Next-Page") or 0)

print("Finished")

Further changes to the script can be done to collect all the jobs and to store the data restricted by a date time.

Note that running the script might take a while. Once done, the following structure should be the following:

root
   |__ deployment-frequency
                |__________ collect-data
                                |__________ fetch_jobs_runs_into_production.py
                |__________ stored-data
                                |__________ all_deploys.json
                |__________ visualize-data

The next step is to generate the plot to see and analyse the data

Visualization with matplotlib

Matplotlib is a powerful library for creating static visualizations in Python. It is widely used in the data science community and has a large number of customization options [1]. The library is easy to use and allows you to create a wide variety of plots, including line charts, bar graphs, pie charts, and more.

Deciding on what is challenging is to choose the right type of plot for your data. Here are some tips for making your visualizations clear and professional:

  • Choose the right type of plot for your data. For example, use a line chart for time series data, a bar graph for categorical data, and a pie chart for proportions.
  • Use clear and descriptive labels for your axes and titles. This will help your audience understand what the plot is showing.
  • Use colors and styles that are easy to read and interpret. Avoid using too many colors or styles that can be distracting.
  • Keep your plots simple and avoid clutter. Too much information can make it difficult for your audience to understand the main message.

Following those guidelines, the plot used for our example from gitlab pipeline runs are based on a barchart. The structure in the end will be the following:

root
   |__ deployment-frequency
                |__________ collect-data
                                |__________ fetch_jobs_runs_into_production.py
                |__________ stored-data
                                |__________ all_deploys.json
                |__________ visualize-data
                                |__________ barchart_deployment_frequency.py

The following file barchart_deployment_frequency.py has the following source code in it to plot the chart:

import json
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime

with open("stored-data/all_deploys.json", "r") as f:
    jobs = json.load(f)

deployments = []
for job in jobs:
    if job.get("finished_at"):  # Ensure there's a finished timestamp
        finished_time = datetime.strptime(job["finished_at"], "%Y-%m-%dT%H:%M:%S.%fZ")
        deployments.append(finished_time)

df = pd.DataFrame(deployments, columns=["Finished At"])
df.sort_values(by="Finished At", inplace=True)

# Plot
plt.figure(figsize=(10, 5))
plt.plot(df["Finished At"], range(1, len(df) + 1), marker="o", linestyle="-", color="b")
plt.xlabel("Time")
plt.ylabel("Number of Deployments")
plt.title("Deployments Over Time")
plt.xticks(rotation=45)
plt.grid()

plt.show()

The plot loads the data fetched from the stored-data folder and transforms it in a formart that matplot can understand and plot it through pandas.

Lessons Learned

Python and matplotlib are powerful tools for data visualization, but they can be challenging to use at first. Here are some lessons learned during the process:

  • Before diving into the code, outline your goals and the types of visualizations you want to create. This will help you stay focused and avoid getting lost in the details.
  • Using python as a ecosystem made things easier as it has a extensive community that uses the tools and collaborate
  • Matplotlib is used for static chart generation only

Conclusion

In this post, we explored how to build reports using Python, matplotlib, and GitLab. We covered the entire process from setting up the environment to generating reports, including data acquisition, processing, and visualization. By following these steps, you can create powerful visualizations that help you understand your data better and make informed decisions.

Additional Tips

This current post created a repository with the code and the data used to generate the reports. It can be used as a template to set the environment and start using the tools without having to worry about the setup.

Resources

References

  1. [1]P. Morgan, Data Analysis From Scratch With Python: Beginner Guide using Python, Pandas, NumPy, Scikit-Learn, IPython, TensorFlow and Matplotlib. AI SciencesAI Sciences LLC, 2018.

You also might like