Breaking down CI jobs testgrid for the upstream kubernetes project #23

June 10, 2021

(Thursdays are the weekly 1:1 meetings with dims. You could find the agenda document here)

Today’s meeting was a discussion in continuation to this previous blog, where I wrote about dims giving me a walkthrough of the k8s testgrid dashboard.

As a takeaway, I had left for myself a home-work exercise in the end of that discussion. Below is where I’m trying to take notes of what I learnt broadly after doing that homework exercise.

And yea, the homework was to explore what I learnt about the kubernetes testgrid (above), now through the source code.

The source code for the kubernetes CI test jobs are present in this repository here ~ kubernetes/test-infra
There are broadly the following three types of CI test jobs in the kubernetes project ~
- Pre-submit jobs
- Post-submit jobs
- Periodic jobs
For the purpose of note taking, let’s start by taking the Pre-submit jobs as our example usecase.

( Here is something important to know (which is also the reason why I'm picking Pre-submit jobs for our example):

Only the Pre-submit CI jobs have this properly defined & maintained dashboard. You would not find a similar testgrid dashboard (something that we are going to discuss below in the notes) for the Post-submit jobs or Periodic jobs. Following being the major reasons:
- The Pre-submit jobs are the kind of jobs that runs everytime before a PR is submitted or merged. So, it contains mostly blocking jobs. And because these are blocking jobs & run on almost all the PRs during a day, we want the number of these jobs to be very less. And thus, the effort of maintaining a dashboard for these limited number of pre-submit jobs sound reasonabe & quite doable.
- On the other hand, the number of Periodic jobs & Post-submit jobs is very high, & currently most of these are frequently ending up failing only. So, the dashboard usually end up being all red. People argue that these jobs fail so often & so much, that maybe the evaluation/validation criteria of these jobs are not good enough & so, no use of maintaining an all time red board for this large number of jobs. )
That being said, now we could start exploring things in action.
Head onto the kubenetes CI jobs testgrid dashboard groups UI using this link ~ https://testgrid.k8s.io.

You will find something like following ~

All these tiles such as conformance, gardener, google, … , presubmits, … , wg-multi-tenancy corresponds to a set of CI jobs dashboards or a dashboard-group (as we call it in the source-code), for a specific project or a sig-group.

The source-code corresponding to all these dashboard-group tiles (shown in the above screenshot) sits here ~ kubernetes/test-infra/config/testgrids
For our example, as I mentioned above, we are checking presubmits dashboard-group.

The kubernetes testgrid dashboard-group for Pre-submit jobs (as of my writing this blog) looks like the following ~

You’ll see in the top left corner, a grid-like icon. Clicking that grid-like icon would take you back to the same list of dashboard-groups home page UI as we discussed in the last point above.

Now, on this pre-submits dashboard-group page, we see all these tabs (or the similar row entries under dashboard column in the Dashboard Group presubmits :: Overview table) ~

presubmits-alibaba-cloud-csi-driver, presubmits-cloud-provider-alibaba, … , presubmits-test-infra

These tabs corresponds to the dashboards that goes under our presubmits dashboard-group. And of course, these dashboard tabs will further divide into CI jobs (as we’ll see later in the notes).

Before we move further, it’s also interesting to look at the information provided in the Dashboard Group presubmits :: Overview table.

As we can see, there are three columns:
- Status,
- Dashboard (which is basically list of names of the dashboard such as presubmits-alibaba-cloud-csi-driver, … , etc.),
- Health
This table gives a broader view of how many CI jobs under a certain dashboard are PASSING, or FAILING, or are there any FLAKY jobs. For ex, in case of dashboard presubmits-kubernetes-nonblocking (from the above screenshot),
- The overall status of this dashboard is FAILING (because we see there is atleast one CI job failing)
- There are 6 PASSING CI jobs, 15 FLAKY CI jobs, & 1 FAILING CI job.
The source-code corresponding to this presubmits dashboard-group sits here ~ kubernetes/test-infra/config/testgrids/kubernetes/presubmits/config.yaml
Let’s try to break this presubmits/config.yaml file (I linked just above).
- The following two lines in the very beginning of the config file, tells us about the name of the dashboard-group it corresponds to.
  
  So, in this as we can see, this config file corresponds to presubmits dashboard-group.
```
dashboard_groups:
- name: presubmits
```
- further down, this section dashboard_names tells us about what all dashboards are included as tabs in the presubmits dashboard-group.
  
  As we could match from the above screenshot, all the dashboard tabs we could see there, are listed under this section.
```
dashboard_names:
- presubmits-alibaba-cloud-csi-driver
- presubmits-cloud-provider-alibaba
- presubmits-cloud-provider-vsphere-blocking
- presubmits-cluster-registry
- presubmits-kops
- presubmits-kube-batch
- presubmits-kubernetes-blocking
- presubmits-kubernetes-nonblocking
- presubmits-kubernetes-scalability
- presubmits-misc
- presubmits-node-problem-detector
- presubmits-poseidon
- presubmits-test-infra
```
Let’s move another step forward.

Click on one of the dashboard tab under the presubmits dashboard-group & see what all CI jobs are added under this.

Let’s take presubmits-kubernetes-blocking as an example. Clicking this dashboard tab, will bring something like this ~

Now, we could see all the various CI jobs collected under the presubmits-kubernetes-blocking dashboard, like ~

pull-kubernetes-e2e-kind, pull-kubernetes-e2e-kind-ipv6, … , pull-kubernetes-unit

The source-code corresponding to this presubmits-kubernetes-blocking dashboard sits in the same config file, presubmits/config.yaml –> kubernetes/test-infra/config/testgrids/kubernetes/presubmits/config.yaml.

Something to note is that although all the CI jobs under this dashbaord will be listed in same config file, but those CI jobs may or may not be defined here (we’ll discuss more about it, below in the notes).

Continuing with the same presubmits-kubernetes-blocking dashboard example above, let’s now explore the source-code for the various CI jobs listed under it:

The following lines of code under the dasboards –> name: presubmits-kubernetes-blocking –> dashboard_tab section, defines what CI jobs will fall under this particular dashboard.

dashboards:
  
  ...
    
  - name: presubmits-kubernetes-blocking
  dashboard_tab:
  - name: pull-kubernetes-e2e-kind
    test_group_name: pull-kubernetes-e2e-kind
    base_options: width=10
  - name: pull-kubernetes-e2e-kind-ipv6
    test_group_name: pull-kubernetes-e2e-kind-ipv6
    base_options: width=10
  - name: pull-kubernetes-bazel-build
    test_group_name: pull-kubernetes-bazel-build
    base_options: width=10
  - name: pull-kubernetes-bazel-test
    test_group_name: pull-kubernetes-bazel-test
    base_options: width=10
  - name: pull-kubernetes-e2e-gce
    test_group_name: pull-kubernetes-e2e-gce
    base_options: width=10
  - name: pull-kubernetes-e2e-gce-100-performance
    test_group_name: pull-kubernetes-e2e-gce-100-performance
    base_options: width=10
  - name: pull-kubernetes-node-e2e
    test_group_name: pull-kubernetes-node-e2e
    base_options: width=10
  - name: pull-kubernetes-node-e2e-containerd
    test_group_name: pull-kubernetes-node-e2e-containerd
    base_options: width=10
  - name: pull-kubernetes-integration
    test_group_name: pull-kubernetes-integration
    base_options: width=10
  - name: pull-kubernetes-verify
    test_group_name: pull-kubernetes-verify
    base_options: width=10
  - name: pull-kubernetes-typecheck
    test_group_name: pull-kubernetes-typecheck
    base_options: width=10
  - name: pull-kubernetes-dependencies
    test_group_name: pull-kubernetes-dependencies
    base_options: width=10
  - name: pull-kubernetes-e2e-gce-network-proxy-http-connect
    test_group_name: pull-kubernetes-e2e-gce-network-proxy-http-connect
    base_options: width=10
  - name: pull-kubernetes-conformance-kind-ga-only-parallel
    test_group_name: pull-kubernetes-conformance-kind-ga-only-parallel
    base_options: width=10
  - name: pull-kubernetes-e2e-gce-ubuntu-containerd
    test_group_name: pull-kubernetes-e2e-gce-ubuntu-containerd
    base_options: width=10
  - name: pull-kubernetes-verify-govet-levee
    test_group_name: pull-kubernetes-verify-govet-levee
    base_options: width=10
  - name: pull-kubernetes-files-remake
    test_group_name: pull-kubernetes-files-remake
    base_options: width=10

Now, as we see, this section just lists what all CI jobs will go under this one specific dashboard tab, but the definition of these CI jobs are not given here.

And that’s our next step to explore below.

These CI jobs under the presubmits-kubernetes-blocking tab are defined at multiple places based on the category or sub-project these CI job(s) fall into.

(something to note here is, one CI job can fall under multiple dashboard tabs, further under multiple dashboard-groups. So, that is one reason why these general purpose CI jobs are not defined just under one dashboard-group config file, but at certain separate places so they could be referenced by multiple dashboards)

Let’s start by taking few of the CI jobs listed above under the presubmits-kubernetes-blocking dashboard tab for reference example:

The pull-kubernetes-e2e-kind CI job is defined here ~ kubernetes/test-infra/config/jobs/kubernetes/sig-testing/kubernetes-kind-presubmits.yaml#L3-L47

  - name: pull-kubernetes-e2e-kind
    cluster: k8s-infra-prow-build
    optional: false
    always_run: true
    decorate: true
    skip_branches:
    - release-\d+\.\d+ # per-release settings
    labels:
      preset-dind-enabled: "true"
      preset-kind-volume-mounts: "true"
    decoration_config:
      timeout: 60m
      grace_period: 15m
    path_alias: k8s.io/kubernetes
    spec:
      containers:
      - image: gcr.io/k8s-testimages/krte:v20210512-b8d1b30-master
        command:
        - wrapper.sh
        - bash
        - -c
        - curl -sSL https://kind.sigs.k8s.io/dl/latest/linux-amd64.tgz | tar xvfz - -C "${PATH%%:*}/" && e2e-k8s.sh
        env:
        - name: FOCUS
          value: "."
        # TODO(bentheelder): reduce the skip list further
        - name: SKIP
          value: \[Slow\]|\[Disruptive\]|\[Flaky\]|\[Feature:.+\]|PodSecurityPolicy|LoadBalancer|load.balancer|Simple.pod.should.support.exec.through.an.HTTP.proxy|subPath.should.support.existing|NFS|nfs|inline.execution.and.attach|should.be.rejected.when.no.endpoints.exist
        - name: PARALLEL
          value: "true"
        # we need privileged mode in order to do docker in docker
        securityContext:
          privileged: true
        resources:
          limits:
            cpu: 7
            memory: 9000Mi
          requests:
            cpu: 7
            memory: 9000Mi
    annotations:
      testgrid-num-failures-to-alert: '10'
      testgrid-alert-stale-results-hours: '24'
      testgrid-create-test-group: 'true'
      fork-per-release: "true"

Similarily, the CI jobs pull-kubernetes-node-e2e & pull-kubernetes-node-e2e-conatinerd are defined in this same one file, here & here respectively.

For finding where rest of the CI jobs are defined under the kubernetes/test-infra repository, you can utilize the following:

Do a GitHub respository wide search.

for ex, here I’m searching for where could be pull-kubernetes-node-e2e CI job in the kubernetes/test-infra repository
Dr do a similar search using k8s hound search engine.

for ex, here I’m again searching for various places across the k8s projet, the pull-kubernetes-node-e2e2 Ci job is referenced

Moving one more step further in the testgrid dashboard, let’s try to exapand & explore one of the CI jobs under one of the dashboard tabs now.

In continuation to our above examples, let’s try to see what comes when you click presubmits (dashboard-group) –> presubmits-kubernetes-blocking (dashboard-tab) –> pull-kubernetes-e2e-kind (CI job):

We could see the following:
- on the left hand side, we see this wide grey section ~ These are the logs generated when the CI job pull-kubernetes-e2e-kind ran (spanned across multiple rows). To look at the full logs in a plain text, look at the artificats created by the CI job.
- on the right hand side, we see the small tiles in green, red, white & grey colors ~ These are statuses of this CI job run across a period of time (days, divided by hours, and so on). This tells us how this certain CI job is performing.
Based on how frequently, a certain CI job fails, you would also see that CI jobs are marked as FLAKY.

The kind of calculations required to decide if a certain job will be marked as flaky based on “this is x no of times when this job failed”, is done using the metrics calculation config files present here ~ kubernetes/test-infra/metrics/configs
And yea, finally we reached at the end. All these above steps can be repeated to understand different dashboard-groups –> dashboards –> CI jobs across the kubernetes prow testgrid platform.

That’s all for the time! o/

cogitationes, labores, et gratiae (thoughts, works, and gratitudes)

Breaking down CI jobs testgrid for the upstream kubernetes project #23

Other posts

go workspaces usage in kubernetes project 19 Jun 2025

go_modules: how to create go vendor tarballs from subdirectories 10 Apr 2025