Breaking down CI jobs testgrid for the upstream kubernetes project #23
June 10, 2021
(Thursdays are the weekly 1:1 meetings with dims. You could find the agenda document here)
Today’s meeting was a discussion in continuation to this previous blog, where I wrote about dims giving me a walkthrough of the k8s testgrid dashboard.
As a takeaway, I had left for myself a home-work exercise in the end of that discussion. Below is where I’m trying to take notes of what I learnt broadly after doing that homework exercise.
And yea, the homework was to explore what I learnt about the kubernetes testgrid (above), now through the source code.
-
The source code for the kubernetes CI test jobs are present in this repository here ~ kubernetes/test-infra
- There are broadly the following three types of CI test jobs in the kubernetes project ~
- Pre-submit jobs
- Post-submit jobs
- Periodic jobs
-
For the purpose of note taking, let’s start by taking the Pre-submit jobs as our example usecase.
(
Here is something important to know (which is also the reason why I'm picking Pre-submit jobs for our example):
Only the
Pre-submit
CI jobs have this properly defined & maintained dashboard. You would not find a similar testgrid dashboard (something that we are going to discuss below in the notes) for thePost-submit
jobs orPeriodic
jobs. Following being the major reasons:- The
Pre-submit jobs
are the kind of jobs that runs everytime before a PR is submitted or merged. So, it contains mostly blocking jobs. And because these are blocking jobs & run on almost all the PRs during a day, we want the number of these jobs to be very less. And thus, the effort of maintaining a dashboard for these limited number ofpre-submit
jobs sound reasonabe & quite doable. - On the other hand, the number of
Periodic jobs
&Post-submit jobs
is very high, & currently most of these are frequently ending up failing only. So, the dashboard usually end up being all red. People argue that these jobs fail so often & so much, that maybe the evaluation/validation criteria of these jobs are not good enough & so, no use of maintaining an all time red board for this large number of jobs. )
That being said, now we could start exploring things in action.
- The
-
Head onto the kubenetes CI jobs testgrid dashboard groups UI using this link ~ https://testgrid.k8s.io.
You will find something like following ~
All these tiles such as
conformance
,gardener
,google
, … ,presubmits
, … ,wg-multi-tenancy
corresponds to a set of CI jobs dashboards or a dashboard-group (as we call it in the source-code), for a specific project or a sig-group.The source-code corresponding to all these dashboard-group tiles (shown in the above screenshot) sits here ~ kubernetes/test-infra/config/testgrids
-
For our example, as I mentioned above, we are checking
presubmits
dashboard-group.The kubernetes testgrid dashboard-group for
Pre-submit jobs
(as of my writing this blog) looks like the following ~You’ll see in the top left corner, a
grid-like
icon. Clicking thatgrid-like
icon would take you back to the same list ofdashboard-groups
home page UI as we discussed in the last point above.Now, on this
pre-submits
dashboard-group page, we see all these tabs (or the similar row entries underdashboard
column in theDashboard Group presubmits :: Overview
table) ~presubmits-alibaba-cloud-csi-driver
,presubmits-cloud-provider-alibaba
, … ,presubmits-test-infra
These tabs corresponds to the dashboards that goes under our
presubmits
dashboard-group. And of course, these dashboard tabs will further divide into CI jobs (as we’ll see later in the notes).Before we move further, it’s also interesting to look at the information provided in the
Dashboard Group presubmits :: Overview
table.As we can see, there are three columns:
Status
,Dashboard
(which is basically list of names of the dashboard such aspresubmits-alibaba-cloud-csi-driver
, … , etc.),Health
This table gives a broader view of how many CI jobs under a certain dashboard are
PASSING
, orFAILING
, or are there anyFLAKY
jobs. For ex, in case of dashboardpresubmits-kubernetes-nonblocking
(from the above screenshot),- The overall status of this dashboard is
FAILING
(because we see there is atleast one CI job failing) - There are 6
PASSING
CI jobs, 15FLAKY
CI jobs, & 1FAILING
CI job.
The source-code corresponding to this
presubmits
dashboard-group sits here ~ kubernetes/test-infra/config/testgrids/kubernetes/presubmits/config.yaml -
Let’s try to break this
presubmits/config.yaml
file (I linked just above).-
The following two lines in the very beginning of the config file, tells us about the name of the
dashboard-group
it corresponds to.So, in this as we can see, this config file corresponds to
presubmits
dashboard-group.dashboard_groups: - name: presubmits
-
further down, this section
dashboard_names
tells us about what all dashboards are included as tabs in thepresubmits
dashboard-group.As we could match from the above screenshot, all the dashboard tabs we could see there, are listed under this section.
dashboard_names: - presubmits-alibaba-cloud-csi-driver - presubmits-cloud-provider-alibaba - presubmits-cloud-provider-vsphere-blocking - presubmits-cluster-registry - presubmits-kops - presubmits-kube-batch - presubmits-kubernetes-blocking - presubmits-kubernetes-nonblocking - presubmits-kubernetes-scalability - presubmits-misc - presubmits-node-problem-detector - presubmits-poseidon - presubmits-test-infra
-
-
Let’s move another step forward.
Click on one of the dashboard tab under the
presubmits
dashboard-group & see what all CI jobs are added under this.Let’s take
presubmits-kubernetes-blocking
as an example. Clicking this dashboard tab, will bring something like this ~Now, we could see all the various CI jobs collected under the
presubmits-kubernetes-blocking
dashboard, like ~pull-kubernetes-e2e-kind
,pull-kubernetes-e2e-kind-ipv6
, … ,pull-kubernetes-unit
The source-code corresponding to this
presubmits-kubernetes-blocking
dashboard sits in the same config file,presubmits/config.yaml
–> kubernetes/test-infra/config/testgrids/kubernetes/presubmits/config.yaml.Something to note is that although all the CI jobs under this dashbaord will be listed in same config file, but those CI jobs may or may not be defined here (we’ll discuss more about it, below in the notes).
-
Continuing with the same
presubmits-kubernetes-blocking
dashboard example above, let’s now explore the source-code for the various CI jobs listed under it:The following lines of code under the
dasboards
–>name: presubmits-kubernetes-blocking
–>dashboard_tab
section, defines what CI jobs will fall under this particular dashboard.dashboards: ... - name: presubmits-kubernetes-blocking dashboard_tab: - name: pull-kubernetes-e2e-kind test_group_name: pull-kubernetes-e2e-kind base_options: width=10 - name: pull-kubernetes-e2e-kind-ipv6 test_group_name: pull-kubernetes-e2e-kind-ipv6 base_options: width=10 - name: pull-kubernetes-bazel-build test_group_name: pull-kubernetes-bazel-build base_options: width=10 - name: pull-kubernetes-bazel-test test_group_name: pull-kubernetes-bazel-test base_options: width=10 - name: pull-kubernetes-e2e-gce test_group_name: pull-kubernetes-e2e-gce base_options: width=10 - name: pull-kubernetes-e2e-gce-100-performance test_group_name: pull-kubernetes-e2e-gce-100-performance base_options: width=10 - name: pull-kubernetes-node-e2e test_group_name: pull-kubernetes-node-e2e base_options: width=10 - name: pull-kubernetes-node-e2e-containerd test_group_name: pull-kubernetes-node-e2e-containerd base_options: width=10 - name: pull-kubernetes-integration test_group_name: pull-kubernetes-integration base_options: width=10 - name: pull-kubernetes-verify test_group_name: pull-kubernetes-verify base_options: width=10 - name: pull-kubernetes-typecheck test_group_name: pull-kubernetes-typecheck base_options: width=10 - name: pull-kubernetes-dependencies test_group_name: pull-kubernetes-dependencies base_options: width=10 - name: pull-kubernetes-e2e-gce-network-proxy-http-connect test_group_name: pull-kubernetes-e2e-gce-network-proxy-http-connect base_options: width=10 - name: pull-kubernetes-conformance-kind-ga-only-parallel test_group_name: pull-kubernetes-conformance-kind-ga-only-parallel base_options: width=10 - name: pull-kubernetes-e2e-gce-ubuntu-containerd test_group_name: pull-kubernetes-e2e-gce-ubuntu-containerd base_options: width=10 - name: pull-kubernetes-verify-govet-levee test_group_name: pull-kubernetes-verify-govet-levee base_options: width=10 - name: pull-kubernetes-files-remake test_group_name: pull-kubernetes-files-remake base_options: width=10
Now, as we see, this section just lists what all CI jobs will go under this one specific dashboard tab, but the definition of these CI jobs are not given here.
And that’s our next step to explore below.
-
These CI jobs under the
presubmits-kubernetes-blocking
tab are defined at multiple places based on the category or sub-project these CI job(s) fall into.(something to note here is, one CI job can fall under multiple
dashboard
tabs, further under multipledashboard-groups
. So, that is one reason why these general purpose CI jobs are not defined just under onedashboard-group
config file, but at certain separate places so they could be referenced by multiple dashboards)Let’s start by taking few of the CI jobs listed above under the
presubmits-kubernetes-blocking
dashboard tab for reference example:- The
pull-kubernetes-e2e-kind
CI job is defined here ~ kubernetes/test-infra/config/jobs/kubernetes/sig-testing/kubernetes-kind-presubmits.yaml#L3-L47
- name: pull-kubernetes-e2e-kind cluster: k8s-infra-prow-build optional: false always_run: true decorate: true skip_branches: - release-\d+\.\d+ # per-release settings labels: preset-dind-enabled: "true" preset-kind-volume-mounts: "true" decoration_config: timeout: 60m grace_period: 15m path_alias: k8s.io/kubernetes spec: containers: - image: gcr.io/k8s-testimages/krte:v20210512-b8d1b30-master command: - wrapper.sh - bash - -c - curl -sSL https://kind.sigs.k8s.io/dl/latest/linux-amd64.tgz | tar xvfz - -C "${PATH%%:*}/" && e2e-k8s.sh env: - name: FOCUS value: "." # TODO(bentheelder): reduce the skip list further - name: SKIP value: \[Slow\]|\[Disruptive\]|\[Flaky\]|\[Feature:.+\]|PodSecurityPolicy|LoadBalancer|load.balancer|Simple.pod.should.support.exec.through.an.HTTP.proxy|subPath.should.support.existing|NFS|nfs|inline.execution.and.attach|should.be.rejected.when.no.endpoints.exist - name: PARALLEL value: "true" # we need privileged mode in order to do docker in docker securityContext: privileged: true resources: limits: cpu: 7 memory: 9000Mi requests: cpu: 7 memory: 9000Mi annotations: testgrid-num-failures-to-alert: '10' testgrid-alert-stale-results-hours: '24' testgrid-create-test-group: 'true' fork-per-release: "true"
Similarily, the CI jobs
pull-kubernetes-node-e2e
&pull-kubernetes-node-e2e-conatinerd
are defined in this same one file, here & here respectively.For finding where rest of the CI jobs are defined under the
kubernetes/test-infra
repository, you can utilize the following:-
Do a GitHub respository wide search.
-
Dr do a similar search using k8s hound search engine.
- The
-
Moving one more step further in the testgrid dashboard, let’s try to exapand & explore one of the CI jobs under one of the
dashboard
tabs now.In continuation to our above examples, let’s try to see what comes when you click
presubmits
(dashboard-group) –>presubmits-kubernetes-blocking
(dashboard-tab) –>pull-kubernetes-e2e-kind
(CI job):We could see the following:
on the left hand side, we see this wide grey section
~ These are the logs generated when the CI jobpull-kubernetes-e2e-kind
ran (spanned across multiple rows). To look at the full logs in a plain text, look at the artificats created by the CI job.on the right hand side, we see the small tiles in green, red, white & grey colors
~ These are statuses of this CI job run across a period of time (days, divided by hours, and so on). This tells us how this certain CI job is performing.
Based on how frequently, a certain CI job fails, you would also see that CI jobs are marked as
FLAKY
.The kind of calculations required to decide if a certain job will be marked as flaky based on “this is x no of times when this job failed”, is done using the metrics calculation config files present here ~ kubernetes/test-infra/metrics/configs
- And yea, finally we reached at the end. All these above steps can be repeated to understand different
dashboard-groups
–>dashboards
–>CI jobs
across the kubernetes prow testgrid platform.
That’s all for the time! o/