cogitationes, labores, et gratiae (thoughts, works, and gratitudes)

Breaking down CI jobs testgrid for the upstream kubernetes project #23

June 10, 2021

(Thursdays are the weekly 1:1 meetings with dims. You could find the agenda document here)

Today’s meeting was a discussion in continuation to this previous blog, where I wrote about dims giving me a walkthrough of the k8s testgrid dashboard.

As a takeaway, I had left for myself a home-work exercise in the end of that discussion. Below is where I’m trying to take notes of what I learnt broadly after doing that homework exercise.

And yea, the homework was to explore what I learnt about the kubernetes testgrid (above), now through the source code.


  • The source code for the kubernetes CI test jobs are present in this repository here ~ kubernetes/test-infra

  • There are broadly the following three types of CI test jobs in the kubernetes project ~
    • Pre-submit jobs
    • Post-submit jobs
    • Periodic jobs
  • For the purpose of note taking, let’s start by taking the Pre-submit jobs as our example usecase.

    ( Here is something important to know (which is also the reason why I'm picking Pre-submit jobs for our example):

    Only the Pre-submit CI jobs have this properly defined & maintained dashboard. You would not find a similar testgrid dashboard (something that we are going to discuss below in the notes) for the Post-submit jobs or Periodic jobs. Following being the major reasons:

    • The Pre-submit jobs are the kind of jobs that runs everytime before a PR is submitted or merged. So, it contains mostly blocking jobs. And because these are blocking jobs & run on almost all the PRs during a day, we want the number of these jobs to be very less. And thus, the effort of maintaining a dashboard for these limited number of pre-submit jobs sound reasonabe & quite doable.
    • On the other hand, the number of Periodic jobs & Post-submit jobs is very high, & currently most of these are frequently ending up failing only. So, the dashboard usually end up being all red. People argue that these jobs fail so often & so much, that maybe the evaluation/validation criteria of these jobs are not good enough & so, no use of maintaining an all time red board for this large number of jobs. )

    That being said, now we could start exploring things in action.

  • Head onto the kubenetes CI jobs testgrid dashboard groups UI using this link ~ https://testgrid.k8s.io.

    You will find something like following ~

    Screenshot from 2021-06-10 22-51-21

    All these tiles such as conformance, gardener, google, … , presubmits, … , wg-multi-tenancy corresponds to a set of CI jobs dashboards or a dashboard-group (as we call it in the source-code), for a specific project or a sig-group.

    The source-code corresponding to all these dashboard-group tiles (shown in the above screenshot) sits here ~ kubernetes/test-infra/config/testgrids

  • For our example, as I mentioned above, we are checking presubmits dashboard-group.

    The kubernetes testgrid dashboard-group for Pre-submit jobs (as of my writing this blog) looks like the following ~

    Screenshot from 2021-06-10 22-42-07

    You’ll see in the top left corner, a grid-like icon. Clicking that grid-like icon would take you back to the same list of dashboard-groups home page UI as we discussed in the last point above.

    Now, on this pre-submits dashboard-group page, we see all these tabs (or the similar row entries under dashboard column in the Dashboard Group presubmits :: Overview table) ~

    presubmits-alibaba-cloud-csi-driver, presubmits-cloud-provider-alibaba, … , presubmits-test-infra

    These tabs corresponds to the dashboards that goes under our presubmits dashboard-group. And of course, these dashboard tabs will further divide into CI jobs (as we’ll see later in the notes).

    Before we move further, it’s also interesting to look at the information provided in the Dashboard Group presubmits :: Overview table.

    As we can see, there are three columns:

    • Status,
    • Dashboard (which is basically list of names of the dashboard such as presubmits-alibaba-cloud-csi-driver, … , etc.),
    • Health

    This table gives a broader view of how many CI jobs under a certain dashboard are PASSING, or FAILING, or are there any FLAKY jobs. For ex, in case of dashboard presubmits-kubernetes-nonblocking (from the above screenshot),

    • The overall status of this dashboard is FAILING (because we see there is atleast one CI job failing)
    • There are 6 PASSING CI jobs, 15 FLAKY CI jobs, & 1 FAILING CI job.

    The source-code corresponding to this presubmits dashboard-group sits here ~ kubernetes/test-infra/config/testgrids/kubernetes/presubmits/config.yaml

  • Let’s try to break this presubmits/config.yaml file (I linked just above).

    • The following two lines in the very beginning of the config file, tells us about the name of the dashboard-group it corresponds to.

      So, in this as we can see, this config file corresponds to presubmits dashboard-group.

      dashboard_groups:
      - name: presubmits
      
    • further down, this section dashboard_names tells us about what all dashboards are included as tabs in the presubmits dashboard-group.

      As we could match from the above screenshot, all the dashboard tabs we could see there, are listed under this section.

      dashboard_names:
      - presubmits-alibaba-cloud-csi-driver
      - presubmits-cloud-provider-alibaba
      - presubmits-cloud-provider-vsphere-blocking
      - presubmits-cluster-registry
      - presubmits-kops
      - presubmits-kube-batch
      - presubmits-kubernetes-blocking
      - presubmits-kubernetes-nonblocking
      - presubmits-kubernetes-scalability
      - presubmits-misc
      - presubmits-node-problem-detector
      - presubmits-poseidon
      - presubmits-test-infra
      
  • Let’s move another step forward.

    Click on one of the dashboard tab under the presubmits dashboard-group & see what all CI jobs are added under this.

    Let’s take presubmits-kubernetes-blocking as an example. Clicking this dashboard tab, will bring something like this ~

    Screenshot from 2021-06-10 23-47-09

    Now, we could see all the various CI jobs collected under the presubmits-kubernetes-blocking dashboard, like ~

    pull-kubernetes-e2e-kind, pull-kubernetes-e2e-kind-ipv6, … , pull-kubernetes-unit

    The source-code corresponding to this presubmits-kubernetes-blocking dashboard sits in the same config file, presubmits/config.yaml –> kubernetes/test-infra/config/testgrids/kubernetes/presubmits/config.yaml.

    Something to note is that although all the CI jobs under this dashbaord will be listed in same config file, but those CI jobs may or may not be defined here (we’ll discuss more about it, below in the notes).

  • Continuing with the same presubmits-kubernetes-blocking dashboard example above, let’s now explore the source-code for the various CI jobs listed under it:

    The following lines of code under the dasboards –> name: presubmits-kubernetes-blocking –> dashboard_tab section, defines what CI jobs will fall under this particular dashboard.

    dashboards:
      
      ...
        
      - name: presubmits-kubernetes-blocking
      dashboard_tab:
      - name: pull-kubernetes-e2e-kind
        test_group_name: pull-kubernetes-e2e-kind
        base_options: width=10
      - name: pull-kubernetes-e2e-kind-ipv6
        test_group_name: pull-kubernetes-e2e-kind-ipv6
        base_options: width=10
      - name: pull-kubernetes-bazel-build
        test_group_name: pull-kubernetes-bazel-build
        base_options: width=10
      - name: pull-kubernetes-bazel-test
        test_group_name: pull-kubernetes-bazel-test
        base_options: width=10
      - name: pull-kubernetes-e2e-gce
        test_group_name: pull-kubernetes-e2e-gce
        base_options: width=10
      - name: pull-kubernetes-e2e-gce-100-performance
        test_group_name: pull-kubernetes-e2e-gce-100-performance
        base_options: width=10
      - name: pull-kubernetes-node-e2e
        test_group_name: pull-kubernetes-node-e2e
        base_options: width=10
      - name: pull-kubernetes-node-e2e-containerd
        test_group_name: pull-kubernetes-node-e2e-containerd
        base_options: width=10
      - name: pull-kubernetes-integration
        test_group_name: pull-kubernetes-integration
        base_options: width=10
      - name: pull-kubernetes-verify
        test_group_name: pull-kubernetes-verify
        base_options: width=10
      - name: pull-kubernetes-typecheck
        test_group_name: pull-kubernetes-typecheck
        base_options: width=10
      - name: pull-kubernetes-dependencies
        test_group_name: pull-kubernetes-dependencies
        base_options: width=10
      - name: pull-kubernetes-e2e-gce-network-proxy-http-connect
        test_group_name: pull-kubernetes-e2e-gce-network-proxy-http-connect
        base_options: width=10
      - name: pull-kubernetes-conformance-kind-ga-only-parallel
        test_group_name: pull-kubernetes-conformance-kind-ga-only-parallel
        base_options: width=10
      - name: pull-kubernetes-e2e-gce-ubuntu-containerd
        test_group_name: pull-kubernetes-e2e-gce-ubuntu-containerd
        base_options: width=10
      - name: pull-kubernetes-verify-govet-levee
        test_group_name: pull-kubernetes-verify-govet-levee
        base_options: width=10
      - name: pull-kubernetes-files-remake
        test_group_name: pull-kubernetes-files-remake
        base_options: width=10
    

    Now, as we see, this section just lists what all CI jobs will go under this one specific dashboard tab, but the definition of these CI jobs are not given here.

    And that’s our next step to explore below.

  • These CI jobs under the presubmits-kubernetes-blocking tab are defined at multiple places based on the category or sub-project these CI job(s) fall into.

    (something to note here is, one CI job can fall under multiple dashboard tabs, further under multiple dashboard-groups. So, that is one reason why these general purpose CI jobs are not defined just under one dashboard-group config file, but at certain separate places so they could be referenced by multiple dashboards)

    Let’s start by taking few of the CI jobs listed above under the presubmits-kubernetes-blocking dashboard tab for reference example:

      - name: pull-kubernetes-e2e-kind
        cluster: k8s-infra-prow-build
        optional: false
        always_run: true
        decorate: true
        skip_branches:
        - release-\d+\.\d+ # per-release settings
        labels:
          preset-dind-enabled: "true"
          preset-kind-volume-mounts: "true"
        decoration_config:
          timeout: 60m
          grace_period: 15m
        path_alias: k8s.io/kubernetes
        spec:
          containers:
          - image: gcr.io/k8s-testimages/krte:v20210512-b8d1b30-master
            command:
            - wrapper.sh
            - bash
            - -c
            - curl -sSL https://kind.sigs.k8s.io/dl/latest/linux-amd64.tgz | tar xvfz - -C "${PATH%%:*}/" && e2e-k8s.sh
            env:
            - name: FOCUS
              value: "."
            # TODO(bentheelder): reduce the skip list further
            - name: SKIP
              value: \[Slow\]|\[Disruptive\]|\[Flaky\]|\[Feature:.+\]|PodSecurityPolicy|LoadBalancer|load.balancer|Simple.pod.should.support.exec.through.an.HTTP.proxy|subPath.should.support.existing|NFS|nfs|inline.execution.and.attach|should.be.rejected.when.no.endpoints.exist
            - name: PARALLEL
              value: "true"
            # we need privileged mode in order to do docker in docker
            securityContext:
              privileged: true
            resources:
              limits:
                cpu: 7
                memory: 9000Mi
              requests:
                cpu: 7
                memory: 9000Mi
        annotations:
          testgrid-num-failures-to-alert: '10'
          testgrid-alert-stale-results-hours: '24'
          testgrid-create-test-group: 'true'
          fork-per-release: "true"
    

    Similarily, the CI jobs pull-kubernetes-node-e2e & pull-kubernetes-node-e2e-conatinerd are defined in this same one file, here & here respectively.

    For finding where rest of the CI jobs are defined under the kubernetes/test-infra repository, you can utilize the following:

  • Moving one more step further in the testgrid dashboard, let’s try to exapand & explore one of the CI jobs under one of the dashboard tabs now.

    In continuation to our above examples, let’s try to see what comes when you click presubmits (dashboard-group) –> presubmits-kubernetes-blocking (dashboard-tab) –> pull-kubernetes-e2e-kind (CI job):

    Screenshot from 2021-06-11 00-35-49

    We could see the following:

    • on the left hand side, we see this wide grey section ~ These are the logs generated when the CI job pull-kubernetes-e2e-kind ran (spanned across multiple rows). To look at the full logs in a plain text, look at the artificats created by the CI job.
    • on the right hand side, we see the small tiles in green, red, white & grey colors ~ These are statuses of this CI job run across a period of time (days, divided by hours, and so on). This tells us how this certain CI job is performing.

    Based on how frequently, a certain CI job fails, you would also see that CI jobs are marked as FLAKY.

    The kind of calculations required to decide if a certain job will be marked as flaky based on “this is x no of times when this job failed”, is done using the metrics calculation config files present here ~ kubernetes/test-infra/metrics/configs

  • And yea, finally we reached at the end. All these above steps can be repeated to understand different dashboard-groups –> dashboards –> CI jobs across the kubernetes prow testgrid platform.

That’s all for the time! o/