Code Monkey home page Code Monkey logo

release's Introduction

OpenShift Release Tools

This repository contains the process tooling for OpenShift releases.

Prow

Prerequisites:

  • ci namespace exists
  • BASIC_AUTH_PASS is the password for authenticating with https://ci.openshift.redhat.com/jenkins/
  • BEARER_TOKEN is the token for authenticating with https://jenkins-origin-ci.svc.ci.openshift.org
  • HMAC_TOKEN is used for decrypting Github webhook payloads
  • OAUTH_TOKEN is used for adding labels and comments in Github
  • RETEST_TOKEN is used by the retester periodic job to rerun tests for PRs
  • CHERRYPICK_TOKEN is used by the cherrypick plugin to cherry-pick PRs in release branches

Ensure the aforementioned requirements are met and stand up a prow cluster:

make prow

For more information on prow, see the upstream documentation.

Prow alerts

A Prometheus server runs in the CI cluster and is configured to create alerts on top of prow metrics. By clicking on the expr field of every alert, we can view the query that is setup for alerting. For more information on alerts, see the Prometheus docs.

Possible reactions to some of these alerts:

Slow operator sync

These should not be a problem in general but if any of them persists for more than a couple of hours, max_goroutines can be incremented to allow more parallelism in the operators (note that the same option dictates both operators).

It may also be that the operators are lagging due to slow responses from Jenkins. We can figure out whether prow requests to Jenkins are slow by looking at the following metrics:

This is the apdex score for GET request latencies from prow to Jenkins where we assume that most requests will have 1s RTT and tolerate up to 2.5s of RTT.

Another possible mitigation for slow syncs is to shard the operators further by spinning up a new deployment of jenkins_operator and tweak its label selector to handle some of the load of the operator that experiences slow syncs. We would also need to change the label selector of the slow operator and add labels in some of the jobs it is handling appropriately.

Today, we use the following mappings between Jenkins operators and masters:

Infrastructure failures

Errors in tests means that there is an underlying infrastructure failure that blocks tests from executing correctly or the tests are executing correctly but a problem in the infrastructure disallows the operators from picking up the results. Most often than not, this is an issue with Jenkins.

Failed requests to Jenkins is usually a problem with Jenkins and less often a misconfiguration in prow (eg. wrong Jenkins credentials). It may be possible that Jenkins is overwhelmed by the number of jobs it is running. In that case max_concurrency can be decremented to force more free space in Jenkins.

Assuming access to the CI cluster, logs for the Jenkins operator and master can be gathered with:

oc logs dc/jenkins-origin-operator -n ci
oc logs -f dc/jenkins -n origin-ci

TODO: How to debug our Jenkins instances.

Postsubmit and batch failures

These alerts are usually triggered because of flaky tests but keep in mind that they may also come from infrastructure failures. The only thing that can be done in this case is to triage these failures, open issues in their respective repositories, and nag people to fix them. We need to be especially cautious about failures in batch tests. Consecutive failures in batch tests means we are not merging with a satisfying rate.

Use the following links to triage these alerts:

https://deck-ci.svc.ci.openshift.org/?type=postsubmit

https://deck-ci.svc.ci.openshift.org/?type=batch

TODO: Forward alerts via e-mail.

release's People

Contributors

0xmichalis avatar stevekuznetsov avatar smarterclayton avatar csrwng avatar smunilla avatar paradigm avatar bparees avatar jwforres avatar sdodson avatar deads2k avatar pgier avatar runcom avatar jupierce avatar dobbymoodge avatar warmchang avatar sallyom avatar mrunalp avatar michaelgugino avatar jarrpa avatar jlebon avatar jim-minter avatar ingvagabund avatar jhadvig avatar cgwalters avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.