taskcluster / taskcluster-tools Goto Github PK

Tools for debugging, inspecting and managing Taskcluster

Home Page: https://tools.taskcluster.net/

License: Mozilla Public License 2.0

CSS 2.04% JavaScript 97.96%

taskcluster-tools's Introduction

Taskcluster Tools

This repository contains a collection of useful tools for use with Taskcluster. Generally, we strive to not add UI to Taskcluster components, but instead offer well documented APIs that can be easily consumed using a client library for Taskcluster.

Developing Taskcluster Tools

Prerequisites for building Taskcluster Tools

Node version v8+
Yarn

Building

First, fork this repository to another GitHub account. Then you can clone and install:

git clone https://github.com/<YOUR_ACCOUNT>/taskcluster-tools.git
cd taskcluster-tools
yarn

Code Organization

src/: source code
src/App: top-level component, holds page layout, route rendering, and login handling
src/views: route-based components typically matching a URL, e.g. /groups maps to src/views/UnifiedInspector
src/components: generic components that can be used in any view (not view-specific)

Tasks and Configuration

Building this project uses Neutrino, neutrino-preset-frontend-infra to:

Compile ES2015+ syntax to ES5-compatible JS
Compile React JSX to de-sugared JS
Show ESLint errors based on R&P rules
Build views into cacheable bundles
Importing files directly via JS and output to the bundle

Testing changes

Install npm dependencies and start it up:

yarn
yarn start

This will start a local development server on port 9000 (http://localhost:9000).

Any ESLint errors across the project will be displayed in the terminal during development.

Available targets

yarn start: the default development build, watches src/, and serves on http://localhost:9000/
yarn build: builds src/ files into a build/ directory

Memory problems during development

It's possible that when building a larger project like taskcluster-tools that Node.js will run out of memory for the amount of files being built during development. As a workaround, instead of running yarn start, run the following to run the same command with more memory:

node --max-old-space-size=4096 node_modules/.bin/neutrino start

You may need to adjust the memory size to your machine specs accordingly.

Testing

Until someone comes up with something better, all testing is manual. Open the tools and check that they work. :)

Ngrok Setup (optional)

Ngrok allows you to expose a web server running on your local machine to the internet. Ngrok is used to create an https connection, so that you can login to the taskcluster-tools. For using ngrok:

Create a free account on ngrok.
Install ngrok - npm install -g ngrok or yarn global add ngrok
Run ngrok - ngrok http 9000

^{Note: You have to run ngrok in a separate terminal/console.}

Deployment

Taskcluster uses Travis to automatically deploy the Heroku application after a successful build on master. This is done in .travis.yml:

deploy:
  provider: heroku
  api_key:
    secure: "<your encrypted API key>"

If a deployment results in the site throwing a 404, try updating the API key. A non valid key will not deploy properly.

Encrypting a key

Run heroku auth:token on the command line and then copy the token.
Navigate to https://travis-encrypt.github.io/.
Write taskcluster/taskcluster-tools in the first input field and paste the copied token in the second text field.
Click the encrypt button.
Replace the <your encrypted API key> placeholder from the snippet above with the encryped key that was generated for you.

taskcluster-tools's People

Contributors

Stargazers

Watchers

Forkers

petemoore djmitche seinlin gregarndt acmiyaguchi maurodoglio mshal selenamarie luser walac rillian imbstack armenzg garbas claudijd techchic aholachek elena-posea majabogeski runt18 andreadelrio kristelteng reznord ahal ckousik anarute tkiethanom digideskio gianabhateja anshikamittal eliperelman tp-tc srfraser hammad13060-zz pavankarthikboddeda wlach sidd0107 swapneshks thenavigat edunham sportsbitenews prachi1210 alisha17 ydidwania aneeshusa biboswan dollymathur70 cpswsg bessii fiennyangeln kriti21 serendicoder adlucem hashi93 yash-iiith blindacai kanikasaini aditi574 laghee kritisingh1 ihsavru akshitac8 mynahmarie davehouse adityac8 tuhina2020 gabbyjose nthomas-mozilla thomcc rv404674 saumya29 hrushikeshbodas hybrid1999 jgraham morristech aditya-kolla raduiman-zz drvegass puppup420247-org edil-it-them j420247 sirinartk

taskcluster-tools's Issues

CODE_OF_CONDUCT.md isn't correct

Your required text does not appear to be correct

As of January 1 2019, Mozilla requires that all GitHub projects include this CODE_OF_CONDUCT.md file in the project root. The file has two parts:

Required Text - All text under the headings Community Participation Guidelines and How to Report, are required, and should not be altered.
Optional Text - The Project Specific Etiquette heading provides a space to speak more specifically about ways people can work effectively and inclusively together. Some examples of those can be found on the Firefox Debugger project, and Common Voice. (The optional part is commented out in the raw template file, and will not be visible until you modify and uncomment that part.)

If you have any questions about this file, or Code of Conduct policies and procedures, please reach out to [email protected].

(Message COC003)

Cannot use interactive connect: a.resume is not a function

I tried to connect to an interactive task, https://tools.taskcluster.net/tasks/XXXX/connect, but I only see an eternally spinning taskcluster logo.

The JS console shows: a.resume is not a function

I stepped through a JS debugger, and found that is caused by this line:

taskcluster-tools/src/views/InteractiveConnect/InteractiveConnect.js

Line 120 in a793bf2

listener.resume();

Most likely related to #321.

@eliperelman

proposal: renaming files to look like routes...?

@walac and I had a bit of a problem finding some redirects today...

It would be smart of we rename components after their url-route..? Or is there no way that would work...

like src/views/one-click-loaner.js be the redirect for /one-click-loaner/#MoCWhMwpTtKxzbvWuMexrQ...
and same for other tools...

Or would require splitting too many things?

AWS-Provisioner work for new spot model

There are a bunch of small changes we should be making to the AWS-Provisioner page. Not sure if you'd rather have these split into smaller issues, but as they're likely tiny patches it's probably reasonable to do in one PR.

Roughly speaking, spot-requests are no longer tacked anywhere by us, so we should remove them from the UI completely. Basically, the regexp [Ss]pot[- ][Rr]equest should not match anything in the provisioner UI.

The changes are:

Remove Spot Requests from the worker-type resources tab. This table will forever be empty
Remove from the bar graphs any mention of spot-request in the tooltip
The Spot Request Id column on the worker type resources tab should be removed
The worker-type status table should have the requested spot capacity removed

This should only land after Wednesday's EC2-Manager and AWS-Provisioner deployment.

cc @eliperelman @helfi92

Inspector - filtered list of tasks isn't updated

RelEng releaseduty spend time in the inspector to keep an eye on the state of release automation. The status bar seems to keep task counts for each state up to date, but the task list will get out of date. This regressed in the last week or so.

Steps to reproduce:

load a large & active task group, probably anything from a gecko push would be fine
use the Status filter to select a state which isn't All, eg Running
do something else while some tasks change state
observe the task count for the filter has changed, but not the list

Workaround - change the filter to some other value and back again.

Tools UI blocks when actions.json is missing context

I forgot to add a context is my first attempt at writing an actions.json, and that completely blocked the tools.tc.net UI on the taskgroup: https://tools.taskcluster.net/groups/A889YN3qTjiHyUhiI46ezA

don't scroll current tasks off screen

Given a 3000+ task view, sometimes I find the task I care about before the entire task graph is loaded. Then I click on it to view it, but right before I click, 100 more tasks load and the task shifts off screen, and I click on a different task. Then I have to go back, and find the task, and either wait for all 3000+ tasks to load, or try the task-click-lotto again.

I'd love if we could load the tasks in the background but keep the current scroll position fixed. I'm not sure if this is possible... maybe if we had a sort-by-latest-loaded instead of a default sort-by-alphabetical-label? but ideally it would keep sorting, but increase the above- and below- screen buffers rather than shifting the view down or up.

Travis failed to deploy

https://travis-ci.org/taskcluster/taskcluster-tools/builds/442179673#L679

Status tab on the worker type detail is empty

Example: https://tools.taskcluster.net/aws-provisioner/gecko-1-b-linux/

This must be some sort of regression.

The textarea to edit the trigger context is too small

It makes it very difficult to see what you're writing.

pulse-inspector: this.listener.pause is not a function

This probably a bug in: https://github.com/taskcluster/taskcluster-client-web
where the pause function isn't implemented...

Terminate instances buttons need to use the correct provisioner id

In the staging instance of the provisioner UI, all the links for the Terminate buttons use the wrong provisioner base url

https://tools.taskcluster.net/aws-provisioner-staging/gecko-1-b-linux/resources

calls

DELETE https://ec2-manager.taskcluster.net/v1/worker-types/gecko-1-b-linux/resources

When it should be calling

DELETE https://ec2-manager-staging.taskcluster.net/v1/worker-types/gecko-1-b-linux/resources

I'm not entirely sure where this string is coming from, so it might not be in this repository, but the end result here is that this repository is surfacing the issue. Please refile in a new repository if more appropriate.

cc @helfi92 @eliperelman

option to sort clients by `lastDateUsed`

This would make it easier to kill old unused clients...

Remove all references to `slugid` or `nice` in favor of methods exposed in taskcluster-client-web

The slugid package should no longer be used by tools anywhere. Instead, we should be pulling v4 and nice slugs from taskcluster-client-web.

Inspector is not auto-updating the state of a task

When you have a task that takes a while to resolve for example, the task inspector is not updating the state of that task when it changes. This is not the wanted behavior.

Two Taskcluster top bars when connecting to a shell

Make Credentials menu collapse on select

In theory, react-bootstrap supports this now but in practice, it doesn't seem to work.

You can see the issue by selecting one of the "Sign In" options on the "Sign In" menu. A new window opens, but if you go back to the original window, there's the menu still hanging out, open.

Filter on task state for Worker Explorers

Analyzing the status of current workers, especially those on aws-provisioner, it's hard to get information, because of the X-elements per page limitation. Being able to filter on the task state, e.g., "running" would allow me to easily track the status.

New Heroku API key needed (yes, again)

The key committed in #575 will not actually work after cutover to SSO. A new API key from a service account will be needed.

I'm sending emails to @helfi92 & @djmitche with details on approach. There need not be any downtime following those instructions.

actions.json is not fully validated before trying to trigger actions

I had an actions.json that was missing the variables field completely. That led to the following error when trying to trigger an action:

Error can't convert undefined to object

which is not really helpful.

Thankfully, the js debugger got me to:
https://github.com/taskcluster/taskcluster-tools/blob/master/src/views/UnifiedInspector/ActionsMenu.jsx#L447

It turns out actions is used without validation of its schema, and variables is required per said schema. A schema validation error would have been more useful.

deleting hook schedules is buggy

@Callek and I have noticed the following behavior:

When we try to delete the only schedule from a hook, the trash button does nothing
When we create a 2nd schedule on a hook, the trash button for the first item deletes the newly created 2nd item
@Callek noticed that the trash button for the 2nd item deletes the first

I was thinking this might be either

permissions: I didn't create the hook that I was trying to delete the schedule for, or
index-starts-at-0 bug, so deleting the first item (bug: index 1) deletes the 2nd item
the final point, the trash button for the 2nd item deleting the 1st schedule, doesn't match up with this unless the indexes wrap

My workaround for this has been to copy the hook's task definition, save it elsewhere, and then delete it completely. We can then recreate the hook with the right schedule(s). This isn't the ideal workflow, so I'm hoping this is a relatively easy fix.

Placement of `new hook` and `refresh` in hooks manager

In the hooks manager, the option for creating a new hook or refreshing the hooks list moves to the end of the page as new hooks keep getting added and user needs to scroll to the bottom of the page.

In my opinion, the UI would be more intuitive if the buttons appear on the top.

@djmitche thoughts?

Wrong syntax highlighting with escaped double quotes

{
    "ANALYSIS_ID": "PHID-DIFF-cf6l47vzbhkej7jbhynn",
    "ANALYSIS_SOURCE": "phabricator",
    "_JAVA_OPTIONS": "-Duser.home=\"/tmp/mozilla-state\""
}

Quickly jump to a line number

Presently when viewing the logviewer, you can only highlight one or many lines, not jump. To highlight, you have to change the URL then refresh the page.

It would be great to have a way to quickly jump to a line number. One neat way to do that is through the keyboard shortcut Cmd + L. This shortcut could show a prompt asking for the line number then send a prop scrollToLine to the logviewer component. This allows the developer to quickly jump to a line number without fussing with the URL.

Logins reset page state

The process of using tools used to be:

Navigate to the edit page, make changes, click a button
Get an error because you're not logged in
Login (page changes to show login)
Click the button again

Now, logging in resets the page to its default status based on the route, deleting all of the information I had entered.

Weird error message in task-inspector

@helfi92 any ideas:
https://screenshots.firefox.com/NMKsbGPwBI01JrR8/tools.taskcluster.net

link: https://screenshots.firefox.com/NMKsbGPwBI01JrR8/tools.taskcluster.net

rail-afk> recently I have noticed that some tasks give me a 404-like error in the UI, even though they should be there, the error message contains the status. example: https://tools.taskcluster.net/groups/LJDnka8tTtOOdlm6-S1vFw/tasks/uBwIJZ77QgSah9sxsacppg/details
19:03 any idea what is going on?
19:05 I've seen this for tasks submitted via the old scheduler API and via in-tree scheduling

Maybe it has to do with the fact that the task doesn't have any runs yet...

Inspector creates a task and a group listener

https://github.com/taskcluster/taskcluster-tools/blob/master/src/views/UnifiedInspector/Inspector.js

We should just probably only create one listener...
Ie. listen for all events that has to do with the given taskGroupId, since we're always loading the full task group anyways right?

We don't have to listen for artifacts separately either... Just make a single pattern with the taskGroupId and bind all the exchanges.

This could reduce load on tc-events quite a bit... Right now there is a lot of reconnects all the time :)

@eliperelman,
Also is it possible that we could use something like https://developer.mozilla.org/en-US/docs/Web/API/Page_Visibility_API to detect if we should reconnect or not?
Obviously, if user hit the "notify me on changes" button to get desktop notifications when the task changes, when we have to reconnect... Is it possible that there is another API we could use to detect if the user is away from the computer?

notify when all tasks complete

currently there is a checkbox that notifies you when task fail it would be very useful to also have a checkbox that would notify when every task in a taskgroup is completed.

Taskcluster Web banner in the Task & Group Inspector view

Let's add a way for users to try the new UI in the task & group inspector view.

two 3000+ task taskgroup inspectors hang nightly

I'm hoping to be able to open 2-3 active nightly graphs without having a hung browser. It's possible this is a nightly issue, and possible it's a tools.tc.net issue, and certainly possible that both could use improvements.

11:27 <aki> two tabs with taskgraphs of 3000+tasks + treeherder seem to hang nightly
11:31 <aki> i wonder if we can lazy load the tasks in a task group, kind of like how gmaps doesn't load the entire world's tiles, just the ones in the area you're looking at
11:32 — aki waves hands vaguely
11:33 <@dustin> sounds like a good PR for you to make :)
11:33 <@dustin> I think it only loads the taskgraph, not fetching each individual task
11:34 <aki> i do have the use case where i'm following a single task, and it's keeping track of the taskgraph

We just unified the two views (task inspector / taskgroup inspector) which is an overall win for UX. Hanging when I'm viewing a single task + a second taskgroup isn't ideal, though; I'm wondering if there's anything we can do on the tools.tc.net side to make this better.

(I do like the current unified UX quite a bit: the task completed, and I was easily able to then easily look at the current running tasks that had been unscheduled because they depended on the first task. It's the hang that I'm trying to address, so if we can address that without affecting the UI, that's the ideal solution for me.)

tool to expand scope or set of scopes

Say I have a role like: assume:repo:github.com/glandium/git-cinnabar:branch:master and I want to know what scopes it gets... I need to call auth.expandScopes it would be nice to have a tool for this...

Note: assume:repo:github.com/glandium/git-cinnabar:branch:<branch> is basically a pattern made up by tc-gh...

Expose dependencies short name and current icon status

When I am looking at a task that has dependencies, as of now:

We only have a list of taskId. It makes it hard to track the current status of them. It would be cool if we had, in addition:

dependency's short name
current status of the dep (unscheduled, pending, running, ...)

`taskcluster signìn` integration doesn't extend expiration

If the .../cli client exists and is expired, it just rotates the accessToken and that's it... The client I get back is still expired..

Note: We do set deleteOnExpiration, but I'm pretty sure that taskcluster-auth still waits a few days after expiration before deleting clients.

Follow log doesn't work

Ticking the follow log box in "Run Logs" unticks itself when the log is updated.

Pulse Inspector UI is broken

Test issue from Rollbar

This is a test issue created by Rollbar. If you can see this, it works!

notify on task failures should not happen on reruns

monitoring a big task graph (eg. mozilla central) you receive a lot of failures for later successful reruns.

would it be possible to wait with notification until reruns are exhausted?

feature request: filter on label

We have filters for running/pending/completed etc, which are helpful.

Given a 3000+ task graph, it's not always enough. If I want to look at all the windows l10n in a running graph, it can be frustrating to have tests or completing linux or mac jobs shifting the task view... I'll probably open another issue about that. It's also nice if I want to view those 20ish tasks, no matter what state they're in, without having to scroll through the 3000+ tasks in the whole graph.

Edit Client view should note expected format for entering a list of scopes

i.e. State that it's expecting one scope per line under "Client Scopes". Or improve error messages about incorrect scopes.

I ran into some silly confusion because I was separating scopes with comma. See https://bugzilla.mozilla.org/show_bug.cgi?id=1462459

Show all clients you have scope to manage

We have some heuristic to create a default filter on clients in the client inspector...

I suggest we make a checkbox "show clients you can manage" (or some better wording).

But it being so that it uses the scopes you have to determine which clientIds you can modify, and only shows you those clients... This requires the inspector to understand scopes used by taskcluster-auth.

Note, there is a proposal to simplify these scopes in: taskcluster/taskcluster-rfcs#99

"Trigger context is not valid YAML" when the trigger context is invalid, even though it is JSON

The trigger context is JSON. If you write invalid JSON, a "Trigger context is not valid YAML" message is shown instead of "Trigger context is not valid JSON".

Use EC2-Manager API to fetch instance state and summaries

I've left the AWS Provisioner state method in place, but it's deprecated. It really needs to be deleted because it causes a decent amount of overhead and adds a lot of complexity to the provisioner.

The ec2-manager api is now the best sort of information about instance details.

We should use the GET /v1/worker-types/:workerType/stats method to determine instance counts and capacity, the GET /v1/worker-types/:workerType/state to get detailed information about the EC2 resources.

We should expose the DELETE /region/:region/instance/:instanceId and DELETE /region/:region/spot-instance-request/:requestId methods to delete specific instances in the EC2 resource table if the user has the correct ec2-manager:manage-instances:<region>:<instanceId> or ec2-manager:manage-spot-requests:<region>:<requestId> scope.

We should expose the DELETE /worker-types/:workerType/resources method to kill all instances of a workertype if the user has the correct ec2-manager:manage-resources:<workerType> scope.

We should create a basic EC2 Manager debugging tools page which exposes the endpoints with a simple click a button, display the response system:

/v1/internal/regions
/v1/internal/spot-requests-to-poll
/v1/internal/db-pool-stats
/v1/internal/all-state
/v1/internal/sqs-stats
/v1/internal/purge-queues (please do not run while testing, it will mess things up in production)

Use `slugid` from taskcluster-client-web

taskcluster-tools/src/views/TaskCreator/TaskCreator.js

Line 5 in 9121a7f

import { nice } from 'slugid';

uses nice() from slugid, which is a Node package (it uses Buffer, which is undefined in browsers).

Support provisioning health and error reporting in aws-provisioner tools pages

We've talked about this in an email thread, but I wanted to get the issue on file. The supporting work in the EC2-Manager service has been completed, awaiting deployment to our production instance. All URLs are pointing at our staging instance. I'm also working on figuring out why we're not getting API references to publish for the EC2-Manager. These are the changes that I think we should make:

Split worker type details into its own page
On the overall view of the AWS-Provisioner page, have a table that shows the latest errors encountered by this provisioner. The supporting API is this
On the overall view of the AWS-Provisioner page, have a table that shows the 'health' of each region/az/instanceType. The supporting API is this. The values in this API response are approximate stats on the configuration. How they're displayed isn't critical to me, but I suspect highlighting rows which have non-zero error values as warnings and rows where those counts are >40% of the success rate as errors would be useful. Ideally, it could be viewed by Region, AZ and Instance Type. The idea is to be able to get an overview of an entire Region, AZ or Instance Type.
On the worker type detail, add a new tab to the view called 'Health' that contains the Errors for that worker type. The supporting API is this. Basically, a worker type specific view of the other errors page
On the worker type 'Health' tab, add a worker type specific overview of Region, AZ and Instance Type. The supporting API is this

This will allow Sheriffs and others interested to see what is going on inside the provisioner. This should reduce the impact on the Taskcluster team when there are problems inside the EC2 system as well as highlight when worker types are incorrectly configured.