Code Monkey home page Code Monkey logo

cortex's People

Contributors

1vn avatar bluuewhale avatar caleb-kaiser avatar cbensimon avatar deliahu avatar dependabot[bot] avatar dsuess avatar hassenio avatar howjmay avatar ismaelc avatar jackmpcollins avatar miguelvr avatar ospillinger avatar rbromley10 avatar robertlucian avatar tthebst avatar vishalbollu avatar wingkwong avatar ynng avatar zouyee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cortex's Issues

Move aws and k8s out of operator package

Description

E.g. could take region in aws.Init(), other params directly (GetLogs() could take logGroup)

aws.Init() could return a client, which would be used for any API calls

Build TensorFlow Serving from source

Description

There may be an opportunity to improve prediction latency on CPUs and/or GPUs by building TensorFlow Serving from source.

For example, on p2.xlarge, TF Serving shows this log message:

Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

And on m5.large, TF Serving shows this log message:

Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA

Here's an example of someone building from source.

Question: Different instance type support different instruction sets. What happens if we compile against instruction sets (e.g. AVX 512 on m5.large) that are not available on your instance (e.g. t3.large or even t3a.large)?

Make spec kinds optional

Description

A model / aggregate / transformed_feature can be defined without using an estimator / aggregator / transformer. If so, Cortex doesn't validate types.

Motivation

Reduce the amount of required configuration

Additional Context

  • e.g. estimator has a key path, and model has keys estimator and estimator_path (only one of them may be defined)

Add built-in Cortex templates

Description

  • Add support for built-in templates
  • Implement cortex.normalize
  • Use it in quickstart?

Notes

  • There doesn't need to be actual files for this; a hardcoded map[string]string (name -> template text) should be sufficient

Motivation

Reduce the amount of configuration required for common pipelines

Ingestion of Parquet containing int or double columns throw validation errors

Description

Parquet file containing IntegerType or DoubleType columns will throw errors during ingestion time because the validation check expects FLOAT_COLUMN to be FloatType and INT_COLUMN to be LongType

Possible Solution / Implementation

During the validation check, accept FloatType or DoubleType when a column is defined to be FLOAT_COLUMN and IntegerType or LongType when a column is defined to be INT_COLUMN

Hyperparameter Tuning

Description

Enable hyperparameter tuning: the parameter space is searched (e.g. grid-based search), multiple models are trained, and the most accurate model is deployed in the API.

Open Questions

  • Does this involve the model config and/or API config?
  • How to define the parameter sweep?

Reduce CLI resource name/type ambiguity

Motivation

Avoid the weirdness that cx status -h makes is seem like you can do cx status models

Option 1

Use resourceType.name to identify resources. After this change, it should be possible to do cx get resrouceType, cx get resourceIdentifier, or cx status resourceIdentifier (in addition to cx get and cx status), where resourceIdentifier can be resourceName if it's unique, and resourceType.resourceName is fully qualified and will always work.

Option 2

Require resource type (e.g. cx get model dnn even if there is only one resource called dnn)

Option 3

Keep it the way it is, possibly improving the docs.

Option 4

Add a new CLI command to reduce ambiguity.

Create estimator and improve inputs

Description

Separate model config into estimator / model, and improve inputs

Design

  • _default / _optional:
    • _default defaults to None in any case (scalar/list/map)
    • _default can never be set to None explicitly
    • If user sets _default, _optional is implicitly set to True
    • If *_COLUMN type is anywhere in or under the type, default is not an option
    • For list and map types, _min_count is an option (and defaults to 0)
  • User provided map types cannot have keys that start with _
  • Denote cortex values with @ throughout (i.e. all input, also e.g. model: @dnn in api)
  • Add training_input to models
  • Remove type: classification in model
  • Rename "inputs" to "input"
  • input: INT is supported
  • Can't mix value / column types in a single type (e.g. FLOAT_COLUMN|INT is not allowed)
  • {arg: INT}, {STRING: INT}, {STRING: INT_COLUMN}, {STRING_COLUMN: INT}, {STRING_COLUMN: INT_COLUMN}, {STRING_COLUMN: [INT_COLUMN]}, etc... are all supported. {[STRING_COLUMN]: INT_COLUMN} is not supported.
  • Cast INT -> FLOAT for inputs, INT -> FLOAT and INT_COLUMN -> FLOAT_COLUMN for outputs

Structure

input: <TYPE>     # (short form)

input:            # (long form)
  _type: <TYPE>
  _default: <>
  ...

Examples

Aggregators

# mean

- kind: aggregator
  name: mean
  output_type: FLOAT
  input: FLOAT_COLUMN|INT_COLUMN

- kind: aggregate
  name: sepal_length_mean
  aggregator: cortex.mean
  input: sepal_length

# bucket_boundaries

- kind: aggregator
  name: bucket_boundaries
  output_type: [FLOAT]
  input:
    col: FLOAT_COLUMN|INT_COLUMN
    num_buckets:
      _type: INT
      _default: 10

- kind: aggregate
  name: sepal_length_bucket
  aggregator: cortex.bucket_boundaries
  input:
    col: sepal_length
    num_buckets: 5

Transformers

# normalize

- kind: transformer
  name: normalize
  output_type: FLOAT_COLUMN
  input:
    col: FLOAT_COLUMN|INT_COLUMN
    mean: INT|FLOAT
    stddev: INT|FLOAT

- kind: transformed_column
  name: sepal_length_normalized
  transformer: cortex.normalize
  input:
    col: sepal_length
    mean: sepal_length_mean
    stddev: sepal_length_stddev

# weight

- kind: transformer
  name: weight
  output_type: FLOAT_COLUMN
  input:
    col: INT_COLUMN
    class_distribution: {INT: FLOAT}

- kind: transformed_column
  name: weight_column
  transformer: weight
  input:
    col: class
    class_distribution: class_distribution

Models

# iris

- kind: estimator
  name: dnn
  target_column: INT_COLUMN
  input:
    normalized_columns: [INT_COLUMN|FLOAT_COLUMN]
    num_classes: INT
  hparams:
    hidden_layers: [INT]
    learning_rate:
      _type: FLOAT
      _default: 0.01

- kind: model
  name: iris-dnn
  estimator: dnn
  target_column: class_indexed
  input:
    normalized_columns:
      - sepal_length_normalized
      - sepal_width_normalized
      - petal_length_normalized
      - petal_width_normalized
    num_classes: 2
  hparams:
    hidden_layers: [4, 4]
  data_partition_ratio: ...
  training: ...

# Fraud

- kind: estimator
  name: dnn
  target_column: INT_COLUMN
  input:
    normalized_columns: [INT_COLUMN|FLOAT_COLUMN]
  training_input:
    weight_column: FLOAT_COLUMN

- kind: model
  name: fraud-dnn
  target_column: class
  input:
    normalized_columns: [time_normalized, v1_normalized, ...]
  training_input:
    weight_column: weights

# Insurance

- kind: estimator
  name: dnn
  target_column: FLOAT_COLUMN
  input:
    categorical:
      _type:
        - feature: INT_COLUMN
          categories: [STRING]
      _min_count: 1  # This wouldn't actually be necessary, it's just to demonstrate
    bucketized:
      - feature: INT_COLUMN|FLOAT_COLUMN
        buckets: [INT]

- kind: model
  name: insurance-dnn
  estimator: insurance-dnn
  target_column: charges_normalized
  input:
    categorical:
      - feature: gender
        categories: ["female", "male"]
      - feature: smoker
        categories: ["yes", "no"]
      - feature: region
        categories: ["northwest", "northeast", "southwest", "southeast"]
      - feature: children
        categories: children_set
    bucketized:
      - feature: age
        buckets: [15, 20, 25, 35, 40, 45, 50, 55, 60, 65]
      - feature: bmi
        buckets: [15, 20, 25, 35, 40, 45, 50, 55]

# Poker

- kind: estimator
  name: dnn
  target_column: INT
  input:
    suit_columns: [INT_COLUMN]
    rank_columns: [INT_COLUMN]

- kind: model
  name: poker-dnn
  type: classification
  target_column: class
  input:
    suit_columns: [card_1_suit, card_2_suit, card_3_suit, card_4_suit, card_5_suit]
    rank_columns: [card_1_rank, card_2_rank, card_3_rank, card_4_rank, card_5_rank]

# Mnist

- kind: model
  name: t2t
  target_column: INT
  input: FLOAT_LIST_COLUMN
  prediction_key: outputs

- kind: trainer
  name: mnist-t2t
  target_column: label
  input: image_pixels

TensorFlow feature_column examples

review tf.feature_column

tf.feature_column.indicator_column(
    tf.feature_column.categorical_column_with_vocabulary_list("smoker", ["yes", "no"])
),

tf.feature_column.bucketized_column(
    tf.feature_column.numeric_column("age"), [15, 20, 25, 35, 40, 45, 50, 55, 60, 65]
),

tf.feature_column.numeric_column("sepal_length_normalized"),

tf.feature_column.numeric_column("image_pixels", shape=model_config["hparams"]["input_shape"])

weight_column = "class_weight"


# cloudml-template

categorical_columns_with_identity = {
    item[0]: tf.feature_column.categorical_column_with_identity(item[0], item[1])
    for item in categorical_feature_names_with_identity.items()
}

categorical_columns_with_vocabulary = {
    item[0]: tf.feature_column.categorical_column_with_vocabulary_list(item[0], item[1])
    for item in metadata.INPUT_CATEGORICAL_FEATURE_NAMES_WITH_VOCABULARY.items()
}

categorical_columns_with_hash_bucket = {
    item[0]: tf.feature_column.categorical_column_with_hash_bucket(item[0], item[1])
    for item in metadata.INPUT_CATEGORICAL_FEATURE_NAMES_WITH_HASH_BUCKET.items()
}

age_buckets = tf.feature_column.bucketized_column(
    feature_columns["age"], boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65]
)

education_X_occupation = tf.feature_column.crossed_column(
    ["education", "occupation"], hash_bucket_size=int(1e4)
)

native_country_embedded = tf.feature_column.embedding_column(
    feature_columns["native_country"], dimension=task.HYPER_PARAMS.embedding_size
)

Design goals

  • As concise as possible
  • Be able to copy paste schema to manifestation when possible, have clear rules when not possible
  • Don't need separate keys for values vs column; typing is for that
  • Use the same input config for models, transformers, and aggregates

Motivation

  • Consistency with transformer / transformed_feature and aggregator / aggregate
  • Enable built-in trainers, so users don't need any TensorFlow code for canned estimators
  • Enable re-use of model implementations

Update Argo version

Description

Long running jobs error (they are actually still running, but Cortex shows them as error)

To Reproduce

Run any long job (e.g. training job with a lot of steps)

Possible Solution / Implementation

This is due to a known bug in Argo, fixed here. It should be released with v2.3 (releases)

Stack Trace

time="2019-01-15T01:05:13Z" level=info msg="Creating a docker executor"
time="2019-01-15T01:05:13Z" level=info msg="Executor (version: v2.2.1, build_date: 2018-10-11T16:27:29Z) initialized with template:\narchiveLocation: {}\ninputs: {}\nmetadata:\n  labels:\n    appName: iris\n    argo: \"true\"\n    workloadId: wrguzieqiqtj6gic4wlx\n    workloadType: data-job\nname: wrguzieqiqtj6gic4wlx\noutputs: {}\nresource:\n  action: create\n  failureCondition: status.applicationState.state in (FAILED,SUBMISSION_FAILED,UNKNOWN)\n  manifest: '{\"kind\":\"SparkApplication\",\"apiVersion\":\"sparkoperator.k8s.io/v1alpha1\",\"metadata\":{\"name\":\"wrguzieqiqtj6gic4wlx\",\"namespace\":\"cortex\",\"creationTimestamp\":null,\"labels\":{\"appName\":\"iris\",\"workloadId\":\"wrguzieqiqtj6gic4wlx\",\"workloadType\":\"data-job\"},\"ownerReferences\":[{\"apiVersion\":\"argoproj.io/v1alpha1\",\"kind\":\"Workflow\",\"name\":\"argo-iris-7wvh7\",\"uid\":\"9bc7aee5-1861-11e9-a664-026185601f56\",\"blockOwnerDeletion\":false}]},\"spec\":{\"type\":\"Python\",\"mode\":\"cluster\",\"image\":\"764403040460.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest\",\"imagePullPolicy\":\"Always\",\"mainApplicationFile\":\"local:///src/spark_job/spark_job.py\",\"arguments\":[\"--workload-id=wrguzieqiqtj6gic4wlx\n    --context=s3://cortex-cluster-david/apps/iris/contexts/83b914617c417fbe26b59c7839d77914b43f02c48fb52c713a5d4c094dff0bd.msgpack\n    --cache-dir=/mnt/context --raw-features=ad88a38e98b74d6dfc1a718b51c3da769141f46ddd0cd8fb96ca9fd01a71521,9bcb141ed9f8ee4f52c55a40e16815d2a3eddbdc118cc81751e762c73134820,ffa5ac7dad3f546d5b867892d4bf820ba34e83362042c894dc86527295b9352,bacc1fe5d16b0dbbb3527ea9de82409f67b926d0243a549cb6ed844ef76cdd7,b02d30d78186521e2399ba6a93e0f20459eb72e661234c9cf88cd289a78af30\n    --aggregates=5566592b7087b3cf822b7a7a56085fcfd1678b363db00fefefab725f7892d37,0c8177a3736da4987b7629346ca461eced6f090c40edbbe9b67595ca00f73ce,ea21d88863bf2fa675550ae8e57d301d247b311b8f7441e1a342f0803a352e8,9fdd4490123c08fc5da726e3ca597f9b0cf9f0f7b7762a5fca7ea7b036df33b,2a55250299f0a8935f8aafba8cef94e5f7776c8e4c480e2d7f42fc5ca336d36,d59e72278cf560705ddb8cd11dbbb9dd2592165d458f51ddefa425f35fe0a3d,77b9f029fcaac3f513e549042f2f64eee9e73ad0d2c7549ec167943f2adbd46,d3e9aaa136d3e338ac1654e4da3ce5fbe6f6e6517e375672efd9ce3b20bedef,a8a3c0802612f715243577910061579af8cc49bc44ffc8f580e9746c264a237\n    --transformed-features=8cc4ed1ed6c711dc775f51d8d129f5b0030ae109c9ae1605f0530c3ecdf511b,1d26f726a9c1879096147a7219418752140c6572a86d29a8a273ac8474c4622,79df34c5ba324066ec281a70ca9f45dd027e24a243bc137efc24eb1f928b708,ba8d0bc773530313233b0c51a27526ab6f13db97b02c9b8114bd9f0667793c1,15533d18f1de1b545c1a32dd6f6df24481f07fe314f78219ad6bab782276703\n    --training-datasets=27fd4e1f80338ab4fbe8f2c67d58b6d521d2576f3cd492ad1ace5c80d94e413\n    --ingest\"],\"driver\":{\"cores\":10,\"memory\":\"512000k\",\"envSecretKeyRefs\":{\"AWS_ACCESS_KEY_ID\":{\"name\":\"aws-credentials\",\"key\":\"AWS_ACCESS_KEY_ID\"},\"AWS_SECRET_ACCESS_KEY\":{\"name\":\"aws-credentials\",\"key\":\"AWS_SECRET_ACCESS_KEY\"}},\"labels\":{\"appName\":\"iris\",\"userFacing\":\"true\",\"workloadId\":\"wrguzieqiqtj6gic4wlx\",\"workloadType\":\"data-job\"},\"podName\":\"wrguzieqiqtj6gic4wlx\",\"serviceAccount\":\"spark\"},\"executor\":{\"cores\":1,\"memory\":\"512000k\",\"envSecretKeyRefs\":{\"AWS_ACCESS_KEY_ID\":{\"name\":\"aws-credentials\",\"key\":\"AWS_ACCESS_KEY_ID\"},\"AWS_SECRET_ACCESS_KEY\":{\"name\":\"aws-credentials\",\"key\":\"AWS_SECRET_ACCESS_KEY\"}},\"labels\":{\"appName\":\"iris\",\"workloadId\":\"wrguzieqiqtj6gic4wlx\",\"workloadType\":\"data-job\"},\"instances\":1},\"deps\":{\"pyFiles\":[\"local:///src/spark_job/spark_util.py\",\"local:///src/lib/*.py\"]},\"restartPolicy\":{\"type\":\"Never\"},\"pythonVersion\":\"3\"},\"status\":{\"lastSubmissionAttemptTime\":null,\"completionTime\":null,\"driverInfo\":{},\"applicationState\":{\"state\":\"\",\"errorMessage\":\"\"}}}'\n  successCondition: status.applicationState.state in (COMPLETED)\n"
time="2019-01-15T01:05:13Z" level=info msg="Loading manifest to /tmp/manifest.yaml"
time="2019-01-15T01:05:13Z" level=info msg="kubectl create -f /tmp/manifest.yaml -o name"
time="2019-01-15T01:05:17Z" level=info msg=sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx
time="2019-01-15T01:05:17Z" level=info msg="Waiting for conditions: status.applicationState.state in (COMPLETED)"
time="2019-01-15T01:05:17Z" level=info msg="Failing for conditions: status.applicationState.state in (FAILED,SUBMISSION_FAILED,UNKNOWN)"
time="2019-01-15T01:05:17Z" level=info msg="kubectl get sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx -w -o json"
time="2019-01-15T01:05:20Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:05:20Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:05:20Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:05:20Z" level=info msg="0/1 success conditions matched"
time="2019-01-15T01:05:41Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:05:41Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:05:41Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:05:41Z" level=info msg="0/1 success conditions matched"
time="2019-01-15T01:05:41Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:05:41Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:05:41Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:05:41Z" level=info msg="0/1 success conditions matched"
time="2019-01-15T01:07:17Z" level=warning msg="Json reader returned error EOF. Calling kill (usually superfluous)"
time="2019-01-15T01:07:17Z" level=warning msg="Command for kubectl get -w for sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx exited. Getting return value using Wait"
time="2019-01-15T01:07:17Z" level=info msg="readJSon failed for resource sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx but cmd.Wait for kubectl get -w command did not error"
time="2019-01-15T01:07:17Z" level=info msg="Waiting for resource sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx resulted in retryable error EOF"
time="2019-01-15T01:07:22Z" level=info msg="kubectl get sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx -w -o json"
time="2019-01-15T01:07:22Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:07:22Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:07:22Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:07:22Z" level=info msg="0/1 success conditions matched"
time="2019-01-15T01:09:08Z" level=warning msg="Json reader returned error EOF. Calling kill (usually superfluous)"
time="2019-01-15T01:09:08Z" level=warning msg="Command for kubectl get -w for sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx exited. Getting return value using Wait"
time="2019-01-15T01:09:08Z" level=info msg="readJSon failed for resource sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx but cmd.Wait for kubectl get -w command did not error"
time="2019-01-15T01:09:08Z" level=info msg="Waiting for resource sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx resulted in retryable error EOF"
time="2019-01-15T01:09:28Z" level=info msg="kubectl get sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx -w -o json"
time="2019-01-15T01:09:28Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:09:28Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:09:28Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:09:28Z" level=info msg="0/1 success conditions matched"
time="2019-01-15T01:10:51Z" level=warning msg="Json reader returned error EOF. Calling kill (usually superfluous)"
time="2019-01-15T01:10:51Z" level=warning msg="Command for kubectl get -w for sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx exited. Getting return value using Wait"
time="2019-01-15T01:10:51Z" level=info msg="readJSon failed for resource sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx but cmd.Wait for kubectl get -w command did not error"
time="2019-01-15T01:10:51Z" level=info msg="Waiting for resource sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx resulted in retryable error EOF"
time="2019-01-15T01:12:11Z" level=info msg="kubectl get sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx -w -o json"
time="2019-01-15T01:12:11Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:12:11Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:12:11Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:12:11Z" level=info msg="0/1 success conditions matched"
time="2019-01-15T01:12:25Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:12:25Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:12:25Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:12:25Z" level=info msg="0/1 success conditions matched"

Clean up evicted pods

Probably after some timeout so user could see them if they want, and so logs get uploaded

operator-b58fbb9b5-z2ftz                    1/1     Running   0          11m
operator-b58fbb9b5-zdmjx                    0/1     Evicted   0          11m

Not using an ingested column as a raw_column results in error

Description

It should be possible to have a column defined in env.data.schema that isn't used as a raw_column (CSV and Parquet).

Application Configuration

Start from Iris, and then completely remove one of the features (e.g. sepal_width)

To Reproduce

  1. cx deploy

Stack Trace

Starting

INFO:cortex:Starting
INFO:cortex:Ingesting
INFO:cortex:Ingesting iris-1 data from s3a://cortex-examples/iris.csv
ERROR:cortex:An error occurred, see `cx logs raw_column sepal_width` for more details.
Traceback (most recent call last):
  File "/src/spark_job/spark_job.py", line 307, in run_job
    raw_df = ingest_raw_dataset(spark, ctx, cols_to_validate, should_ingest)
  File "/src/spark_job/spark_job.py", line 151, in ingest_raw_dataset
    ingest_df = spark_util.ingest(ctx, spark)
  File "/src/spark_job/spark_util.py", line 223, in ingest
    expected_schema = expected_schema_from_context(ctx)
  File "/src/spark_job/spark_util.py", line 117, in expected_schema_from_context
    for fname in expected_field_names
  File "/src/spark_job/spark_util.py", line 117, in <listcomp>
    for fname in expected_field_names
KeyError: 'petal_width'

Version

master

Additional Context

It should also be possible to not ingest all of the columns in the dataset (just Parquet for now); we should test this.

Enable external constant ingestion

Description

Allow users to upload objects to S3 and use them in aggregators, transformers, and estimators.

Motivation

Some artifacts are pre-computed and need to be accessed in the pipeline (e.g. word embeddings).

Read AWS credentials from ~/.aws for cortex_installer.sh and/or CLI

Description

Find a way to use (or ask if the user wants to use) AWS credentials from ~/.aws

Motivation

  • Reduce installation friction
  • Users may not want to overwrite environment variables (even though it is shell scoped)

Additional Context

  • How do you know which AWS profile to use?
  • Is there a library to read the AWS config?

Infer raw columns from CSV headers, Parquet schema

Description

Add an option to environment.schema to use CSV headers and parquet column names as the raw column names.

Motivation

Reduce the amount of required YAML configuration when the raw data already has the column names.

Additional Context

  • This will feed into raw column data type inference
  • When using the option, it should not be possible to explicitly identify any of the raw columns
  • We still need to validate the raw column names

Resources not allocated to Spark workloads to generate training datasets

Description

Failed to create training datasets.

To Reproduce

Reproducible in many of the examples, but easiest to reproduce in Poker.

  1. Define only the app, environment and raw columns in YAML
  2. cx deploy to process the raw columns
  3. Define a model that uses the raw columns
  4. cx deploy

Stack Trace

cx logs dnn/training_dataset

Failed to start:

time="2019-04-18T20:12:09Z" level=info msg="Creating a docker executor"
time="2019-04-18T20:12:09Z" level=info msg="Executor (version: v2.2.1, build_date: 2018-10-11T16:27:29Z) initialized with template:\narchiveLocation: {}\ninputs: {}\nmetadata:\n  labels:\n    appName: recommendations\n    argo: \"true\"\n    workloadID: bord6fnh1lma5hyn8my3\n    workloadType: data-job\nname: bord6fnh1lma5hyn8my3\noutputs: {}\nresource:\n  action: create\n  failureCondition: status.applicationState.state in (FAILED,SUBMISSION_FAILED,UNKNOWN)\n  manifest: |-\n    {\n      \"kind\": \"SparkApplication\",\n      \"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\n      \"metadata\": {\n        \"name\": \"bord6fnh1lma5hyn8my3\",\n        \"namespace\": \"cortex\",\n        \"creationTimestamp\": null,\n        \"labels\": {\n          \"appName\": \"recommendations\",\n          \"workloadID\": \"bord6fnh1lma5hyn8my3\",\n          \"workloadType\": \"data-job\"\n        },\n        \"ownerReferences\": [\n          {\n            \"apiVersion\": \"argoproj.io/v1alpha1\",\n            \"kind\": \"Workflow\",\n            \"name\": \"argo-recommendations-rplw6\",\n            \"uid\": \"3dca2989-6216-11e9-aaf1-02cc01957708\",\n            \"blockOwnerDeletion\": false\n          }\n        ]\n      },\n      \"spec\": {\n        \"type\": \"Python\",\n        \"mode\": \"cluster\",\n        \"image\": \"969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest\",\n        \"imagePullPolicy\": \"Always\",\n        \"mainApplicationFile\": \"local:///src/spark_job/spark_job.py\",\n        \"arguments\": [\n          \"--workload-id=bord6fnh1lma5hyn8my3 --context=s3://cortex-cluster-vishal/apps/recommendations/contexts/9063143c7366987a974e14c07bf21c40bed64e03a3aa0fed55a670c7756e317.msgpack --cache-dir=/mnt/context --raw-columns= --aggregates= --transformed-columns= --training-datasets=d6c73248656984e3d08a6165cd3b34de27253021cb94c232526f96776999d73\"\n        ],\n        \"driver\": {\n          \"cores\": 0,\n          \"memory\": \"0k\",\n          \"envVars\": {\n            \"CORTEX_CACHE_DIR\": \"/mnt/context\",\n            \"CORTEX_CONTEXT_S3_PATH\": \"s3://cortex-cluster-vishal/apps/recommendations/contexts/9063143c7366987a974e14c07bf21c40bed64e03a3aa0fed55a670c7756e317.msgpack\",\n            \"CORTEX_SPARK_VERBOSITY\": \"WARN\",\n            \"CORTEX_WORKLOAD_ID\": \"bord6fnh1lma5hyn8my3\"\n          },\n          \"envSecretKeyRefs\": {\n            \"AWS_ACCESS_KEY_ID\": {\n              \"name\": \"aws-credentials\",\n              \"key\": \"AWS_ACCESS_KEY_ID\"\n            },\n            \"AWS_SECRET_ACCESS_KEY\": {\n              \"name\": \"aws-credentials\",\n              \"key\": \"AWS_SECRET_ACCESS_KEY\"\n            }\n          },\n          \"labels\": {\n            \"appName\": \"recommendations\",\n            \"userFacing\": \"true\",\n            \"workloadID\": \"bord6fnh1lma5hyn8my3\",\n            \"workloadType\": \"data-job\"\n          },\n          \"podName\": \"bord6fnh1lma5hyn8my3\",\n          \"serviceAccount\": \"spark\"\n        },\n        \"executor\": {\n          \"cores\": 0,\n          \"memory\": \"0k\",\n          \"envVars\": {\n            \"CORTEX_CACHE_DIR\": \"/mnt/context\",\n            \"CORTEX_CONTEXT_S3_PATH\": \"s3://cortex-cluster-vishal/apps/recommendations/contexts/9063143c7366987a974e14c07bf21c40bed64e03a3aa0fed55a670c7756e317.msgpack\",\n            \"CORTEX_SPARK_VERBOSITY\": \"WARN\",\n            \"CORTEX_WORKLOAD_ID\": \"bord6fnh1lma5hyn8my3\"\n          },\n          \"envSecretKeyRefs\": {\n            \"AWS_ACCESS_KEY_ID\": {\n              \"name\": \"aws-credentials\",\n              \"key\": \"AWS_ACCESS_KEY_ID\"\n            },\n            \"AWS_SECRET_ACCESS_KEY\": {\n              \"name\": \"aws-credentials\",\n              \"key\": \"AWS_SECRET_ACCESS_KEY\"\n            }\n          },\n          \"labels\": {\n            \"appName\": \"recommendations\",\n            \"workloadID\": \"bord6fnh1lma5hyn8my3\",\n            \"workloadType\": \"data-job\"\n          },\n          \"instances\": 0\n        },\n        \"deps\": {\n          \"pyFiles\": [\n            \"local:///src/spark_job/spark_util.py\",\n            \"local:///src/lib/*.py\"\n          ]\n        },\n        \"restartPolicy\": {\n          \"type\": \"Never\"\n        },\n        \"pythonVersion\": \"3\"\n      },\n      \"status\": {\n        \"lastSubmissionAttemptTime\": null,\n        \"completionTime\": null,\n        \"driverInfo\": {},\n        \"applicationState\": {\n          \"state\": \"\",\n          \"errorMessage\": \"\"\n        }\n      }\n    }\n  successCondition: status.applicationState.state in (COMPLETED)\n"
time="2019-04-18T20:12:09Z" level=info msg="Loading manifest to /tmp/manifest.yaml"
time="2019-04-18T20:12:09Z" level=info msg="kubectl create -f /tmp/manifest.yaml -o name"
time="2019-04-18T20:12:10Z" level=fatal msg="The SparkApplication \"bord6fnh1lma5hyn8my3\" is invalid: []: Invalid value: map[string]interface {}{\"apiVersion\":\"sparkoperator.k8s.io/v1alpha1\", \"kind\":\"SparkApplication\", \"metadata\":map[string]interface {}{\"name\":\"bord6fnh1lma5hyn8my3\", \"namespace\":\"cortex\", \"creationTimestamp\":\"2019-04-18T20:12:09Z\", \"labels\":map[string]interface {}{\"workloadID\":\"bord6fnh1lma5hyn8my3\", \"workloadType\":\"data-job\", \"appName\":\"recommendations\"}, \"ownerReferences\":[]interface {}{map[string]interface {}{\"apiVersion\":\"argoproj.io/v1alpha1\", \"kind\":\"Workflow\", \"name\":\"argo-recommendations-rplw6\", \"uid\":\"3dca2989-6216-11e9-aaf1-02cc01957708\", \"blockOwnerDeletion\":false}}, \"generation\":1, \"uid\":\"3f1c787f-6216-11e9-aaf1-02cc01957708\", \"selfLink\":\"\"}, \"spec\":map[string]interface {}{\"image\":\"969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest\", \"mainApplicationFile\":\"local:///src/spark_job/spark_job.py\", \"mode\":\"cluster\", \"restartPolicy\":map[string]interface {}{\"type\":\"Never\"}, \"type\":\"Python\", \"driver\":map[string]interface {}{\"serviceAccount\":\"spark\", \"cores\":0, \"envSecretKeyRefs\":map[string]interface {}{\"AWS_ACCESS_KEY_ID\":map[string]interface {}{\"key\":\"AWS_ACCESS_KEY_ID\", \"name\":\"aws-credentials\"}, \"AWS_SECRET_ACCESS_KEY\":map[string]interface {}{\"key\":\"AWS_SECRET_ACCESS_KEY\", \"name\":\"aws-credentials\"}}, \"envVars\":map[string]interface {}{\"CORTEX_CACHE_DIR\":\"/mnt/context\", \"CORTEX_CONTEXT_S3_PATH\":\"s3://cortex-cluster-vishal/apps/recommendations/contexts/9063143c7366987a974e14c07bf21c40bed64e03a3aa0fed55a670c7756e317.msgpack\", \"CORTEX_SPARK_VERBOSITY\":\"WARN\", \"CORTEX_WORKLOAD_ID\":\"bord6fnh1lma5hyn8my3\"}, \"labels\":map[string]interface {}{\"appName\":\"recommendations\", \"userFacing\":\"true\", \"workloadID\":\"bord6fnh1lma5hyn8my3\", \"workloadType\":\"data-job\"}, \"memory\":\"0k\", \"podName\":\"bord6fnh1lma5hyn8my3\"}, \"deps\":map[string]interface {}{\"pyFiles\":[]interface {}{\"local:///src/spark_job/spark_util.py\", \"local:///src/lib/*.py\"}}, \"executor\":map[string]interface {}{\"envVars\":map[string]interface {}{\"CORTEX_CACHE_DIR\":\"/mnt/context\", \"CORTEX_CONTEXT_S3_PATH\":\"s3://cortex-cluster-vishal/apps/recommendations/contexts/9063143c7366987a974e14c07bf21c40bed64e03a3aa0fed55a670c7756e317.msgpack\", \"CORTEX_SPARK_VERBOSITY\":\"WARN\", \"CORTEX_WORKLOAD_ID\":\"bord6fnh1lma5hyn8my3\"}, \"instances\":0, \"labels\":map[string]interface {}{\"appName\":\"recommendations\", \"workloadID\":\"bord6fnh1lma5hyn8my3\", \"workloadType\":\"data-job\"}, \"memory\":\"0k\", \"cores\":0, \"envSecretKeyRefs\":map[string]interface {}{\"AWS_ACCESS_KEY_ID\":map[string]interface {}{\"key\":\"AWS_ACCESS_KEY_ID\", \"name\":\"aws-credentials\"}, \"AWS_SECRET_ACCESS_KEY\":map[string]interface {}{\"name\":\"aws-credentials\", \"key\":\"AWS_SECRET_ACCESS_KEY\"}}}, \"imagePullPolicy\":\"Always\", \"pythonVersion\":\"3\", \"arguments\":[]interface {}{\"--workload-id=bord6fnh1lma5hyn8my3 --context=s3://cortex-cluster-vishal/apps/recommendations/contexts/9063143c7366987a974e14c07bf21c40bed64e03a3aa0fed55a670c7756e317.msgpack --cache-dir=/mnt/context --raw-columns= --aggregates= --transformed-columns= --training-datasets=d6c73248656984e3d08a6165cd3b34de27253021cb94c232526f96776999d73\"}}, \"status\":map[string]interface {}{\"lastSubmissionAttemptTime\":interface {}(nil), \"applicationState\":map[string]interface {}{\"errorMessage\":\"\", \"state\":\"\"}, \"completionTime\":interface {}(nil), \"driverInfo\":map[string]interface {}{}}}: validation failure list:\nspec.driver.cores in body should be greater than 0\nspec.executor.instances in body should be greater than or equal to 1\nspec.executor.cores in body should be greater than 0\ngithub.com/argoproj/argo/errors.New\n\t/root/go/src/github.com/argoproj/argo/errors/errors.go:48\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).ExecResource\n\t/root/go/src/github.com/argoproj/argo/workflow/executor/resource.go:36\ngithub.com/argoproj/argo/cmd/argoexec/commands.execResource\n\t/root/go/src/github.com/argoproj/argo/cmd/argoexec/commands/resource.go:38\ngithub.com/argoproj/argo/cmd/argoexec/commands.glob..func2\n\t/root/go/src/github.com/argoproj/argo/cmd/argoexec/commands/resource.go:23\ngithub.com/argoproj/argo/vendor/github.com/spf13/cobra.(*Command).execute\n\t/root/go/src/github.com/argoproj/argo/vendor/github.com/spf13/cobra/command.go:766\ngithub.com/argoproj/argo/vendor/github.com/spf13/cobra.(*Command).ExecuteC\n\t/root/go/src/github.com/argoproj/argo/vendor/github.com/spf13/cobra/command.go:852\ngithub.com/argoproj/argo/vendor/github.com/spf13/cobra.(*Command).Execute\n\t/root/go/src/github.com/argoproj/argo/vendor/github.com/spf13/cobra/command.go:800\nmain.main\n\t/root/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:15\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:198\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:2361"

Version

master

Official Website?

Hey there - I was just curious if you have a real website that isn't just a redirect to your repo?

I noticed you have docs and a job posting page but no normal landing page. Is this by default due to stealth launch? Or perhaps issues with SEO as there seems to be another company very close to your naming - https://www.cortexlabs.ai/

Obviously the development is more important but it is always nice to be able to send non-technical/business colleagues a place where they can get an overview without crawling through a repo :)

Aggregate still in "aggregating" status even when logs say OOM

Version: Master
Stacktrace:

19/03/26 14:07:12 SPARK:WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
19/03/26 14:09:22 SPARK:WARN TaskSetManager: Lost task 0.0 in stage 8.0 (TID 12, 192.168.56.198, executor 1): java.lang.OutOfMemoryError: Java heap space
	at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
	at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
	at org.apache.spark.io.ReadAheadInputStream.<init>(ReadAheadInputStream.java:105)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.<init>(UnsafeSorterSpillReader.java:82)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:156)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:477)
	at org.apache.spark.sql.execution.UnsafeKVExternalSorter.sortedIterator(UnsafeKVExternalSorter.java:204)
	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:261)
	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:221)
	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:360)
	at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:112)
	at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:102)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

19/03/26 14:09:24 SPARK:ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.56.198: 
The executor with id 1 exited with exit code 52.
The API gave the following brief reason: null
The API gave the following message: null
The API gave the following container statuses:

ContainerStatus(containerID=docker://b5b1b2dc6ad8c04e8162db0737f3f42a0092d67ea0b4c072abb4b311c07dca2e, image=969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest, imageID=docker-pullable://969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark@sha256:a0fc3551548c7f4dd1b5a27b52c4b0a96a52b5f63e8bb0a25899234a9cbe61df, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=executor, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://b5b1b2dc6ad8c04e8162db0737f3f42a0092d67ea0b4c072abb4b311c07dca2e, exitCode=52, finishedAt=Time(time=2019-03-26T14:09:22Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-03-26T14:06:12Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})
      
19/03/26 14:09:24 SPARK:WARN TaskSetManager: Lost task 1.0 in stage 8.0 (TID 13, 192.168.56.198, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: 
The executor with id 1 exited with exit code 52.
The API gave the following brief reason: null
The API gave the following message: null
The API gave the following container statuses:

ContainerStatus(containerID=docker://b5b1b2dc6ad8c04e8162db0737f3f42a0092d67ea0b4c072abb4b311c07dca2e, image=969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest, imageID=docker-pullable://969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark@sha256:a0fc3551548c7f4dd1b5a27b52c4b0a96a52b5f63e8bb0a25899234a9cbe61df, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=executor, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://b5b1b2dc6ad8c04e8162db0737f3f42a0092d67ea0b4c072abb4b311c07dca2e, exitCode=52, finishedAt=Time(time=2019-03-26T14:09:22Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-03-26T14:06:12Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})
      
19/03/26 14:11:06 SPARK:WARN TaskSetManager: Lost task 1.1 in stage 8.0 (TID 14, 192.168.53.180, executor 2): java.lang.OutOfMemoryError: Java heap space
	at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
	at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
	at org.apache.spark.io.ReadAheadInputStream.<init>(ReadAheadInputStream.java:105)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.<init>(UnsafeSorterSpillReader.java:82)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:156)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:477)
	at org.apache.spark.sql.execution.UnsafeKVExternalSorter.sortedIterator(UnsafeKVExternalSorter.java:204)

	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:261)
	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:221)
	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:360)
	at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:112)
	at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:102)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

19/03/26 14:11:08 SPARK:ERROR TaskSchedulerImpl: Lost executor 2 on 192.168.53.180: 
The executor with id 2 exited with exit code 52.
The API gave the following brief reason: null
The API gave the following message: null
The API gave the following container statuses:

ContainerStatus(containerID=docker://394490c15a9d6b0f904574e7aba6af5aac5c4add56256d728009ebb5ea355d43, image=969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest, imageID=docker-pullable://969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark@sha256:a0fc3551548c7f4dd1b5a27b52c4b0a96a52b5f63e8bb0a25899234a9cbe61df, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=executor, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://394490c15a9d6b0f904574e7aba6af5aac5c4add56256d728009ebb5ea355d43, exitCode=52, finishedAt=Time(time=2019-03-26T14:11:07Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-03-26T14:09:24Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})
      
19/03/26 14:11:08 SPARK:WARN TaskSetManager: Lost task 0.1 in stage 8.0 (TID 15, 192.168.53.180, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: 
The executor with id 2 exited with exit code 52.
The API gave the following brief reason: null
The API gave the following message: null
The API gave the following container statuses:

ContainerStatus(containerID=docker://394490c15a9d6b0f904574e7aba6af5aac5c4add56256d728009ebb5ea355d43, image=969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest, imageID=docker-pullable://969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark@sha256:a0fc3551548c7f4dd1b5a27b52c4b0a96a52b5f63e8bb0a25899234a9cbe61df, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=executor, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://394490c15a9d6b0f904574e7aba6af5aac5c4add56256d728009ebb5ea355d43, exitCode=52, finishedAt=Time(time=2019-03-26T14:11:07Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-03-26T14:09:24Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})

Support GCP

Notes

  • Update some docs (e.g. how to install GPUs)

Add option to cache Spark data in memory

Description

Add an option in environments to cache the entire dataset in memory for spark jobs

Motivation

Small data jobs could run faster

Additional Context

If speedup is not significant, it's not worth implementing this feature

API is sometimes temporarily unavailable when updating

Description

API is sometimes temporarily unavailable when updating to a new model

Application Configuration

Fraud

To Reproduce

  1. cx deploy
  2. Modify dnn (e.g. to do 100 steps)
  3. cx deploy
  4. Repeatedly run cx predict fraud transactions.json

Actual Behavior

error: api "fraud" is updating

Expected Behavior

Successful prediction requests at all times (smooth transition to new model)

Screenshots

image

Version

master

Additional Context

Add client-side prediction batching

Description

When multiple prediction requests are made in the same API call, the API container should batch them into a single gRPC request to TensorFlow Serving

Motivation

  • Reduce latency for prediction requests which contain multiple samples
  • Increase bandwidth when prediction requests contain multiple samples

Additional Context

Here's an example, but their code might not be relevant since they don't use the tensorflow package (an optimization that is unnecessary for us since our client is always running)

Improve built-in index_string data format

Description

Option 1: save both mappings in the analysis. Could make it a different transformer/aggregator? Probably not

Option 2: At startup time, predict.py can create the mapping of string -> index to speed up transformations (keep index -> string array also, for reverse transformations). Is it custom for just this analysis? Or is there an API for this (e.g. "preprocess")?

Motivation

Faster real-time predictions

Add metadata database

Description

Storing metadata in a relational database (e.g. resource configuration, model accuracy, logs?, job statuses?) would allow us to query information about current and past resources/workloads. We will need to add CLI commands for this as well (e.g. query past models, jobs, ...).

Notes

  • Some use cases need to be transferred from S3 to the DB (e.g. inferred data types, context persistence, logs?, job statuses?, anything else that uses S3 which isn't user data?)
  • Should we store logs in the database and not use CloudWatchLogs?
  • Which DB should we used? Should we use a managed service e.g. RDS or Dynamo (and provide implementations for all our clouds), or deploy something in the k8s cluster e.g. CockroachDB?
  • Cache cloudwatch metrics

Make raw columns optional

Description

If the user doesn't specify any raw features, infer them from the environment's schema.

Motivation

This is a continuation of making all schemas optional to reduce the amount of required config.

Notes

  • Still check that all defined raw columns are ingested
  • Still ensure schemas from all environments match
  • Should we ingest columns that aren't used anywhere in the pipeline?

Name of api auto-generated by cx init has underscore

Description

API names can not have underscores. cx init produces apis.yaml with example template:

## Sample API:
#
# - kind: api
#   name: my_api
#   model_name: my_model
#   compute:
#     replicas: 1

Change underscore to hyphen

cx init should prevent users from using underscores

Description

cx deploy will fail validation later down the road, so we should prevent users from running e.g. cx init iris_1. Also run other app name validation (spaces, etc...)

Hyphens are ok

Motivation

Improve developer experience

Address TF Serving GRPC Warning

Description

cx logs on an API:

/usr/local/lib/python3.5/dist-packages/tensorflow_serving/apis/prediction_service_pb2.py:131: DeprecationWarning: beta_create_PredictionService_stub() method is deprecated. This method will be removed in near future versions of TF Serving. Please switch to GA gRPC API in prediction_service_pb2_grpc.
  'prediction_service_pb2_grpc.', DeprecationWarning)

Add additional input argument options

Description

Currently we support _default, _optional, _min_count, and _max_count. We should add support for _min / _max (floats and ints), and _values (or _allowed_values) (all types). For example, we can use _min: 1 for num_buckets, use _allowed_values for all estimator optimizers and BoostedTrees pruning mode, ...

Resolve Spark Context file added warnings

Description

We should avoid repeatedly adding files to the Spark context so we can remove this warning:

...
INFO:cortex:Sanity checking transformers against the first 100 samples
19/03/26 01:43:53 SPARK:WARN CacheManager: Asked to cache already cached data.
INFO:cortex:Transforming class to class_indexed
19/03/26 01:43:54 SPARK:WARN SparkContext: The path /mnt/context/transformer_cortex_index_string.py has been added already. Overwriting of added paths is not supported in the current version.
19/03/26 01:43:55 SPARK:WARN SparkContext: The path /mnt/context/transformer_cortex_index_string.py has been added already. Overwriting of added paths is not supported in the current version.
INFO:cortex:class:          Iris-setosa    Iris-setosa    Iris-setosa
INFO:cortex:class_indexed:  0              0              0
INFO:cortex:Transforming petal_length to petal_length_normalized
19/03/26 01:43:56 SPARK:WARN SparkContext: The path /mnt/context/transformer_cortex_normalize.py has been added already. Overwriting of added paths is not supported in the current version.
19/03/26 01:43:56 SPARK:WARN SparkContext: The path /mnt/context/transformer_cortex_normalize.py has been added already. Overwriting of added paths is not supported in the current version.
...

Respond to prediction request with transformed columns

Description

Add a flag in the api config to add transformed feature values to the prediction response. Or respond with these values automatically (without adding a flag to enable/disable it).

Motivation

Clients may need to perform post-processing on the prediction response based on the values of the transformed features (e.g. entity extraction requires the tokenized text).

Notes

  • Add e.g. -r/--raw flag to CLI to show full json response?

Allow for skipping CSV columns when ingesting

Description

"_" in the schema denotes not ingesting that column, and columns can be left off of the end of the CSV. For example, if there were columns named 1, 2, 3, 4, 5 and the user only wanted 1, 2, and 4, they could be ingested with a CSV schema of [1, 2, _, 4]

Build from source instructions

Description

  • simple documentation for building from source (linked from README)
  • simple commands (probably ties into makefile)
  • see the internal onboarding docs (Dev Workflow and Dev Setup), should be able to delete them when done

Motivation

motivation: user wants to make a change to operator to support a custom use case and wants to build cortex from source. easier to onboard engineers

Move errors to relevant packages

Description

Remove error strings from strings package, and move them to package-specific errors.go

Motivation

Avoid package dependency cycles

Resource status unavailable on S3

Version: 0.2

Stacktrace:

ERROR:cortex:An error occurred, see cx logs raw_column petal_width for more details.
Traceback (most recent call last):
  File "/src/spark_job/spark_job.py", line 311, in run_job
    run_custom_aggregators(spark, ctx, cols_to_aggregate, raw_df)
  File "/src/spark_job/spark_job.py", line 209, in run_custom_aggregators
    ctx.upload_resource_status_success(*builtin_aggregates)
  File "/src/lib/context.py", line 442, in upload_resource_status_success
    self.upload_resource_status_end("succeeded", *resources)
  File "/src/lib/context.py", line 450, in upload_resource_status_end
    status = self.get_resource_status(resource)
  File "/src/lib/context.py", line 411, in get_resource_status
    return aws.read_json_from_s3(key, self.bucket)
  File "/src/lib/aws.py", line 154, in read_json_from_s3
    obj = read_bytes_from_s3(key, bucket, allow_missing, client_config).decode("utf-8")
AttributeError: 'NoneType' object has no attribute 'decode'
19/03/25 21:38:15 SPARK:WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)

Additional information:
cx logs keeps getting error logs even after a successful run of cx refresh.

Split strings package

Description

Strings package shouldn't have anything cortex-specific, and should be moved to lib

pkg/lib/interfaces/interfaces.go would use PrimType from whichever package it moves too and would use ErrorInvalidPrimitiveType from that package, and we could delete pkg/lib/interfaces/errors.go

Clarify uninstall docs

Description

  • Uninstall needs AWS credentials
  • Uninstall script confusing first section (not whole script)

Resource status not found

Description

Sometimes a deployment fails in the aggregation step.

Application Configuration

Iris examples

To Reproduce

Not consistently reproducible.

  1. Deploy iris example

Stack Trace

cortex logs sepal_width_mean

ERROR:cortex:An error occurred, see cx logs raw_column petal_width for more details.
Traceback (most recent call last):
  File "/src/spark_job/spark_job.py", line 311, in run_job
    run_custom_aggregators(spark, ctx, cols_to_aggregate, raw_df)
  File "/src/spark_job/spark_job.py", line 209, in run_custom_aggregators
    ctx.upload_resource_status_success(*builtin_aggregates)
  File "/src/lib/context.py", line 442, in upload_resource_status_success
    self.upload_resource_status_end("succeeded", *resources)
  File "/src/lib/context.py", line 450, in upload_resource_status_end
    status = self.get_resource_status(resource)
  File "/src/lib/context.py", line 411, in get_resource_status
    return aws.read_json_from_s3(key, self.bucket)
  File "/src/lib/aws.py", line 154, in read_json_from_s3
    obj = read_bytes_from_s3(key, bucket, allow_missing, client_config).decode("utf-8")
AttributeError: 'NoneType' object has no attribute 'decode'
19/03/25 21:38:15 SPARK:WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)

Version

0.2

Deploy Cortex locally

Description

Setup the infrastructure to deploy cortex locally

  • Refactor code
  • Update install scripts for running cortex locally

Aggregate still in "aggregating" status even when logs say OOM

Description

Aggregate says it is still in progress even though logs say out of memory

Application Configuration

Added a sum_distinct_float aggregate operation to the normalize template in fraud

Actual Behavior

Aggregate remains in "aggregating" status

Expected Behavior

Expected aggregate to "fail" given that we are summing distinctly on a float column with many different values.

Stack Trace

19/03/26 14:07:12 SPARK:WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
19/03/26 14:09:22 SPARK:WARN TaskSetManager: Lost task 0.0 in stage 8.0 (TID 12, 192.168.56.198, executor 1): java.lang.OutOfMemoryError: Java heap space
	at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
	at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
	at org.apache.spark.io.ReadAheadInputStream.<init>(ReadAheadInputStream.java:105)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.<init>(UnsafeSorterSpillReader.java:82)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:156)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:477)
	at org.apache.spark.sql.execution.UnsafeKVExternalSorter.sortedIterator(UnsafeKVExternalSorter.java:204)
	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:261)
	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:221)
	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:360)
	at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:112)
	at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:102)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

19/03/26 14:09:24 SPARK:ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.56.198: 
The executor with id 1 exited with exit code 52.
The API gave the following brief reason: null
The API gave the following message: null
The API gave the following container statuses:

ContainerStatus(containerID=docker://b5b1b2dc6ad8c04e8162db0737f3f42a0092d67ea0b4c072abb4b311c07dca2e, image=969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest, imageID=docker-pullable://969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark@sha256:a0fc3551548c7f4dd1b5a27b52c4b0a96a52b5f63e8bb0a25899234a9cbe61df, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=executor, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://b5b1b2dc6ad8c04e8162db0737f3f42a0092d67ea0b4c072abb4b311c07dca2e, exitCode=52, finishedAt=Time(time=2019-03-26T14:09:22Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-03-26T14:06:12Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})
      
19/03/26 14:09:24 SPARK:WARN TaskSetManager: Lost task 1.0 in stage 8.0 (TID 13, 192.168.56.198, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: 
The executor with id 1 exited with exit code 52.
The API gave the following brief reason: null
The API gave the following message: null
The API gave the following container statuses:

ContainerStatus(containerID=docker://b5b1b2dc6ad8c04e8162db0737f3f42a0092d67ea0b4c072abb4b311c07dca2e, image=969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest, imageID=docker-pullable://969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark@sha256:a0fc3551548c7f4dd1b5a27b52c4b0a96a52b5f63e8bb0a25899234a9cbe61df, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=executor, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://b5b1b2dc6ad8c04e8162db0737f3f42a0092d67ea0b4c072abb4b311c07dca2e, exitCode=52, finishedAt=Time(time=2019-03-26T14:09:22Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-03-26T14:06:12Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})
      
19/03/26 14:11:06 SPARK:WARN TaskSetManager: Lost task 1.1 in stage 8.0 (TID 14, 192.168.53.180, executor 2): java.lang.OutOfMemoryError: Java heap space
	at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
	at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
	at org.apache.spark.io.ReadAheadInputStream.<init>(ReadAheadInputStream.java:105)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.<init>(UnsafeSorterSpillReader.java:82)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:156)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:477)
	at org.apache.spark.sql.execution.UnsafeKVExternalSorter.sortedIterator(UnsafeKVExternalSorter.java:204)

	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:261)
	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:221)
	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:360)
	at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:112)
	at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:102)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

19/03/26 14:11:08 SPARK:ERROR TaskSchedulerImpl: Lost executor 2 on 192.168.53.180: 
The executor with id 2 exited with exit code 52.
The API gave the following brief reason: null
The API gave the following message: null
The API gave the following container statuses:

ContainerStatus(containerID=docker://394490c15a9d6b0f904574e7aba6af5aac5c4add56256d728009ebb5ea355d43, image=969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest, imageID=docker-pullable://969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark@sha256:a0fc3551548c7f4dd1b5a27b52c4b0a96a52b5f63e8bb0a25899234a9cbe61df, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=executor, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://394490c15a9d6b0f904574e7aba6af5aac5c4add56256d728009ebb5ea355d43, exitCode=52, finishedAt=Time(time=2019-03-26T14:11:07Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-03-26T14:09:24Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})
      
19/03/26 14:11:08 SPARK:WARN TaskSetManager: Lost task 0.1 in stage 8.0 (TID 15, 192.168.53.180, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: 
The executor with id 2 exited with exit code 52.
The API gave the following brief reason: null
The API gave the following message: null
The API gave the following container statuses:

ContainerStatus(containerID=docker://394490c15a9d6b0f904574e7aba6af5aac5c4add56256d728009ebb5ea355d43, image=969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest, imageID=docker-pullable://969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark@sha256:a0fc3551548c7f4dd1b5a27b52c4b0a96a52b5f63e8bb0a25899234a9cbe61df, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=executor, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://394490c15a9d6b0f904574e7aba6af5aac5c4add56256d728009ebb5ea355d43, exitCode=52, finishedAt=Time(time=2019-03-26T14:11:07Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-03-26T14:09:24Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})

Version

master

Support bucket regions for data ingestion

Description

Currently we default to the same region that your cluster is in. That should be the default behavior, but we should allow ingesting from buckets in other regions. We should also update all of our examples to specify the us-west-2, so that the examples can run in any region.

Notes

We either need to add the region to the bucket URL, or add a config field for region

Differentiate expected vs unexpected operator errors

Description

Distinguish between expected and unexpected errors, so we can handle them differently (e.g. print stack traces by default for unexpected errors)

Also, error codes should get passed through to the CLI in addition to the message, so that we can check codes rather than error messages.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.