cortexlabs / cortex Goto Github PK
View Code? Open in Web Editor NEWProduction infrastructure for machine learning at scale
Home Page: https://cortexlabs.com/
License: Apache License 2.0
Production infrastructure for machine learning at scale
Home Page: https://cortexlabs.com/
License: Apache License 2.0
E.g. could take region
in aws.Init()
, other params directly (GetLogs()
could take logGroup
)
aws.Init()
could return a client, which would be used for any API calls
There may be an opportunity to improve prediction latency on CPUs and/or GPUs by building TensorFlow Serving from source.
For example, on p2.xlarge
, TF Serving shows this log message:
Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
And on m5.large
, TF Serving shows this log message:
Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
Here's an example of someone building from source.
Question: Different instance type support different instruction sets. What happens if we compile against instruction sets (e.g. AVX 512
on m5.large
) that are not available on your instance (e.g. t3.large
or even t3a.large
)?
Status*
consts in resource/status.go
should be shared across resource types
A model
/ aggregate
/ transformed_feature
can be defined without using an estimator
/ aggregator
/ transformer
. If so, Cortex doesn't validate types.
Reduce the amount of required configuration
estimator
has a key path
, and model
has keys estimator
and estimator_path
(only one of them may be defined)cortex.normalize
map[string]string
(name -> template text) should be sufficientReduce the amount of configuration required for common pipelines
Parquet file containing IntegerType or DoubleType columns will throw errors during ingestion time because the validation check expects FLOAT_COLUMN to be FloatType and INT_COLUMN to be LongType
During the validation check, accept FloatType or DoubleType when a column is defined to be FLOAT_COLUMN and IntegerType or LongType when a column is defined to be INT_COLUMN
Enable hyperparameter tuning: the parameter space is searched (e.g. grid-based search), multiple models are trained, and the most accurate model is deployed in the API.
Avoid the weirdness that cx status -h
makes is seem like you can do cx status models
Use resourceType.name
to identify resources. After this change, it should be possible to do cx get resrouceType
, cx get resourceIdentifier
, or cx status resourceIdentifier
(in addition to cx get
and cx status
), where resourceIdentifier
can be resourceName if it's unique, and resourceType.resourceName is fully qualified and will always work.
Require resource type (e.g. cx get model dnn
even if there is only one resource called dnn
)
Keep it the way it is, possibly improving the docs.
Add a new CLI command to reduce ambiguity.
Separate model config into estimator
/ model
, and improve inputs
_default
/ _optional
:
_default
defaults to None
in any case (scalar/list/map)_default
can never be set to None
explicitly_default
, _optional
is implicitly set to True
*_COLUMN
type is anywhere in or under the type, default is not an option_min_count
is an option (and defaults to 0
)_
@
throughout (i.e. all input
, also e.g. model: @dnn
in api
)training_input
to modelstype: classification
in modelinput: INT
is supportedFLOAT_COLUMN|INT
is not allowed){arg: INT}
, {STRING: INT}
, {STRING: INT_COLUMN}
, {STRING_COLUMN: INT}
, {STRING_COLUMN: INT_COLUMN}
, {STRING_COLUMN: [INT_COLUMN]}
, etc... are all supported. {[STRING_COLUMN]: INT_COLUMN}
is not supported.INT
-> FLOAT
for inputs, INT
-> FLOAT
and INT_COLUMN
-> FLOAT_COLUMN
for outputsinput: <TYPE> # (short form)
input: # (long form)
_type: <TYPE>
_default: <>
...
# mean
- kind: aggregator
name: mean
output_type: FLOAT
input: FLOAT_COLUMN|INT_COLUMN
- kind: aggregate
name: sepal_length_mean
aggregator: cortex.mean
input: sepal_length
# bucket_boundaries
- kind: aggregator
name: bucket_boundaries
output_type: [FLOAT]
input:
col: FLOAT_COLUMN|INT_COLUMN
num_buckets:
_type: INT
_default: 10
- kind: aggregate
name: sepal_length_bucket
aggregator: cortex.bucket_boundaries
input:
col: sepal_length
num_buckets: 5
# normalize
- kind: transformer
name: normalize
output_type: FLOAT_COLUMN
input:
col: FLOAT_COLUMN|INT_COLUMN
mean: INT|FLOAT
stddev: INT|FLOAT
- kind: transformed_column
name: sepal_length_normalized
transformer: cortex.normalize
input:
col: sepal_length
mean: sepal_length_mean
stddev: sepal_length_stddev
# weight
- kind: transformer
name: weight
output_type: FLOAT_COLUMN
input:
col: INT_COLUMN
class_distribution: {INT: FLOAT}
- kind: transformed_column
name: weight_column
transformer: weight
input:
col: class
class_distribution: class_distribution
# iris
- kind: estimator
name: dnn
target_column: INT_COLUMN
input:
normalized_columns: [INT_COLUMN|FLOAT_COLUMN]
num_classes: INT
hparams:
hidden_layers: [INT]
learning_rate:
_type: FLOAT
_default: 0.01
- kind: model
name: iris-dnn
estimator: dnn
target_column: class_indexed
input:
normalized_columns:
- sepal_length_normalized
- sepal_width_normalized
- petal_length_normalized
- petal_width_normalized
num_classes: 2
hparams:
hidden_layers: [4, 4]
data_partition_ratio: ...
training: ...
# Fraud
- kind: estimator
name: dnn
target_column: INT_COLUMN
input:
normalized_columns: [INT_COLUMN|FLOAT_COLUMN]
training_input:
weight_column: FLOAT_COLUMN
- kind: model
name: fraud-dnn
target_column: class
input:
normalized_columns: [time_normalized, v1_normalized, ...]
training_input:
weight_column: weights
# Insurance
- kind: estimator
name: dnn
target_column: FLOAT_COLUMN
input:
categorical:
_type:
- feature: INT_COLUMN
categories: [STRING]
_min_count: 1 # This wouldn't actually be necessary, it's just to demonstrate
bucketized:
- feature: INT_COLUMN|FLOAT_COLUMN
buckets: [INT]
- kind: model
name: insurance-dnn
estimator: insurance-dnn
target_column: charges_normalized
input:
categorical:
- feature: gender
categories: ["female", "male"]
- feature: smoker
categories: ["yes", "no"]
- feature: region
categories: ["northwest", "northeast", "southwest", "southeast"]
- feature: children
categories: children_set
bucketized:
- feature: age
buckets: [15, 20, 25, 35, 40, 45, 50, 55, 60, 65]
- feature: bmi
buckets: [15, 20, 25, 35, 40, 45, 50, 55]
# Poker
- kind: estimator
name: dnn
target_column: INT
input:
suit_columns: [INT_COLUMN]
rank_columns: [INT_COLUMN]
- kind: model
name: poker-dnn
type: classification
target_column: class
input:
suit_columns: [card_1_suit, card_2_suit, card_3_suit, card_4_suit, card_5_suit]
rank_columns: [card_1_rank, card_2_rank, card_3_rank, card_4_rank, card_5_rank]
# Mnist
- kind: model
name: t2t
target_column: INT
input: FLOAT_LIST_COLUMN
prediction_key: outputs
- kind: trainer
name: mnist-t2t
target_column: label
input: image_pixels
review tf.feature_column
tf.feature_column.indicator_column(
tf.feature_column.categorical_column_with_vocabulary_list("smoker", ["yes", "no"])
),
tf.feature_column.bucketized_column(
tf.feature_column.numeric_column("age"), [15, 20, 25, 35, 40, 45, 50, 55, 60, 65]
),
tf.feature_column.numeric_column("sepal_length_normalized"),
tf.feature_column.numeric_column("image_pixels", shape=model_config["hparams"]["input_shape"])
weight_column = "class_weight"
# cloudml-template
categorical_columns_with_identity = {
item[0]: tf.feature_column.categorical_column_with_identity(item[0], item[1])
for item in categorical_feature_names_with_identity.items()
}
categorical_columns_with_vocabulary = {
item[0]: tf.feature_column.categorical_column_with_vocabulary_list(item[0], item[1])
for item in metadata.INPUT_CATEGORICAL_FEATURE_NAMES_WITH_VOCABULARY.items()
}
categorical_columns_with_hash_bucket = {
item[0]: tf.feature_column.categorical_column_with_hash_bucket(item[0], item[1])
for item in metadata.INPUT_CATEGORICAL_FEATURE_NAMES_WITH_HASH_BUCKET.items()
}
age_buckets = tf.feature_column.bucketized_column(
feature_columns["age"], boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65]
)
education_X_occupation = tf.feature_column.crossed_column(
["education", "occupation"], hash_bucket_size=int(1e4)
)
native_country_embedded = tf.feature_column.embedding_column(
feature_columns["native_country"], dimension=task.HYPER_PARAMS.embedding_size
)
transformer
/ transformed_feature
and aggregator
/ aggregate
Long running jobs error (they are actually still running, but Cortex shows them as error
)
Run any long job (e.g. training job with a lot of steps)
This is due to a known bug in Argo, fixed here. It should be released with v2.3 (releases)
time="2019-01-15T01:05:13Z" level=info msg="Creating a docker executor"
time="2019-01-15T01:05:13Z" level=info msg="Executor (version: v2.2.1, build_date: 2018-10-11T16:27:29Z) initialized with template:\narchiveLocation: {}\ninputs: {}\nmetadata:\n labels:\n appName: iris\n argo: \"true\"\n workloadId: wrguzieqiqtj6gic4wlx\n workloadType: data-job\nname: wrguzieqiqtj6gic4wlx\noutputs: {}\nresource:\n action: create\n failureCondition: status.applicationState.state in (FAILED,SUBMISSION_FAILED,UNKNOWN)\n manifest: '{\"kind\":\"SparkApplication\",\"apiVersion\":\"sparkoperator.k8s.io/v1alpha1\",\"metadata\":{\"name\":\"wrguzieqiqtj6gic4wlx\",\"namespace\":\"cortex\",\"creationTimestamp\":null,\"labels\":{\"appName\":\"iris\",\"workloadId\":\"wrguzieqiqtj6gic4wlx\",\"workloadType\":\"data-job\"},\"ownerReferences\":[{\"apiVersion\":\"argoproj.io/v1alpha1\",\"kind\":\"Workflow\",\"name\":\"argo-iris-7wvh7\",\"uid\":\"9bc7aee5-1861-11e9-a664-026185601f56\",\"blockOwnerDeletion\":false}]},\"spec\":{\"type\":\"Python\",\"mode\":\"cluster\",\"image\":\"764403040460.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest\",\"imagePullPolicy\":\"Always\",\"mainApplicationFile\":\"local:///src/spark_job/spark_job.py\",\"arguments\":[\"--workload-id=wrguzieqiqtj6gic4wlx\n --context=s3://cortex-cluster-david/apps/iris/contexts/83b914617c417fbe26b59c7839d77914b43f02c48fb52c713a5d4c094dff0bd.msgpack\n --cache-dir=/mnt/context --raw-features=ad88a38e98b74d6dfc1a718b51c3da769141f46ddd0cd8fb96ca9fd01a71521,9bcb141ed9f8ee4f52c55a40e16815d2a3eddbdc118cc81751e762c73134820,ffa5ac7dad3f546d5b867892d4bf820ba34e83362042c894dc86527295b9352,bacc1fe5d16b0dbbb3527ea9de82409f67b926d0243a549cb6ed844ef76cdd7,b02d30d78186521e2399ba6a93e0f20459eb72e661234c9cf88cd289a78af30\n --aggregates=5566592b7087b3cf822b7a7a56085fcfd1678b363db00fefefab725f7892d37,0c8177a3736da4987b7629346ca461eced6f090c40edbbe9b67595ca00f73ce,ea21d88863bf2fa675550ae8e57d301d247b311b8f7441e1a342f0803a352e8,9fdd4490123c08fc5da726e3ca597f9b0cf9f0f7b7762a5fca7ea7b036df33b,2a55250299f0a8935f8aafba8cef94e5f7776c8e4c480e2d7f42fc5ca336d36,d59e72278cf560705ddb8cd11dbbb9dd2592165d458f51ddefa425f35fe0a3d,77b9f029fcaac3f513e549042f2f64eee9e73ad0d2c7549ec167943f2adbd46,d3e9aaa136d3e338ac1654e4da3ce5fbe6f6e6517e375672efd9ce3b20bedef,a8a3c0802612f715243577910061579af8cc49bc44ffc8f580e9746c264a237\n --transformed-features=8cc4ed1ed6c711dc775f51d8d129f5b0030ae109c9ae1605f0530c3ecdf511b,1d26f726a9c1879096147a7219418752140c6572a86d29a8a273ac8474c4622,79df34c5ba324066ec281a70ca9f45dd027e24a243bc137efc24eb1f928b708,ba8d0bc773530313233b0c51a27526ab6f13db97b02c9b8114bd9f0667793c1,15533d18f1de1b545c1a32dd6f6df24481f07fe314f78219ad6bab782276703\n --training-datasets=27fd4e1f80338ab4fbe8f2c67d58b6d521d2576f3cd492ad1ace5c80d94e413\n --ingest\"],\"driver\":{\"cores\":10,\"memory\":\"512000k\",\"envSecretKeyRefs\":{\"AWS_ACCESS_KEY_ID\":{\"name\":\"aws-credentials\",\"key\":\"AWS_ACCESS_KEY_ID\"},\"AWS_SECRET_ACCESS_KEY\":{\"name\":\"aws-credentials\",\"key\":\"AWS_SECRET_ACCESS_KEY\"}},\"labels\":{\"appName\":\"iris\",\"userFacing\":\"true\",\"workloadId\":\"wrguzieqiqtj6gic4wlx\",\"workloadType\":\"data-job\"},\"podName\":\"wrguzieqiqtj6gic4wlx\",\"serviceAccount\":\"spark\"},\"executor\":{\"cores\":1,\"memory\":\"512000k\",\"envSecretKeyRefs\":{\"AWS_ACCESS_KEY_ID\":{\"name\":\"aws-credentials\",\"key\":\"AWS_ACCESS_KEY_ID\"},\"AWS_SECRET_ACCESS_KEY\":{\"name\":\"aws-credentials\",\"key\":\"AWS_SECRET_ACCESS_KEY\"}},\"labels\":{\"appName\":\"iris\",\"workloadId\":\"wrguzieqiqtj6gic4wlx\",\"workloadType\":\"data-job\"},\"instances\":1},\"deps\":{\"pyFiles\":[\"local:///src/spark_job/spark_util.py\",\"local:///src/lib/*.py\"]},\"restartPolicy\":{\"type\":\"Never\"},\"pythonVersion\":\"3\"},\"status\":{\"lastSubmissionAttemptTime\":null,\"completionTime\":null,\"driverInfo\":{},\"applicationState\":{\"state\":\"\",\"errorMessage\":\"\"}}}'\n successCondition: status.applicationState.state in (COMPLETED)\n"
time="2019-01-15T01:05:13Z" level=info msg="Loading manifest to /tmp/manifest.yaml"
time="2019-01-15T01:05:13Z" level=info msg="kubectl create -f /tmp/manifest.yaml -o name"
time="2019-01-15T01:05:17Z" level=info msg=sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx
time="2019-01-15T01:05:17Z" level=info msg="Waiting for conditions: status.applicationState.state in (COMPLETED)"
time="2019-01-15T01:05:17Z" level=info msg="Failing for conditions: status.applicationState.state in (FAILED,SUBMISSION_FAILED,UNKNOWN)"
time="2019-01-15T01:05:17Z" level=info msg="kubectl get sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx -w -o json"
time="2019-01-15T01:05:20Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:05:20Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:05:20Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:05:20Z" level=info msg="0/1 success conditions matched"
time="2019-01-15T01:05:41Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:05:41Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:05:41Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:05:41Z" level=info msg="0/1 success conditions matched"
time="2019-01-15T01:05:41Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:05:41Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:05:41Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:05:41Z" level=info msg="0/1 success conditions matched"
time="2019-01-15T01:07:17Z" level=warning msg="Json reader returned error EOF. Calling kill (usually superfluous)"
time="2019-01-15T01:07:17Z" level=warning msg="Command for kubectl get -w for sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx exited. Getting return value using Wait"
time="2019-01-15T01:07:17Z" level=info msg="readJSon failed for resource sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx but cmd.Wait for kubectl get -w command did not error"
time="2019-01-15T01:07:17Z" level=info msg="Waiting for resource sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx resulted in retryable error EOF"
time="2019-01-15T01:07:22Z" level=info msg="kubectl get sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx -w -o json"
time="2019-01-15T01:07:22Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:07:22Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:07:22Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:07:22Z" level=info msg="0/1 success conditions matched"
time="2019-01-15T01:09:08Z" level=warning msg="Json reader returned error EOF. Calling kill (usually superfluous)"
time="2019-01-15T01:09:08Z" level=warning msg="Command for kubectl get -w for sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx exited. Getting return value using Wait"
time="2019-01-15T01:09:08Z" level=info msg="readJSon failed for resource sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx but cmd.Wait for kubectl get -w command did not error"
time="2019-01-15T01:09:08Z" level=info msg="Waiting for resource sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx resulted in retryable error EOF"
time="2019-01-15T01:09:28Z" level=info msg="kubectl get sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx -w -o json"
time="2019-01-15T01:09:28Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:09:28Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:09:28Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:09:28Z" level=info msg="0/1 success conditions matched"
time="2019-01-15T01:10:51Z" level=warning msg="Json reader returned error EOF. Calling kill (usually superfluous)"
time="2019-01-15T01:10:51Z" level=warning msg="Command for kubectl get -w for sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx exited. Getting return value using Wait"
time="2019-01-15T01:10:51Z" level=info msg="readJSon failed for resource sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx but cmd.Wait for kubectl get -w command did not error"
time="2019-01-15T01:10:51Z" level=info msg="Waiting for resource sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx resulted in retryable error EOF"
time="2019-01-15T01:12:11Z" level=info msg="kubectl get sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx -w -o json"
time="2019-01-15T01:12:11Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:12:11Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:12:11Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:12:11Z" level=info msg="0/1 success conditions matched"
time="2019-01-15T01:12:25Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:12:25Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:12:25Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:12:25Z" level=info msg="0/1 success conditions matched"
Probably after some timeout so user could see them if they want, and so logs get uploaded
operator-b58fbb9b5-z2ftz 1/1 Running 0 11m
operator-b58fbb9b5-zdmjx 0/1 Evicted 0 11m
It should be possible to have a column defined in env.data.schema that isn't used as a raw_column (CSV and Parquet).
Start from Iris, and then completely remove one of the features (e.g. sepal_width
)
Starting
INFO:cortex:Starting
INFO:cortex:Ingesting
INFO:cortex:Ingesting iris-1 data from s3a://cortex-examples/iris.csv
ERROR:cortex:An error occurred, see `cx logs raw_column sepal_width` for more details.
Traceback (most recent call last):
File "/src/spark_job/spark_job.py", line 307, in run_job
raw_df = ingest_raw_dataset(spark, ctx, cols_to_validate, should_ingest)
File "/src/spark_job/spark_job.py", line 151, in ingest_raw_dataset
ingest_df = spark_util.ingest(ctx, spark)
File "/src/spark_job/spark_util.py", line 223, in ingest
expected_schema = expected_schema_from_context(ctx)
File "/src/spark_job/spark_util.py", line 117, in expected_schema_from_context
for fname in expected_field_names
File "/src/spark_job/spark_util.py", line 117, in <listcomp>
for fname in expected_field_names
KeyError: 'petal_width'
master
It should also be possible to not ingest all of the columns in the dataset (just Parquet for now); we should test this.
Allow users to upload objects to S3 and use them in aggregators, transformers, and estimators.
Some artifacts are pre-computed and need to be accessed in the pipeline (e.g. word embeddings).
Find a way to use (or ask if the user wants to use) AWS credentials from ~/.aws
Add an option to environment.schema to use CSV headers and parquet column names as the raw column names.
Reduce the amount of required YAML configuration when the raw data already has the column names.
E.g. implement transformations/analyses that are built in to spark ml
Failed to create training datasets.
Reproducible in many of the examples, but easiest to reproduce in Poker.
cx deploy
to process the raw columnscx deploy
cx logs dnn/training_dataset
Failed to start:
time="2019-04-18T20:12:09Z" level=info msg="Creating a docker executor"
time="2019-04-18T20:12:09Z" level=info msg="Executor (version: v2.2.1, build_date: 2018-10-11T16:27:29Z) initialized with template:\narchiveLocation: {}\ninputs: {}\nmetadata:\n labels:\n appName: recommendations\n argo: \"true\"\n workloadID: bord6fnh1lma5hyn8my3\n workloadType: data-job\nname: bord6fnh1lma5hyn8my3\noutputs: {}\nresource:\n action: create\n failureCondition: status.applicationState.state in (FAILED,SUBMISSION_FAILED,UNKNOWN)\n manifest: |-\n {\n \"kind\": \"SparkApplication\",\n \"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\n \"metadata\": {\n \"name\": \"bord6fnh1lma5hyn8my3\",\n \"namespace\": \"cortex\",\n \"creationTimestamp\": null,\n \"labels\": {\n \"appName\": \"recommendations\",\n \"workloadID\": \"bord6fnh1lma5hyn8my3\",\n \"workloadType\": \"data-job\"\n },\n \"ownerReferences\": [\n {\n \"apiVersion\": \"argoproj.io/v1alpha1\",\n \"kind\": \"Workflow\",\n \"name\": \"argo-recommendations-rplw6\",\n \"uid\": \"3dca2989-6216-11e9-aaf1-02cc01957708\",\n \"blockOwnerDeletion\": false\n }\n ]\n },\n \"spec\": {\n \"type\": \"Python\",\n \"mode\": \"cluster\",\n \"image\": \"969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest\",\n \"imagePullPolicy\": \"Always\",\n \"mainApplicationFile\": \"local:///src/spark_job/spark_job.py\",\n \"arguments\": [\n \"--workload-id=bord6fnh1lma5hyn8my3 --context=s3://cortex-cluster-vishal/apps/recommendations/contexts/9063143c7366987a974e14c07bf21c40bed64e03a3aa0fed55a670c7756e317.msgpack --cache-dir=/mnt/context --raw-columns= --aggregates= --transformed-columns= --training-datasets=d6c73248656984e3d08a6165cd3b34de27253021cb94c232526f96776999d73\"\n ],\n \"driver\": {\n \"cores\": 0,\n \"memory\": \"0k\",\n \"envVars\": {\n \"CORTEX_CACHE_DIR\": \"/mnt/context\",\n \"CORTEX_CONTEXT_S3_PATH\": \"s3://cortex-cluster-vishal/apps/recommendations/contexts/9063143c7366987a974e14c07bf21c40bed64e03a3aa0fed55a670c7756e317.msgpack\",\n \"CORTEX_SPARK_VERBOSITY\": \"WARN\",\n \"CORTEX_WORKLOAD_ID\": \"bord6fnh1lma5hyn8my3\"\n },\n \"envSecretKeyRefs\": {\n \"AWS_ACCESS_KEY_ID\": {\n \"name\": \"aws-credentials\",\n \"key\": \"AWS_ACCESS_KEY_ID\"\n },\n \"AWS_SECRET_ACCESS_KEY\": {\n \"name\": \"aws-credentials\",\n \"key\": \"AWS_SECRET_ACCESS_KEY\"\n }\n },\n \"labels\": {\n \"appName\": \"recommendations\",\n \"userFacing\": \"true\",\n \"workloadID\": \"bord6fnh1lma5hyn8my3\",\n \"workloadType\": \"data-job\"\n },\n \"podName\": \"bord6fnh1lma5hyn8my3\",\n \"serviceAccount\": \"spark\"\n },\n \"executor\": {\n \"cores\": 0,\n \"memory\": \"0k\",\n \"envVars\": {\n \"CORTEX_CACHE_DIR\": \"/mnt/context\",\n \"CORTEX_CONTEXT_S3_PATH\": \"s3://cortex-cluster-vishal/apps/recommendations/contexts/9063143c7366987a974e14c07bf21c40bed64e03a3aa0fed55a670c7756e317.msgpack\",\n \"CORTEX_SPARK_VERBOSITY\": \"WARN\",\n \"CORTEX_WORKLOAD_ID\": \"bord6fnh1lma5hyn8my3\"\n },\n \"envSecretKeyRefs\": {\n \"AWS_ACCESS_KEY_ID\": {\n \"name\": \"aws-credentials\",\n \"key\": \"AWS_ACCESS_KEY_ID\"\n },\n \"AWS_SECRET_ACCESS_KEY\": {\n \"name\": \"aws-credentials\",\n \"key\": \"AWS_SECRET_ACCESS_KEY\"\n }\n },\n \"labels\": {\n \"appName\": \"recommendations\",\n \"workloadID\": \"bord6fnh1lma5hyn8my3\",\n \"workloadType\": \"data-job\"\n },\n \"instances\": 0\n },\n \"deps\": {\n \"pyFiles\": [\n \"local:///src/spark_job/spark_util.py\",\n \"local:///src/lib/*.py\"\n ]\n },\n \"restartPolicy\": {\n \"type\": \"Never\"\n },\n \"pythonVersion\": \"3\"\n },\n \"status\": {\n \"lastSubmissionAttemptTime\": null,\n \"completionTime\": null,\n \"driverInfo\": {},\n \"applicationState\": {\n \"state\": \"\",\n \"errorMessage\": \"\"\n }\n }\n }\n successCondition: status.applicationState.state in (COMPLETED)\n"
time="2019-04-18T20:12:09Z" level=info msg="Loading manifest to /tmp/manifest.yaml"
time="2019-04-18T20:12:09Z" level=info msg="kubectl create -f /tmp/manifest.yaml -o name"
time="2019-04-18T20:12:10Z" level=fatal msg="The SparkApplication \"bord6fnh1lma5hyn8my3\" is invalid: []: Invalid value: map[string]interface {}{\"apiVersion\":\"sparkoperator.k8s.io/v1alpha1\", \"kind\":\"SparkApplication\", \"metadata\":map[string]interface {}{\"name\":\"bord6fnh1lma5hyn8my3\", \"namespace\":\"cortex\", \"creationTimestamp\":\"2019-04-18T20:12:09Z\", \"labels\":map[string]interface {}{\"workloadID\":\"bord6fnh1lma5hyn8my3\", \"workloadType\":\"data-job\", \"appName\":\"recommendations\"}, \"ownerReferences\":[]interface {}{map[string]interface {}{\"apiVersion\":\"argoproj.io/v1alpha1\", \"kind\":\"Workflow\", \"name\":\"argo-recommendations-rplw6\", \"uid\":\"3dca2989-6216-11e9-aaf1-02cc01957708\", \"blockOwnerDeletion\":false}}, \"generation\":1, \"uid\":\"3f1c787f-6216-11e9-aaf1-02cc01957708\", \"selfLink\":\"\"}, \"spec\":map[string]interface {}{\"image\":\"969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest\", \"mainApplicationFile\":\"local:///src/spark_job/spark_job.py\", \"mode\":\"cluster\", \"restartPolicy\":map[string]interface {}{\"type\":\"Never\"}, \"type\":\"Python\", \"driver\":map[string]interface {}{\"serviceAccount\":\"spark\", \"cores\":0, \"envSecretKeyRefs\":map[string]interface {}{\"AWS_ACCESS_KEY_ID\":map[string]interface {}{\"key\":\"AWS_ACCESS_KEY_ID\", \"name\":\"aws-credentials\"}, \"AWS_SECRET_ACCESS_KEY\":map[string]interface {}{\"key\":\"AWS_SECRET_ACCESS_KEY\", \"name\":\"aws-credentials\"}}, \"envVars\":map[string]interface {}{\"CORTEX_CACHE_DIR\":\"/mnt/context\", \"CORTEX_CONTEXT_S3_PATH\":\"s3://cortex-cluster-vishal/apps/recommendations/contexts/9063143c7366987a974e14c07bf21c40bed64e03a3aa0fed55a670c7756e317.msgpack\", \"CORTEX_SPARK_VERBOSITY\":\"WARN\", \"CORTEX_WORKLOAD_ID\":\"bord6fnh1lma5hyn8my3\"}, \"labels\":map[string]interface {}{\"appName\":\"recommendations\", \"userFacing\":\"true\", \"workloadID\":\"bord6fnh1lma5hyn8my3\", \"workloadType\":\"data-job\"}, \"memory\":\"0k\", \"podName\":\"bord6fnh1lma5hyn8my3\"}, \"deps\":map[string]interface {}{\"pyFiles\":[]interface {}{\"local:///src/spark_job/spark_util.py\", \"local:///src/lib/*.py\"}}, \"executor\":map[string]interface {}{\"envVars\":map[string]interface {}{\"CORTEX_CACHE_DIR\":\"/mnt/context\", \"CORTEX_CONTEXT_S3_PATH\":\"s3://cortex-cluster-vishal/apps/recommendations/contexts/9063143c7366987a974e14c07bf21c40bed64e03a3aa0fed55a670c7756e317.msgpack\", \"CORTEX_SPARK_VERBOSITY\":\"WARN\", \"CORTEX_WORKLOAD_ID\":\"bord6fnh1lma5hyn8my3\"}, \"instances\":0, \"labels\":map[string]interface {}{\"appName\":\"recommendations\", \"workloadID\":\"bord6fnh1lma5hyn8my3\", \"workloadType\":\"data-job\"}, \"memory\":\"0k\", \"cores\":0, \"envSecretKeyRefs\":map[string]interface {}{\"AWS_ACCESS_KEY_ID\":map[string]interface {}{\"key\":\"AWS_ACCESS_KEY_ID\", \"name\":\"aws-credentials\"}, \"AWS_SECRET_ACCESS_KEY\":map[string]interface {}{\"name\":\"aws-credentials\", \"key\":\"AWS_SECRET_ACCESS_KEY\"}}}, \"imagePullPolicy\":\"Always\", \"pythonVersion\":\"3\", \"arguments\":[]interface {}{\"--workload-id=bord6fnh1lma5hyn8my3 --context=s3://cortex-cluster-vishal/apps/recommendations/contexts/9063143c7366987a974e14c07bf21c40bed64e03a3aa0fed55a670c7756e317.msgpack --cache-dir=/mnt/context --raw-columns= --aggregates= --transformed-columns= --training-datasets=d6c73248656984e3d08a6165cd3b34de27253021cb94c232526f96776999d73\"}}, \"status\":map[string]interface {}{\"lastSubmissionAttemptTime\":interface {}(nil), \"applicationState\":map[string]interface {}{\"errorMessage\":\"\", \"state\":\"\"}, \"completionTime\":interface {}(nil), \"driverInfo\":map[string]interface {}{}}}: validation failure list:\nspec.driver.cores in body should be greater than 0\nspec.executor.instances in body should be greater than or equal to 1\nspec.executor.cores in body should be greater than 0\ngithub.com/argoproj/argo/errors.New\n\t/root/go/src/github.com/argoproj/argo/errors/errors.go:48\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).ExecResource\n\t/root/go/src/github.com/argoproj/argo/workflow/executor/resource.go:36\ngithub.com/argoproj/argo/cmd/argoexec/commands.execResource\n\t/root/go/src/github.com/argoproj/argo/cmd/argoexec/commands/resource.go:38\ngithub.com/argoproj/argo/cmd/argoexec/commands.glob..func2\n\t/root/go/src/github.com/argoproj/argo/cmd/argoexec/commands/resource.go:23\ngithub.com/argoproj/argo/vendor/github.com/spf13/cobra.(*Command).execute\n\t/root/go/src/github.com/argoproj/argo/vendor/github.com/spf13/cobra/command.go:766\ngithub.com/argoproj/argo/vendor/github.com/spf13/cobra.(*Command).ExecuteC\n\t/root/go/src/github.com/argoproj/argo/vendor/github.com/spf13/cobra/command.go:852\ngithub.com/argoproj/argo/vendor/github.com/spf13/cobra.(*Command).Execute\n\t/root/go/src/github.com/argoproj/argo/vendor/github.com/spf13/cobra/command.go:800\nmain.main\n\t/root/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:15\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:198\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:2361"
master
Hey there - I was just curious if you have a real website that isn't just a redirect to your repo?
I noticed you have docs and a job posting page but no normal landing page. Is this by default due to stealth launch? Or perhaps issues with SEO as there seems to be another company very close to your naming - https://www.cortexlabs.ai/
Obviously the development is more important but it is always nice to be able to send non-technical/business colleagues a place where they can get an overview without crawling through a repo :)
Version: Master
Stacktrace:
19/03/26 14:07:12 SPARK:WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
19/03/26 14:09:22 SPARK:WARN TaskSetManager: Lost task 0.0 in stage 8.0 (TID 12, 192.168.56.198, executor 1): java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.spark.io.ReadAheadInputStream.<init>(ReadAheadInputStream.java:105)
at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.<init>(UnsafeSorterSpillReader.java:82)
at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:156)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:477)
at org.apache.spark.sql.execution.UnsafeKVExternalSorter.sortedIterator(UnsafeKVExternalSorter.java:204)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:261)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:221)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:360)
at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:112)
at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:102)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/03/26 14:09:24 SPARK:ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.56.198:
The executor with id 1 exited with exit code 52.
The API gave the following brief reason: null
The API gave the following message: null
The API gave the following container statuses:
ContainerStatus(containerID=docker://b5b1b2dc6ad8c04e8162db0737f3f42a0092d67ea0b4c072abb4b311c07dca2e, image=969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest, imageID=docker-pullable://969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark@sha256:a0fc3551548c7f4dd1b5a27b52c4b0a96a52b5f63e8bb0a25899234a9cbe61df, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=executor, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://b5b1b2dc6ad8c04e8162db0737f3f42a0092d67ea0b4c072abb4b311c07dca2e, exitCode=52, finishedAt=Time(time=2019-03-26T14:09:22Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-03-26T14:06:12Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})
19/03/26 14:09:24 SPARK:WARN TaskSetManager: Lost task 1.0 in stage 8.0 (TID 13, 192.168.56.198, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason:
The executor with id 1 exited with exit code 52.
The API gave the following brief reason: null
The API gave the following message: null
The API gave the following container statuses:
ContainerStatus(containerID=docker://b5b1b2dc6ad8c04e8162db0737f3f42a0092d67ea0b4c072abb4b311c07dca2e, image=969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest, imageID=docker-pullable://969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark@sha256:a0fc3551548c7f4dd1b5a27b52c4b0a96a52b5f63e8bb0a25899234a9cbe61df, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=executor, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://b5b1b2dc6ad8c04e8162db0737f3f42a0092d67ea0b4c072abb4b311c07dca2e, exitCode=52, finishedAt=Time(time=2019-03-26T14:09:22Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-03-26T14:06:12Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})
19/03/26 14:11:06 SPARK:WARN TaskSetManager: Lost task 1.1 in stage 8.0 (TID 14, 192.168.53.180, executor 2): java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.spark.io.ReadAheadInputStream.<init>(ReadAheadInputStream.java:105)
at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.<init>(UnsafeSorterSpillReader.java:82)
at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:156)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:477)
at org.apache.spark.sql.execution.UnsafeKVExternalSorter.sortedIterator(UnsafeKVExternalSorter.java:204)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:261)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:221)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:360)
at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:112)
at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:102)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/03/26 14:11:08 SPARK:ERROR TaskSchedulerImpl: Lost executor 2 on 192.168.53.180:
The executor with id 2 exited with exit code 52.
The API gave the following brief reason: null
The API gave the following message: null
The API gave the following container statuses:
ContainerStatus(containerID=docker://394490c15a9d6b0f904574e7aba6af5aac5c4add56256d728009ebb5ea355d43, image=969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest, imageID=docker-pullable://969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark@sha256:a0fc3551548c7f4dd1b5a27b52c4b0a96a52b5f63e8bb0a25899234a9cbe61df, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=executor, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://394490c15a9d6b0f904574e7aba6af5aac5c4add56256d728009ebb5ea355d43, exitCode=52, finishedAt=Time(time=2019-03-26T14:11:07Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-03-26T14:09:24Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})
19/03/26 14:11:08 SPARK:WARN TaskSetManager: Lost task 0.1 in stage 8.0 (TID 15, 192.168.53.180, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason:
The executor with id 2 exited with exit code 52.
The API gave the following brief reason: null
The API gave the following message: null
The API gave the following container statuses:
ContainerStatus(containerID=docker://394490c15a9d6b0f904574e7aba6af5aac5c4add56256d728009ebb5ea355d43, image=969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest, imageID=docker-pullable://969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark@sha256:a0fc3551548c7f4dd1b5a27b52c4b0a96a52b5f63e8bb0a25899234a9cbe61df, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=executor, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://394490c15a9d6b0f904574e7aba6af5aac5c4add56256d728009ebb5ea355d43, exitCode=52, finishedAt=Time(time=2019-03-26T14:11:07Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-03-26T14:09:24Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})
Add an example of a recommendation service.
Add an option in environments
to cache the entire dataset in memory for spark jobs
Small data jobs could run faster
If speedup is not significant, it's not worth implementing this feature
API is sometimes temporarily unavailable when updating to a new model
Fraud
cx deploy
dnn
(e.g. to do 100 steps)cx deploy
cx predict fraud transactions.json
error: api "fraud" is updating
Successful prediction requests at all times (smooth transition to new model)
master
When multiple prediction requests are made in the same API call, the API container should batch them into a single gRPC request to TensorFlow Serving
Here's an example, but their code might not be relevant since they don't use the tensorflow
package (an optimization that is unnecessary for us since our client is always running)
Metadata files should be stored in a shared metadata root directory, rather than a path scoped to resources like it is now.
Consuming metadata will no longer require a metadata_key
, and the storage path can be derived only from the resource ID
Option 1: save both mappings in the analysis. Could make it a different transformer/aggregator? Probably not
Option 2: At startup time, predict.py can create the mapping of string -> index to speed up transformations (keep index -> string array also, for reverse transformations). Is it custom for just this analysis? Or is there an API for this (e.g. "preprocess")?
Faster real-time predictions
cx deploy
should provide a nice Cortex validation error if app name has an underscore (hypens are ok)
Improve developer experience
Storing metadata in a relational database (e.g. resource configuration, model accuracy, logs?, job statuses?) would allow us to query information about current and past resources/workloads. We will need to add CLI commands for this as well (e.g. query past models, jobs, ...).
There is a bug in TensorFlow 1.12 which breaks dropout layers in Keras.
If the user doesn't specify any raw features, infer them from the environment's schema.
This is a continuation of making all schemas optional to reduce the amount of required config.
API names can not have underscores. cx init
produces apis.yaml with example template:
## Sample API:
#
# - kind: api
# name: my_api
# model_name: my_model
# compute:
# replicas: 1
Change underscore to hyphen
cx deploy
will fail validation later down the road, so we should prevent users from running e.g. cx init iris_1
. Also run other app name validation (spaces, etc...)
Hyphens are ok
Improve developer experience
We may need to revisit all of our CLI command names
cx logs
on an API:
/usr/local/lib/python3.5/dist-packages/tensorflow_serving/apis/prediction_service_pb2.py:131: DeprecationWarning: beta_create_PredictionService_stub() method is deprecated. This method will be removed in near future versions of TF Serving. Please switch to GA gRPC API in prediction_service_pb2_grpc.
'prediction_service_pb2_grpc.', DeprecationWarning)
Currently we support _default
, _optional
, _min_count
, and _max_count
. We should add support for _min
/ _max
(floats and ints), and _values
(or _allowed_values
) (all types). For example, we can use _min: 1
for num_buckets
, use _allowed_values
for all estimator optimizers and BoostedTrees pruning mode, ...
We should avoid repeatedly adding files to the Spark context so we can remove this warning:
...
INFO:cortex:Sanity checking transformers against the first 100 samples
19/03/26 01:43:53 SPARK:WARN CacheManager: Asked to cache already cached data.
INFO:cortex:Transforming class to class_indexed
19/03/26 01:43:54 SPARK:WARN SparkContext: The path /mnt/context/transformer_cortex_index_string.py has been added already. Overwriting of added paths is not supported in the current version.
19/03/26 01:43:55 SPARK:WARN SparkContext: The path /mnt/context/transformer_cortex_index_string.py has been added already. Overwriting of added paths is not supported in the current version.
INFO:cortex:class: Iris-setosa Iris-setosa Iris-setosa
INFO:cortex:class_indexed: 0 0 0
INFO:cortex:Transforming petal_length to petal_length_normalized
19/03/26 01:43:56 SPARK:WARN SparkContext: The path /mnt/context/transformer_cortex_normalize.py has been added already. Overwriting of added paths is not supported in the current version.
19/03/26 01:43:56 SPARK:WARN SparkContext: The path /mnt/context/transformer_cortex_normalize.py has been added already. Overwriting of added paths is not supported in the current version.
...
Add a flag in the api config to add transformed feature values to the prediction response. Or respond with these values automatically (without adding a flag to enable/disable it).
Clients may need to perform post-processing on the prediction response based on the values of the transformed features (e.g. entity extraction requires the tokenized text).
-r/--raw
flag to CLI to show full json response?"_" in the schema denotes not ingesting that column, and columns can be left off of the end of the CSV. For example, if there were columns named 1, 2, 3, 4, 5 and the user only wanted 1, 2, and 4, they could be ingested with a CSV schema of [1, 2, _, 4]
motivation: user wants to make a change to operator to support a custom use case and wants to build cortex from source. easier to onboard engineers
Remove error strings from strings
package, and move them to package-specific errors.go
Avoid package dependency cycles
Version: 0.2
Stacktrace:
ERROR:cortex:An error occurred, see cx logs raw_column petal_width for more details.
Traceback (most recent call last):
File "/src/spark_job/spark_job.py", line 311, in run_job
run_custom_aggregators(spark, ctx, cols_to_aggregate, raw_df)
File "/src/spark_job/spark_job.py", line 209, in run_custom_aggregators
ctx.upload_resource_status_success(*builtin_aggregates)
File "/src/lib/context.py", line 442, in upload_resource_status_success
self.upload_resource_status_end("succeeded", *resources)
File "/src/lib/context.py", line 450, in upload_resource_status_end
status = self.get_resource_status(resource)
File "/src/lib/context.py", line 411, in get_resource_status
return aws.read_json_from_s3(key, self.bucket)
File "/src/lib/aws.py", line 154, in read_json_from_s3
obj = read_bytes_from_s3(key, bucket, allow_missing, client_config).decode("utf-8")
AttributeError: 'NoneType' object has no attribute 'decode'
19/03/25 21:38:15 SPARK:WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
Additional information:
cx logs keeps getting error logs even after a successful run of cx refresh.
Strings package shouldn't have anything cortex-specific, and should be moved to lib
pkg/lib/interfaces/interfaces.go
would use PrimType
from whichever package it moves too and would use ErrorInvalidPrimitiveType
from that package, and we could delete pkg/lib/interfaces/errors.go
Sometimes a deployment fails in the aggregation step.
Iris examples
Not consistently reproducible.
cortex logs sepal_width_mean
ERROR:cortex:An error occurred, see cx logs raw_column petal_width for more details.
Traceback (most recent call last):
File "/src/spark_job/spark_job.py", line 311, in run_job
run_custom_aggregators(spark, ctx, cols_to_aggregate, raw_df)
File "/src/spark_job/spark_job.py", line 209, in run_custom_aggregators
ctx.upload_resource_status_success(*builtin_aggregates)
File "/src/lib/context.py", line 442, in upload_resource_status_success
self.upload_resource_status_end("succeeded", *resources)
File "/src/lib/context.py", line 450, in upload_resource_status_end
status = self.get_resource_status(resource)
File "/src/lib/context.py", line 411, in get_resource_status
return aws.read_json_from_s3(key, self.bucket)
File "/src/lib/aws.py", line 154, in read_json_from_s3
obj = read_bytes_from_s3(key, bucket, allow_missing, client_config).decode("utf-8")
AttributeError: 'NoneType' object has no attribute 'decode'
19/03/25 21:38:15 SPARK:WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
0.2
Setup the infrastructure to deploy cortex locally
Aggregate says it is still in progress even though logs say out of memory
Added a sum_distinct_float aggregate operation to the normalize template in fraud
Aggregate remains in "aggregating" status
Expected aggregate to "fail" given that we are summing distinctly on a float column with many different values.
19/03/26 14:07:12 SPARK:WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
19/03/26 14:09:22 SPARK:WARN TaskSetManager: Lost task 0.0 in stage 8.0 (TID 12, 192.168.56.198, executor 1): java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.spark.io.ReadAheadInputStream.<init>(ReadAheadInputStream.java:105)
at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.<init>(UnsafeSorterSpillReader.java:82)
at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:156)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:477)
at org.apache.spark.sql.execution.UnsafeKVExternalSorter.sortedIterator(UnsafeKVExternalSorter.java:204)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:261)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:221)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:360)
at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:112)
at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:102)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/03/26 14:09:24 SPARK:ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.56.198:
The executor with id 1 exited with exit code 52.
The API gave the following brief reason: null
The API gave the following message: null
The API gave the following container statuses:
ContainerStatus(containerID=docker://b5b1b2dc6ad8c04e8162db0737f3f42a0092d67ea0b4c072abb4b311c07dca2e, image=969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest, imageID=docker-pullable://969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark@sha256:a0fc3551548c7f4dd1b5a27b52c4b0a96a52b5f63e8bb0a25899234a9cbe61df, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=executor, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://b5b1b2dc6ad8c04e8162db0737f3f42a0092d67ea0b4c072abb4b311c07dca2e, exitCode=52, finishedAt=Time(time=2019-03-26T14:09:22Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-03-26T14:06:12Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})
19/03/26 14:09:24 SPARK:WARN TaskSetManager: Lost task 1.0 in stage 8.0 (TID 13, 192.168.56.198, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason:
The executor with id 1 exited with exit code 52.
The API gave the following brief reason: null
The API gave the following message: null
The API gave the following container statuses:
ContainerStatus(containerID=docker://b5b1b2dc6ad8c04e8162db0737f3f42a0092d67ea0b4c072abb4b311c07dca2e, image=969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest, imageID=docker-pullable://969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark@sha256:a0fc3551548c7f4dd1b5a27b52c4b0a96a52b5f63e8bb0a25899234a9cbe61df, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=executor, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://b5b1b2dc6ad8c04e8162db0737f3f42a0092d67ea0b4c072abb4b311c07dca2e, exitCode=52, finishedAt=Time(time=2019-03-26T14:09:22Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-03-26T14:06:12Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})
19/03/26 14:11:06 SPARK:WARN TaskSetManager: Lost task 1.1 in stage 8.0 (TID 14, 192.168.53.180, executor 2): java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.spark.io.ReadAheadInputStream.<init>(ReadAheadInputStream.java:105)
at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.<init>(UnsafeSorterSpillReader.java:82)
at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:156)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:477)
at org.apache.spark.sql.execution.UnsafeKVExternalSorter.sortedIterator(UnsafeKVExternalSorter.java:204)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:261)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:221)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:360)
at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:112)
at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:102)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/03/26 14:11:08 SPARK:ERROR TaskSchedulerImpl: Lost executor 2 on 192.168.53.180:
The executor with id 2 exited with exit code 52.
The API gave the following brief reason: null
The API gave the following message: null
The API gave the following container statuses:
ContainerStatus(containerID=docker://394490c15a9d6b0f904574e7aba6af5aac5c4add56256d728009ebb5ea355d43, image=969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest, imageID=docker-pullable://969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark@sha256:a0fc3551548c7f4dd1b5a27b52c4b0a96a52b5f63e8bb0a25899234a9cbe61df, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=executor, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://394490c15a9d6b0f904574e7aba6af5aac5c4add56256d728009ebb5ea355d43, exitCode=52, finishedAt=Time(time=2019-03-26T14:11:07Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-03-26T14:09:24Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})
19/03/26 14:11:08 SPARK:WARN TaskSetManager: Lost task 0.1 in stage 8.0 (TID 15, 192.168.53.180, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason:
The executor with id 2 exited with exit code 52.
The API gave the following brief reason: null
The API gave the following message: null
The API gave the following container statuses:
ContainerStatus(containerID=docker://394490c15a9d6b0f904574e7aba6af5aac5c4add56256d728009ebb5ea355d43, image=969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest, imageID=docker-pullable://969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark@sha256:a0fc3551548c7f4dd1b5a27b52c4b0a96a52b5f63e8bb0a25899234a9cbe61df, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=executor, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://394490c15a9d6b0f904574e7aba6af5aac5c4add56256d728009ebb5ea355d43, exitCode=52, finishedAt=Time(time=2019-03-26T14:11:07Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-03-26T14:09:24Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})
master
Rename the transformed_column
parameter to transformed_column_name
in transform_spark()
.
Make it more clear to the user what this argument contains.
Currently we default to the same region that your cluster is in. That should be the default behavior, but we should allow ingesting from buckets in other regions. We should also update all of our examples to specify the us-west-2, so that the examples can run in any region.
We either need to add the region to the bucket URL, or add a config field for region
Distinguish between expected and unexpected errors, so we can handle them differently (e.g. print stack traces by default for unexpected errors)
Also, error codes should get passed through to the CLI in addition to the message, so that we can check codes rather than error messages.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.