cortexlabs / cortex Goto Github PK

View Code? Open in Web Editor NEW

8.0K 146.0 607.0 11.28 MB

Production infrastructure for machine learning at scale

Home Page: https://cortexlabs.com/

License: Apache License 2.0

Shell 2.82% Go 89.40% Dockerfile 0.93% Python 5.22% Makefile 0.42% Jinja 1.21%

machine-learning infrastructure

cortex's Introduction

Docs • Slack

Note: This project is no longer actively maintained by its original authors.

Production infrastructure for machine learning at scale

Deploy, manage, and scale machine learning models in production.

Serverless workloads

Realtime - respond to requests in real-time and autoscale based on in-flight request volumes.

Async - process requests asynchronously and autoscale based on request queue length.

Batch - run distributed and fault-tolerant batch processing jobs on-demand.

Automated cluster management

Autoscaling - elastically scale clusters with CPU and GPU instances.

Spot instances - run workloads on spot instances with automated on-demand backups.

Environments - create multiple clusters with different configurations.

CI/CD and observability integrations

Provisioning - provision clusters with declarative configuration or a Terraform provider.

Metrics - send metrics to any monitoring tool or use pre-built Grafana dashboards.

Logs - stream logs to any log management tool or use the pre-built CloudWatch integration.

Built for AWS

EKS - Cortex runs on top of EKS to scale workloads reliably and cost-effectively.

VPC - deploy clusters into a VPC on your AWS account to keep your data private.

IAM - integrate with IAM for authentication and authorization workflows.

cortex's People

Contributors

Stargazers

Watchers

Forkers

paulxuca developgo mewbak hunglethanh9 hbcbh1999 nasikshafeek vdt shaunstanislauslau kustomzone victor8733 koalacxr ghanima siathalysedi wengbenjue qsdj tarrysingh pythianjoseph jshuadvd forging2012 rbromley10 jdetras ml-lab meelement ancardona stjordanis sysbot zyxyaoqi gappelgren washingtonm burakakrishna kwecht habeebhassan jaytoday bruck1701 iamymind cxz apextw backwardn 1vn databill86 gulahmed aadil672 neibla galinamerz yueyedeai sd37 dantodor acumenix jolks abuzzell rsohlot magnetstone sairamchappidi ankitrai forkkit mjunaidi sethips afcarl xr0118 sitedata hieuqtran flamato ssitb vserpak shenhzou654321 hongshunyang pkusnail hf5210 lynchuan tfresources francishero jianbotang sadnessofatlantis chapzq77 stevenhailin lambdalpha sunnydavidli melodylail danielbbz myqf555 rotcx scchy colettecasey triper1022 edghpark pragnakalpdev6 rogervaas robertlucian amirstudy liuyuuan yw3388 nguyenducnhaty ynng tejamoy prabhjotsl amir22010 henrywu2019 faisalshahbaz susangzj sailychen

cortex's Issues

Description

Add support for built-in templates
Implement cortex.normalize
Use it in quickstart?

Notes

There doesn't need to be actual files for this; a hardcoded map[string]string (name -> template text) should be sufficient

Motivation

Reduce the amount of configuration required for common pipelines

Differentiate expected vs unexpected operator errors

Description

Distinguish between expected and unexpected errors, so we can handle them differently (e.g. print stack traces by default for unexpected errors)

Also, error codes should get passed through to the CLI in addition to the message, so that we can check codes rather than error messages.

Name of api auto-generated by cx init has underscore

Description

API names can not have underscores. cx init produces apis.yaml with example template:

## Sample API:
#
# - kind: api
#   name: my_api
#   model_name: my_model
#   compute:
#     replicas: 1

Change underscore to hyphen

Notes

Update some docs (e.g. how to install GPUs)

Description

Sometimes a deployment fails in the aggregation step.

Application Configuration

Iris examples

To Reproduce

Not consistently reproducible.

Deploy iris example

Stack Trace

cortex logs sepal_width_mean

ERROR:cortex:An error occurred, see cx logs raw_column petal_width for more details.
Traceback (most recent call last):
  File "/src/spark_job/spark_job.py", line 311, in run_job
    run_custom_aggregators(spark, ctx, cols_to_aggregate, raw_df)
  File "/src/spark_job/spark_job.py", line 209, in run_custom_aggregators
    ctx.upload_resource_status_success(*builtin_aggregates)
  File "/src/lib/context.py", line 442, in upload_resource_status_success
    self.upload_resource_status_end("succeeded", *resources)
  File "/src/lib/context.py", line 450, in upload_resource_status_end
    status = self.get_resource_status(resource)
  File "/src/lib/context.py", line 411, in get_resource_status
    return aws.read_json_from_s3(key, self.bucket)
  File "/src/lib/aws.py", line 154, in read_json_from_s3
    obj = read_bytes_from_s3(key, bucket, allow_missing, client_config).decode("utf-8")
AttributeError: 'NoneType' object has no attribute 'decode'
19/03/25 21:38:15 SPARK:WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)

Version

0.2

Add CLI command to list all active deployments

Additional Notes

We may need to revisit all of our CLI command names

Description

Strings package shouldn't have anything cortex-specific, and should be moved to lib

pkg/lib/interfaces/interfaces.go would use PrimType from whichever package it moves too and would use ErrorInvalidPrimitiveType from that package, and we could delete pkg/lib/interfaces/errors.go

Move aws and k8s out of operator package

Description

E.g. could take region in aws.Init(), other params directly (GetLogs() could take logGroup)

aws.Init() could return a client, which would be used for any API calls

Aggregate still in "aggregating" status even when logs say OOM

Description

Aggregate says it is still in progress even though logs say out of memory

Application Configuration

Added a sum_distinct_float aggregate operation to the normalize template in fraud

Actual Behavior

Aggregate remains in "aggregating" status

Expected Behavior

Expected aggregate to "fail" given that we are summing distinctly on a float column with many different values.

Stack Trace

19/03/26 14:07:12 SPARK:WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
19/03/26 14:09:22 SPARK:WARN TaskSetManager: Lost task 0.0 in stage 8.0 (TID 12, 192.168.56.198, executor 1): java.lang.OutOfMemoryError: Java heap space
	at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
	at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
	at org.apache.spark.io.ReadAheadInputStream.<init>(ReadAheadInputStream.java:105)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.<init>(UnsafeSorterSpillReader.java:82)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:156)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:477)
	at org.apache.spark.sql.execution.UnsafeKVExternalSorter.sortedIterator(UnsafeKVExternalSorter.java:204)
	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:261)
	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:221)
	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:360)
	at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:112)
	at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:102)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

19/03/26 14:09:24 SPARK:ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.56.198: 
The executor with id 1 exited with exit code 52.
The API gave the following brief reason: null
The API gave the following message: null
The API gave the following container statuses:

ContainerStatus(containerID=docker://b5b1b2dc6ad8c04e8162db0737f3f42a0092d67ea0b4c072abb4b311c07dca2e, image=969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest, imageID=docker-pullable://969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark@sha256:a0fc3551548c7f4dd1b5a27b52c4b0a96a52b5f63e8bb0a25899234a9cbe61df, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=executor, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://b5b1b2dc6ad8c04e8162db0737f3f42a0092d67ea0b4c072abb4b311c07dca2e, exitCode=52, finishedAt=Time(time=2019-03-26T14:09:22Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-03-26T14:06:12Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})
      
19/03/26 14:09:24 SPARK:WARN TaskSetManager: Lost task 1.0 in stage 8.0 (TID 13, 192.168.56.198, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: 
The executor with id 1 exited with exit code 52.
The API gave the following brief reason: null
The API gave the following message: null
The API gave the following container statuses:

ContainerStatus(containerID=docker://b5b1b2dc6ad8c04e8162db0737f3f42a0092d67ea0b4c072abb4b311c07dca2e, image=969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest, imageID=docker-pullable://969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark@sha256:a0fc3551548c7f4dd1b5a27b52c4b0a96a52b5f63e8bb0a25899234a9cbe61df, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=executor, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://b5b1b2dc6ad8c04e8162db0737f3f42a0092d67ea0b4c072abb4b311c07dca2e, exitCode=52, finishedAt=Time(time=2019-03-26T14:09:22Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-03-26T14:06:12Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})
      
19/03/26 14:11:06 SPARK:WARN TaskSetManager: Lost task 1.1 in stage 8.0 (TID 14, 192.168.53.180, executor 2): java.lang.OutOfMemoryError: Java heap space
	at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
	at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
	at org.apache.spark.io.ReadAheadInputStream.<init>(ReadAheadInputStream.java:105)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.<init>(UnsafeSorterSpillReader.java:82)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:156)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:477)
	at org.apache.spark.sql.execution.UnsafeKVExternalSorter.sortedIterator(UnsafeKVExternalSorter.java:204)

	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:261)
	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:221)
	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:360)
	at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:112)
	at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:102)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

19/03/26 14:11:08 SPARK:ERROR TaskSchedulerImpl: Lost executor 2 on 192.168.53.180: 
The executor with id 2 exited with exit code 52.
The API gave the following brief reason: null
The API gave the following message: null
The API gave the following container statuses:

ContainerStatus(containerID=docker://394490c15a9d6b0f904574e7aba6af5aac5c4add56256d728009ebb5ea355d43, image=969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest, imageID=docker-pullable://969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark@sha256:a0fc3551548c7f4dd1b5a27b52c4b0a96a52b5f63e8bb0a25899234a9cbe61df, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=executor, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://394490c15a9d6b0f904574e7aba6af5aac5c4add56256d728009ebb5ea355d43, exitCode=52, finishedAt=Time(time=2019-03-26T14:11:07Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-03-26T14:09:24Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})
      
19/03/26 14:11:08 SPARK:WARN TaskSetManager: Lost task 0.1 in stage 8.0 (TID 15, 192.168.53.180, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: 
The executor with id 2 exited with exit code 52.
The API gave the following brief reason: null
The API gave the following message: null
The API gave the following container statuses:

ContainerStatus(containerID=docker://394490c15a9d6b0f904574e7aba6af5aac5c4add56256d728009ebb5ea355d43, image=969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest, imageID=docker-pullable://969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark@sha256:a0fc3551548c7f4dd1b5a27b52c4b0a96a52b5f63e8bb0a25899234a9cbe61df, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=executor, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://394490c15a9d6b0f904574e7aba6af5aac5c4add56256d728009ebb5ea355d43, exitCode=52, finishedAt=Time(time=2019-03-26T14:11:07Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-03-26T14:09:24Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})

Version

master

Aggregate still in "aggregating" status even when logs say OOM

Version: Master
Stacktrace:

19/03/26 14:07:12 SPARK:WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
19/03/26 14:09:22 SPARK:WARN TaskSetManager: Lost task 0.0 in stage 8.0 (TID 12, 192.168.56.198, executor 1): java.lang.OutOfMemoryError: Java heap space
	at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
	at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
	at org.apache.spark.io.ReadAheadInputStream.<init>(ReadAheadInputStream.java:105)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.<init>(UnsafeSorterSpillReader.java:82)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:156)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:477)
	at org.apache.spark.sql.execution.UnsafeKVExternalSorter.sortedIterator(UnsafeKVExternalSorter.java:204)
	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:261)
	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:221)
	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:360)
	at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:112)
	at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:102)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

19/03/26 14:09:24 SPARK:ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.56.198: 
The executor with id 1 exited with exit code 52.
The API gave the following brief reason: null
The API gave the following message: null
The API gave the following container statuses:

ContainerStatus(containerID=docker://b5b1b2dc6ad8c04e8162db0737f3f42a0092d67ea0b4c072abb4b311c07dca2e, image=969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest, imageID=docker-pullable://969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark@sha256:a0fc3551548c7f4dd1b5a27b52c4b0a96a52b5f63e8bb0a25899234a9cbe61df, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=executor, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://b5b1b2dc6ad8c04e8162db0737f3f42a0092d67ea0b4c072abb4b311c07dca2e, exitCode=52, finishedAt=Time(time=2019-03-26T14:09:22Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-03-26T14:06:12Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})
      
19/03/26 14:09:24 SPARK:WARN TaskSetManager: Lost task 1.0 in stage 8.0 (TID 13, 192.168.56.198, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: 
The executor with id 1 exited with exit code 52.
The API gave the following brief reason: null
The API gave the following message: null
The API gave the following container statuses:

ContainerStatus(containerID=docker://b5b1b2dc6ad8c04e8162db0737f3f42a0092d67ea0b4c072abb4b311c07dca2e, image=969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest, imageID=docker-pullable://969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark@sha256:a0fc3551548c7f4dd1b5a27b52c4b0a96a52b5f63e8bb0a25899234a9cbe61df, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=executor, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://b5b1b2dc6ad8c04e8162db0737f3f42a0092d67ea0b4c072abb4b311c07dca2e, exitCode=52, finishedAt=Time(time=2019-03-26T14:09:22Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-03-26T14:06:12Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})
      
19/03/26 14:11:06 SPARK:WARN TaskSetManager: Lost task 1.1 in stage 8.0 (TID 14, 192.168.53.180, executor 2): java.lang.OutOfMemoryError: Java heap space
	at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
	at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
	at org.apache.spark.io.ReadAheadInputStream.<init>(ReadAheadInputStream.java:105)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.<init>(UnsafeSorterSpillReader.java:82)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:156)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:477)
	at org.apache.spark.sql.execution.UnsafeKVExternalSorter.sortedIterator(UnsafeKVExternalSorter.java:204)

	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:261)
	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:221)
	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:360)
	at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:112)
	at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:102)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

19/03/26 14:11:08 SPARK:ERROR TaskSchedulerImpl: Lost executor 2 on 192.168.53.180: 
The executor with id 2 exited with exit code 52.
The API gave the following brief reason: null
The API gave the following message: null
The API gave the following container statuses:

ContainerStatus(containerID=docker://394490c15a9d6b0f904574e7aba6af5aac5c4add56256d728009ebb5ea355d43, image=969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest, imageID=docker-pullable://969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark@sha256:a0fc3551548c7f4dd1b5a27b52c4b0a96a52b5f63e8bb0a25899234a9cbe61df, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=executor, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://394490c15a9d6b0f904574e7aba6af5aac5c4add56256d728009ebb5ea355d43, exitCode=52, finishedAt=Time(time=2019-03-26T14:11:07Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-03-26T14:09:24Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})
      
19/03/26 14:11:08 SPARK:WARN TaskSetManager: Lost task 0.1 in stage 8.0 (TID 15, 192.168.53.180, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: 
The executor with id 2 exited with exit code 52.
The API gave the following brief reason: null
The API gave the following message: null
The API gave the following container statuses:

ContainerStatus(containerID=docker://394490c15a9d6b0f904574e7aba6af5aac5c4add56256d728009ebb5ea355d43, image=969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest, imageID=docker-pullable://969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark@sha256:a0fc3551548c7f4dd1b5a27b52c4b0a96a52b5f63e8bb0a25899234a9cbe61df, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=executor, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://394490c15a9d6b0f904574e7aba6af5aac5c4add56256d728009ebb5ea355d43, exitCode=52, finishedAt=Time(time=2019-03-26T14:11:07Z, additionalProperties={}), message=null, reason=Error, signal=null, startedAt=Time(time=2019-03-26T14:09:24Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})

Infer raw columns from CSV headers, Parquet schema

Description

Add an option to environment.schema to use CSV headers and parquet column names as the raw column names.

Motivation

Reduce the amount of required YAML configuration when the raw data already has the column names.

Additional Context

This will feed into raw column data type inference
When using the option, it should not be possible to explicitly identify any of the raw columns
We still need to validate the raw column names

Respond to prediction request with transformed columns

Description

Add a flag in the api config to add transformed feature values to the prediction response. Or respond with these values automatically (without adding a flag to enable/disable it).

Motivation

Clients may need to perform post-processing on the prediction response based on the values of the transformed features (e.g. entity extraction requires the tokenized text).

Notes

Add e.g. -r/--raw flag to CLI to show full json response?

Not using an ingested column as a raw_column results in error

Description

It should be possible to have a column defined in env.data.schema that isn't used as a raw_column (CSV and Parquet).

Application Configuration

Start from Iris, and then completely remove one of the features (e.g. sepal_width)

To Reproduce

cx deploy

Stack Trace

Starting

INFO:cortex:Starting
INFO:cortex:Ingesting
INFO:cortex:Ingesting iris-1 data from s3a://cortex-examples/iris.csv
ERROR:cortex:An error occurred, see `cx logs raw_column sepal_width` for more details.
Traceback (most recent call last):
  File "/src/spark_job/spark_job.py", line 307, in run_job
    raw_df = ingest_raw_dataset(spark, ctx, cols_to_validate, should_ingest)
  File "/src/spark_job/spark_job.py", line 151, in ingest_raw_dataset
    ingest_df = spark_util.ingest(ctx, spark)
  File "/src/spark_job/spark_util.py", line 223, in ingest
    expected_schema = expected_schema_from_context(ctx)
  File "/src/spark_job/spark_util.py", line 117, in expected_schema_from_context
    for fname in expected_field_names
  File "/src/spark_job/spark_util.py", line 117, in <listcomp>
    for fname in expected_field_names
KeyError: 'petal_width'

Version

master

Additional Context

It should also be possible to not ingest all of the columns in the dataset (just Parquet for now); we should test this.

Add option to cache Spark data in memory

Description

Add an option in environments to cache the entire dataset in memory for spark jobs

Motivation

Small data jobs could run faster

Additional Context

If speedup is not significant, it's not worth implementing this feature

Make spec kinds optional

Description

A model / aggregate / transformed_feature can be defined without using an estimator / aggregator / transformer. If so, Cortex doesn't validate types.

Motivation

Reduce the amount of required configuration

Additional Context

e.g. estimator has a key path, and model has keys estimator and estimator_path (only one of them may be defined)

Add recommendation example

Description

Add an example of a recommendation service.

Build TensorFlow Serving from source

Description

There may be an opportunity to improve prediction latency on CPUs and/or GPUs by building TensorFlow Serving from source.

For example, on p2.xlarge, TF Serving shows this log message:

Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

And on m5.large, TF Serving shows this log message:

Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA

Here's an example of someone building from source.

Question: Different instance type support different instruction sets. What happens if we compile against instruction sets (e.g. AVX 512 on m5.large) that are not available on your instance (e.g. t3.large or even t3a.large)?

Improve built-in index_string data format

Description

Option 1: save both mappings in the analysis. Could make it a different transformer/aggregator? Probably not

Option 2: At startup time, predict.py can create the mapping of string -> index to speed up transformations (keep index -> string array also, for reverse transformations). Is it custom for just this analysis? Or is there an API for this (e.g. "preprocess")?

Motivation

Faster real-time predictions

Refactor resource metadata files to use shared metadata root directory

Description

Metadata files should be stored in a shared metadata root directory, rather than a path scoped to resources like it is now.

Motivation

Consuming metadata will no longer require a metadata_key, and the storage path can be derived only from the resource ID

Add additional input argument options

Description

Currently we support _default, _optional, _min_count, and _max_count. We should add support for _min / _max (floats and ints), and _values (or _allowed_values) (all types). For example, we can use _min: 1 for num_buckets, use _allowed_values for all estimator optimizers and BoostedTrees pruning mode, ...

Is there any example for custom Spark code for data processing?

Make raw columns optional

Description

If the user doesn't specify any raw features, infer them from the environment's schema.

Motivation

This is a continuation of making all schemas optional to reduce the amount of required config.

Notes

Still check that all defined raw columns are ingested
Still ensure schemas from all environments match
Should we ingest columns that aren't used anywhere in the pipeline?

Ingestion of Parquet containing int or double columns throw validation errors

Description

Parquet file containing IntegerType or DoubleType columns will throw errors during ingestion time because the validation check expects FLOAT_COLUMN to be FloatType and INT_COLUMN to be LongType

Possible Solution / Implementation

During the validation check, accept FloatType or DoubleType when a column is defined to be FLOAT_COLUMN and IntegerType or LongType when a column is defined to be INT_COLUMN

Address TF Serving GRPC Warning

Description

cx logs on an API:

/usr/local/lib/python3.5/dist-packages/tensorflow_serving/apis/prediction_service_pb2.py:131: DeprecationWarning: beta_create_PredictionService_stub() method is deprecated. This method will be removed in near future versions of TF Serving. Please switch to GA gRPC API in prediction_service_pb2_grpc.
  'prediction_service_pb2_grpc.', DeprecationWarning)

API is sometimes temporarily unavailable when updating

Description

API is sometimes temporarily unavailable when updating to a new model

Application Configuration

Fraud

To Reproduce

cx deploy
Modify dnn (e.g. to do 100 steps)
cx deploy
Repeatedly run cx predict fraud transactions.json

Actual Behavior

error: api "fraud" is updating

Expected Behavior

Successful prediction requests at all times (smooth transition to new model)

Screenshots

Version

master

Additional Context

container readiness probes docs 1, docs 2
bitnami example

Update to TensorFlow 1.13

Motivation

There is a bug in TensorFlow 1.12 which breaks dropout layers in Keras.

Update Argo version

Description

Long running jobs error (they are actually still running, but Cortex shows them as error)

To Reproduce

Run any long job (e.g. training job with a lot of steps)

Possible Solution / Implementation

This is due to a known bug in Argo, fixed here. It should be released with v2.3 (releases)

Stack Trace

time="2019-01-15T01:05:13Z" level=info msg="Creating a docker executor"
time="2019-01-15T01:05:13Z" level=info msg="Executor (version: v2.2.1, build_date: 2018-10-11T16:27:29Z) initialized with template:\narchiveLocation: {}\ninputs: {}\nmetadata:\n  labels:\n    appName: iris\n    argo: \"true\"\n    workloadId: wrguzieqiqtj6gic4wlx\n    workloadType: data-job\nname: wrguzieqiqtj6gic4wlx\noutputs: {}\nresource:\n  action: create\n  failureCondition: status.applicationState.state in (FAILED,SUBMISSION_FAILED,UNKNOWN)\n  manifest: '{\"kind\":\"SparkApplication\",\"apiVersion\":\"sparkoperator.k8s.io/v1alpha1\",\"metadata\":{\"name\":\"wrguzieqiqtj6gic4wlx\",\"namespace\":\"cortex\",\"creationTimestamp\":null,\"labels\":{\"appName\":\"iris\",\"workloadId\":\"wrguzieqiqtj6gic4wlx\",\"workloadType\":\"data-job\"},\"ownerReferences\":[{\"apiVersion\":\"argoproj.io/v1alpha1\",\"kind\":\"Workflow\",\"name\":\"argo-iris-7wvh7\",\"uid\":\"9bc7aee5-1861-11e9-a664-026185601f56\",\"blockOwnerDeletion\":false}]},\"spec\":{\"type\":\"Python\",\"mode\":\"cluster\",\"image\":\"764403040460.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest\",\"imagePullPolicy\":\"Always\",\"mainApplicationFile\":\"local:///src/spark_job/spark_job.py\",\"arguments\":[\"--workload-id=wrguzieqiqtj6gic4wlx\n    --context=s3://cortex-cluster-david/apps/iris/contexts/83b914617c417fbe26b59c7839d77914b43f02c48fb52c713a5d4c094dff0bd.msgpack\n    --cache-dir=/mnt/context --raw-features=ad88a38e98b74d6dfc1a718b51c3da769141f46ddd0cd8fb96ca9fd01a71521,9bcb141ed9f8ee4f52c55a40e16815d2a3eddbdc118cc81751e762c73134820,ffa5ac7dad3f546d5b867892d4bf820ba34e83362042c894dc86527295b9352,bacc1fe5d16b0dbbb3527ea9de82409f67b926d0243a549cb6ed844ef76cdd7,b02d30d78186521e2399ba6a93e0f20459eb72e661234c9cf88cd289a78af30\n    --aggregates=5566592b7087b3cf822b7a7a56085fcfd1678b363db00fefefab725f7892d37,0c8177a3736da4987b7629346ca461eced6f090c40edbbe9b67595ca00f73ce,ea21d88863bf2fa675550ae8e57d301d247b311b8f7441e1a342f0803a352e8,9fdd4490123c08fc5da726e3ca597f9b0cf9f0f7b7762a5fca7ea7b036df33b,2a55250299f0a8935f8aafba8cef94e5f7776c8e4c480e2d7f42fc5ca336d36,d59e72278cf560705ddb8cd11dbbb9dd2592165d458f51ddefa425f35fe0a3d,77b9f029fcaac3f513e549042f2f64eee9e73ad0d2c7549ec167943f2adbd46,d3e9aaa136d3e338ac1654e4da3ce5fbe6f6e6517e375672efd9ce3b20bedef,a8a3c0802612f715243577910061579af8cc49bc44ffc8f580e9746c264a237\n    --transformed-features=8cc4ed1ed6c711dc775f51d8d129f5b0030ae109c9ae1605f0530c3ecdf511b,1d26f726a9c1879096147a7219418752140c6572a86d29a8a273ac8474c4622,79df34c5ba324066ec281a70ca9f45dd027e24a243bc137efc24eb1f928b708,ba8d0bc773530313233b0c51a27526ab6f13db97b02c9b8114bd9f0667793c1,15533d18f1de1b545c1a32dd6f6df24481f07fe314f78219ad6bab782276703\n    --training-datasets=27fd4e1f80338ab4fbe8f2c67d58b6d521d2576f3cd492ad1ace5c80d94e413\n    --ingest\"],\"driver\":{\"cores\":10,\"memory\":\"512000k\",\"envSecretKeyRefs\":{\"AWS_ACCESS_KEY_ID\":{\"name\":\"aws-credentials\",\"key\":\"AWS_ACCESS_KEY_ID\"},\"AWS_SECRET_ACCESS_KEY\":{\"name\":\"aws-credentials\",\"key\":\"AWS_SECRET_ACCESS_KEY\"}},\"labels\":{\"appName\":\"iris\",\"userFacing\":\"true\",\"workloadId\":\"wrguzieqiqtj6gic4wlx\",\"workloadType\":\"data-job\"},\"podName\":\"wrguzieqiqtj6gic4wlx\",\"serviceAccount\":\"spark\"},\"executor\":{\"cores\":1,\"memory\":\"512000k\",\"envSecretKeyRefs\":{\"AWS_ACCESS_KEY_ID\":{\"name\":\"aws-credentials\",\"key\":\"AWS_ACCESS_KEY_ID\"},\"AWS_SECRET_ACCESS_KEY\":{\"name\":\"aws-credentials\",\"key\":\"AWS_SECRET_ACCESS_KEY\"}},\"labels\":{\"appName\":\"iris\",\"workloadId\":\"wrguzieqiqtj6gic4wlx\",\"workloadType\":\"data-job\"},\"instances\":1},\"deps\":{\"pyFiles\":[\"local:///src/spark_job/spark_util.py\",\"local:///src/lib/*.py\"]},\"restartPolicy\":{\"type\":\"Never\"},\"pythonVersion\":\"3\"},\"status\":{\"lastSubmissionAttemptTime\":null,\"completionTime\":null,\"driverInfo\":{},\"applicationState\":{\"state\":\"\",\"errorMessage\":\"\"}}}'\n  successCondition: status.applicationState.state in (COMPLETED)\n"
time="2019-01-15T01:05:13Z" level=info msg="Loading manifest to /tmp/manifest.yaml"
time="2019-01-15T01:05:13Z" level=info msg="kubectl create -f /tmp/manifest.yaml -o name"
time="2019-01-15T01:05:17Z" level=info msg=sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx
time="2019-01-15T01:05:17Z" level=info msg="Waiting for conditions: status.applicationState.state in (COMPLETED)"
time="2019-01-15T01:05:17Z" level=info msg="Failing for conditions: status.applicationState.state in (FAILED,SUBMISSION_FAILED,UNKNOWN)"
time="2019-01-15T01:05:17Z" level=info msg="kubectl get sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx -w -o json"
time="2019-01-15T01:05:20Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:05:20Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:05:20Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:05:20Z" level=info msg="0/1 success conditions matched"
time="2019-01-15T01:05:41Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:05:41Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:05:41Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:05:41Z" level=info msg="0/1 success conditions matched"
time="2019-01-15T01:05:41Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:05:41Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:05:41Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:05:41Z" level=info msg="0/1 success conditions matched"
time="2019-01-15T01:07:17Z" level=warning msg="Json reader returned error EOF. Calling kill (usually superfluous)"
time="2019-01-15T01:07:17Z" level=warning msg="Command for kubectl get -w for sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx exited. Getting return value using Wait"
time="2019-01-15T01:07:17Z" level=info msg="readJSon failed for resource sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx but cmd.Wait for kubectl get -w command did not error"
time="2019-01-15T01:07:17Z" level=info msg="Waiting for resource sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx resulted in retryable error EOF"
time="2019-01-15T01:07:22Z" level=info msg="kubectl get sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx -w -o json"
time="2019-01-15T01:07:22Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:07:22Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:07:22Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:07:22Z" level=info msg="0/1 success conditions matched"
time="2019-01-15T01:09:08Z" level=warning msg="Json reader returned error EOF. Calling kill (usually superfluous)"
time="2019-01-15T01:09:08Z" level=warning msg="Command for kubectl get -w for sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx exited. Getting return value using Wait"
time="2019-01-15T01:09:08Z" level=info msg="readJSon failed for resource sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx but cmd.Wait for kubectl get -w command did not error"
time="2019-01-15T01:09:08Z" level=info msg="Waiting for resource sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx resulted in retryable error EOF"
time="2019-01-15T01:09:28Z" level=info msg="kubectl get sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx -w -o json"
time="2019-01-15T01:09:28Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:09:28Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:09:28Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:09:28Z" level=info msg="0/1 success conditions matched"
time="2019-01-15T01:10:51Z" level=warning msg="Json reader returned error EOF. Calling kill (usually superfluous)"
time="2019-01-15T01:10:51Z" level=warning msg="Command for kubectl get -w for sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx exited. Getting return value using Wait"
time="2019-01-15T01:10:51Z" level=info msg="readJSon failed for resource sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx but cmd.Wait for kubectl get -w command did not error"
time="2019-01-15T01:10:51Z" level=info msg="Waiting for resource sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx resulted in retryable error EOF"
time="2019-01-15T01:12:11Z" level=info msg="kubectl get sparkapplication.sparkoperator.k8s.io/wrguzieqiqtj6gic4wlx -w -o json"
time="2019-01-15T01:12:11Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:12:11Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:12:11Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:12:11Z" level=info msg="0/1 success conditions matched"
time="2019-01-15T01:12:25Z" level=info msg="{\"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\"kind\": \"SparkApplication\"...}"
time="2019-01-15T01:12:25Z" level=info msg="failure condition '{status.applicationState.state in [FAILED SUBMISSION_FAILED UNKNOWN]}' evaluated false"
time="2019-01-15T01:12:25Z" level=info msg="success condition '{status.applicationState.state in [COMPLETED]}' evaluated false"
time="2019-01-15T01:12:25Z" level=info msg="0/1 success conditions matched"

Clarify uninstall docs

Description

Uninstall needs AWS credentials
Uninstall script confusing first section (not whole script)

Allow for skipping CSV columns when ingesting

Description

"_" in the schema denotes not ingesting that column, and columns can be left off of the end of the CSV. For example, if there were columns named 1, 2, 3, 4, 5 and the user only wanted 1, 2, and 4, they could be ingested with a CSV schema of [1, 2, _, 4]

Official Website?

Hey there - I was just curious if you have a real website that isn't just a redirect to your repo?

I noticed you have docs and a job posting page but no normal landing page. Is this by default due to stealth launch? Or perhaps issues with SEO as there seems to be another company very close to your naming - https://www.cortexlabs.ai/

Obviously the development is more important but it is always nice to be able to send non-technical/business colleagues a place where they can get an overview without crawling through a repo :)

Move errors to relevant packages

Description

Remove error strings from strings package, and move them to package-specific errors.go

Motivation

Avoid package dependency cycles

Enable external constant ingestion

Description

Allow users to upload objects to S3 and use them in aggregators, transformers, and estimators.

Motivation

Some artifacts are pre-computed and need to be accessed in the pipeline (e.g. word embeddings).

Add built-in transformers and aggregators

Description

E.g. implement transformations/analyses that are built in to spark ml

Clean up evicted pods

Probably after some timeout so user could see them if they want, and so logs get uploaded

operator-b58fbb9b5-z2ftz                    1/1     Running   0          11m
operator-b58fbb9b5-zdmjx                    0/1     Evicted   0          11m

Reduce CLI resource name/type ambiguity

Motivation

Avoid the weirdness that cx status -h makes is seem like you can do cx status models

Option 1

Use resourceType.name to identify resources. After this change, it should be possible to do cx get resrouceType, cx get resourceIdentifier, or cx status resourceIdentifier (in addition to cx get and cx status), where resourceIdentifier can be resourceName if it's unique, and resourceType.resourceName is fully qualified and will always work.

Option 2

Require resource type (e.g. cx get model dnn even if there is only one resource called dnn)

Option 3

Keep it the way it is, possibly improving the docs.

Option 4

Add a new CLI command to reduce ambiguity.

Resources not allocated to Spark workloads to generate training datasets

Description

Failed to create training datasets.

To Reproduce

Reproducible in many of the examples, but easiest to reproduce in Poker.

Define only the app, environment and raw columns in YAML
cx deploy to process the raw columns
Define a model that uses the raw columns
cx deploy

Stack Trace

cx logs dnn/training_dataset

Failed to start:

time="2019-04-18T20:12:09Z" level=info msg="Creating a docker executor"
time="2019-04-18T20:12:09Z" level=info msg="Executor (version: v2.2.1, build_date: 2018-10-11T16:27:29Z) initialized with template:\narchiveLocation: {}\ninputs: {}\nmetadata:\n  labels:\n    appName: recommendations\n    argo: \"true\"\n    workloadID: bord6fnh1lma5hyn8my3\n    workloadType: data-job\nname: bord6fnh1lma5hyn8my3\noutputs: {}\nresource:\n  action: create\n  failureCondition: status.applicationState.state in (FAILED,SUBMISSION_FAILED,UNKNOWN)\n  manifest: |-\n    {\n      \"kind\": \"SparkApplication\",\n      \"apiVersion\": \"sparkoperator.k8s.io/v1alpha1\",\n      \"metadata\": {\n        \"name\": \"bord6fnh1lma5hyn8my3\",\n        \"namespace\": \"cortex\",\n        \"creationTimestamp\": null,\n        \"labels\": {\n          \"appName\": \"recommendations\",\n          \"workloadID\": \"bord6fnh1lma5hyn8my3\",\n          \"workloadType\": \"data-job\"\n        },\n        \"ownerReferences\": [\n          {\n            \"apiVersion\": \"argoproj.io/v1alpha1\",\n            \"kind\": \"Workflow\",\n            \"name\": \"argo-recommendations-rplw6\",\n            \"uid\": \"3dca2989-6216-11e9-aaf1-02cc01957708\",\n            \"blockOwnerDeletion\": false\n          }\n        ]\n      },\n      \"spec\": {\n        \"type\": \"Python\",\n        \"mode\": \"cluster\",\n        \"image\": \"969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest\",\n        \"imagePullPolicy\": \"Always\",\n        \"mainApplicationFile\": \"local:///src/spark_job/spark_job.py\",\n        \"arguments\": [\n          \"--workload-id=bord6fnh1lma5hyn8my3 --context=s3://cortex-cluster-vishal/apps/recommendations/contexts/9063143c7366987a974e14c07bf21c40bed64e03a3aa0fed55a670c7756e317.msgpack --cache-dir=/mnt/context --raw-columns= --aggregates= --transformed-columns= --training-datasets=d6c73248656984e3d08a6165cd3b34de27253021cb94c232526f96776999d73\"\n        ],\n        \"driver\": {\n          \"cores\": 0,\n          \"memory\": \"0k\",\n          \"envVars\": {\n            \"CORTEX_CACHE_DIR\": \"/mnt/context\",\n            \"CORTEX_CONTEXT_S3_PATH\": \"s3://cortex-cluster-vishal/apps/recommendations/contexts/9063143c7366987a974e14c07bf21c40bed64e03a3aa0fed55a670c7756e317.msgpack\",\n            \"CORTEX_SPARK_VERBOSITY\": \"WARN\",\n            \"CORTEX_WORKLOAD_ID\": \"bord6fnh1lma5hyn8my3\"\n          },\n          \"envSecretKeyRefs\": {\n            \"AWS_ACCESS_KEY_ID\": {\n              \"name\": \"aws-credentials\",\n              \"key\": \"AWS_ACCESS_KEY_ID\"\n            },\n            \"AWS_SECRET_ACCESS_KEY\": {\n              \"name\": \"aws-credentials\",\n              \"key\": \"AWS_SECRET_ACCESS_KEY\"\n            }\n          },\n          \"labels\": {\n            \"appName\": \"recommendations\",\n            \"userFacing\": \"true\",\n            \"workloadID\": \"bord6fnh1lma5hyn8my3\",\n            \"workloadType\": \"data-job\"\n          },\n          \"podName\": \"bord6fnh1lma5hyn8my3\",\n          \"serviceAccount\": \"spark\"\n        },\n        \"executor\": {\n          \"cores\": 0,\n          \"memory\": \"0k\",\n          \"envVars\": {\n            \"CORTEX_CACHE_DIR\": \"/mnt/context\",\n            \"CORTEX_CONTEXT_S3_PATH\": \"s3://cortex-cluster-vishal/apps/recommendations/contexts/9063143c7366987a974e14c07bf21c40bed64e03a3aa0fed55a670c7756e317.msgpack\",\n            \"CORTEX_SPARK_VERBOSITY\": \"WARN\",\n            \"CORTEX_WORKLOAD_ID\": \"bord6fnh1lma5hyn8my3\"\n          },\n          \"envSecretKeyRefs\": {\n            \"AWS_ACCESS_KEY_ID\": {\n              \"name\": \"aws-credentials\",\n              \"key\": \"AWS_ACCESS_KEY_ID\"\n            },\n            \"AWS_SECRET_ACCESS_KEY\": {\n              \"name\": \"aws-credentials\",\n              \"key\": \"AWS_SECRET_ACCESS_KEY\"\n            }\n          },\n          \"labels\": {\n            \"appName\": \"recommendations\",\n            \"workloadID\": \"bord6fnh1lma5hyn8my3\",\n            \"workloadType\": \"data-job\"\n          },\n          \"instances\": 0\n        },\n        \"deps\": {\n          \"pyFiles\": [\n            \"local:///src/spark_job/spark_util.py\",\n            \"local:///src/lib/*.py\"\n          ]\n        },\n        \"restartPolicy\": {\n          \"type\": \"Never\"\n        },\n        \"pythonVersion\": \"3\"\n      },\n      \"status\": {\n        \"lastSubmissionAttemptTime\": null,\n        \"completionTime\": null,\n        \"driverInfo\": {},\n        \"applicationState\": {\n          \"state\": \"\",\n          \"errorMessage\": \"\"\n        }\n      }\n    }\n  successCondition: status.applicationState.state in (COMPLETED)\n"
time="2019-04-18T20:12:09Z" level=info msg="Loading manifest to /tmp/manifest.yaml"
time="2019-04-18T20:12:09Z" level=info msg="kubectl create -f /tmp/manifest.yaml -o name"
time="2019-04-18T20:12:10Z" level=fatal msg="The SparkApplication \"bord6fnh1lma5hyn8my3\" is invalid: []: Invalid value: map[string]interface {}{\"apiVersion\":\"sparkoperator.k8s.io/v1alpha1\", \"kind\":\"SparkApplication\", \"metadata\":map[string]interface {}{\"name\":\"bord6fnh1lma5hyn8my3\", \"namespace\":\"cortex\", \"creationTimestamp\":\"2019-04-18T20:12:09Z\", \"labels\":map[string]interface {}{\"workloadID\":\"bord6fnh1lma5hyn8my3\", \"workloadType\":\"data-job\", \"appName\":\"recommendations\"}, \"ownerReferences\":[]interface {}{map[string]interface {}{\"apiVersion\":\"argoproj.io/v1alpha1\", \"kind\":\"Workflow\", \"name\":\"argo-recommendations-rplw6\", \"uid\":\"3dca2989-6216-11e9-aaf1-02cc01957708\", \"blockOwnerDeletion\":false}}, \"generation\":1, \"uid\":\"3f1c787f-6216-11e9-aaf1-02cc01957708\", \"selfLink\":\"\"}, \"spec\":map[string]interface {}{\"image\":\"969758392368.dkr.ecr.us-west-2.amazonaws.com/cortexlabs/spark:latest\", \"mainApplicationFile\":\"local:///src/spark_job/spark_job.py\", \"mode\":\"cluster\", \"restartPolicy\":map[string]interface {}{\"type\":\"Never\"}, \"type\":\"Python\", \"driver\":map[string]interface {}{\"serviceAccount\":\"spark\", \"cores\":0, \"envSecretKeyRefs\":map[string]interface {}{\"AWS_ACCESS_KEY_ID\":map[string]interface {}{\"key\":\"AWS_ACCESS_KEY_ID\", \"name\":\"aws-credentials\"}, \"AWS_SECRET_ACCESS_KEY\":map[string]interface {}{\"key\":\"AWS_SECRET_ACCESS_KEY\", \"name\":\"aws-credentials\"}}, \"envVars\":map[string]interface {}{\"CORTEX_CACHE_DIR\":\"/mnt/context\", \"CORTEX_CONTEXT_S3_PATH\":\"s3://cortex-cluster-vishal/apps/recommendations/contexts/9063143c7366987a974e14c07bf21c40bed64e03a3aa0fed55a670c7756e317.msgpack\", \"CORTEX_SPARK_VERBOSITY\":\"WARN\", \"CORTEX_WORKLOAD_ID\":\"bord6fnh1lma5hyn8my3\"}, \"labels\":map[string]interface {}{\"appName\":\"recommendations\", \"userFacing\":\"true\", \"workloadID\":\"bord6fnh1lma5hyn8my3\", \"workloadType\":\"data-job\"}, \"memory\":\"0k\", \"podName\":\"bord6fnh1lma5hyn8my3\"}, \"deps\":map[string]interface {}{\"pyFiles\":[]interface {}{\"local:///src/spark_job/spark_util.py\", \"local:///src/lib/*.py\"}}, \"executor\":map[string]interface {}{\"envVars\":map[string]interface {}{\"CORTEX_CACHE_DIR\":\"/mnt/context\", \"CORTEX_CONTEXT_S3_PATH\":\"s3://cortex-cluster-vishal/apps/recommendations/contexts/9063143c7366987a974e14c07bf21c40bed64e03a3aa0fed55a670c7756e317.msgpack\", \"CORTEX_SPARK_VERBOSITY\":\"WARN\", \"CORTEX_WORKLOAD_ID\":\"bord6fnh1lma5hyn8my3\"}, \"instances\":0, \"labels\":map[string]interface {}{\"appName\":\"recommendations\", \"workloadID\":\"bord6fnh1lma5hyn8my3\", \"workloadType\":\"data-job\"}, \"memory\":\"0k\", \"cores\":0, \"envSecretKeyRefs\":map[string]interface {}{\"AWS_ACCESS_KEY_ID\":map[string]interface {}{\"key\":\"AWS_ACCESS_KEY_ID\", \"name\":\"aws-credentials\"}, \"AWS_SECRET_ACCESS_KEY\":map[string]interface {}{\"name\":\"aws-credentials\", \"key\":\"AWS_SECRET_ACCESS_KEY\"}}}, \"imagePullPolicy\":\"Always\", \"pythonVersion\":\"3\", \"arguments\":[]interface {}{\"--workload-id=bord6fnh1lma5hyn8my3 --context=s3://cortex-cluster-vishal/apps/recommendations/contexts/9063143c7366987a974e14c07bf21c40bed64e03a3aa0fed55a670c7756e317.msgpack --cache-dir=/mnt/context --raw-columns= --aggregates= --transformed-columns= --training-datasets=d6c73248656984e3d08a6165cd3b34de27253021cb94c232526f96776999d73\"}}, \"status\":map[string]interface {}{\"lastSubmissionAttemptTime\":interface {}(nil), \"applicationState\":map[string]interface {}{\"errorMessage\":\"\", \"state\":\"\"}, \"completionTime\":interface {}(nil), \"driverInfo\":map[string]interface {}{}}}: validation failure list:\nspec.driver.cores in body should be greater than 0\nspec.executor.instances in body should be greater than or equal to 1\nspec.executor.cores in body should be greater than 0\ngithub.com/argoproj/argo/errors.New\n\t/root/go/src/github.com/argoproj/argo/errors/errors.go:48\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).ExecResource\n\t/root/go/src/github.com/argoproj/argo/workflow/executor/resource.go:36\ngithub.com/argoproj/argo/cmd/argoexec/commands.execResource\n\t/root/go/src/github.com/argoproj/argo/cmd/argoexec/commands/resource.go:38\ngithub.com/argoproj/argo/cmd/argoexec/commands.glob..func2\n\t/root/go/src/github.com/argoproj/argo/cmd/argoexec/commands/resource.go:23\ngithub.com/argoproj/argo/vendor/github.com/spf13/cobra.(*Command).execute\n\t/root/go/src/github.com/argoproj/argo/vendor/github.com/spf13/cobra/command.go:766\ngithub.com/argoproj/argo/vendor/github.com/spf13/cobra.(*Command).ExecuteC\n\t/root/go/src/github.com/argoproj/argo/vendor/github.com/spf13/cobra/command.go:852\ngithub.com/argoproj/argo/vendor/github.com/spf13/cobra.(*Command).Execute\n\t/root/go/src/github.com/argoproj/argo/vendor/github.com/spf13/cobra/command.go:800\nmain.main\n\t/root/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:15\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:198\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:2361"

Version

master

Support bucket regions for data ingestion

Description

Currently we default to the same region that your cluster is in. That should be the default behavior, but we should allow ingesting from buckets in other regions. We should also update all of our examples to specify the us-west-2, so that the examples can run in any region.

Notes

We either need to add the region to the bucket URL, or add a config field for region

Resolve Spark Context file added warnings

Description

We should avoid repeatedly adding files to the Spark context so we can remove this warning:

...
INFO:cortex:Sanity checking transformers against the first 100 samples
19/03/26 01:43:53 SPARK:WARN CacheManager: Asked to cache already cached data.
INFO:cortex:Transforming class to class_indexed
19/03/26 01:43:54 SPARK:WARN SparkContext: The path /mnt/context/transformer_cortex_index_string.py has been added already. Overwriting of added paths is not supported in the current version.
19/03/26 01:43:55 SPARK:WARN SparkContext: The path /mnt/context/transformer_cortex_index_string.py has been added already. Overwriting of added paths is not supported in the current version.
INFO:cortex:class:          Iris-setosa    Iris-setosa    Iris-setosa
INFO:cortex:class_indexed:  0              0              0
INFO:cortex:Transforming petal_length to petal_length_normalized
19/03/26 01:43:56 SPARK:WARN SparkContext: The path /mnt/context/transformer_cortex_normalize.py has been added already. Overwriting of added paths is not supported in the current version.
19/03/26 01:43:56 SPARK:WARN SparkContext: The path /mnt/context/transformer_cortex_normalize.py has been added already. Overwriting of added paths is not supported in the current version.
...

Add metadata database

Description

Storing metadata in a relational database (e.g. resource configuration, model accuracy, logs?, job statuses?) would allow us to query information about current and past resources/workloads. We will need to add CLI commands for this as well (e.g. query past models, jobs, ...).

Notes

Some use cases need to be transferred from S3 to the DB (e.g. inferred data types, context persistence, logs?, job statuses?, anything else that uses S3 which isn't user data?)
Should we store logs in the database and not use CloudWatchLogs?
Which DB should we used? Should we use a managed service e.g. RDS or Dynamo (and provide implementations for all our clouds), or deploy something in the k8s cluster e.g. CockroachDB?
Cache cloudwatch metrics

Deploy Cortex locally

Description

Setup the infrastructure to deploy cortex locally

Refactor code
Update install scripts for running cortex locally

Resource status unavailable on S3

Version: 0.2

Stacktrace:

ERROR:cortex:An error occurred, see cx logs raw_column petal_width for more details.
Traceback (most recent call last):
  File "/src/spark_job/spark_job.py", line 311, in run_job
    run_custom_aggregators(spark, ctx, cols_to_aggregate, raw_df)
  File "/src/spark_job/spark_job.py", line 209, in run_custom_aggregators
    ctx.upload_resource_status_success(*builtin_aggregates)
  File "/src/lib/context.py", line 442, in upload_resource_status_success
    self.upload_resource_status_end("succeeded", *resources)
  File "/src/lib/context.py", line 450, in upload_resource_status_end
    status = self.get_resource_status(resource)
  File "/src/lib/context.py", line 411, in get_resource_status
    return aws.read_json_from_s3(key, self.bucket)
  File "/src/lib/aws.py", line 154, in read_json_from_s3
    obj = read_bytes_from_s3(key, bucket, allow_missing, client_config).decode("utf-8")
AttributeError: 'NoneType' object has no attribute 'decode'
19/03/25 21:38:15 SPARK:WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)

Additional information:
cx logs keeps getting error logs even after a successful run of cx refresh.

Create estimator and improve inputs

Description

Separate model config into estimator / model, and improve inputs

Design

_default / _optional:
- _default defaults to None in any case (scalar/list/map)
- _default can never be set to None explicitly
- If user sets _default, _optional is implicitly set to True
- If *_COLUMN type is anywhere in or under the type, default is not an option
- For list and map types, _min_count is an option (and defaults to 0)
User provided map types cannot have keys that start with _
Denote cortex values with @ throughout (i.e. all input, also e.g. model: @dnn in api)
Add training_input to models
Remove type: classification in model
Rename "inputs" to "input"
input: INT is supported
Can't mix value / column types in a single type (e.g. FLOAT_COLUMN|INT is not allowed)
{arg: INT}, {STRING: INT}, {STRING: INT_COLUMN}, {STRING_COLUMN: INT}, {STRING_COLUMN: INT_COLUMN}, {STRING_COLUMN: [INT_COLUMN]}, etc... are all supported. {[STRING_COLUMN]: INT_COLUMN} is not supported.
Cast INT -> FLOAT for inputs, INT -> FLOAT and INT_COLUMN -> FLOAT_COLUMN for outputs

Structure

input: <TYPE>     # (short form)

input:            # (long form)
  _type: <TYPE>
  _default: <>
  ...

Examples

Aggregators

# mean

- kind: aggregator
  name: mean
  output_type: FLOAT
  input: FLOAT_COLUMN|INT_COLUMN

- kind: aggregate
  name: sepal_length_mean
  aggregator: cortex.mean
  input: sepal_length

# bucket_boundaries

- kind: aggregator
  name: bucket_boundaries
  output_type: [FLOAT]
  input:
    col: FLOAT_COLUMN|INT_COLUMN
    num_buckets:
      _type: INT
      _default: 10

- kind: aggregate
  name: sepal_length_bucket
  aggregator: cortex.bucket_boundaries
  input:
    col: sepal_length
    num_buckets: 5

Transformers

# normalize

- kind: transformer
  name: normalize
  output_type: FLOAT_COLUMN
  input:
    col: FLOAT_COLUMN|INT_COLUMN
    mean: INT|FLOAT
    stddev: INT|FLOAT

- kind: transformed_column
  name: sepal_length_normalized
  transformer: cortex.normalize
  input:
    col: sepal_length
    mean: sepal_length_mean
    stddev: sepal_length_stddev

# weight

- kind: transformer
  name: weight
  output_type: FLOAT_COLUMN
  input:
    col: INT_COLUMN
    class_distribution: {INT: FLOAT}

- kind: transformed_column
  name: weight_column
  transformer: weight
  input:
    col: class
    class_distribution: class_distribution

Models

# iris

- kind: estimator
  name: dnn
  target_column: INT_COLUMN
  input:
    normalized_columns: [INT_COLUMN|FLOAT_COLUMN]
    num_classes: INT
  hparams:
    hidden_layers: [INT]
    learning_rate:
      _type: FLOAT
      _default: 0.01

- kind: model
  name: iris-dnn
  estimator: dnn
  target_column: class_indexed
  input:
    normalized_columns:
      - sepal_length_normalized
      - sepal_width_normalized
      - petal_length_normalized
      - petal_width_normalized
    num_classes: 2
  hparams:
    hidden_layers: [4, 4]
  data_partition_ratio: ...
  training: ...

# Fraud

- kind: estimator
  name: dnn
  target_column: INT_COLUMN
  input:
    normalized_columns: [INT_COLUMN|FLOAT_COLUMN]
  training_input:
    weight_column: FLOAT_COLUMN

- kind: model
  name: fraud-dnn
  target_column: class
  input:
    normalized_columns: [time_normalized, v1_normalized, ...]
  training_input:
    weight_column: weights

# Insurance

- kind: estimator
  name: dnn
  target_column: FLOAT_COLUMN
  input:
    categorical:
      _type:
        - feature: INT_COLUMN
          categories: [STRING]
      _min_count: 1  # This wouldn't actually be necessary, it's just to demonstrate
    bucketized:
      - feature: INT_COLUMN|FLOAT_COLUMN
        buckets: [INT]

- kind: model
  name: insurance-dnn
  estimator: insurance-dnn
  target_column: charges_normalized
  input:
    categorical:
      - feature: gender
        categories: ["female", "male"]
      - feature: smoker
        categories: ["yes", "no"]
      - feature: region
        categories: ["northwest", "northeast", "southwest", "southeast"]
      - feature: children
        categories: children_set
    bucketized:
      - feature: age
        buckets: [15, 20, 25, 35, 40, 45, 50, 55, 60, 65]
      - feature: bmi
        buckets: [15, 20, 25, 35, 40, 45, 50, 55]

# Poker

- kind: estimator
  name: dnn
  target_column: INT
  input:
    suit_columns: [INT_COLUMN]
    rank_columns: [INT_COLUMN]

- kind: model
  name: poker-dnn
  type: classification
  target_column: class
  input:
    suit_columns: [card_1_suit, card_2_suit, card_3_suit, card_4_suit, card_5_suit]
    rank_columns: [card_1_rank, card_2_rank, card_3_rank, card_4_rank, card_5_rank]

# Mnist

- kind: model
  name: t2t
  target_column: INT
  input: FLOAT_LIST_COLUMN
  prediction_key: outputs

- kind: trainer
  name: mnist-t2t
  target_column: label
  input: image_pixels

TensorFlow feature_column examples

review tf.feature_column

tf.feature_column.indicator_column(
    tf.feature_column.categorical_column_with_vocabulary_list("smoker", ["yes", "no"])
),

tf.feature_column.bucketized_column(
    tf.feature_column.numeric_column("age"), [15, 20, 25, 35, 40, 45, 50, 55, 60, 65]
),

tf.feature_column.numeric_column("sepal_length_normalized"),

tf.feature_column.numeric_column("image_pixels", shape=model_config["hparams"]["input_shape"])

weight_column = "class_weight"


# cloudml-template

categorical_columns_with_identity = {
    item[0]: tf.feature_column.categorical_column_with_identity(item[0], item[1])
    for item in categorical_feature_names_with_identity.items()
}

categorical_columns_with_vocabulary = {
    item[0]: tf.feature_column.categorical_column_with_vocabulary_list(item[0], item[1])
    for item in metadata.INPUT_CATEGORICAL_FEATURE_NAMES_WITH_VOCABULARY.items()
}

categorical_columns_with_hash_bucket = {
    item[0]: tf.feature_column.categorical_column_with_hash_bucket(item[0], item[1])
    for item in metadata.INPUT_CATEGORICAL_FEATURE_NAMES_WITH_HASH_BUCKET.items()
}

age_buckets = tf.feature_column.bucketized_column(
    feature_columns["age"], boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65]
)

education_X_occupation = tf.feature_column.crossed_column(
    ["education", "occupation"], hash_bucket_size=int(1e4)
)

native_country_embedded = tf.feature_column.embedding_column(
    feature_columns["native_country"], dimension=task.HYPER_PARAMS.embedding_size
)

Design goals

As concise as possible
Be able to copy paste schema to manifestation when possible, have clear rules when not possible
Don't need separate keys for values vs column; typing is for that
Use the same input config for models, transformers, and aggregates

Motivation

Consistency with transformer / transformed_feature and aggregator / aggregate
Enable built-in trainers, so users don't need any TensorFlow code for canned estimators
Enable re-use of model implementations

cx init should prevent users from using underscores

Description

cx deploy will fail validation later down the road, so we should prevent users from running e.g. cx init iris_1. Also run other app name validation (spaces, etc...)

Hyphens are ok

Motivation

Improve developer experience

Rename `transformed_column` parameter in `transform_spark()`

Description

Rename the transformed_column parameter to transformed_column_name in transform_spark().

Motivation

Make it more clear to the user what this argument contains.

Build from source instructions

Description

simple documentation for building from source (linked from README)
simple commands (probably ties into makefile)
see the internal onboarding docs (Dev Workflow and Dev Setup), should be able to delete them when done

Motivation

motivation: user wants to make a change to operator to support a custom use case and wants to build cortex from source. easier to onboard engineers

Validate app name does not have underscore

Description

cx deploy should provide a nice Cortex validation error if app name has an underscore (hypens are ok)

Motivation

Improve developer experience

Read AWS credentials from ~/.aws for cortex_installer.sh and/or CLI

Description

Find a way to use (or ask if the user wants to use) AWS credentials from ~/.aws

Motivation

Reduce installation friction
Users may not want to overwrite environment variables (even though it is shell scoped)

Additional Context

How do you know which AWS profile to use?
Is there a library to read the AWS config?

Merge resource status types

Description

Status* consts in resource/status.go should be shared across resource types

Add client-side prediction batching

Description

When multiple prediction requests are made in the same API call, the API container should batch them into a single gRPC request to TensorFlow Serving

Motivation

Reduce latency for prediction requests which contain multiple samples
Increase bandwidth when prediction requests contain multiple samples

Additional Context

Here's an example, but their code might not be relevant since they don't use the tensorflow package (an optimization that is unnecessary for us since our client is always running)

Hyperparameter Tuning

Description

Enable hyperparameter tuning: the parameter space is searched (e.g. grid-based search), multiple models are trained, and the most accurate model is deployed in the API.

Open Questions

Does this involve the model config and/or API config?
How to define the parameter sweep?