tensorflow / ecosystem Goto Github PK

Integration of TensorFlow with other open-source frameworks

License: Apache License 2.0

Python 22.34% Java 13.86% Scala 57.29% Dockerfile 1.03% Shell 0.90% Jinja 4.57%

ecosystem's Introduction

TensorFlow Ecosystem

This repository contains examples for integrating TensorFlow with other open-source frameworks. The examples are minimal and intended for use as templates. Users can tailor the templates for their own use-cases.

If you have any additions or improvements, please create an issue or pull request.

docker - Docker configuration for running TensorFlow on cluster managers.
kubeflow - A Kubernetes native platform for ML
- A K8s custom resource for running distributed TensorFlow jobs
- Jupyter images for different versions of TensorFlow
- TFServing Docker images and K8s templates
kubernetes - Templates for running distributed TensorFlow on Kubernetes.
marathon - Templates for running distributed TensorFlow using Marathon, deployed on top of Mesos.
hadoop - TFRecord file InputFormat/OutputFormat for Hadoop MapReduce and Spark.
spark-tensorflow-connector - Spark TensorFlow Connector
spark-tensorflow-distributor - Python package that helps users do distributed training with TensorFlow on their Spark clusters.

Distributed TensorFlow

See the Distributed TensorFlow documentation for a description of how it works. The examples in this repository focus on the most common form of distributed training: between-graph replication with asynchronous updates.

Common Setup for distributed training

Every distributed training program has some common setup. First, define flags so that the worker knows about other workers and knows what role it plays in distributed training:

# Flags for configuring the task
flags.DEFINE_integer("task_index", None,
                     "Worker task index, should be >= 0. task_index=0 is "
                     "the master worker task the performs the variable "
                     "initialization.")
flags.DEFINE_string("ps_hosts", None,
                    "Comma-separated list of hostname:port pairs")
flags.DEFINE_string("worker_hosts", None,
                    "Comma-separated list of hostname:port pairs")
flags.DEFINE_string("job_name", None, "job name: worker or ps")

Then, start your server. Since worker and parameter servers (ps jobs) usually share a common program, parameter servers should stop at this point and so they are joined with the server.

# Construct the cluster and start the server
ps_spec = FLAGS.ps_hosts.split(",")
worker_spec = FLAGS.worker_hosts.split(",")

cluster = tf.train.ClusterSpec({
    "ps": ps_spec,
    "worker": worker_spec})

server = tf.train.Server(
    cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)

if FLAGS.job_name == "ps":
  server.join()

Afterwards, your code varies depending on the form of distributed training you intend on doing. The most common form is between-graph replication.

Between-graph Replication

In this mode, each worker separately constructs the exact same graph. Each worker then runs the graph in isolation, only sharing gradients with the parameter servers. This set up is illustrated by the following diagram. Please note that each dashed box indicates a task.

You must explicitly set the device before graph construction for this mode of training. The following code snippet from the Distributed TensorFlow tutorial demonstrates the setup:

with tf.device(tf.train.replica_device_setter(
    worker_device="/job:worker/task:%d" % FLAGS.task_index,
    cluster=cluster)):
  # Construct the TensorFlow graph.

# Run the TensorFlow graph.

Requirements To Run the Examples

To run our examples, Jinja templates must be installed:

# On Ubuntu
sudo apt-get install python-jinja2

# On most other platforms
sudo pip install Jinja2

Jinja is used for template expansion. There are other framework-specific requirements, please refer to the README page of each framework.

ecosystem's People

Contributors

Stargazers

Watchers

Forkers

jhseu llhe xcbat wuntoguo benjamesbabala wanjinchang gninnur winnerineast sguzwf rajdarshan xuxinkun vulcanoio elibixby chengchengowen puneith ericyue ailearning zhuohuwu0603 vegaxju lvjiangzhao manuzhang newgenesys joescrabshack shouhong satchel9 kb10 aniruddh-sk rahulnimje skavulya anfeng xxxdreamer foxish tslam75 ml-lab felipebetancur zjffdu snowwolph harryzzp alexkreidler delonzhou heshilengyu fangwei1688 tim-trueai karthikvadla jyzhang-bjtu u20024804 bnulwh bhack fynanceai joyeshmishra mohan-chinnappan-n gagalin siddhantdesai jingchengdu willswl alexxnica kryndex solertis davidtranno1 huyong1109 cngong ravirajadrangi lzlz99 bzz zomglings shzaidi labbros barkinet cgw21 muxi166 cheyang redeipirati hereismari jayden11 mydp2017 terryking1992 shadowridgedev tesseract2048 songwen1986 kailuowang neggert sueann chenqiangzhishen shobhit-agarwal tlc033 lokihjl whzhufan dinneo hephaex grseb9s algoskynet chenxshuo amitkumarj441 jaedukseo lwanglinhong singularitisystems weinenglong alex891 sohrobk wamonoo

ecosystem's Issues

Failed to build on hadoop with maven3

I'm getting the following error while compiling the packages in ecosystem/hadoop for my tensorflowonspark test. The error seems to be with maven-plugin - surefire plugin.

I have the following version of maven installed on my Ubuntu 16.04:

mayub@nf-qu-bd-h01:~/ecosystem/hadoop$ mvn -v
Apache Maven 3.3.9
Maven home: /usr/share/maven
Java version: 1.7.0_67, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-7-oracle-cloudera/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "4.4.0-87-generic", arch: "amd64", family: "unix"

Below is the error

mayub@nf-qu-bd-h01:~/ecosystem/hadoop$ mvn -U clean install
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building tensorflow-hadoop 1.8.0
[INFO] ------------------------------------------------------------------------
Downloading: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-surefire-plugin/2.17/maven-surefire-plugin-2.17.pom
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1.958 s
[INFO] Finished at: 2018-06-28T20:41:30-04:00
[INFO] Final Memory: 27M/1963M
[INFO] ------------------------------------------------------------------------
[ERROR] Plugin org.apache.maven.plugins:maven-surefire-plugin:2.17 or one of its dependencies could not be resolved: Failed to read artifact descriptor for org.apache.maven.plugins:maven-surefire-plugin:jar:2.17: Could not transfer artifact org.apache.maven.plugins:maven-surefire-plugin:pom:2.17 from/to central (https://repo.maven.apache.org/maven2): Received fatal alert: protocol_version -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginResolutionException

Any help is appreciated.

problem: with this tf.train.MonitoredTrainingSession

i build a new dockerimage by your dockerfile , but when i run mnist.py on k8s cluster, it occurs no moudle:
tf.train.MonitoredTrainingSession.
the tensorflow version is 0.9 , why ? how to solve the problem?

[spark-tensorflow-connector] failed to load tfrecord files using tensorflow

I'm trying out spark-tensorflow-connector on my mac book
with

tensorflow version 1.8.0
hadoop version 2.6.5
spark version 2.1.1

run spark in local mode with the following code:

from pyspark.sql import Row
from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import *


rows = [Row(11, 1, 23, 10.9, 14.0, [1.0, 2.0], 'r1'), 
Row(21, 2, 24, 12.0, 15.0, [2.0,2.0],'r2')]

schema = StructType([StructField("id", IntegerType()),
StructField("IntegerTypeLabel", IntegerType()),
StructField("LongTypeLabel", LongType()),
StructField("FloatTypeLabel", FloatType()),
StructField("DoubleTypeLabel", DoubleType()),
StructField("VectorLabel", ArrayType(DoubleType(), True)),
StructField("name", StringType())])

df = spark.createDataFrame(rows, schema)
df.write.mode("overwrite").format("tfrecords").option("recordType", "Example").save("./test")

when trying to load it in tensorflow as followed:

import tensorflow as tf
dataset = tf.data.TFRecordDataset(['part-r-00003'])


def decode(x):
    feature = {'id': tf.int32,
               'IntegerTypeLabel': tf.int32,
               'LongTypeLable': tf.int64,
               'FloatTypeLabel': tf.float32,
               'DoubleTypeLabel': tf.float64,
               'VectorLabel': tf.FixedLenFeature([],tf.float64),
               'name': tf.string}
    return tf.parse_single_example(x,features=feature)

dataset = dataset.map(decode)
dataset = dataset.batch(1)
iterator = dataset.make_one_shot_iterator()
next = iterator.get_next()
with tf.Session() as sess:
    print(sess.run(next))

An error is raised:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/sean/anaconda/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 851, in map
    return MapDataset(self, map_func)
  File "/Users/sean/anaconda/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1839, in __init__
    self._map_func.add_to_graph(ops.get_default_graph())
  File "/Users/sean/anaconda/lib/python3.6/site-packages/tensorflow/python/framework/function.py", line 484, in add_to_graph
    self._create_definition_if_needed()
  File "/Users/sean/anaconda/lib/python3.6/site-packages/tensorflow/python/framework/function.py", line 319, in _create_definition_if_needed
    self._create_definition_if_needed_impl()
  File "/Users/sean/anaconda/lib/python3.6/site-packages/tensorflow/python/framework/function.py", line 336, in _create_definition_if_needed_impl
    outputs = self._func(*inputs)
  File "/Users/sean/anaconda/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1804, in tf_map_func
    ret = map_func(nested_args)
  File "<stdin>", line 9, in decode
  File "/Users/sean/anaconda/lib/python3.6/site-packages/tensorflow/python/ops/parsing_ops.py", line 758, in parse_single_example
    return parse_single_example_v2(serialized, features, name)
  File "/Users/sean/anaconda/lib/python3.6/site-packages/tensorflow/python/ops/parsing_ops.py", line 1279, in parse_single_example_v2
    [VarLenFeature, SparseFeature, FixedLenFeature, FixedLenSequenceFeature])
  File "/Users/sean/anaconda/lib/python3.6/site-packages/tensorflow/python/ops/parsing_ops.py", line 296, in _features_to_raw_params
    raise ValueError("Invalid feature %s:%s." % (key, feature))
ValueError: Invalid feature DoubleTypeLabel:<dtype: 'float64'>.

one of the files spark generated is:

b'\n\xa6\x01\n\x0b\n\x02id\x12\x05\x1a\x03\n\x01\x0b\n\x19\n\x10IntegerTypeLabel\x12\x05\x1a\x03\n\x01\x01\n\x16\n\rLongTypeLabel\x12\x05\x1a\x03\n\x01\x17\n\x1a\n\x0eFloatTypeLabel\x12\x08\x12\x06\n\x04ff.A\n\x1b\n\x0fDoubleTypeLabel\x12\x08\x12\x06\n\x04\x00\x00`A\n\x1b\n\x0bVectorLabel\x12\x0c\x12\n\n\x08\x00\x00\x80?\x00\x00\x00@\n\x0e\n\x04name\x12\x06\n\x04\n\x02r1'

I'm trying to figure out why. Could any body give me hint?

GZIP/ZLIB compression for TFRecord files.

I am not 100% sure how reading compressed input is implemented in TensorFlow, but supporting outputting compressed TFRecord files would be amazing, as TFRecord is a rather inefficient format in terms of space.

Here are a few references to reading GZIP'd TF Records in the TensorFlow docs:

https://www.tensorflow.org/versions/r1.3/api_docs/cc/class/tensorflow/ops/fixed-length-record-reader#classtensorflow_1_1ops_1_1_fixed_length_record_reader

https://www.tensorflow.org/api_docs/python/tf/contrib/data/TFRecordDataset

https://www.tensorflow.org/versions/r1.3/api_docs/python/tf/python_io/TFRecordCompressionType

There is no 'overwrite' mode when writing to tfrecords.

Most dataframe writer formats, have writing 'modes' where the user can select from
append, overwrite, ignore and error. Currently, spark-tensorflow-connector silently ignores this parameter.

Here is a toy example, which succeeds on the first run, but fails on the second run:

%pyspark
##Define toy dataset.
df = spark.createDataFrame([
    (0, "data0"),
    (1, "data1"),
    (2, "data2"),
], ["id", "data"])

##Write as TFRecords.
df.repartition(1).write.mode("overwrite").format("tensorflow").save("/storage/example/example.tfrecords")

Here is the output, (On second run):

Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-4573457104174058914.py", line 339, in <module>
    exec(code)
  File "<stdin>", line 6, in <module>
  File "/opt/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 550, in save
    self._jwrite.save(path)
  File "/opt/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/opt/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/opt/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o2651.save.
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /storage/example/example.tfrecords already exists
	at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1099)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1085)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:1005)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:996)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:996)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:996)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:984)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$1.apply(PairRDDFunctions.scala:984)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$1.apply(PairRDDFunctions.scala:984)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:983)
	at org.trustedanalytics.spark.datasources.tensorflow.DefaultSource.createRelation(DefaultSource.scala:52)
	at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198)
	at sun.reflect.GeneratedMethodAccessor145.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:745)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-4573457104174058914.py", line 346, in <module>
    raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-4573457104174058914.py", line 339, in <module>
    exec(code)
  File "<stdin>", line 6, in <module>
  File "/opt/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 550, in save
    self._jwrite.save(path)
  File "/opt/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/opt/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/opt/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o2651.save.
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /storage/example/example.tfrecords already exists
	at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1099)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1085)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:1005)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:996)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:996)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:996)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:984)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$1.apply(PairRDDFunctions.scala:984)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$1.apply(PairRDDFunctions.scala:984)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:983)
	at org.trustedanalytics.spark.datasources.tensorflow.DefaultSource.createRelation(DefaultSource.scala:52)
	at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198)
	at sun.reflect.GeneratedMethodAccessor145.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:745)

Failed to find data source: tensorflow

import com.tencent.mmsearch_x.SparkTool._
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.commons.io.FileUtils
import org.apache.spark.sql.{ DataFrame, Row }
import org.apache.spark.sql.catalyst.expressions.GenericRow
import org.apache.spark.sql.types._
val path = "./output/test-output.tfrecord"
val testRows: Array[Row] = Array(
  new GenericRow(Array[Any](11, 1, 23L, 10.0F, 14.0, List(1.0, 2.0), "r1")),
  new GenericRow(Array[Any](21, 2, 24L, 12.0F, 15.0, List(2.0, 2.0), "r2")))
val schema = StructType(List(StructField("id", IntegerType),
  StructField("IntegerTypeLabel", IntegerType),
  StructField("LongTypeLabel", LongType),
  StructField("FloatTypeLabel", FloatType),
  StructField("DoubleTypeLabel", DoubleType),
  StructField("VectorLabel", ArrayType(DoubleType, true)),
  StructField("name", StringType)))

val rdd = spark.sparkContext.parallelize(testRows)

//Save DataFrame as TFRecords
val df: DataFrame = spark.createDataFrame(rdd, schema)
df.show()
df.printSchema()

//.option("recordType", "Example")
df.write.format("tensorflow").save(path)  //or df.write.format("tfrecords").save(path)

both format "tensorflow" and "tfrecords" result to { Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: tensorflow }

java.math.BigDecimal cannot be cast to org.apache.spark.sql.types.Decimal

Hi,
I am trying to save a Spark dataframe using the connector. The dataframe schema is:
root
|-- CustomerNumber: string (nullable = true)
|-- ProductId: string (nullable = true)
|-- Multiplier: decimal(38,20) (nullable = true)
|-- scaled_features: vector (nullable = true)

When saving I am getting the exception:

`---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
in ()
1 #prepared.write.json("/data/home/pricing_2018_10_21")
2 tf_path ="/data/home/pricing_2018_10_22_tfrecord"
----> 3 prepared.write.format("tfrecords").option("recordType", "Example").save(tf_path)

/databricks/spark/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options)
711 self._jwrite.save()
712 else:
--> 713 self._jwrite.save(path)
714
715 @SInCE(1.4)

/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in call(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()

/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(

Py4JJavaError: An error occurred while calling o3150.save.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:100)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1083)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:375)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1081)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:1000)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:991)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:991)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:375)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:991)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:979)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$1.apply(PairRDDFunctions.scala:979)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$1.apply(PairRDDFunctions.scala:979)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:375)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:978)
at org.tensorflow.spark.datasources.tfrecords.DefaultSource.save(DefaultSource.scala:82)
at org.tensorflow.spark.datasources.tfrecords.DefaultSource.saveDistributed(DefaultSource.scala:109)
at org.tensorflow.spark.datasources.tfrecords.DefaultSource.createRelation(DefaultSource.scala:71)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:72)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:88)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:150)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:138)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$5.apply(SparkPlan.scala:190)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:187)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:108)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:108)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:682)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:682)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:89)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:175)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:84)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:126)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:682)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:287)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:281)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 1642.0 failed 4 times, most recent failure: Lost task 10.3 in stage 1642.0 (TID 155750, 10.139.64.4, executor 0): org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:155)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:112)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:384)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassCastException: java.math.BigDecimal cannot be cast to org.apache.spark.sql.types.Decimal
at org.tensorflow.spark.datasources.tfrecords.serde.DefaultTfRecordRowEncoder$.org$tensorflow$spark$datasources$tfrecords$serde$DefaultTfRecordRowEncoder$$encodeFeature(DefaultTfRecordRowEncoder.scala:117)
at org.tensorflow.spark.datasources.tfrecords.serde.DefaultTfRecordRowEncoder$$anonfun$encodeExample$1.apply(DefaultTfRecordRowEncoder.scala:64)
at org.tensorflow.spark.datasources.tfrecords.serde.DefaultTfRecordRowEncoder$$anonfun$encodeExample$1.apply(DefaultTfRecordRowEncoder.scala:61)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.tensorflow.spark.datasources.tfrecords.serde.DefaultTfRecordRowEncoder$.encodeExample(DefaultTfRecordRowEncoder.scala:61)
at org.tensorflow.spark.datasources.tfrecords.DefaultSource$$anonfun$2.apply(DefaultSource.scala:59)
at org.tensorflow.spark.datasources.tfrecords.DefaultSource$$anonfun$2.apply(DefaultSource.scala:56)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:129)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:127)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1392)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:139)
... 8 more

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1747)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1735)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1734)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1734)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:962)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:962)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:962)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1970)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1918)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1906)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:759)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2141)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2162)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2194)
at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78)
... 57 more
Caused by: org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:155)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:112)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:384)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Caused by: java.lang.ClassCastException: java.math.BigDecimal cannot be cast to org.apache.spark.sql.types.Decimal
at org.tensorflow.spark.datasources.tfrecords.serde.DefaultTfRecordRowEncoder$.org$tensorflow$spark$datasources$tfrecords$serde$DefaultTfRecordRowEncoder$$encodeFeature(DefaultTfRecordRowEncoder.scala:117)
at org.tensorflow.spark.datasources.tfrecords.serde.DefaultTfRecordRowEncoder$$anonfun$encodeExample$1.apply(DefaultTfRecordRowEncoder.scala:64)
at org.tensorflow.spark.datasources.tfrecords.serde.DefaultTfRecordRowEncoder$$anonfun$encodeExample$1.apply(DefaultTfRecordRowEncoder.scala:61)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.tensorflow.spark.datasources.tfrecords.serde.DefaultTfRecordRowEncoder$.encodeExample(DefaultTfRecordRowEncoder.scala:61)
at org.tensorflow.spark.datasources.tfrecords.DefaultSource$$anonfun$2.apply(DefaultSource.scala:59)
at org.tensorflow.spark.datasources.tfrecords.DefaultSource$$anonfun$2.apply(DefaultSource.scala:56)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:129)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:127)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1392)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:139)
... 8 more
`

org.tensorflow.example not found

�HI，guys！
I want to compile tensorflow-hadoop, but throw the following exception：
[ERROR] /Users/jonathanwei/summary/tensorflow/ecosystem/hadoop/src/main/java/org/tensorflow/hadoop/example/TFRecordFileMRExample.java:[23,1] 程序包org.tensorflow.example不存在

Contributing Spark TensorFlow connector to ecosystem

Our team has been working on a Spark TensorFlow connector that we would like to contribute back to the TensorFlow ecosystem. The connector uses the TensorFlow Hadoop input/output format, and simplifies import and export of data from TFRecords into Spark dataframes.

@jhseu, please let us know if this is something you are interested in. We would need some guidance on which directory to place the library in before creating the pull request. We were not sure if we should create a new spark directory at the root of the repo, or whether to create a new sub-directory under hadoop.

Here is a snippet of code that demonstrates the usage of the library:

import org.apache.commons.io.FileUtils
import org.apache.spark.sql.{ DataFrame, Row }
import org.apache.spark.sql.catalyst.expressions.GenericRow
import org.apache.spark.sql.types._

val path = "test-output.tfr"
val testRows: Array[Row] = Array(
new GenericRow(Array[Any](11, 1, 23L, 10.0F, 14.0, List(1.0, 2.0), "r1")),
new GenericRow(Array[Any](21, 2, 24L, 12.0F, 15.0, List(2.0, 2.0), "r2")))
val schema = StructType(List(StructField("id", IntegerType), StructField("IntegerTypelabel", IntegerType), StructField("LongTypelabel", LongType), StructField("FloatTypelabel", FloatType), StructField("DoubleTypelabel", DoubleType), StructField("vectorlabel", ArrayType(DoubleType, true)), StructField("name", StringType)))
val rdd = spark.sparkContext.parallelize(testRows)

//Save DataFrame as TFRecords
val df: DataFrame = spark.createDataFrame(rdd, schema)
df.write.format("tensorflow").save(path)

//Read TFRecords into DataFrame.
//The DataFrame schema is inferred from the TFRecords if no custom schema is provided.
val importedDf1: DataFrame = spark.read.format("tensorflow").load(path)
importedDf1.show()

//Read TFRecords into DataFrame using custom schema
val importedDf2: DataFrame = spark.read.format("tensorflow").schema(schema).load(path)
importedDf2.show()
+--------------+-----------+----------------+-------------+---------------+----+---+
|FloatTypelabel|vectorlabel|IntegerTypelabel|LongTypelabel|DoubleTypelabel|name| id|
+--------------+-----------+----------------+-------------+---------------+----+---+
|          10.0| [1.0, 2.0]|               1|           23|           14.0|  r1| 11|
|          12.0| [2.0, 2.0]|               2|           24|           15.0|  r2| 21|
+--------------+-----------+----------------+-------------+---------------+----+---+

Checkpoint file cannot be found error on Kubernetes cluster

I followed the instructions from https://github.com/tensorflow/ecosystem/tree/master/kubernetes to run mnist sample on kubernetes with 2 worker and 2 ps. Then I got below errors. Do you have any idea what's the possible reason? Thanks!

Error on ps-1: Not found: /train_dir/model.ckpt-0_temp_5fcd97c881a7428db8eded829b964618/part-00000-of-00002.index
Error on worker-0: NotFoundError (see above for traceback): /train_dir/model.ckpt-0_temp_5fcd97c881a7428db8eded829b964618/part-00000-of-00002.index
[[Node: save/MergeV2Checkpoints = MergeV2Checkpoints[delete_old_dirs=true, _device="/job:ps/replica:0/task:1/cpu:0"](save/MergeV2Checkpoints/checkpoint_prefixes, _recv_save/Const_0_S81)]]

The mnist.py used is from the https://github.com/tensorflow/ecosystem/tree/master/docker

Hadoop: reading TFRecords from TensorFlow fails

Hi, I'm using tensorflow-hadoop to save some data from Apache Spark and then read it on TensorFlow side as TFRecords.

Although resulting files can be read using tf.python_io.tf_record_iterator(), I can not find the way to make it work with queues and tf.TFRecordReader with the tf.parse_single_example decoder

Here is simplest reproducible example:

mvn surefire:test -Dtest=TFRecordFileTest to get TFRecords tmp/tfr-test/part-m-00000
protoc --decode_raw < /tmp/tfr-test/part-m-00000 and get Failed to parse input. (why?)

reading with tf.python_io.tf_record_iterator() works

tfrecord = "/tmp/tfr-test/part-m-00001"
for serialized_example in tf.python_io.tf_record_iterator(tfrecord):
  example = tf.train.Example()
  example.ParseFromString(serialized_example)
  rec = example.features.feature['data'].int64_list.value
  print(rec)

Prints the list of int values, as expected.

reading with tf.parse_single_example() fails

filenames = glob.glob("/tmp/tfr-test/part-m-0000*")
filename_queue = tf.train.string_input_producer(filenames)
reader = tf.WholeFileReader()
_, serialized_example = reader.read(filename_queue)
features = tf.parse_single_example(
    serialized_example,
    features={'data': tf.VarLenFeature(dtype=tf.int64)})

data = features['data'].values

sess = tf.InteractiveSession()
tf.train.start_queue_runners(sess=sess)
print(data.eval())

with InvalidArgumentError: Could not parse example input, value: '

>>> print(data.eval())
2017-05-26 20:05:17.917177: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Could not parse example input, value: '
     Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/<python>//lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 569, in eval
    return _eval_using_default_session(self, feed_dict, self.graph, session)
  File "/<python>//lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3741, in _eval_using_default_session
    return session.run(tensors, feed_dict)
  File "/<python>//lib/python2.7/site-packages/tensorflow/python/client/session.py", line 778, in run
    run_metadata_ptr)
  File "/<python>//lib/python2.7/site-packages/tensorflow/python/client/session.py", line 982, in _run
    feed_dict_string, options, run_metadata)
  File "/<python>//lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1032, in _do_run
    target_list, options, run_metadata)
  File "/<python>//lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1052, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Could not parse example input, value: '

Did I miss something or is that an expected result? Would appreciate any help here.

run distributed training with kubernetes: worker wait for ps reponse forever

I am new to distributed world. I setup a kubernete cluster with 2 node. then I run the example with 1 ps and 1 worker. but my worker seems wait forever.

here is my log from worker pod: kubectl logs mymnistdist-worker-0-6b11q

2017-09-05 02:06:37.768560: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0

here my my log from ps pod:

01:42:57.575306: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:5000

Ｉ login into these 2 pods, found out that the hostname is:

mymnistdist-ps-0-ncfjz
mymnistdist-worker-0-6b11q

they have a random suffix. but what I give to the tf.train.ClusterSpec is :

 worker_hosts=mymnistdist-worker-0:5000

ps_hosts=mymnistdist-ps-0:5000

I google around, seems there is no way to set a pod name in a kubernete ReplicaSet.
Is there a way to work around this?

[Request] Add Kubeflow as the way to run TensorFlow distributed training jobs on Kubernetes

Hi, I am from Kubeflow community, which is an open source community dedicated to making using ML stacks on Kubernetes easy, fast and extensible.

We implemented an operator for TensorFlow on Kubernetes. Personally, I think the CRD defined in the operator is ease to use and understand compared to the jinja template. Here is one example:

apiVersion: "kubeflow.org/v1alpha1"
kind: "TFJob"
metadata:
  name: "example-job"
spec:
  replicaSpecs:
    - replicas: 1
      tfReplicaType: MASTER
      template:
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
              name: tensorflow
          restartPolicy: OnFailure
    - replicas: 1
      tfReplicaType: WORKER
      template:
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
              name: tensorflow
          restartPolicy: OnFailure
    - replicas: 2
      tfReplicaType: PS
      template:
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
              name: tensorflow
restartPolicy: OnFailure

I am wondering if we could add our implementation in the repository to enrich the ecosystem of TensorFlow and Kubernetes.

Welcome your suggestions!

/cc @ddysher @jlewi @DjangoPeng @ScorpioCPH

[hadoop] DataLossError after dump data to local disk

Hi all, I am using the hadoop package and successfully transformed data to tfrecord on cluster. But after I dump the data to local disk and tried to use tensorflow to read the data using

for serialize_example in tf.python_io.tf_record_iterator(filename)

but I get the following error:

Traceback (most recent call last):
  File "tf_test.py", line 3, in <module>
    for serialize_example in tf.python_io.tf_record_iterator('tfrecord_test/test_file'):
  File "/usr/lib/python2.7/site-packages/tensorflow/python/lib/io/tf_record.py", line 77, in tf_record_iterator
    reader.GetNext(status)
  File "/usr/lib64/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0

Can anyone help on this problem?

Fail to build the HDFS docker image because of lack of the hadoop files

Now we try to build the docker image for hdfs supported and it fails all the time. It is easy to re-produce by following these commands.

git clone https://github.com/tensorflow/ecosystem

cd ./ecosystem/docker/

docker build -t tensorflow_hdfs_latest -f Dockerfile.hdfs .

Then we will get this error message.

The url of the hadoop files seems to be invalid and we can test by accessing http://www-us.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz.

We have changed the url of the official hadoop files and it should work now.

RUN curl -O https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz

cannot find tensorflow-hadoop-1.0-SNAPSHOT.jar

I try to use mvn to build hadoop project by following the link https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_YARN.

After I run mvn clean package -Dmaven.test.skip=true
the files in ./target is
drwxr-xr-x 8 root root 4096 Nov 5 15:15 ./
drwxr-xr-x 4 root root 4096 Nov 5 15:15 ../
drwxr-xr-x 3 root root 4096 Nov 5 15:15 apidocs/
drwxr-xr-x 3 root root 4096 Nov 5 15:15 classes/
drwxr-xr-x 3 root root 4096 Nov 5 15:15 generated-sources/
drwxr-xr-x 2 root root 4096 Nov 5 15:15 javadoc-bundle-options/
drwxr-xr-x 2 root root 4096 Nov 5 15:15 maven-archiver/
drwxr-xr-x 3 root root 4096 Nov 5 15:15 maven-status/
-rw-r--r-- 1 root root 19195 Nov 5 15:15 tensorflow-hadoop-1.10.0.jar
-rw-r--r-- 1 root root 68250 Nov 5 15:15 tensorflow-hadoop-1.10.0-javadoc.jar
-rw-r--r-- 1 root root 10424 Nov 5 15:15 tensorflow-hadoop-1.10.0-sources.jar

There is no tensorflow-hadoop-1.0-SNAPSHOT.jar, so did I missed anything?

Add tensorflow-hadoop jar to public maven repository

Feature request to add versioned tensorflow-hadoop jars to a public maven repository. This will make it easier to develop Java libraries that rely on it. Currently, we need to compile our own snapshot versions of the library to use it.

How to use tensorflow code to resovle tfrecords got from spark correctly ？

I have some problems when I use “spark-tensorflow-connector”. And I can not solve it in a month, so I have no choice but to ask you for advice.
I want to get TFRecords from spark , so I use df.write.format("tfrecords").option("recordType", "Example").save(path). But when I resolve the TFRecords by tensorflow code,the data was wrong. For example, my data on spark is

feature1,feature2
Hello, friend
HI , girl

When I use tensorflow to resolve the result transformed by “spark-tensorflow-connector”,I got this:
feature1,feature2
Hello, girl
HI , friend

The data was error.
So how can I resolve this problem ?
I just use “df.coalesce(1).write.format("tfrecords")” to solve the problem.But this way is very slow.
I am sorry to disturb you during your busy schedule. Waiting for your reply. Thank you very much!

Error when generate TFRecord with example code on Hadoop

Hi,

I'm trying to use MR to generate TFRecords. Currently I'm starting from the example code(TFRecordFileMRExample). But after package the jar and run a Hadoop application, I got the following errors.

I also locally test generating tfrecord by something like java -cp tensorflow-hadoop-1.5.0.jar org.tensorflow.hadoop.example.TFRecordLocalExample and it works. So I think the dependencies are successfully built. But hadoop job cannot build tfrecords.

My hadoop version is 2.7.1

Looking forward for any comments and solutions.

2018-10-11 05:28:54,606 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.NoSuchMethodError: com.google.protobuf.Descriptors$Descriptor.getOneofs()Ljava/util/List;
	at com.google.protobuf.GeneratedMessageV3$FieldAccessorTable.<init>(GeneratedMessageV3.java:1727)
	at org.tensorflow.example.FeatureProtos.<clinit>(FeatureProtos.java:104)
	at org.tensorflow.example.Features$FeatureDefaultEntryHolder.<clinit>(Features.java:95)
	at org.tensorflow.example.Features$Builder.internalGetMutableFeature(Features.java:512)
	at org.tensorflow.example.Features$Builder.putFeature(Features.java:630)
	at org.tensorflow.hadoop.example.TFRecordFileMRExample$ToTFRecordMapper.map(TFRecordFileMRExample.java:61)
	at org.tensorflow.hadoop.example.TFRecordFileMRExample$ToTFRecordMapper.map(TFRecordFileMRExample.java:45)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1803)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

Thank you very much!

[spark-tensorflow-connector] Unable to load a tfrecords file from HDFS

When trying to load a file using,

var df = spark.read.format("tfrecords").load(hdfs_path)

this error is raised,

java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror;
	at org.tensorflow.spark.datasources.tfrecords.TensorflowRelation$$anonfun$3.apply(TensorflowRelation.scala:42)
	at org.tensorflow.spark.datasources.tfrecords.TensorflowRelation$$anonfun$3.apply(TensorflowRelation.scala:42)
	at scala.Option.getOrElse(Option.scala:120)
	at org.tensorflow.spark.datasources.tfrecords.TensorflowRelation.x$1$lzycompute(TensorflowRelation.scala:42)
	at org.tensorflow.spark.datasources.tfrecords.TensorflowRelation.x$1(TensorflowRelation.scala:32)
	at org.tensorflow.spark.datasources.tfrecords.TensorflowRelation.tfSchema$lzycompute(TensorflowRelation.scala:32)
	at org.tensorflow.spark.datasources.tfrecords.TensorflowRelation.tfSchema(TensorflowRelation.scala:32)
	at org.tensorflow.spark.datasources.tfrecords.TensorflowRelation.schema(TensorflowRelation.scala:59)
	at org.apache.spark.sql.execution.datasources.LogicalRelation.<init>(LogicalRelation.scala:40)
	at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:389)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:147)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:82)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:87)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:89)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:91)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:93)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:95)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:97)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:99)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:101)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:103)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:105)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:107)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:109)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:111)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:113)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:115)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:117)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:119)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:121)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:123)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:125)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:127)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:129)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:131)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:133)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:135)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:137)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:139)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:141)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:143)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:145)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:147)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:149)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:151)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:153)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:155)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:157)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:159)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:161)
	at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:163)
	at $iwC$$iwC$$iwC$$iwC.<init>(<console>:165)
	at $iwC$$iwC$$iwC.<init>(<console>:167)
	at $iwC$$iwC.<init>(<console>:169)
	at $iwC.<init>(<console>:171)
	at <init>(<console>:173)
	at .<init>(<console>:177)
	at .<clinit>(<console>)
	at .<init>(<console>:7)
	at .<clinit>(<console>)
	at $print(<console>)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1083)
	at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1364)
	at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:859)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:890)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:838)
	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:859)
	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:904)
	at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:815)
	at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
	at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:947)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:947)
	at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:947)
	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
	at org.apache.spark.repl.Main$.main(Main.scala:34)
	at org.apache.spark.repl.Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:761)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:189)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:214)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:128)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The TFRecords file is generated using,

df.write.format("tfrecords").save(outputTrainFilename)

Used these versions,

org.tensorflow:spark-connector_2.11:1.10.0
org.tensorflow:hadoop:1.10.0

with Scala 2.1.0

ClassNotFoundException: org.tensorflow.example.Features Error

I am following instructions here: https://docs.databricks.com/_static/notebooks/spark-tensorflow-connector.html

Built the spark-tensorflow-connector and hadoop jars per instructions on the readme. The jars are passed in using the --jars spark shell argument. However, I am still getting the following error when executing:

df.repartition(1).write.format("tfrecords").option("recordType", "Example").mode("overwrite").save(path)

spark-tensorflow-connector support for BinaryType

I'm working on a project where I'm trying to store binary data (jpeg-encoded images). They're currently stored in a Spark-sql BinaryType column, and I'm want to them to end up in a BytesList field in my tfrecords.

Considering I'm trying to copy binary data byte-by-byte, this has proven surprisingly difficult, mainly because I have to convert to String in between, which introduces all sorts of issues with character encodings. This would be much easier if we could go directly to and from BinaryType.

I'm willing to contribute a PR, but I wanted to get some eyes on a proposal before I spend time writing code. I'd like to add an encoder that will go directly from a BinaryType column to BytesList. I think this is relatively uncontroversial.

The second bit is trickier. What should happen when we read a BytesList from TFRecords?. Right now, I think it's assumed to be a UTF-8-encoded string and is decoded as such. I'm guessing this actually covers the majority of use cases, it might cause problems if a user actually wants their binary data back (haven't actually tested this). It seems like a more natural approach would be to return a BinaryType and let the user cast to StringType if that's what they want. This, of course, would break existing code, so maybe it's a non-starter.

Thoughts? My immediate concern is just the encoder, but I thought I'd raise the other issue as well for consistency's sake.

TensorFlow on Apache Ignite

Apache Ignite is a memory-centric distributed database. The goal of TensorFlow on Apache Ignite is to allow users to train and inference neural network models on a data stored in Apache Ignite utilizing all TensorFlow functionality and Apache Ignite data collocation abilities.

Our team is currently working on an Apache Ignite for this purpose, corresponding task is IGNITE-8335 in Apache JIRA. We would like to contribute our work here as well as accept any contribution.

As a first step we'd like to publish the first version of our design document. We will be grateful for any comments and remarks regarding this document and any other ideas from TensorFlow community.

Distributed training is slow compared to single machine

I am trying to perform a distributed training on models from Tensorflow Object Detection models. I am able to perform distributed learning, but to my notice the distributed training is too slow compared to single machine training. As the step duration is high in distributed training than that of single machine. I am using Hadoop HDFS to feed data to all hosts .

From Distributed Training (All machines have more or less same step duration):

INFO:tensorflow:global step 1247: loss = 3.1943 (1.972 sec/step)

INFO:tensorflow:global step 1251: loss = 2.5009 (1.218 sec/step)

INFO:tensorflow:global step 1254: loss = 2.2030 (1.746 sec/step)

INFO:tensorflow:global step 1257: loss = 1.9776 (2.222 sec/step)

INFO:tensorflow:global step 1262: loss = 2.6990 (1.161 sec/step)

INFO:tensorflow:global step 1264: loss = 1.9501 (1.682 sec/step)

INFO:tensorflow:global step 1268: loss = 1.6125 (1.288 sec/step)

INFO:tensorflow:global step 1271: loss = 2.0434 (1.223 sec/step)

From single machine training:

INFO:tensorflow:global step 236: loss = 4.1127 (0.319 sec/step)

INFO:tensorflow:global step 237: loss = 3.0208 (0.316 sec/step)

INFO:tensorflow:global step 238: loss = 3.2838 (0.367 sec/step)

INFO:tensorflow:global step 239: loss = 2.9822 (0.324 sec/step)

INFO:tensorflow:global step 240: loss = 2.9753 (0.322 sec/step)

INFO:tensorflow:global step 249: loss = 2.8071 (0.318 sec/step)

INFO:tensorflow:global step 250: loss = 3.4335 (0.328 sec/step)

INFO:tensorflow:global step 251: loss = 4.3550 (0.322 sec/step)

Why do we encounter such difference between distributed learning as well as Single machine training?. Is there a way to resolve this?

k8s how long is the training process?

I run distributed mnist on k8s.
1 ps and 3 works.
After a hour, the status of pods are:
NAME READY STATUS RESTARTS AGE
distributed-mnist-ps-0-fz4gw 1/1 Running 0 1h
distributed-mnist-worker-0-l4nv5 1/1 Running 0 1h
distributed-mnist-worker-1-8j8d7 1/1 Running 0 1h
distributed-mnist-worker-2-0rjbw 1/1 Running 0 1h

It has trained 1 hour. How long is the training process?Thanks.

support gpu on k8s or dc

can ecosystem support k8s gpu nodes?

[Hadoop] java.lang.NoClassDefFoundError: org/tensorflow/example/Int64List

Hello,

I am trying to run the "src/main/java/org/tensorflow/hadoop/example/TFRecordFileMRExample.java" .

I am using Hadoop 2.9.0 and TensorFlow 1.6.0. After exec:
mvn clean package
I exec
hadoop jar target/tensorflow-hadoop-1.6.0.jar org.tensorflow.hadoop.example.TFRecordFileMRExample /user/root

And then it fails showing:

java.lang.Exception: java.lang.NoClassDefFoundError: org/tensorflow/example/Int64List
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:491)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:551)
Caused by: java.lang.NoClassDefFoundError: org/tensorflow/example/Int64List
at org.tensorflow.hadoop.example.TFRecordFileMRExample$ToTFRecordMapper.map(TFRecordFileMRExample.java:48)
at org.tensorflow.hadoop.example.TFRecordFileMRExample$ToTFRecordMapper.map(TFRecordFileMRExample.java:43)

It seems something wrong with maven? I can find org.tensorflow.example, and this situation reappear despite I use another machine.

Any idea? Thanks

Tensorflow can support pyspark API...?

Hi,
may i know the Tensorflow can support pyspark API...? i it's supports please provide me the connector.
thanks,
srinivas.

Error with Maven3 Build in Ubuntu 14.04

Hi,
I keep getting a error when I do the build with maven 3.3.9:
"/ecosystem/hadoop/src/main/java/org/tensorflow/hadoop/example/TFRecordFileMRExample.java: package org.tensorflow.example does not exist" error. I have successfully downloaded and installed the prerequisites. Not sure if it a protobuf issue or a mvn issue.

Get `TypeError: can't pickle _thread._local objects` in running AllReduce Strategy

Hi,

I just follow your instruction to run the standalone AllReduce Example. But I got this error-'TypeError: can't pickle _thread._local objects'

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04):MocOS 10.13.6 (17G65)
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:None
TensorFlow installed from (source or binary): Anaconda
TensorFlow version (use command below): 1.11
Python version:3.6.5
Exact command to reproduce:

export TF_CONFIG='{"cluster":{"worker":["localhost:10000", "localhost:10001"],"chief":["localhost:10002"]}, "task":{"type":"worker","index":0}}'
python tf_std_server.py

export TF_CONFIG='{"cluster":{"worker":["localhost:10000", "localhost:10001"],"chief":["localhost:10002"]}, "task":{"type":"worker","index":1}}'
python tf_std_server.py

export TF_CONFIG='{"cluster":{"worker":["localhost:10000", "localhost:10001"],"chief":["localhost:10002"]}, "task":{"type":"chief","index":0}}'
python tf_std_server.py

export TF_CLUSTER='{"worker":["localhost:10000", "localhost:10001"],"chief":["localhost:10002"]}'
python keras_model_to_estimator_client_alt.py /tmp/checkpoints

Describe the problem

I just follow the instruction to run the example standalone-CollectiveAllReduce code on my computer , but I still get the error - 'TypeError: can't pickle _thread._local objects'

Source code

https://github.com/tensorflow/ecosystem/blob/65299036e6f64e3d333fafd437624a6f8bb0bea6/distribution_strategy/tf_std_server.py
https://github.com/tensorflow/ecosystem/blob/65299036e6f64e3d333fafd437624a6f8bb0bea6/distribution_strategy/tf_std_server.py

Do we have official docker image for distributed training sample in docker hub?

Just for https://github.com/tensorflow/ecosystem/tree/master/docker? I can see the docker file, but not official docker image. Can we provide the official images? Thanks.

feature decoder issue of Spark tensorflow connector

Right now, the StringFeatureDecoder uses the following logic to transform bytesList to String

feature.getBytesList.toByteString.toStringUtf8.trim

Since BytesList is a List[ByteString], this might work if bytesList only contains one string. But when there is a binary image or multiple strings, it will not work any more.

I think it might be better to map BytesList to Array[Array[Byte] instead of String, and let users decide if they need to convert Array[Byte] to string after getting the dataset.

can‘t run,Caused by: java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/util/ArrayData$

Caused by: java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/util/ArrayData$

how can the parameter server stop itself?

Since we wrote below code in the parameter server part:
server.join()

the parameter server could not stop itself when the training finishes unless we kill the process. do you have other suggestions?

Compile the code failed.Class java.lang.FunctionalInterface not found

[ERROR] Class java.lang.FunctionalInterface not found - continuing with a stub.
[ERROR] Class java.lang.FunctionalInterface not found - continuing with a stub.
[ERROR] Class java.lang.FunctionalInterface not found - continuing with a stub.
[ERROR] Class java.lang.FunctionalInterface not found - continuing with a stub.
[ERROR] Class java.lang.FunctionalInterface not found - continuing with a stub.
[ERROR] Class java.lang.FunctionalInterface not found - continuing with a stub.
[ERROR] Class java.lang.FunctionalInterface not found - continuing with a stub.
[ERROR] 7 errors found

Sparse Vectors write in a non-useful way.

When using the spark-tensorflow-connector, and writing a dataframe containing a sparse vector to the tfrecords format, we write the sparse representation, rather than the more useful dense representation.
(3,[1,2],[1.0,1.0]) rather than [0.0, 1.0, 1.0]

I am not sure what the default behaviour should be, or how we can handle this more elegantly, because just blindly densifying sparse vectors could cause problems.

However, tensorflow doesn't like sparse data types, so we should at least give the option of densifying.

NOTE: A current workaround is to define a UDF which densifies the vector.

Here is an example: (Written in PySpark)

##Define a toy corpus.
df = spark.createDataFrame([
    (0, "Spark is cool"),
    (1, "Tensorflow is cool also"),
    (2, "Why doesn't Tensorflow support normal data formats"),
    (3, "Tensorflow eats my GPU's")
], ["id", "sentence"])


##Tokenize the corpus.
from pyspark.ml.feature import Tokenizer
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
df = tokenizer.transform(df)

##Fit a CountVectorizerModel from the corpus.
from pyspark.ml.feature import CountVectorizer
cv = CountVectorizer(inputCol="words", outputCol="features", minDF=2.0)
df = cv.fit(df).transform(df)

##Display Dataset
df.printSchema()
df.show(truncate=False)

#Write as TFRecords.
df.repartition(1).write.format("tensorflow").save("/storage/example/example.tfrecords")

Here is the output of the above code:

root
 |-- id: long (nullable = true)
 |-- sentence: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- features: vector (nullable = true)
+---+--------------------------------------------------+----------------------------------------------------------+-------------------------+
|id |sentence                                          |words                                                     |features                 |
+---+--------------------------------------------------+----------------------------------------------------------+-------------------------+
|0  |Spark is cool                                     |[spark, is, cool]                                         |(3,[1,2],[1.0,1.0])      |
|1  |Tensorflow is cool also                           |[tensorflow, is, cool, also]                              |(3,[0,1,2],[1.0,1.0,1.0])|
|2  |Why doesn't Tensorflow support normal data formats|[why, doesn't, tensorflow, support, normal, data, formats]|(3,[0],[1.0])            |
|3  |Tensorflow eats my GPU's                          |[tensorflow, eats, my, gpu's]                             |(3,[0],[1.0])            |
+---+--------------------------------------------------+----------------------------------------------------------+-------------------------+

Here is the output of cat /storage/example/example.tfrecords/*:

�P9��
}
�
�id����
�
�
�sentence��
�

Spark is cool
*
�words�!
�
�WrappedArray(spark, is, cool)
#
�features��
�
�(3,[1,2],[1.0,1.0])��>�����x
��
�
�id����
��
B
�sentence�6
4
2Why doesn't Tensorflow support normal data formats
S
�words�J
H
FWrappedArray(why, doesn't, tensorflow, support, normal, data, formats)
�
�features��
�

(3,[0],[1.0])��=���5V
��
�
�id����
��
'
�sentence��
�
�Tensorflow is cool also
5
�words�,
*
(WrappedArray(tensorflow, is, cool, also)
)
�features��
�
�(3,[0,1,2],[1.0,1.0,1.0])j������}C
��
�
�id����
��
(
�sentence��
�
�Tensorflow eats my GPU's
6
�words�-
+
)WrappedArray(tensorflow, eats, my, gpu's)
�
�features��
�

(3,[0],[1.0])�XA�

Unable to start the Tensor Flow

Team,

I have installed anaconda on Redhat 6.8. I am following the installation steps of tensorflow from (https://www.tensorflow.org/install/install_linux#InstallingAnaconda). As the process of installation I am giving the command (conda create -n tensorflow). Because our cluster is protected I am unable to download the .conda directory. Can you please help me on this.

$ export PATH=/opt/app/anaconda2/python27/bin:$PATH
$ conda create -n tensorflow
Fetching package metadata ...

CondaHTTPError: HTTP None None for url https://repo.continuum.io/pkgs/free/linux-64/repodata.json.bz2
Elapsed: None

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.
ConnectionError(MaxRetryError("HTTPSConnectionPool(host='repo.continuum.io', port=443): Max retries exceeded with url: /pkgs/free/linux-64/repodata.json.bz2 (Caused by ProtocolError('Connection aborted.', error(97, 'Address family not supported by protocol')))",),)

$ cd /opt/app/anaconda2/python27/pkgs/tensorflow-0.10.0rc0-np111py27_0/
$ ll
total 12
drwxrwxr-x 2 python python 4096 Mar 31 11:20 bin
drwxrwxr-x 3 python python 4096 Dec 14 22:46 info
drwxrwxr-x 3 python python 4096 Dec 14 22:45 lib
$ pwd

TensorFlow implementation in MapReduce

Hello guys,

I just want to ask something. Can i implement TensorFlow in MapReduce Programming Model within Hadoop ?

Yes, i know that TensorFlow can be runned in distributed mode, according this reference https://www.tensorflow.org/versions/r0.11/how_tos/distributed/ . But, it doesn't explain more detail how to use it in MapReduce.

I see Apache Hadoop MapReduce InputFormat/OutputFormat implementation for TensorFlow's TFRecords format in this repo. But, i dont have any idea about this. Can you explain to me, what is it ? And what is it for ? Can i use it to help for writing TensorFlow in MapReduce way ?

One last thing, can i use TensorFlow to predict something a lot of data ? Because, i only know that TensorFlow can be used to classify similar images.

I'm sorry if my question look dumb. Well, there are many thing in the world that i don't know. So, when i don't know, i will ask a question 😄
Thanks for your attention

[spark-tensorflow-connector] Reading TFRecord throws NoSuchElementException

The following exception is thrown when I read tfrecord with

spark.read.format("tfrecords").load(path)

java.util.NoSuchElementException: key not found: path
	at scala.collection.MapLike$class.default(MapLike.scala:228)
	at org.apache.spark.sql.catalyst.util.CaseInsensitiveMap.default(CaseInsensitiveMap.scala:28)
	at scala.collection.MapLike$class.apply(MapLike.scala:141)
	at org.apache.spark.sql.catalyst.util.CaseInsensitiveMap.apply(CaseInsensitiveMap.scala:28)
	at org.tensorflow.spark.datasources.tfrecords.TensorflowRelation.x$1$lzycompute(TensorflowRelation.scala:33)
	at org.tensorflow.spark.datasources.tfrecords.TensorflowRelation.x$1(TensorflowRelation.scala:32)
	at org.tensorflow.spark.datasources.tfrecords.TensorflowRelation.tfSchema$lzycompute(TensorflowRelation.scala:32)
	at org.tensorflow.spark.datasources.tfrecords.TensorflowRelation.tfSchema(TensorflowRelation.scala:32)
	at org.tensorflow.spark.datasources.tfrecords.TensorflowRelation.schema(TensorflowRelation.scala:59)
	at org.apache.spark.sql.execution.datasources.LogicalRelation$.apply(LogicalRelation.scala:77)
	at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:423)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:172)

Instead, path has to be set in option like

spark.read.format("tfrecords").option("path", path).load()

Maybe the README needs update ?

Add support for SequenceExample to Spark TensorFlow Connector

We received a request from @mpekalski to add support for feature lists specified in the SequenceExample tfrecord. At present, the connector only supports Example messages.

We are proposing that each feature_list gets represented as an Array of Arrays of type Long, Float or String depending on whether it is an Int64List, FloatList, or ByteList.

For example, a Spark SQL row of:

StructType(List(
    StructField("movie_ratings", ArrayType(ArrayType(LongType, true), true)), 
    StructField("actors", ArrayType(ArrayType(StringType, true), true)))
))

gets represented as a SequenceExample with feature_lists:

feature_lists: { 
  feature_list: {
      key: "movie_ratings"
      value: { feature: { int64_list: { value: [ 4 ] } }
               feature: { int64_list: { value: [ 5 ] } }
               feature: { int64_list: { value: [ 2 ] } } }
     } 
  }
  feature_list: {
      key  : "actors"
      value: { feature: {  bytes_list: { value: [ "Tim Robbins", "Morgan Freeman" ]} }         
            feature: { bytes_list: { value: [ "Brad Pitt", "Edward Norton", "Helena Bonham Carter" ] } }
      }
  }
}

@jhseu, @mpekalski, @karthikvadla, @joyeshmishra, please let us know if you would like us to make any changes to the proposed API. We will work on the pull request.

[TFRecords File is too large] 3 times than parquet, 1.3 times than normal textFile

As a binary file, shouldn't TFRecords-file be much smaller than other types of file?(especially some textFile of RDD)

I have created this "featureDF", and saved it in 3 different ways.
parquet: 793.4M
featureDF.write.parquet("xxx")
textFile: 1.8G
featureDF.rdd.saveAsTextFile("xxxx")
tfrecords: 2.4G
featureDF.write.format("tfrecords").options("recordType","Example").save("xxx")

and with tfrecords, options("codec","org.apache.hadoop.io.compress.GzipCodec") is not working.
so i tried the linux command gzip -9 featureDF.tfrecord, compressed it from 2.4G to 515M, cost 5 min. But the gzip -9 featureDF.parquet compressed a parquet file from 793.4M to 149M.

Is there any efficient way to compress this tfrecord files?
(already tried replace DoubleType and LongTpye with StringType, not useful)

Issues with spark-tensorflow-connector

When using spark-tensorflow-connector, I run the given example and get the the TFRecord test-output.tfr. Then I want to use it in tensorflow, but some errors occurs.
Here is my code:

import tensorflow as tf
print(tf.__version__)
fileNameQue = tf.train.string_input_producer(["/home/wj/test-output.tfr"])
reader = tf.TFRecordReader()
key, value = reader.read(fileNameQue)
features = tf.parse_single_example(value, features={
        'id': tf.FixedLenFeature([], tf.int64),
        'IntegerTypelabel': tf.FixedLenFeature([], tf.int64),
        'LongTypelabel': tf.FixedLenFeature([], tf.int64),
        'FloatTypelabel': tf.FixedLenFeature([], tf.float32),
        'DoubleTypelabel': tf.FixedLenFeature([], tf.float32),
        'name': tf.FixedLenFeature([], tf.string),
    })
# with tf.Session() as sess:
sess = tf.InteractiveSession()

coord=tf.train.Coordinator()
threads= tf.train.start_queue_runners(coord=coord)
#for i in range(2):
# a,b,c,d,e,f = sess.run([id,IntegerTypelabel,LongTypelabel,FloatTypelabel,DoubleTypelabel,name])
a = sess.run(features)
coord.request_stop()
coord.join(threads)

Here is the error info:

---------------------------------------------------------------------------
FailedPreconditionError                   Traceback (most recent call last)
<ipython-input-13-889826d0ffbd> in <module>()
      7 #for i in range(2):
      8 # a,b,c,d,e,f = sess.run([id,IntegerTypelabel,LongTypelabel,FloatTypelabel,DoubleTypelabel,name])
----> 9 a = sess.run(features)
     10 
     11 

/home/wj/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.pyc in run(self, fetches, feed_dict, options, run_metadata)
    765     try:
    766       result = self._run(None, fetches, feed_dict, options_ptr,
--> 767                          run_metadata_ptr)
    768       if run_metadata:
    769         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/home/wj/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.pyc in _run(self, handle, fetches, feed_dict, options, run_metadata)
    963     if final_fetches or final_targets:
    964       results = self._do_run(handle, final_targets, final_fetches,
--> 965                              feed_dict_string, options, run_metadata)
    966     else:
    967       results = []

/home/wj/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.pyc in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1013     if handle is None:
   1014       return self._do_call(_run_fn, self._session, feed_dict, fetch_list,
-> 1015                            target_list, options, run_metadata)
   1016     else:
   1017       return self._do_call(_prun_fn, self._session, handle, feed_dict,

/home/wj/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.pyc in _do_call(self, fn, *args)
   1033         except KeyError:
   1034           pass
-> 1035       raise type(e)(node_def, op, message)
   1036 
   1037   def _extend_graph(self):

FailedPreconditionError: /home/wj/test-output.tfr
	 [[Node: ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input_producer)]]

Caused by op u'ReaderReadV2', defined at:
  File "/home/wj/anaconda2/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/home/wj/anaconda2/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/wj/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py", line 3, in <module>
    app.launch_new_instance()
  File "/home/wj/anaconda2/lib/python2.7/site-packages/traitlets/config/application.py", line 653, in launch_instance
    app.start()
  File "/home/wj/anaconda2/lib/python2.7/site-packages/ipykernel/kernelapp.py", line 474, in start
    ioloop.IOLoop.instance().start()
  File "/home/wj/anaconda2/lib/python2.7/site-packages/zmq/eventloop/ioloop.py", line 162, in start
    super(ZMQIOLoop, self).start()
  File "/home/wj/anaconda2/lib/python2.7/site-packages/tornado/ioloop.py", line 887, in start
    handler_func(fd_obj, events)
  File "/home/wj/anaconda2/lib/python2.7/site-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/wj/anaconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 440, in _handle_events
    self._handle_recv()
  File "/home/wj/anaconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 472, in _handle_recv
    self._run_callback(callback, msg)
  File "/home/wj/anaconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback
    callback(*args, **kwargs)
  File "/home/wj/anaconda2/lib/python2.7/site-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/wj/anaconda2/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 276, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/home/wj/anaconda2/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 228, in dispatch_shell
    handler(stream, idents, msg)
  File "/home/wj/anaconda2/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 390, in execute_request
    user_expressions, allow_stdin)
  File "/home/wj/anaconda2/lib/python2.7/site-packages/ipykernel/ipkernel.py", line 196, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/home/wj/anaconda2/lib/python2.7/site-packages/ipykernel/zmqshell.py", line 501, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/home/wj/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2717, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/home/wj/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2821, in run_ast_nodes
    if self.run_code(code, result):
  File "/home/wj/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-6-7fd6f1c21551>", line 1, in <module>
    key, value = reader.read(fileNameQue)
  File "/home/wj/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/io_ops.py", line 272, in read
    return gen_io_ops._reader_read_v2(self._reader_ref, queue_ref, name=name)
  File "/home/wj/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 410, in _reader_read_v2
    queue_handle=queue_handle, name=name)
  File "/home/wj/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
    op_def=op_def)
  File "/home/wj/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2395, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/wj/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1264, in __init__
    self._traceback = _extract_stack()

FailedPreconditionError (see above for traceback): /home/wj/test-output.tfr
	 [[Node: ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input_producer)]]

Here is my environment:

Linux-3.13.0-24-generic-x86_64-with-debian-jessie-sid
('Python', '2.7.12 |Anaconda 4.2.0 (64-bit)| (default, Jul  2 2016, 17:42:40) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]')
('NumPy', '1.12.1')
('SciPy', '0.19.0')
('Tensorflow', '1.0.0')

Any version of this project can't running

org.spark_project.guava.util.concurrent.ExecutionError: java.lang.NoClassDefFoundError: org/codehaus/janino/ClassBodyEvaluator
at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2261)
at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:905)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:412)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:366)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:32)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:889)
at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:130)
at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:120)
at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:114)
at org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$2.apply(ExistingRDD.scala:194)
at org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$2.apply(ExistingRDD.scala:193)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:815)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:815)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoClassDefFoundError: org/codehaus/janino/ClassBodyEvaluator
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:912)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1013)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1010)
at org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
at org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
... 33 more
Caused by: java.lang.ClassNotFoundException: org.codehaus.janino.ClassBodyEvaluator
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 40 more
18/11/18 18:34:23 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 6)
org.spark_project.guava.util.concurrent.ExecutionError: java.lang.NoClassDefFoundError: org/codehaus/janino/ClassBodyEvaluator
at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2261)
at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:905)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:412)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:366)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:32)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:889)
at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:130)
at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:120)
at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:114)
at org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$2.apply(ExistingRDD.scala:194)
at org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$2.apply(ExistingRDD.scala:193)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:815)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:815)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoClassDefFoundError: org/codehaus/janino/ClassBodyEvaluator
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:912)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1013)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1010)
at org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
at org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
... 33 more
Caused by: java.lang.ClassNotFoundException: org.codehaus.janino.ClassBodyEvaluator
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 40 more
18/11/18 18:34:23 ERROR Executor: Exception in task 2.0 in stage 1.0 (TID 7)
org.spark_project.guava.util.concurrent.ExecutionError: java.lang.NoClassDefFoundError: org/codehaus/janino/ClassBodyEvaluator
at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2261)
at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:905)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:412)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:366)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:32)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:889)
at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:130)
at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:120)
at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:114)
at org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$2.apply(ExistingRDD.scala:194)
at org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$2.apply(ExistingRDD.scala:193)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:815)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:815)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

Marathon/Mesos and docker

Why Docker is still mandatory? See douban/tfmesos#12

TensorFlow on Hadoop YARN

Hadoop YARN is a commonly deployed cluster manager. Having the ability to run TensorFlow on YARN would be very useful in such environment.

Our team is currently working on a YARN application for this purpose, and would like to contribute our work here. We will provide more details of our contribution soon.

-Jason

Cannot run program "/usr/local/python3.6/bin/python3": error=2, No such file or directory

18/04/20 11:34:53 ERROR ApplicationMaster: User class threw exception: java.io.IOException: Cannot run program "/usr/local/python3.6/bin/python3": error=2, No such file or directory
java.io.IOException: Cannot run program "/usr/local/python3.6/bin/python3": error=2, No such file or directory

Add support for Kubernetes on AWS

As I have seen that here we are deploying Tensorflow on Kubernetes under Google Cloud Platform and it would be great if we could add Kubernetes deployment steps under AWS (for AWS users).

Does Spark tf connect support in memory transform?

I have a lot of data in parquet format on hdfs, which I want to use for tensorflow. Currently, I can read parquet files into spark and output the dataframe to tfrecord as temp files ready for tensorflow, however, it costs both time and space. Is there any method to read in parquet files and make it available for tensorflow directly?

Spark Package for spark-tensorflow-connector

Hi,
I found the Spark Package for spark-tensorflow-connector at https://spark-packages.org/package/tapanalyticstoolkit/spark-tensorflow-connector. Is that a stable version to use, or would you recommend building from master here? Do you plan to update the Spark Package regularly?

Thanks,
Sue Ann

tensorflow-hadoop fails to compile with protoc 3.3.0

The newest protoc (3.3.0) causes a compilation failure with errors such as:

... hadoop/src/main/java/org/tensorflow/example/FloatList.java:[194,17] error: no suitable method found for parseFrom(ByteBuffer)
[ERROR] int,int) is not applicable

Compiling with protoc 3.1.0 (as per the README) works fine.