tensorflow / io Goto Github PK

Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO

License: Apache License 2.0

Python 42.51% Shell 1.02% C++ 50.56% R 2.30% Dockerfile 0.13% CSS 0.52% JavaScript 0.19% Go 0.11% Starlark 2.12% Swift 0.54%

tensorflow-io tensorflow filesystem dataset streaming

io's Introduction

TensorFlow I/O

TensorFlow I/O is a collection of file systems and file formats that are not available in TensorFlow's built-in support. A full list of supported file systems and file formats by TensorFlow I/O can be found here.

The use of tensorflow-io is straightforward with keras. Below is an example to Get Started with TensorFlow with the data processing aspect replaced by tensorflow-io:

import tensorflow as tf
import tensorflow_io as tfio

# Read the MNIST data into the IODataset.
dataset_url = "https://storage.googleapis.com/cvdf-datasets/mnist/"
d_train = tfio.IODataset.from_mnist(
    dataset_url + "train-images-idx3-ubyte.gz",
    dataset_url + "train-labels-idx1-ubyte.gz",
)

# Shuffle the elements of the dataset.
d_train = d_train.shuffle(buffer_size=1024)

# By default image data is uint8, so convert to float32 using map().
d_train = d_train.map(lambda x, y: (tf.image.convert_image_dtype(x, tf.float32), y))

# prepare batches the data just like any other tf.data.Dataset
d_train = d_train.batch(32)

# Build the model.
model = tf.keras.models.Sequential(
    [
        tf.keras.layers.Flatten(input_shape=(28, 28)),
        tf.keras.layers.Dense(512, activation=tf.nn.relu),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(10, activation=tf.nn.softmax),
    ]
)

# Compile the model.
model.compile(
    optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)

# Fit the model.
model.fit(d_train, epochs=5, steps_per_epoch=200)

In the above MNIST example, the URL's to access the dataset files are passed directly to the tfio.IODataset.from_mnist API call. This is due to the inherent support that tensorflow-io provides for HTTP/HTTPS file system, thus eliminating the need for downloading and saving datasets on a local directory.

NOTE: Since tensorflow-io is able to detect and uncompress the MNIST dataset automatically if needed, we can pass the URL's for the compressed files (gzip) to the API call as is.

Please check the official documentation for more detailed and interesting usages of the package.

Installation

Python Package

The tensorflow-io Python package can be installed with pip directly using:

$ pip install tensorflow-io

People who are a little more adventurous can also try our nightly binaries:

$ pip install tensorflow-io-nightly

To ensure you have a version of TensorFlow that is compatible with TensorFlow-IO, you can specify the tensorflow extra requirement during install:

pip install tensorflow-io[tensorflow]

Similar extras exist for the tensorflow-gpu, tensorflow-cpu and tensorflow-rocm packages.

Docker Images

In addition to the pip packages, the docker images can be used to quickly get started.

For stable builds:

$ docker pull tfsigio/tfio:latest
$ docker run -it --rm --name tfio-latest tfsigio/tfio:latest

For nightly builds:

$ docker pull tfsigio/tfio:nightly
$ docker run -it --rm --name tfio-nightly tfsigio/tfio:nightly

R Package

Once the tensorflow-io Python package has been successfully installed, you can install the development version of the R package from GitHub via the following:

if (!require("remotes")) install.packages("remotes")
remotes::install_github("tensorflow/io", subdir = "R-package")

TensorFlow Version Compatibility

To ensure compatibility with TensorFlow, it is recommended to install a matching version of TensorFlow I/O according to the table below. You can find the list of releases here.

TensorFlow I/O Version	TensorFlow Compatibility	Release Date
0.37.0	2.16.x	Apr 25, 2024
0.36.0	2.15.x	Feb 02, 2024
0.35.0	2.14.x	Dec 18, 2023
0.34.0	2.13.x	Sep 08, 2023
0.33.0	2.13.x	Aug 01, 2023
0.32.0	2.12.x	Mar 28, 2023
0.31.0	2.11.x	Feb 25, 2023
0.30.0	2.11.x	Jan 20, 2023
0.29.0	2.11.x	Dec 18, 2022
0.28.0	2.11.x	Nov 21, 2022
0.27.0	2.10.x	Sep 08, 2022
0.26.0	2.9.x	May 17, 2022
0.25.0	2.8.x	Apr 19, 2022
0.24.0	2.8.x	Feb 04, 2022
0.23.1	2.7.x	Dec 15, 2021
0.23.0	2.7.x	Dec 14, 2021
0.22.0	2.7.x	Nov 10, 2021
0.21.0	2.6.x	Sep 12, 2021
0.20.0	2.6.x	Aug 11, 2021
0.19.1	2.5.x	Jul 25, 2021
0.19.0	2.5.x	Jun 25, 2021
0.18.0	2.5.x	May 13, 2021
0.17.1	2.4.x	Apr 16, 2021
0.17.0	2.4.x	Dec 14, 2020
0.16.0	2.3.x	Oct 23, 2020
0.15.0	2.3.x	Aug 03, 2020
0.14.0	2.2.x	Jul 08, 2020
0.13.0	2.2.x	May 10, 2020
0.12.0	2.1.x	Feb 28, 2020
0.11.0	2.1.x	Jan 10, 2020
0.10.0	2.0.x	Dec 05, 2019
0.9.1	2.0.x	Nov 15, 2019
0.9.0	2.0.x	Oct 18, 2019
0.8.1	1.15.x	Nov 15, 2019
0.8.0	1.15.x	Oct 17, 2019
0.7.2	1.14.x	Nov 15, 2019
0.7.1	1.14.x	Oct 18, 2019
0.7.0	1.14.x	Jul 14, 2019
0.6.0	1.13.x	May 29, 2019
0.5.0	1.13.x	Apr 12, 2019
0.4.0	1.13.x	Mar 01, 2019
0.3.0	1.12.0	Feb 15, 2019
0.2.0	1.12.0	Jan 29, 2019
0.1.0	1.12.0	Dec 16, 2018

Performance Benchmarking

We use github-pages to document the results of API performance benchmarks. The benchmark job is triggered on every commit to master branch and facilitates tracking performance w.r.t commits.

Contributing

Tensorflow I/O is a community led open source project. As such, the project depends on public contributions, bug-fixes, and documentation. Please see:

contribution guidelines for a guide on how to contribute.
development doc for instructions on the development environment setup.
tutorials for a list of tutorial notebooks and instructions on how to write one.

Build Status and CI

Build	Status
Linux CPU Python 2
Linux CPU Python 3
Linux GPU Python 2
Linux GPU Python 3

Because of manylinux2010 requirement, TensorFlow I/O is built with Ubuntu:16.04 + Developer Toolset 7 (GCC 7.3) on Linux. Configuration with Ubuntu 16.04 with Developer Toolset 7 is not exactly straightforward. If the system have docker installed, then the following command will automatically build manylinux2010 compatible whl package:

#!/usr/bin/env bash

ls dist/*
for f in dist/*.whl; do
  docker run -i --rm -v $PWD:/v -w /v --net=host quay.io/pypa/manylinux2010_x86_64 bash -x -e /v/tools/build/auditwheel repair --plat manylinux2010_x86_64 $f
done
sudo chown -R $(id -nu):$(id -ng) .
ls wheelhouse/*

It takes some time to build, but once complete, there will be python 3.5, 3.6, 3.7 compatible whl packages available in wheelhouse directory.

On macOS, the same command could be used. However, the script expects python in shell and will only generate a whl package that matches the version of python in shell. If you want to build a whl package for a specific python then you have to alias this version of python to python in shell. See .github/workflows/build.yml Auditwheel step for instructions how to do that.

Note the above command is also the command we use when releasing packages for Linux and macOS.

TensorFlow I/O uses both GitHub Workflows and Google CI (Kokoro) for continuous integration. GitHub Workflows is used for macOS build and test. Kokoro is used for Linux build and test. Again, because of the manylinux2010 requirement, on Linux whl packages are always built with Ubuntu 16.04 + Developer Toolset 7. Tests are done on a variatiy of systems with different python3 versions to ensure a good coverage:

Python	Ubuntu 18.04	Ubuntu 20.04	macOS + osx9	Windows-2019
2.7	✔️	✔️	✔️	N/A
3.7	✔️	✔️	✔️	✔️
3.8	✔️	✔️	✔️	✔️

TensorFlow I/O has integrations with many systems and cloud vendors such as Prometheus, Apache Kafka, Apache Ignite, Google Cloud PubSub, AWS Kinesis, Microsoft Azure Storage, Alibaba Cloud OSS etc.

We tried our best to test against those systems in our continuous integration whenever possible. Some tests such as Prometheus, Kafka, and Ignite are done with live systems, meaning we install Prometheus/Kafka/Ignite on CI machine before the test is run. Some tests such as Kinesis, PubSub, and Azure Storage are done through official or non-official emulators. Offline tests are also performed whenever possible, though systems covered through offine tests may not have the same level of coverage as live systems or emulators.

	Live System	Emulator	CI Integration	Offline
Apache Kafka	✔️		✔️
Apache Ignite	✔️		✔️
Prometheus	✔️		✔️
Google PubSub		✔️	✔️
Azure Storage		✔️	✔️
AWS Kinesis		✔️	✔️
Alibaba Cloud OSS				✔️
Google BigTable/BigQuery		to be added
Elasticsearch (experimental)	✔️		✔️
MongoDB (experimental)	✔️		✔️

References for emulators:

Official PubSub Emulator by Google Cloud for Cloud PubSub.
Official Azurite Emulator by Azure for Azure Storage.
None-official LocalStack emulator by LocalStack for AWS Kinesis.

Community

SIG IO Google Group and mailing list: [email protected]
SIG IO Monthly Meeting Notes
Gitter room: tensorflow/sig-io

Additional Information

License

Apache License 2.0

io's People

Contributors

Stargazers

Watchers

Forkers

yongtang terrytangyuan yuhonghong66 bryancutler yupbank jdc08161063 kkmsft dmitrievanthony batermj lbstroud yuhonghong7035 kevinykuo bhanditz suphoff nunofernandes-plight lc0 wookayin alipay fraudies dennisjay zhjunqin jjmachan zhaojp-frank muyixiang caszkgui 0101011 uiuran captain-pool pubfork zhoudaqing farcry4998 backyes xiongkezhi jiachengxu pooppap vlasenkoalexey mbrukman deeprtc vestigegroup gradienthealth feihugis captainduke henrytansetiawan yuanjie-ai cuiyifeng suyashkumar seanpmorgan dinhhai94 mrkjn tjadamlee emiratesback hubbucket-team phymucs zl376 nuzhny007 nikol0900 faroit upalchowdhury zxshinxz sshrdp fgrzeszc sbaier1 samykibrahim qinxuye neuroph12 adamgao1996 harbidel zaccharieramzi pshiko jahau yoonlee-lab dylantallchiefgit doc22940 jjedele sylfrena pooyam stjordanis vnghia nikunjbansal99 burgerkingeater zouxu09 ebritsyn hmoralesp zc100 shaunhenju etsangsplk dav009 chen155998 ruhuajiang hixio-mh michaelbanfield ouwen matt-komm lifeiteng pshved wanderingseed marioecd kmjung emkornfield global-localhost

io's Issues

Support audio format for tensorflow-io

With issue #11 and PR #30, tensorflow-io will have Video format support through FFmpeg. Since FFmpeg supports Audio as well, it is natural to add Audio format support through FFmpeg in tensorflow-io.

This issue captures the necessary effort to support Audio format in FFmpeg. Note that unlike video, many audio format need additional information feed from outside of the container. That means the API augments might be different from video support.

Device and GPU/CPU in Dataset

While working on CIFAR format, noticed we may need access to Device (like Eigen thread or GPU) for preprocessing data in Dataset. For example, some formats are channel_first while by default we output data in channel_last. That requires Transpose op. We could do the transformation in python but ideally we can do it within the kernel.

Also a related issue: At the moment Dataset are attached to CPU device. But will it helps to send data directly to GPU if user want to do it?

Make Dockerfile for Python Development

It would be nice to have a Dockerfile to build an image with all the configuration and packages for a developer to start building and testing right away. This would include:

Required OS packages
Bazel installation
TensorFlow
Required pip packages

Set up static html documentation for R package

We need to scaffold R package documentation website. pkgdown is a popular choice in R community to generate this automatically through simple configuration. We should at least scaffold the following pages:

API reference
Tutorials
Index page

Idea: DT_VARIANT type for ImageSets / ImageSource?

When looking through the current image operations and how to add functionality I quickly came to imagine lots and lots of per file format special operations - and did not really like that picture.

Currently all operations on image and video files (on file system or as strings) parse the files, extract one (or more) pictures as 2-3 dimensional data tensors and then throw away all parsing information.

This severely restricts the usability - all possible operations would have to be defined per file format - and each operation starts from scratch from an unopened file.

If ImageSets (Single Pictures, Multiple Pictures (Tiff), Videos (Single or multi channel)) would be wrapped in a DT_VARIANT they could become first class objects in TF.

ImageSet Operations could then extract information from the ImageSet like cardinality, specific images, image sizes, ...

Example usage pattern in pseudo TF code:

# Extract Random Image from TIF file
image_set = TiffImageSet('example.tif')
n = ImageSetCardinality(image_set)
index = tf.random.uniform(1, minval = 0; maxval = n, dtype=tf.dtypes.int64)
image = ImageSetGetImage(image_set, index)

A similar argument could be made to represent single potential images as an ImageSource Object.
This way meta data (focal length, resolution, ...) information could be retrieved from the ImageSource Object using special ImageSource Operations, used in calculations and then used as parameters to retrieve a cropped and scaled subset with another ImageSource Operation.

Example usage:

image_source = Picture('car.jpg');
 # Extract Low RES picture from image_source and find license plate
license_plate_area = find_license_plate(image_source)
# Extract high Resolution patch from image_source and read license  
plate_image = ImageSourceExtractPatch(image_source, license_plate_area)
tag = read_tag(plate_image)

I believe ImageSource operations could also be made batch friendly to avoid unnecessary copy operations on batching.

This may not be a realistic idea - but I wanted to bring it up as this may be the ideal time to specify a new generic interface. (There are enough existing formats in the repository to verify interface, but not too many to make the task unmanageable.)

Thoughts?

Guidance on custom filesystem

The past guide for this was here https://www.tensorflow.org/guide/extend/filesystem and I'm looking at porting the azure blob storage file system to this repo.

I was wanting some guidance on what might have changed? I found the igfs registration here https://github.com/tensorflow/io/blob/master/tensorflow_io/ignite/ops/igfs_ops.cc and what seems to be the build target here https://github.com/tensorflow/io/blob/master/tensorflow_io/ignite/BUILD#L5-L43 for this file system. Are these two components what is likely from my point of view?

Support libsvm for tensorflow-io

In TensorFlow, libsvm format is supported through tf.contrib.libsvm which converts libsvm format into sparse tensor format. With the expected deprecation of tf.contrib, the libsvm will be deprecated as well.

It would be great to provide continued support of libsvm through tensorflow-io package. While the tf.contrib.libsvm converts libsvm format into sparse tensor, I think in tensorflow-io, we could convert libsvm to tf.data directly.

Build tensorflow-io package on MacOS

At the moment, tensorflow-io package is built on Linux (and inside the tensorflow:custom-op image).

As users may use the package on different platforms and with different python versions, package built on different platform will be needed. At least we should provide the same support as tensorflow itself.

Add boolean type support to Arrow Datasets

Arrow datasets are missing support for boolean data types

tensorflow-io 0.4.0 release

TensorFlow v1.13.1 has been released:
https://github.com/tensorflow/tensorflow/releases/tag/v1.13.1

We will need to make a release of 0.4.0 for tensorflow-io. There are several additional items we really like to be in before the dev summit: macOS support, MNIST, PubSub.

But we could also make a 0.4.0 release now, and have a 0.5.0 release immediately after (before dev summit), to hopefully get better PR.

Here is the list of items for 0.4.0

Change required package to tensorflow>=1.13.0,<1.14.0
Update README.md to add additional supported ops since 0.3.0
Update RELEASE.md to capture 0.4.0 changes.
Build *.whl files and push to PyPI.org
Release R package to CRAN.

/cc @dmitrievanthony @BryanCutler @terrytangyuan

setup.py: Dependency on tensorflow

When installing the current version of tensorflow-io (pip install tensorflow-io-nightly), it tries to install its dependency package tensorflow, not tensorflow-gpu if the gpu one is installed. As a result tensorflow could be installed mistakenly instead of tensorflow-gpu.

I have no idea about specifying conditional dependencies. How about we simply do not require tensorflow in REQUIRED_PACKAGE from setup.py?

Release R package on CRAN

CRAN is where most R packages are published. We can release the initial version of the package once #6 is closed as well as the following work:

Make sure all tests pass locally and on CI
Make sure certain tests that cannot run on CRAN test machines are properly skipped
Make sure all CRAN checks/requirements pass
Address feedback from CRAN maintainers' package submission review (if any)

support for libsvm format

Support BigQuery for tensorflow-io

BigQuery is Google's serverless cloud data analytics platform. During last SIG I/O call the support for BigQuery (BigQueryDataset) was mentioned.

It would be good to add BigQuery (BigQueryDataset) to tensorflow-io so that users could use tensorflow to do more big data analytics on cloud.

Note that Google's BigQuery seems to have no C++ client library. It does have a python support and a RESTful API support. The implementation likely will need to use BigQuery's python client library, or http?

Release tensorflow-io package on pypi

The tensorflow-io package could already build whl file, though the package has not been released to pypi yet.

It makes sense to publish the pip file so that users who want to use I/O packages could simply run:

pip install tensorflow-io

Action Items:

Create a release and cut a 0.1 version.
Build tensorflow-io package for Python 2.7, 3.4, 3.5, 3.6 (matching TensorFlow build environment)
Upload to pypi.

Migrate LMDBDataset from tensorflow to tensorflow-io

LMDBDataset is part of the tf.contrib.data and it naturally fits the tensorflow-io context. It makes sense to migrate LMDBDataset to tensorflow-io.

AlibabaCloud OSS support

This feature request is from tensorflow/tensorflow#24556 (/cc @oss-developer):

AlibabaCloud Object Storage Service(OSS) is one of the most widely used cloud stroage services in the world. Could I ask for OSS support with tensorflow?

Support Cloud PubSub for tensorflow-io

The tensorflow-io support a basic Kinesis integration which is a cloud-vendor specific (AWS) streaming platform.

A similar streaming platform provided by Google Cloud is the Cloud PubSub. It makes sense to provide support for Cloud PubSub in tensorflow-io as well.

One good thing with PubSub is that it has gRPC endpoint, which should make it a lot easier to implement in C++ than BigQuery (which only has RESTful API endpoint).

Support video format for tensorflow-io

In TensorFlow, video format is supported through tf.contrib.ffmpeg which calls command line ffmpeg to decode video format to tensors and feed into tensorflow.

The tf.contrib.ffmpeg is pretty much unmaintained, and, the command line ffmpeg invocation is really unreliable due to the changes of output text over different versions.

The tf.contrib.ffmpeg will also be deprecated soon so users of tensorflow will have no direct access to video format very soon. This is a big loss for many users.

I think it makes sense to support video formats in tensorflow-io, by dynamically linking ffmpeg's library (not command line invocation) and generate output to tf.data.

We have to be very careful with licenses for external libraries, though as far as I know (correct me if I am wrong), ffmpeg is LGPL 2.1+ so it would be OK to only dynamically linking ffmpeg from tensorflow-io (Apache 2.0 license). Also, we should not distribute ffmpeg library directly. We should merely call the api through .so/.dll if it has been installed in system already.

Guidance on initializable iterators w/ numpy arrays

I am collecting data during the training process and using Dataset.from_tensor_slices with placeholders and an initializable iterator to refresh the dataset. The dataset uses the tensor slices to then do further preprocessing.

As new data is collected, I reinitialize the iterator's placeholders with the new numpy array data.

Since initializable iterators are deprecated now, how do you recommend I seed the dataset with the dynamics numpy arrays? Should I switch to using a generator?

Support images natively for tensorflow-io

With WebP format (#42/#43) it is possible to import a collections of images into tf.data natively without any additional steps. Due to historical reasons importing png/jpg/bmp files in TensorFlow still has to go through the tensor conversions. It would be good to import a collection of image file (potentially with a mixture of different format) to tf.data without any redundant operations.

FYI: Bazel 0.20.0 now really required (Newer versions no longer working)

A new external dependency requires Bazel 0.20.0
Google Cloud #2090

Workaround to avoid having to globally downgrade bazel

Support Python 3 for tensorflow-io

At the moment, tensorflow-io package could be built within docker image tensorflow:custom-op. By default tensorflow:custom-op uses python 2.7.6. The package has not been built and tested with python 3. We should support python 3.

Also, from tensorflow repo there is a lot of interests for python 3.7 support, so 3.7 should be considered as well.

http/https filesystem support

We already have support for filesystems with igfs://, gs://, s3:// (either through tensorflow-io or through tensorflow repo). One protocol that is missing is the https:// or http://. There are lots of data are referring an image (or video/audio) through the URL so it really would be great if we could provide support for https:// or http://.

While HTTP protocol supports Accept-Ranges header, not all web servers provides this feature. So this has to be taken into consideration. Also, there is no need to re-download the file again if already save to local disk. It would make sense to have a local cache similar to pip or bazel.

The implementation likely will be in stages. We could have an early version first then expand features as needed.

tensorflow-io 0.2.0 release

This issue tracks the efforts to release tensorflow-io 0.2.0.

Since TensorFlow 1.13 has not been released, tensorflow-io 0.2.0 will be based on TensorFlow 1.12.

Update README.md to add additional supported ops since 0.1.0 (@BryanCutler)
Update README.md to add link to individual ops' README.md if possible. (@BryanCutler)
Change setup.py to fix tensorflow==1.12.0, as 1.13 is not compatible. (#65/@yongtang)
Add integration test in tests so that the build of *.whl files work correctly (#64/@yongtang)
Add examples in doc/tutorials so that end user could see the use case (defer to tensorflow 1.13).
Update R-packages so that additional ops are supported. (#70/ @terrytangyuan)
Update RELEASE.md to capture 0.2.0 changes (#66 / @yongtang)
Build *.whl files and push to PyPI.org (@BryanCutler)
Release R packages to CRAN.

If you want to help, you can add a comment to indicate the items you are working on.

Support parquet format for tensorflow-io

There are quite some interests to support parquet for tensorflow and a PR (tensorflow/tensorflow#19461) has already been approved.

With the expected deprecation of tf.contrib, it makes sense to move PR tensorflow/tensorflow#19461 to tensorflow-io.

At the moment the blocking issue with the related PR is the bazel version. The PR requires bazel 1.7.1+ to incorporate boost library while tensorflow CI still uses bazel 1.5.0. We will need to find out a way to work out the bazel version issue. See tensorflow/tensorflow#22449, tensorflow/tensorflow#22964 for details.

Support NumPy for tensorflow-io

In TensorFlow's guide of "Importing Data":
https://www.tensorflow.org/guide/datasets

It is possible to reading input data directly from TFRecord (TFRecordDataset), text (TextLineDataset ), csv (CsvDataset) but not with NumPy. Reading input from NumPy still have to use a not so elegant way in the example code of the TensorFlow Guide:

with np.load("/var/data/training_data.npy") as data:
  features = data["features"]
  labels = data["labels"]

# Assume that each row of `features` corresponds to the same row as `labels`.
assert features.shape[0] == labels.shape[0]

dataset = tf.data.Dataset.from_tensor_slices((features, labels))

It should be possible to implement NumPy support so that reading input from numpy could be done in a similar fashion as other input format. This potentially could also improve the performance as it may not be needed to read everything into the memory immediately (remotely related: tensorflow/tensorflow#16933).

TensorFlow 1.13 support

There are some tf.data API changes between TensorFlow 1.12 and 1.13 which breaks the build of tensorflow-io. This issue is created to track the the effort needed to provide TensorFlow 1.13 support for tensorflow-io.

Unable to install directly using pip on macOS

@yongtang I seem unable to download artifact from PyPI using pip:

$ pip install tensorflow-io
Collecting tensorflow-io
  Could not find a version that satisfies the requirement tensorflow-io (from versions: )
No matching distribution found for tensorflow-io

Is there a step I'm missing that we could add to the README?

Referencing: #7

Setup nightly build through Travis CI

We could set up a nightly build through Travis CI, so that users can install a preview version. This could be useful as TensorFlow 1.13 is not released yet (only 1.13.0rc0) and we need to start testing early, so that we could release roughly the same time as 1.13.0 release.

Ideally we could setup another project in PyPI.org with tensorflow-io-nightly.

Ignite tests fail with python 3.4 and 3.6

While running tests on Travis CI, it looks like ignite tests //tensorflow_io/ignite:ignite_py_test fail with python 3.4:
https://travis-ci.org/tensorflow/io/jobs/467306695

Python 3.6 tests also fail with //tensorflow_io/ignite:ignite_py_test
https://travis-ci.org/tensorflow/io/jobs/467306697

It may worth to take a look.

R Interface Prototype

Package setup and testing infrastructure
Implement R wrappers + unit tests
Integration with tfdatasets R package
Setup CI for R package and make sure tests all pass
Examples/vignettes

Setup CI for Tensorflow I/O

This issue tracks the efforts and progresses to setup CI for TensorFlow I/O. We plan to use Google CI so coordination with TensorFlow's CI might be needed.

Travis CI enhancement

The Travis CI config needs to be improved, to support additional platforms and improve maintainability of .travis.yaml:

Platform	Build	Python Test	R 3.5 Test
Linux Python 2.7	Ubuntu 14.04	Ubuntu 16.04+18.04	Ubuntu 16.04+18.04
Linux Python 3.4	Ubuntu 14.04
Linux Python 3.5	Ubuntu 14.04	Ubuntu 16.04
Linux Python 3.6	Ubuntu 14.04	Ubuntu 18.04
MacOS Python 2.x	TBD	TBD	TBD
MacOS Python 3.x	TBD	TBD	TBD
Windows Python 3.x	TBD	TBD	TBD

Setup lint with CI

We haven't had any lint checking yet and it would be good to setup a lint checking with CI.

In Python pylint/flake8 could be good choice.

In C++, clang-format was used but that was far from ideal. The reason was that clang-format differs output for different versions. This is really hard to maintain a consistent format. For now, let's not use clang-format. Instead we could check simple things like tab vs space, end-of-line spaces, etc.

Support Apache Arrow for tensorflow-io

Note: this issue is from tensorflow/tensorflow#23001 (@BryanCutler 👍 )

Apache Arrow is a standard format for in-memory columnar data. It provides a cross-language platform for systems to communicate and operate on data efficiently.

Adding Arrow support in TensorFlow Dataset will allow systems to interface with TensorFlow in a well defined way, without the need to develop custom converters, serialize data, or write to specialized files.

It would be straightforward to add a base layer of Arrow support that works on Arrow record batches (a common struct for Arrow IPC) and extend that layer to support different kinds of Arrow Ops:

Python memory / Pandas DataFrames
Arrow Feather files
Parquet files
Socket / Pipes

A slightly more involved Op could use Arrow Flight - Arrow-based messaging over gRPC. Additionally, it would possible to define Ops to connect directly to other systems that can export Arrow data.

Arrow Datasets missing AsGraphDefInternal impl

The Arrow Datasets are missing the proper AsGraphDefInternal implementations to define the inputs and output Node for serialization. This seems to have been allowed to work for v1.12, but causes v1.13 to fail with a cryptic error of a thread being killed during MakeDataset.

Add bazel into build image

Do I understand correctly that Docker image proposed to use for build :

docker run -it -v ${PWD}:/working_dir -w /working_dir tensorflow/tensorflow:custom-op

doesn't contain bazel and we have to setup it manually? Can we prepare a docker image that already contains bazel?

Support live video for tensorflow-io

Not sure if it is feasible though it would be interesting if tensorflow could be used for live video processing directly. For that we will need live video stream support in tensorflow-io. That will benefit many applications I think.

Note video format support (not live video) has already been captured in #11 and #30.

/cc @juwangvsu

Rework on compressed file based Dataset

As the number of file based Dataset is growing, code duplications start to happen. The biggest area of duplication is the compression support. There are two types of compressions:

ZLIB/GZIP where you have a single compressed entry
ZIP where you have multiple entries inside (e.g, npz file is essentially a ZIP).
The compression topic itself could be complicated, like recursive compression. The goal of tensorflow-io though, is to support formats that are commonly used in machine learning community. So one level of compression is enough.

We should rework on Dataset to have a CompressedFileDataset like abstraction.

Python 3.7 support

Python 3.7 is a supported version in TF 1.13.0 (rc2), so we should support 3.7 as well (already saw lots of interests in TF about 3.7).

Tiff file support

Would that be possible to support TIFF file?

Add tests to make sure python modules are exposed correctly

While tensorflow-io has tests covered for different modules (Ignite/Kafka/etc) through bazel test, those tests are always run inside the repo directory and no through pip install.

It would be good to have tests to make sure pip install correctly exposes python modules. Some simple tests of pip install tensorflow-io-*.whl && python -c "import tensorflow_io.Kafka" would be good enough to serve the needs here.

Kafka dataset may exit in macOS

On macOS sometimes Kafka dataset may exit early:
https://travis-ci.org/tensorflow/io/jobs/499999592

This bug is likely caused by usage of librdkafka. The resources (e.g., messages) may have to be closed before close kafka consumer.

TensorFlow 2.0 support

This issue is created to track TensorFlow 2.0 support for tensorflow-io. The most visible changes in TensorFlow 2.0 is the eager execution in default. Will need to test thoroughly and see if there are any impact on tensorflow-io.

ArrowDataset.from_pandas fails when preserve_index=True

When making an ArrowDataset from a pandas.DataFrame, if the preserve_index flag is set to True the Dataset iterator will fail with the error:

InternalError: Missing 2-th output from node IteratorGetNext (defined at <ipython-input-42-624c3010a782>:1)  = IteratorGetNext[output_shapes=[[], [], []], output_types=[DT_DOUBLE, DT_INT64, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator)

This is because the preserve_index flag will add an additional column in the record batch and there is not a corresponding column index sent to the op.

Support live audio for tensorflow-io

One interesting usage of tensorflow from the community is to process live audio stream. For that I think it would be great to support live audio stream in tensorflow-io so that more applications could benefit from it directly.

Note this issue might be different from #49 (audio file format, not live audio).

tensorflow_io namespace?

Should tensorflow-io export any symbols in the tensorflow namespace?

In most cases directly using the tensorflow namespace the is not an issue.
Many tensorflow-io kernels are fully contained in an anonymous namespace and REGISTER_OP just defines a static variable.

However any functionality requiring more then a single file currently exports symbols in the tensorflow namespace. If we ever get a symbol collision with tensorflow then symbol resolution may cause interesting issues.

Should tensorflow-io use a tensorflow_io workspace?
( And is there an automated way to test/protect against exporting symbols into the tensorflow namespace?)

Support WebP for tensorflow-io

In TensorFlow, basic image formats of bmp, png, and jpeg are supported. An issue (tensorflow/tensorflow#18250) was opened in TensorFlow repo looking for WebP support.

This issue captures the effort to support WebP in tensorflow-io.

tensorflow-io 0.3.0 release

I think we have enough new features to release 0.3.0. We could have another release 0.4.0 to match tensorflow 1.13.0, likely in a couple of weeks.

Update README.md to add additional supported ops since 0.2.0
Update RELEASE.md to capture 0.3.0 changes.
Build *.whl files and push to PyPI.org
Release R packages to CRAN.