Code Monkey home page Code Monkey logo

text's Introduction




PyPI version PyPI nightly version PyPI Python version Documentation Contributions welcome License

TensorFlow Text - Text processing in Tensorflow

IMPORTANT: When installing TF Text with pip install, please note the version of TensorFlow you are running, as you should specify the corresponding minor version of TF Text (eg. for tensorflow==2.3.x use tensorflow_text==2.3.x).

INDEX

Introduction

TensorFlow Text provides a collection of text related classes and ops ready to use with TensorFlow 2.0. The library can perform the preprocessing regularly required by text-based models, and includes other features useful for sequence modeling not provided by core TensorFlow.

The benefit of using these ops in your text preprocessing is that they are done in the TensorFlow graph. You do not need to worry about tokenization in training being different than the tokenization at inference, or managing preprocessing scripts.

Documentation

Please visit http://tensorflow.org/text for all documentation. This site includes API docs, guides for working with TensorFlow Text, as well as tutorials for building specific models.

Unicode

Most ops expect that the strings are in UTF-8. If you're using a different encoding, you can use the core tensorflow transcode op to transcode into UTF-8. You can also use the same op to coerce your string to structurally valid UTF-8 if your input could be invalid.

docs = tf.constant([u'Everything not saved will be lost.'.encode('UTF-16-BE'),
                    u'Sad☹'.encode('UTF-16-BE')])
utf8_docs = tf.strings.unicode_transcode(docs, input_encoding='UTF-16-BE',
                                         output_encoding='UTF-8')

Normalization

When dealing with different sources of text, it's important that the same words are recognized to be identical. A common technique for case-insensitive matching in Unicode is case folding (similar to lower-casing). (Note that case folding internally applies NFKC normalization.)

We also provide Unicode normalization ops for transforming strings into a canonical representation of characters, with Normalization Form KC being the default (NFKC).

print(text.case_fold_utf8(['Everything not saved will be lost.']))
print(text.normalize_utf8(['Äffin']))
print(text.normalize_utf8(['Äffin'], 'nfkd'))
tf.Tensor(['everything not saved will be lost.'], shape=(1,), dtype=string)
tf.Tensor(['\xc3\x84ffin'], shape=(1,), dtype=string)
tf.Tensor(['A\xcc\x88ffin'], shape=(1,), dtype=string)

Tokenization

Tokenization is the process of breaking up a string into tokens. Commonly, these tokens are words, numbers, and/or punctuation.

The main interfaces are Tokenizer and TokenizerWithOffsets which each have a single method tokenize and tokenizeWithOffsets respectively. There are multiple implementing tokenizers available now. Each of these implement TokenizerWithOffsets (which extends Tokenizer) which includes an option for getting byte offsets into the original string. This allows the caller to know the bytes in the original string the token was created from.

All of the tokenizers return RaggedTensors with the inner-most dimension of tokens mapping to the original individual strings. As a result, the resulting shape's rank is increased by one. Please review the ragged tensor guide if you are unfamiliar with them. https://www.tensorflow.org/guide/ragged_tensor

WhitespaceTokenizer

This is a basic tokenizer that splits UTF-8 strings on ICU defined whitespace characters (eg. space, tab, new line).

tokenizer = text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['everything not saved will be lost.', u'Sad☹'.encode('UTF-8')])
print(tokens.to_list())
[['everything', 'not', 'saved', 'will', 'be', 'lost.'], ['Sad\xe2\x98\xb9']]

UnicodeScriptTokenizer

This tokenizer splits UTF-8 strings based on Unicode script boundaries. The script codes used correspond to International Components for Unicode (ICU) UScriptCode values. See: http://icu-project.org/apiref/icu4c/uscript_8h.html

In practice, this is similar to the WhitespaceTokenizer with the most apparent difference being that it will split punctuation (USCRIPT_COMMON) from language texts (eg. USCRIPT_LATIN, USCRIPT_CYRILLIC, etc) while also separating language texts from each other.

tokenizer = text.UnicodeScriptTokenizer()
tokens = tokenizer.tokenize(['everything not saved will be lost.',
                             u'Sad☹'.encode('UTF-8')])
print(tokens.to_list())
[['everything', 'not', 'saved', 'will', 'be', 'lost', '.'],
 ['Sad', '\xe2\x98\xb9']]

Unicode split

When tokenizing languages without whitespace to segment words, it is common to just split by character, which can be accomplished using the unicode_split op found in core.

tokens = tf.strings.unicode_split([u"仅今年前".encode('UTF-8')], 'UTF-8')
print(tokens.to_list())
[['\xe4\xbb\x85', '\xe4\xbb\x8a', '\xe5\xb9\xb4', '\xe5\x89\x8d']]

Offsets

When tokenizing strings, it is often desired to know where in the original string the token originated from. For this reason, each tokenizer which implements TokenizerWithOffsets has a tokenize_with_offsets method that will return the byte offsets along with the tokens. The start_offsets lists the bytes in the original string each token starts at (inclusive), and the end_offsets lists the bytes where each token ends at (exclusive, i.e., first byte after the token).

tokenizer = text.UnicodeScriptTokenizer()
(tokens, start_offsets, end_offsets) = tokenizer.tokenize_with_offsets(
    ['everything not saved will be lost.', u'Sad☹'.encode('UTF-8')])
print(tokens.to_list())
print(start_offsets.to_list())
print(end_offsets.to_list())
[['everything', 'not', 'saved', 'will', 'be', 'lost', '.'],
 ['Sad', '\xe2\x98\xb9']]
[[0, 11, 15, 21, 26, 29, 33], [0, 3]]
[[10, 14, 20, 25, 28, 33, 34], [3, 6]]

TF.Data Example

Tokenizers work as expected with the tf.data API. A simple example is provided below.

docs = tf.data.Dataset.from_tensor_slices([['Never tell me the odds.'],
                                           ["It's a trap!"]])
tokenizer = text.WhitespaceTokenizer()
tokenized_docs = docs.map(lambda x: tokenizer.tokenize(x))
iterator = tokenized_docs.make_one_shot_iterator()
print(iterator.get_next().to_list())
print(iterator.get_next().to_list())
[['Never', 'tell', 'me', 'the', 'odds.']]
[["It's", 'a', 'trap!']]

Keras API

When you use different tokenizers and ops to preprocess your data, the resulting outputs are Ragged Tensors. The Keras API makes it easy now to train a model using Ragged Tensors without having to worry about padding or masking the data, by either using the ToDense layer which handles all of these for you or relying on Keras built-in layers support for natively working on ragged data.

model = tf.keras.Sequential([
  tf.keras.layers.InputLayer(input_shape=(None,), dtype='int32', ragged=True)
  text.keras.layers.ToDense(pad_value=0, mask=True),
  tf.keras.layers.Embedding(100, 16),
  tf.keras.layers.LSTM(32),
  tf.keras.layers.Dense(32, activation='relu'),
  tf.keras.layers.Dense(1, activation='sigmoid')
])

Other Text Ops

TF.Text packages other useful preprocessing ops. We will review a couple below.

Wordshape

A common feature used in some natural language understanding models is to see if the text string has a certain property. For example, a sentence breaking model might contain features which check for word capitalization or if a punctuation character is at the end of a string.

Wordshape defines a variety of useful regular expression based helper functions for matching various relevant patterns in your input text. Here are a few examples.

tokenizer = text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['Everything not saved will be lost.',
                             u'Sad☹'.encode('UTF-8')])

# Is capitalized?
f1 = text.wordshape(tokens, text.WordShape.HAS_TITLE_CASE)
# Are all letters uppercased?
f2 = text.wordshape(tokens, text.WordShape.IS_UPPERCASE)
# Does the token contain punctuation?
f3 = text.wordshape(tokens, text.WordShape.HAS_SOME_PUNCT_OR_SYMBOL)
# Is the token a number?
f4 = text.wordshape(tokens, text.WordShape.IS_NUMERIC_VALUE)

print(f1.to_list())
print(f2.to_list())
print(f3.to_list())
print(f4.to_list())
[[True, False, False, False, False, False], [True]]
[[False, False, False, False, False, False], [False]]
[[False, False, False, False, False, True], [True]]
[[False, False, False, False, False, False], [False]]

N-grams & Sliding Window

N-grams are sequential words given a sliding window size of n. When combining the tokens, there are three reduction mechanisms supported. For text, you would want to use Reduction.STRING_JOIN which appends the strings to each other. The default separator character is a space, but this can be changed with the string_separater argument.

The other two reduction methods are most often used with numerical values, and these are Reduction.SUM and Reduction.MEAN.

tokenizer = text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['Everything not saved will be lost.',
                             u'Sad☹'.encode('UTF-8')])

# Ngrams, in this case bi-gram (n = 2)
bigrams = text.ngrams(tokens, 2, reduction_type=text.Reduction.STRING_JOIN)

print(bigrams.to_list())
[['Everything not', 'not saved', 'saved will', 'will be', 'be lost.'], []]

Installation

Install using PIP

When installing TF Text with pip install, please note the version of TensorFlow you are running, as you should specify the corresponding version of TF Text. For example, if you're using TF 2.0, install the 2.0 version of TF Text, and if you're using TF 1.15, install the 1.15 version of TF Text.

pip install -U tensorflow-text==<version>

A note about different operating system packages

After version 2.10, we will only be providing pip packages for Linux x86_64 and Intel-based Macs. TensorFlow Text has always leveraged the release infrastructure of the core TensorFlow package to more easily maintain compatible releases with minimal maintenance, allowing the team to focus on TF Text itself and contributions to other parts of the TensorFlow ecosystem.

For other systems like Windows, Aarch64, and Apple Macs, TensorFlow relies on build collaborators, and so we will not be providing packages for them. However, we will continue to accept PRs to make building for these OSs easy for users, and will try to point to community efforts related to them.

Build from source steps:

Note that TF Text needs to be built in the same environment as TensorFlow. Thus, if you manually build TF Text, it is highly recommended that you also build TensorFlow.

If building on MacOS, you must have coreutils installed. It is probably easiest to do with Homebrew.

  1. build and install TensorFlow.
  2. Clone the TF Text repo:
    git clone https://github.com/tensorflow/text.git
    cd text
  3. Run the build script to create a pip package:
    ./oss_scripts/run_build.sh
    After this step, there should be a *.whl file in current directory. File name similar to tensorflow_text-2.5.0rc0-cp38-cp38-linux_x86_64.whl.
  4. Install the package to environment:
    pip install ./tensorflow_text-*-*-*-os_platform.whl

Build or test using TensorFlow's SIG docker image:

  1. Pull image from Tensorflow SIG docker builds.

  2. Run a container based with the pulled image and create a bash session. This can be done by running docker run -it {image_name} bash.
    {image_name} can be any name with {tf_verison}-python{python_version} format. An example for python 3.10 and TF version 2.10 :- 2.10-python3.10.

  3. Clone the TF-Text Github repository inside container: git clone https://github.com/tensorflow/text.git.
    Once cloned, change to the working directory using cd text/.

  4. Run the configuration script(s): ./oss_scripts/configure.sh and ./oss_scripts/prepare_tf_dep.sh.
    This will update bazel and TF dependencies to installed tensorflow in the container.

  5. To run the tests, use the bazel command: bazel test --test_output=errors tensorflow_text:all. This will run all the tests declared in the BUILD file.
    To run a specific test, modify the above command replacing :all with the test name (for example :fast_bert_normalizer).

  6. Build the pip package/wheel:
    bazel build --config=release_cpu_linux oss_scripts/pip_package:build_pip_package
    ./bazel-bin/oss_scripts/pip_package/build_pip_package /{wheel_dir}

    Once the build is complete, you should see the wheel available under {wheel_dir} directory.

text's People

Contributors

8bitmp3 avatar anirudh161 avatar brianwieder avatar broken avatar cantonios avatar devnev39 avatar edloper avatar emilypilley avatar fchollet avatar fionalang avatar gregbillock avatar howl-anderson avatar irinambejan avatar jaymessina3 avatar jblespiau avatar luyaoxu avatar markdaoust avatar matthen avatar pcoet avatar qlzh727 avatar raw-pointer avatar rtg0795 avatar sachinprasadhs avatar stonepia avatar sun1638650145 avatar synandi avatar tf-text-github-robot avatar thaink avatar thuang513 avatar vam-google avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

text's Issues

Invalid regex pattern during tokenization

When using the Bert or Wordpiece tokenizer, I encounter the error Invalid pattern (\p{Whitespace}+|[!-/]|[:-@]|[\[-]|[{-~] ...`.

tensorflow-text version is 2.0.0.

I thought the issue might be the vocab table, therefore I tried to replicate the BertTokenizer test from the repo. But also this code example fails.

I created a Colab version for the issue

I am not sure if it is a bug or a problem with my tokenizer setup.

Thank you for any insights!

2.1.0 RC builds

First of all, thanks for maintaining TF Text! 😄

Are there any plans to build release candidate releases for TF Text? Release candidate builds (2.1.0rc0 and 2.1.0rc1) have been published for the main TensorFlow package, but corresponding releases aren't available yet for TF Text. Instructions for installing TF Text state:

When installing TF Text with pip install, please note the version of TensorFlow you are running, as you should specify the corresponding version of TF Text.

If release candidate builds could be created, that would be super helpful for me, since I'm unable to easily use TF 2.0.0 due to these changes not making it into the release 😅

Universal distribution / Windows binaries

Hello,

Is it possible to add Windows binaries / universal distribution? I couldn,'t install this library on Windows, no wonder there are no binaries for windows on pypi .

tensorflow-probability project provides universal distribution. See pypi .

I don't know if it's a problem with how bazel build is configured or something else. It would be great to have it on all platforms.

Many Thanks.

Potential optimization for LongestMatchStartingAt in WordpieceTokenizer?

Currently the code uses a maximum word length, and refuses to tokenize longer words. The LongestMatchStartingAt helper function considers all string prefixes starting with the entire word.

If we instead passed the maximum subtoken length, then perhaps LongestMatchStartingAt could consider prefixes of at most that length. This could speed up this function and might mean we don't need to set a maximum word length.

Sorry if I have misunderstood something!

add "step" and "drop_remainder" param to "sliding_window"

Hey team,

For the current sliding_window function, would you consider add a step param, which indicates the number of steps moving forward for the next window?

The `drop_remainder' = True param will drop windows which have length less than width.

For example:
width=2, step=2, drop_remainder=False
[1,2,3,4,5] => (1,2) (3,4) (5)

width=2, step=2, drop_remainder=True
[1,2,3,4,5] => (1,2) (3,4)

thanks!

undefined symbol: _ZN10tensorflow7strings6StrCatERKNS0_8AlphaNumES3_ error while importing tensorflow_text

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Kubuntu 18.04 LTS
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: no
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 2.0.0b0
  • Python version: Anaconda python 3.7.3
  • CUDA/cuDNN version: None
  • GPU model and memory: None

Describe the current behavior
Error on importing tensorflow-text making it impossible to be imported.

Describe the expected behavior
Library can be effortlessly imported and used.

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

I created a new environment using

conda create --name tensorflow python=3.7 numpy matplotlib scikit-learn pandas scipy
conda activate tensorflow
pip install tensorflow-text

then, when trying to import tensorflow_text the following error appears

$ python
Python 3.7.3 (default, Mar 27 2019, 22:11:17) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> import tensorflow_text as text
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/kuba/.anaconda3/envs/tensorflow/lib/python3.7/site-packages/tensorflow_text/__init__.py", line 20, in <module>
    from tensorflow_text.python.ops import *
  File "/home/kuba/.anaconda3/envs/tensorflow/lib/python3.7/site-packages/tensorflow_text/python/ops/__init__.py", line 19, in <module>
    from tensorflow_text.python.ops.greedy_constrained_sequence_op import greedy_constrained_sequence
  File "/home/kuba/.anaconda3/envs/tensorflow/lib/python3.7/site-packages/tensorflow_text/python/ops/greedy_constrained_sequence_op.py", line 34, in <module>
    gen_constrained_sequence_op = load_library.load_op_library(resource_loader.get_path_to_datafile('_constrained_sequence_op.so'))
  File "/home/kuba/.anaconda3/envs/tensorflow/lib/python3.7/site-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /home/kuba/.anaconda3/envs/tensorflow/lib/python3.7/site-packages/tensorflow_text/python/ops/_constrained_sequence_op.so: undefined symbol: _ZN10tensorflow7strings6StrCatERKNS0_8AlphaNumES3_
>>>

Error on saving keras custom layer with tensorflow_text.BertTokenizer

Trying so save a keras custom layers with tokenizer in it fails
versions info:

tensorflow==2.1.0
tensorflow-text==2.1.1

Code to reproduce:


import tensorflow_text
import tensorflow as tf


class TokenizationLayer(tf.keras.layers.Layer):
    def __init__(self, vocab_path, **kwargs):
        self.vocab_path =vocab_path
        self.tokenizer = tensorflow_text.BertTokenizer(vocab_path, token_out_type=tf.int64)
        super(TokenizationLayer, self).__init__(**kwargs)
        
    def get_config(self):
        config = super(TokenizationLayer, self).get_config()
        config.update({
            'vocab_path': self.vocab_path,
        })
        return config

    def call(self,inputs):
        return self.tokenizer.tokenize(inputs).to_tensor()


vocab_path = r"/home/resources/bert_en_uncased_L-12_H-768_A-12/1/assets/vocab.txt"
# tensorflow_text.BertTokenizer(vocab_lookup_table = vocab_path, token_out_type=tf.int64)
inputs = tf.keras.layers.Input(shape=(), dtype=tf.string)
tokenization_layer = TokenizationLayer(vocab_path)
outputs = tokenization_layer(inputs)
model = tf.keras.models.Model(inputs=inputs, outputs=outputs)

model.save("./test")

It also gives error on

 def call(self,inputs):
        return self.tokenizer.tokenize(inputs)

Error:

AssertionError                            Traceback (most recent call last)
<ipython-input-55-e49dd5ac9a41> in <module>
----> 1 model.save("./test")

~/.local/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/network.py in save(self, filepath, overwrite, include_optimizer, save_format, signatures, options)
   1006     """
   1007     save.save_model(self, filepath, overwrite, include_optimizer, save_format,
-> 1008                     signatures, options)
   1009 
   1010   def save_weights(self, filepath, overwrite=True, save_format=None):

~/.local/lib/python3.6/site-packages/tensorflow_core/python/keras/saving/save.py in save_model(model, filepath, overwrite, include_optimizer, save_format, signatures, options)
    113   else:
    114     saved_model_save.save(model, filepath, overwrite, include_optimizer,
--> 115                           signatures, options)
    116 
    117 

~/.local/lib/python3.6/site-packages/tensorflow_core/python/keras/saving/saved_model/save.py in save(model, filepath, overwrite, include_optimizer, signatures, options)
     76     # we use the default replica context here.
     77     with distribution_strategy_context._get_default_replica_context():  # pylint: disable=protected-access
---> 78       save_lib.save(model, filepath, signatures, options)
     79 
     80   if not include_optimizer:

~/.local/lib/python3.6/site-packages/tensorflow_core/python/saved_model/save.py in save(obj, export_dir, signatures, options)
    907   object_saver = util.TrackableSaver(checkpoint_graph_view)
    908   asset_info, exported_graph = _fill_meta_graph_def(
--> 909       meta_graph_def, saveable_view, signatures, options.namespace_whitelist)
    910   saved_model.saved_model_schema_version = (
    911       constants.SAVED_MODEL_SCHEMA_VERSION)

~/.local/lib/python3.6/site-packages/tensorflow_core/python/saved_model/save.py in _fill_meta_graph_def(meta_graph_def, saveable_view, signature_functions, namespace_whitelist)
    585 
    586   with exported_graph.as_default():
--> 587     signatures = _generate_signatures(signature_functions, resource_map)
    588     for concrete_function in saveable_view.concrete_functions:
    589       concrete_function.add_to_graph()

~/.local/lib/python3.6/site-packages/tensorflow_core/python/saved_model/save.py in _generate_signatures(signature_functions, resource_map)
    456             argument_inputs, signature_key, function.name))
    457     outputs = _call_function_with_mapped_captures(
--> 458         function, mapped_inputs, resource_map)
    459     signatures[signature_key] = signature_def_utils.build_signature_def(
    460         _tensor_dict_to_tensorinfo(exterior_argument_placeholders),

~/.local/lib/python3.6/site-packages/tensorflow_core/python/saved_model/save.py in _call_function_with_mapped_captures(function, args, resource_map)
    408   """Calls `function` in the exported graph, using mapped resource captures."""
    409   export_captures = _map_captures_to_created_tensors(
--> 410       function.graph.captures, resource_map)
    411   # Calls the function quite directly, since we have new captured resource
    412   # tensors we need to feed in which weren't part of the original function

~/.local/lib/python3.6/site-packages/tensorflow_core/python/saved_model/save.py in _map_captures_to_created_tensors(original_captures, resource_map)
    330            "be tracked by assigning them to an attribute of a tracked object "
    331            "or assigned to an attribute of the main object directly.")
--> 332           .format(interior))
    333     export_captures.append(mapped_resource)
    334   return export_captures

AssertionError: Tried to export a function which references untracked object Tensor("StatefulPartitionedCall/args_1:0", shape=(), dtype=resource).TensorFlow objects (e.g. tf.Variable) captured by functions must be tracked by assigning them to an attribute of a tracked object or assigned to an attribute of the main object directly.

Error: `U_FILE_ACCESS_ERROR` when build by bazel as `http_archive`

I build tensorflow/text from branch 2.0 by bazel as a http_archive. When I try to use CaseFoldUTF8Op op, I get an error:

U_FILE_ACCESS_ERROR: Could not retrieve ICU NFKC_CaseFold normalizer [[{{node CaseFoldUTF8/CaseFoldUTF8}}]]

It looks like logic related with normalization_data.c doesn't work, but I can't understand why.

I use bazel 0.24.1

Do you have any suggestions?

how to deal with raggedtensor in the model output

I have a seq2seq model like this

model = Sequential([
  InputLayer(input_shape=(None,), dtype='int64', ragged=True),
  tftext.keras.layers.ToDense(pad_value=0, mask=True),
  Embedding(src_vocab, n_units, input_length=src_timesteps, mask_zero=True),
  LSTM(n_units),
  RepeatVector(tar_timesteps),
  LSTM(n_units, return_sequences=True),
  TimeDistributed(Dense(tar_vocab, activation='softmax'))
])

And I am build the dataset from a pair of engilish-german sentences like this

def basic_preprocess(src, dst):
  # Preprocess
  rt_src = preprocess(src)
  rt_dst = preprocess(dst)
  # Encode tokens
  features = tf.ragged.map_flat_values(en_vocab_table.lookup, rt_src)
  labels = tf.ragged.map_flat_values(ge_vocab_table.lookup, rt_dst)

  return features, labels

My problem is that when I fit the model I get the following error as the output is ragged (while input is not longer ragged)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tensorflow-2.0.0/python3.6/tensorflow_core/python/util/nest.py in assert_same_structure(nest1, nest2, check_types, expand_composites)
    317     _pywrap_tensorflow.AssertSameStructure(nest1, nest2, check_types,
--> 318                                            expand_composites)
    319   except (ValueError, TypeError) as e:

ValueError: The two structures don't have the same nested structure.

First structure: type=TensorSpec str=TensorSpec(shape=(None,), dtype=tf.int64, name=None)

Second structure: type=RaggedTensor str=tf.RaggedTensor(values=Tensor("input_1/flat_values:0", shape=(None,), dtype=int64), row_splits=Tensor("input_1/row_splits_0:0", shape=(None,), dtype=int64))

More specifically: Substructure "type=RaggedTensor str=tf.RaggedTensor(values=Tensor("input_1/flat_values:0", shape=(None,), dtype=int64), row_splits=Tensor("input_1/row_splits_0:0", shape=(None,), dtype=int64))" is a sequence, while substructure "type=TensorSpec str=TensorSpec(shape=(None,), dtype=tf.int64, name=None)" is not

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)

How should I fix this?

Raspberry Pi Support

Build from source is currently broken on Raspberry pi. When I tried to build using bazel, the following error occurs:

/root/.cache/bazel/_bazel_root/7ca663d4ab74a0b23905915c5a09aeae/external/com_goo
gle_absl/absl/strings/BUILD.bazel:83:1: C++ compilation of rule '@com_google_absl//absl
/strings:internal' failed (Exit 1)
Traceback (most recent call last):
File "external/org_tensorflow/third_party/toolchains/preconfig/ubuntu16.04/gcc7_manyl
inux2010-nvcc-cuda10.0/clang/bin/crosstool_wrapper_driver_is_not_gcc", line 272, in
sys.exit(main())
File "external/org_tensorflow/third_party/toolchains/preconfig/ubuntu16.04/gcc7_manylinux2010-nvcc-cuda10.0/clang/bin/crosstool_wrapper_driver_is_not_gcc", line 269, in main
return subprocess.call([CPU_COMPILER] + cpu_compiler_flags)
File "/usr/lib/python3.7/subprocess.py", line 323, in call
with Popen(*popenargs, **kwargs) as p:
File "/usr/lib/python3.7/subprocess.py", line 775, in init
restore_signals, start_new_session)
File "/usr/lib/python3.7/subprocess.py", line 1522, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/dt7/usr/bin/gcc': '/dt7/usr/bin/gcc'
Target //oss_scripts/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.979s, Critical Path: 0.47s
INFO: 0 processes.
FAILED: Build did NOT complete successfully

The compilation process seems to rely on some cuda gcc. The procedure I used for building is as following:

  1. bash oss_scripts/configure.sh
  2. remove these 2 lines from WORKSPACE as they are causing patch error: patches = ["//third_party/icu:udata.patch"], patch_args = ["-p1"]
  3. bazel build oss_scripts/pip_package:build_pip_package

wordpiece detokenization

Hey team,

is it possible to detokenize wordpieces with tf text? If not, is this feature something that you would consider implementing in the future?

No matching distribution found for tensorflow-text

I failed to install tensorflow-text.

When I enter pip install -U tensorflow-text

There was an error:

Could not find a version that satisfies the requirement tensorflow-text (from versions: )
No matching distribution found for tensorflow-text

  • Python 3.5.4 [MSC v.1900 64 bit (AMD64)] on win32
  • Tensorflow 2.0.0rc0

Unable to install latest version (2.2.0rc1)?

Hi TF-text team,

I'm trying to install the library using pip for the latest version (2.2.0rc1), but couldn't find any matching distribution. I'm using Python 3.7 and pip 20.0.2. The full errors is below

ERROR: Could not find a version that satisfies the requirement tensorflow-text==2.2.0rc1 (from versions: 0.1.0, 1.0.0b0, 1.0.0b1, 1.0.0b2, 1.15.0rc0, 1.15.1, 2.0.0rc0, 2.0.1, 2.1.0rc0)
ERROR: No matching distribution found for tensorflow-text==2.2.0rc1

From the pypi should be available, but why pip unable to find it. Any idea?

Thanks before

unichr() and xrange() were removed in Python 3

flake8 testing of https://github.com/tensorflow/text on Python 3.7.1

$ flake8 . --count --select=E9,F63,F72,F82 --show-source --statistics

./tensorflow_text/python/ops/string_ops.py:29:12: F821 undefined name 'unichr'
    return unichr(codepoint)
           ^
./tensorflow_text/python/ops/wordpiece_tokenizer_test.py:142:19: F821 undefined name 'xrange'
  for docs_idx in xrange(len(tokens)):
                  ^
./tensorflow_text/python/ops/wordpiece_tokenizer_test.py:144:23: F821 undefined name 'xrange'
    for tokens_idx in xrange(len(tokens[docs_idx])):
                      ^
3     F821 undefined name 'unichr'
3

E901,E999,F821,F822,F823 are the "showstopper" flake8 issues that can halt the runtime with a SyntaxError, NameError, etc. These 5 are different from most other flake8 issues which are merely "style violations" -- useful for readability but they do not effect runtime safety.

  • F821: undefined name name
  • F822: undefined name name in __all__
  • F823: local variable name referenced before assignment
  • E901: SyntaxError or IndentationError
  • E999: SyntaxError -- failed to compile a file into an Abstract Syntax Tree

List of Languages Supported (E.g. English)

Hi, I just wanted to ask what are the languages supported by Tensorflow.text? From what I can see thus far, English, Chinese and German are supported. Thanks for answering! :)

Can not load SentencePiece model

I'm struggling with loading a sentencepiece model, and the error message is a bit cryptic so I'm not sure where to go next.

The error I get is the following:

2020-01-31 12:07:45.420864: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at sentencepiece_kernels.cc:211 : Internal: external/com_google_sentencepiece/src/sentencepiece_processor.cc(73) [model_proto->ParseFromArray(serialized.data(), serialized.size())] 
Traceback (most recent call last):
  File "load.py", line 4, in <module>
    tokenizer = tensorflow_text.SentencepieceTokenizer('model.model')
  File "/home/dferreira/projects/porn_classifier_tf2/venv/lib/python3.7/site-packages/tensorflow_text/python/ops/sentencepiece_tokenizer.py", line 79, in __init__
    model=model)
  File "<string>", line 51, in sentencepiece_op
  File "<string>", line 125, in sentencepiece_op_eager_fallback
  File "/home/dferreira/projects/porn_classifier_tf2/venv/lib/python3.7/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InternalError: external/com_google_sentencepiece/src/sentencepiece_processor.cc(73) [model_proto->ParseFromArray(serialized.data(), serialized.size())]  [Op:SentencepieceOp]

I'm using Python 3.7.6 with:

tensorflow==2.1.0
tensorflow-text==2.1.0rc0
sentencepiece==0.1.85

The following is a minimal reproducible example:

  • Create a file raw_text with the content:
This is a raw text file.
With 2 lines.
  • Create train.py with the content:
import sentencepiece

sentencepiece.SentencePieceTrainer.Train('--input=raw_text --vocab_size=20 --model_prefix=model')
  • Run python train.py. You will get a model.model and model.vocab.
  • Create load.py with the content:
import tensorflow_text

tokenizer = tensorflow_text.SentencepieceTokenizer('model.model')
  • Run python load.py and you will get the error above.

It should be noted that loading the same model via sentencepiece.SentencePieceProcessor.Load works.

Like I said, I wasn't really able to interpret the error message.
How can I make this work?

Tokenizer TF.Data Transform Crashes

When using the .apply method as a transformation for a tf.data object on to a text tokenizer, it crashes all Colab sessions (both in TF 1.x and 2.0). Here is a sample taken from the official colab notebook.

docs = tf.data.Dataset.from_tensor_slices([['Never tell me the odds.'],
                                           ["It's a trap!"]])
tokenizer = text.WhitespaceTokenizer()
tokenized_docs = docs.map(lambda x: tokenizer.tokenize(x))
iterator = tokenized_docs.make_one_shot_iterator()
print(iterator.get_next().to_list())
print(iterator.get_next().to_list())

Tensorflow 1.x support?

Firstly, thanks for this code! It makes serializing and sharing text-based models much easier. We currently use custom ops for e.g. subword tokenization and would love to switch over to the one in this repo, facilitating sharing models over tf hub etc.

Is there any plan to support tensorflow 1.x versions?

Please fix setup.py to allow tensorflow-gpu

Please fix setup.py to allow tensorflow-gpu:

Old:
install_requires=[
'tensorflow==2.0.0b0',
],

New:
install_requires=[
'tensorflow-gpu >=2.0.0b0',
],
extras_require = [ 'tensorflow >=2.0.0b0', ]

Thanks,
Snehasish

Error loading '_text_similarity_metric_ops.so' when running unit tests

running python 3.7 on mac osx 10.14.6. Not sure if there is some dependency or build step I am missing but I cannot seem to run the unit tests with out the code failing to load this file. Have tried with tensorflow 1.x and 2.x. stack trace is below. Maybe I am just missing something simple?

Traceback (most recent call last):
File "/Applications/PyCharm CE.app/Contents/helpers/pycharm/_jb_unittest_runner.py", line 35, in
main(argv=args, module=None, testRunner=unittestpy.TeamcityTestRunner, buffer=not JB_DISABLE_BUFFERING)
File "/miniconda3/envs/tf2/lib/python3.7/unittest/main.py", line 100, in init
self.parseArgs(argv)
File "/miniconda3/envs/tf2/lib/python3.7/unittest/main.py", line 147, in parseArgs
self.createTests()
File "/miniconda3/envs/tf2/lib/python3.7/unittest/main.py", line 159, in createTests
self.module)
File "/miniconda3/envs/tf2/lib/python3.7/unittest/loader.py", line 220, in loadTestsFromNames
suites = [self.loadTestsFromName(name, module) for name in names]
File "/miniconda3/envs/tf2/lib/python3.7/unittest/loader.py", line 220, in
suites = [self.loadTestsFromName(name, module) for name in names]
File "/miniconda3/envs/tf2/lib/python3.7/unittest/loader.py", line 154, in loadTestsFromName
module = import(module_name)
File "/Users/dittmar/Development/text/tensorflow_text/python/ops/bert_tokenizer_test.py", line 32, in
from tensorflow_text.python.ops import bert_tokenizer
File "/Users/dittmar/Development/text/tensorflow_text/init.py", line 21, in
from tensorflow_text.python import metrics
File "/Users/dittmar/Development/text/tensorflow_text/python/metrics/init.py", line 20, in
from tensorflow_text.python.metrics.text_similarity_metric_ops import *
File "/Users/dittmar/Development/text/tensorflow_text/python/metrics/text_similarity_metric_ops.py", line 28, in
gen_text_similarity_metric_ops = load_library.load_op_library(resource_loader.get_path_to_datafile('_text_similarity_metric_ops.so'))
File "/miniconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: dlopen(/Users/dittmar/Development/text/tensorflow_text/python/metrics/_text_similarity_metric_ops.so, 6): image not found

Converting Ragged Tensor of Word Pieces to bert input

Is there an example of how to convert the ragged tensors returned by WordpieceTokenizer into the correct (dense padded) format to use with bert (or bert on tensorflow hub) that is compatible with tf.function?

For example:

With the bert preprocessing
inputs = [['test this', 'and me too']

Would become three dense tensors of input_ids, input_masks, and segement_ids

(<tf.Tensor: id=5315, shape=(1, 48), dtype=int32, numpy=
 array([[ 101, 2774, 1142, 1105, 1143, 1315,  102,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0]], dtype=int32)>,
 <tf.Tensor: id=5316, shape=(1, 48), dtype=int32, numpy=
 array([[0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0]], dtype=int32)>,
 <tf.Tensor: id=5317, shape=(1, 48), dtype=int32, numpy=
 array([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0]], dtype=int32)>)

Using eager mode its straight forward to implement naively:

import tensorflow as tf
import tensorflow_text as tf_text

wp_tokenizer = tf_text.WordpieceTokenizer(vocab_table,
                                       token_out_type=tf.int64)

ws_tokenizer = tf_text.WhitespaceTokenizer()


# a low effort implementation of berts tokenizer
@tf.function
def tokenizer_fn(x):
    ws = ws_tokenizer.tokenize(x)
    wp = wp_tokenizer.tokenize(ws)
    return tf.cast(wp, tf.int32)

def bert_featureize(features):
    """Convert the raw text into the three tensors returned by bert preprocessing"""
    
    CLS = tf.constant([101], dtype=tf.int32)
    SEP = tf.constant([102], dtype=tf.int32)
    
    # (B, Word, Word Piece)
    # `take me to letter g` -> `<tf.RaggedTensor [[12882, 28304, 10376], 
    #                                             [10155], [30839], [27908], [10105], [34109]]>`
    wp_tokens = tokenizer_fn(features)
    input_ids = [] 
    input_masks = []
    segment_ids = []
    
    # iterate over batch, flattening pieces back to word level, padding, etc
    for tensor in wp_tokens:

        ids = tf.concat([CLS, tensor.flat_values, SEP], axis=0)
        length = len(ids)
        padding = tf.zeros(ModelKeys.MAX_SEQ_LEN - length, tf.int32)
        input_id = tf.concat([ids,padding],axis=0)
        input_mask = tf.where((input_id == 0) | 
                              (input_id == ModelKeys.CLS_ID) | 
                              (input_id == ModelKeys.SEP_ID),
                              0, 
                              tf.ones(ModelKeys.MAX_SEQ_LEN, tf.int32))
        
        input_mask = tf.cast(input_mask, tf.int32) 
        segment_id = tf.cast(input_id > 0, tf.int32)
        input_ids.append(input_id)
        input_masks.append(input_mask)
        segment_ids.append(segment_id)
    return tf.stack(input_ids), tf.stack(input_masks), tf.stack(segment_ids)

input_ids, input_masks, segment_ids =bert_featureize([['test this', 'and me too']])

(<tf.Tensor: id=5315, shape=(1, 48), dtype=int32, numpy=
 array([[ 101, 2774, 1142, 1105, 1143, 1315,  102,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0]], dtype=int32)>,
 <tf.Tensor: id=5316, shape=(1, 48), dtype=int32, numpy=
 array([[0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0]], dtype=int32)>,
 <tf.Tensor: id=5317, shape=(1, 48), dtype=int32, numpy=
 array([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0]], dtype=int32)>)

# a bert tensorflow hub module
embeddings = module(input_ids, input_masks, segment_ids)
# 

However decorating bert_featurize with tf.function does not correctly work

  ValueError: slice index 1 of dimension 0 out of bounds. for 'RaggedGetItem_1/strided_slice_2' (op: 'StridedSlice') with input shapes: [1], [1], [1], [1] and with computed input tensors: input[1] = <1>, input[2] = <2>, input[3] = <1>.

We previously used a custom c++ tensorflow op that performed creating the input_ids etc as part of the graph but are looking to deprecate that now that this has been released.

Thanks!

some bugs when using tensorflow 1.x

tensorflow_version=1.15

use BertTokenizer will cause

AttributeError: 'RaggedTensor' object has no attribute 'merge_dims'

and use BasicTokenizer and set lower_case=True will cause

tensorflow.python.framework.errors_impl.InternalError: U_FILE_ACCESS_ERROR: Could not retrieve ICU NFKC_CaseFold normalizer [Op:CaseFoldUTF8]

NotImplementedError: Saving is not yet supported for TextVectorization layers.

I'm defining a TextVectorization layer like this

decoder_vectorize = TextVectorization(
  name='de_vectorize',
  standardize = 'lower_and_strip_punctuation',
  split       = 'whitespace',
  max_tokens  = config.tar_vocab,
  output_mode ='int', 
  output_sequence_length=config.tar_timesteps
)

Then after training, I try to save the model but i'm hitting errors

>>> model.save('model.h5')
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py in <listcomp>(.0)
   3312   if context.executing_eagerly():
-> 3313     return [x.numpy() for x in tensors]
   3314   elif ops.inside_function():  # pylint: disable=protected-access
   3315     raise RuntimeError('Cannot get value inside Tensorflow graph function.')

AttributeError: 'TrackableWeightHandler' object has no attribute 'numpy'

>>> model.save('model.tf', save_format='tf')
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/layers/preprocessing/text_vectorization.py in fail(_)
    295     def fail(_):
    296       raise NotImplementedError(
--> 297           "Saving is not yet supported for TextVectorization layers.")
    298     self._table._list_extra_dependencies_for_serialization = fail  # pylint: disable=protected-access
    299 

NotImplementedError: Saving is not yet supported for TextVectorization layers.

This is on a nightly TF version 2.2.0-dev20200122. Is there a workaround?

import TensorFlow_text failure (Reason: image not found)

Traceback (most recent call last):
File "", line 1, in
File "/Users/ethan/venv/nlp_env/lib/python3.7/site-packages/tensorflow_text/init.py", line 20, in
from tensorflow_text.python.ops import *
File "/Users/ethan/venv/nlp_env/lib/python3.7/site-packages/tensorflow_text/python/ops/init.py", line 19, in
from tensorflow_text.python.ops.greedy_constrained_sequence_op import greedy_constrained_sequence
File "/Users/ethan/venv/nlp_env/lib/python3.7/site-packages/tensorflow_text/python/ops/greedy_constrained_sequence_op.py", line 34, in
gen_constrained_sequence_op = load_library.load_op_library(resource_loader.get_path_to_datafile('_constrained_sequence_op.so'))
File "/Users/ethan/venv/nlp_env/lib/python3.7/site-packages/tensorflow_core/python/framework/load_library.py", line 61, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: dlopen(/Users/ethan/venv/nlp_env/lib/python3.7/site-packages/tensorflow_text/python/ops/_constrained_sequence_op.so, 6): Library not loaded: @rpath/libtensorflow_framework.1.dylib
Referenced from: /Users/ethan/venv/nlp_env/lib/python3.7/site-packages/tensorflow_text/python/ops/_constrained_sequence_op.so
Reason: image not found

No matching distribution found for tensorflow-text on Windows 10

When I try to download tensorflow-text I have the error message No matching distribution found for tensorflow-text and I'm on windows 10. I can't use tensorflow-text yeton Windows, right?

(seg_env) C:\Users\antoi\Documents\Programming\Covent Garden\Segmentation>python -m pip install tensorflow-text
ERROR: Could not find a version that satisfies the requirement tensorflow-text (from versions: none)
ERROR: No matching distribution found for tensorflow-text

tensorflow 2.1.0
Python 3.6.7

Error with string output wordpiece tokenization

I am finding an error when trying to do wordpiece tokenization, with string subtoken output:

import tensorflow as tf
import tensorflow_text as tf_text

sess = tf.InteractiveSession()
y = tf.constant(["ab", "a", "c"], dtype=tf.string)
vocab = tf.lookup.KeyValueTensorInitializer(
    keys=y, values=y,
    key_dtype=tf.string,
    value_dtype=tf.string,
)
vocab_lookup_table = tf.lookup.StaticHashTable(vocab, "")
tokenizer = tf_text.WordpieceTokenizer(
    vocab_lookup_table=vocab_lookup_table,
    token_out_type=tf.string,
    unknown_token=None,
    suffix_indicator="",
)
sess.run(tf.tables_initializer())
print(
    sess.run(tokenizer.tokenize(["abc", "ad"]))
)

I would expect the output [["ab", "c"], ["a", "d"]], but actually see:

F tensorflow/core/framework/tensor.cc:624] Check failed: dtype() == expected_dtype (9 vs. 7) string expected, got int64
Abort trap: 6

tensorflow_text version 0.1.0rc2, tensorflow version 1.14.0

Run model using `WhitespaceTokenizer` from SavedModel Graph Error

I'm using this for some simple tokenization in an Estimator text classifier model. When I try to run my model from SavedModel export, I get this error: KeyError: 'WhitespaceTokenizeWithOffsets'

  • TF: 1.14.0
  • TF-text: 0.1.0

I export my model like this:

def serving_input_receiver_fn():
    """A function that takes no argument and returns a
    `tf.estimator.export.ServingInputReceiver"""
    
    #1 parse the proto w/ the query string
    #2 tokenize the query as "query"
    #3 embedding picks up "query"
    
    serialized_tf_example = tf.placeholder(
        dtype=tf.string,
        shape=[None],
        name='input_example_tensor')
    receiver_tensors = {'examples': serialized_tf_example}
    feature_spec = {'query': tf.FixedLenFeature(1, dtype=tf.string)}
    features = tf.parse_example(serialized_tf_example, feature_spec)
    
    sparse_tokens = tokenizer.tokenize(features["query"]).to_sparse()
    
    features = {'tokens': sparse_tokens}
    
    return tf.estimator.export.ServingInputReceiver(features, receiver_tensors)

classifier.export_saved_model("exports/estimators-BOW-test", serving_input_receiver_fn)

However, when I go to test my model, gets an error:

tag_set = "serve"
saved_model_dir = "exports/estimators-BOW-test/1571023056/"
output_tensor_names_sorted = ["dnn/head/Tile:0",
                              "dnn/head/predictions/probabilities:0"]

input_examples = ["taco bell", "taco", "korean", "hancos", "tacos", "mashed potatoes"]
proto_str = make_examples(input_examples, "query")
inputs_feed_dict = {"input_example_tensor:0": proto_str}

CLASS_NAMES = np.array(["CUISINE", "DISH", "RESTAURANT", "ADDRESS"])

with tf.Session(graph=tf.Graph()) as sess:
    loader.load(sess, tag_set.split(','), saved_model_dir)
    outputs = sess.run(output_tensor_names_sorted,
                       feed_dict=inputs_feed_dict)
    for inputs, outputs in zip(input_examples, outputs[1]):
        print(inputs, CLASS_NAMES[np.argmax(outputs)])

Error:

KeyError: 'WhitespaceTokenizeWithOffsets'

I can make this error go away, if in my python process I import:

import tensorflow_text as text

What is happening? Shouldn't all these ops be on the graph, so for example I could run this in Java or TF Serving(non-python env)??

tokenizer doesn't work with tf.distribute.Strategy

Hi guys,

My environment is tensorflow==2.0.0 and tensorflow_text==2.0.1

I'm wondering if there's a way to get tensorflow text to work with tensorflow's distributed strategies. Here's a minimal example that doesn't work for me. Note that if I remove tf.function, I can iterate through the dataset, but then i'm in eager mode and can't take advantage of executing things in a graph.

import tensorflow as tf
import tensorflow_text as text

docs = tf.data.Dataset.from_tensor_slices([['Never tell me the odds.'],
                                           ["It's a trap!"]])
tokenizer = text.WhitespaceTokenizer()
tokenized_docs = docs.map(lambda x: tokenizer.tokenize(x))

def print_dataset(dataset):
    for x in dataset:
        tf.print(x)
print_dataset(tokenized_docs)
tf.function(print_dataset)(tokenized_docs)

strategy = tf.distribute.OneDeviceStrategy(device='/cpu:0')
distributed_dataset = strategy.experimental_distribute_dataset(tokenized_docs)
with strategy.scope():
    @tf.function
    def foo(dataset):
        for x in dataset:
            tf.print(x)

foo(distributed_dataset)

# OUTPUT
'''
2019-12-11 22:33:24.268738: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-12-11 22:33:24.297320: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2500000000 Hz
2019-12-11 22:33:24.305256: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4d16340 executing computations on platform Host. Devices:
2019-12-11 22:33:24.305312: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/dispatch.py:180: batch_gather (from tensorflow.python.ops.array_ops) is deprecated and will be removed after 2017-10-25.
Instructions for updating:
`tf.batch_gather` is deprecated, please use `tf.gather` with `batch_dims=-1` instead.
<tf.RaggedTensor [[b'Never', b'tell', b'me', b'the', b'odds.']]>
<tf.RaggedTensor [[b"It's", b'a', b'trap!']]>
tf.RaggedTensor(values=Tensor("RaggedFromVariant/RaggedTensorFromVariant:1", shape=(None,), dtype=string), row_splits=Tensor("RaggedFromVariant/RaggedTensorFromVariant:0", shape=(None,), dtype=int64))
tf.RaggedTensor(values=Tensor("RaggedFromVariant/RaggedTensorFromVariant:1", shape=(None,), dtype=string), row_splits=Tensor("RaggedFromVariant/RaggedTensorFromVariant:0", shape=(None,), dtype=int64))
Traceback (most recent call last):
  File "test.py", line 48, in <module>
    foo(distributed_dataset)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 503, in _call
    self._initialize(args, kwds, add_initializers_to=initializer_map)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 408, in _initialize
    *args, **kwds))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1848, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 2150, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 2041, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/func_graph.py", line 915, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 358, in wrapped_fn
    return weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/func_graph.py", line 905, in wrapper
    raise e.ag_error_metadata.to_exception(e)
TypeError: in converted code:

    test.py:45 foo  *
        for x in dataset:
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/operators/control_flow.py:337 for_stmt
        return custom_handler(extra_test, body, init_vars)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/input_lib.py:416 _autograph_for_loop
        self.reduce((constant_op.constant(0),), reduce_body_with_dummy_state)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/input_lib.py:422 reduce
        has_data, data = _get_next_as_optional(iterator, self._strategy)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/input_lib.py:200 _get_next_as_optional
        iterator._iterators[i].get_next_as_list(new_name))  # pylint: disable=protected-access
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/input_lib.py:878 get_next_as_list
        lambda: _dummy_tensor_fn(data.value_structure))
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py:507 new_func
        return func(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py:1174 cond
        return cond_v2.cond_v2(pred, true_fn, false_fn, name)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/cond_v2.py:91 cond_v2
        op_return_value=pred)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/func_graph.py:915 func_graph_from_py_func
        func_outputs = python_func(*func_args, **func_kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/input_lib.py:878 <lambda>
        lambda: _dummy_tensor_fn(data.value_structure))
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/input_lib.py:801 _dummy_tensor_fn
        result.append(create_dummy_tensor(feature_shape, feature_type))
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/input_lib.py:784 create_dummy_tensor
        for dim in feature_shape.dims:

    TypeError: 'NoneType' object is not iterable

'''

Removing the tokenizer seems to work

import tensorflow as tf
import tensorflow_text as text

docs = tf.data.Dataset.from_tensor_slices([['Never tell me the odds.'],
                                           ["It's a trap!"]])

def print_dataset(dataset):
    for x in dataset:
        tf.print(x)
print_dataset(docs)
tf.function(print_dataset)(docs)

strategy = tf.distribute.OneDeviceStrategy(device='/cpu:0')
distributed_dataset = strategy.experimental_distribute_dataset(docs)
with strategy.scope():
    @tf.function
    def foo(dataset):
        for x in dataset:
            tf.print(x)

foo(distributed_dataset)

# OUTPUT
'''
2019-12-11 22:30:47.173794: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-12-11 22:30:47.202302: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2500000000 Hz
2019-12-11 22:30:47.210319: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3369120 executing computations on platform Host. Devices:
2019-12-11 22:30:47.210378: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
["Never tell me the odds."]
["It\'s a trap!"]
["Never tell me the odds."]
["It\'s a trap!"]
["Never tell me the odds."]
["It\'s a trap!"]
'''

Any help or guidance would be appreciated

import fails: "undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs"

I encountered this bug which is most probably a duplicate of #30 that has been closed.
Is it related to #160 (comment) ?

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04 LTS
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: no
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 2.0.0
  • Python version: Anaconda python 3.7.5
  • CUDA/cuDNN version: None
  • GPU model and memory: None

Describe the current behavior
Error on importing tensorflow-text making it impossible to be imported.

Describe the expected behavior
Library can be effortlessly imported and used.

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

I created a new minimal environment using

conda create -n tf-test tensorflow python=3.7
conda activate tf-test
pip install tensorflow-text

then, when trying to import tensorflow_text the following error appears

$ python
Python 3.7.5 (default, Oct 25 2019, 15:51:11) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> import tensorflow_text as text
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mathieu/miniconda3/envs/tf-test/lib/python3.7/site-packages/tensorflow_text/__init__.py", line 21, in <module>
    from tensorflow_text.python import metrics
  File "/home/mathieu/miniconda3/envs/tf-test/lib/python3.7/site-packages/tensorflow_text/python/metrics/__init__.py", line 20, in <module>
    from tensorflow_text.python.metrics.text_similarity_metric_ops import *
  File "/home/mathieu/miniconda3/envs/tf-test/lib/python3.7/site-packages/tensorflow_text/python/metrics/text_similarity_metric_ops.py", line 28, in <module>
    gen_text_similarity_metric_ops = load_library.load_op_library(resource_loader.get_path_to_datafile('_text_similarity_metric_ops.so'))
  File "/home/mathieu/miniconda3/envs/tf-test/lib/python3.7/site-packages/tensorflow_core/python/framework/load_library.py", line 61, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /home/mathieu/miniconda3/envs/tf-test/lib/python3.7/site-packages/tensorflow_text/python/metrics/_text_similarity_metric_ops.so: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs

RaggedTensor requires merge_dims method

On this line:

return final_tokens.merge_dims(-2, -1)

RaggedTensor uses the merge_dims method. However, this feature is currently not deployed in tf 2.0; it looks like it was added here: tensorflow/tensorflow@6f29f5c

but has yet to be deployed.

Minimal reproducing example:

import tensorflow as tf
import tensorflow_text as tft

num_oov_buckets = 3
initializer = tf.lookup.KeyValueTensorInitializer(['Hello', 'World'], [0, 1], value_dtype=tf.int64)
table = tf.lookup.StaticVocabularyTable(initializer, num_oov_buckets)

data = tf.constant(['World Hello missing'])
tok = tft.BertTokenizer(table)

session = tf.compat.v1.Session()
tf.compat.v1.tables_initializer()

output = tok.tokenize(data)

print(output[0][0].eval(session=session))
print(output[0][1].eval(session=session))
print(output[0][2].eval(session=session))

Traceback:
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow_text/python/ops/bert_tokenizer.py", line 195, in tokenize
tokens = self._basic_tokenizer.tokenize(text_input)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow_text/python/ops/bert_tokenizer.py", line 126, in tokenize
return final_tokens.merge_dims(-2, -1)
AttributeError: 'RaggedTensor' object has no attribute 'merge_dims'

tensorflow-text crashes in google colab

When executed in Google Colab, this code crashes the session:

!pip install tensorflow_text

import tensorflow as tf
import tensorflow_text as text

docs = tf.data.Dataset.from_tensor_slices([['Never tell me the odds.'], ["It's a trap!"]])
tokenizer = text.WhitespaceTokenizer()
tokenized_docs = docs.map(lambda x: tokenizer.tokenize(x))
iterator = tokenized_docs.make_one_shot_iterator()
print(iterator.get_next().to_list())

It also SegFaults when executed locally.

Are convolution layers supported?

I have the following model

model = tf.keras.Sequential([
  InputLayer(input_shape=(None,), dtype='int64', ragged=True),
  tftext.keras.layers.ToDense(pad_value=0, mask=True),
  Embedding(vocab_size, n_units),
  Conv1D(filters=32, kernel_size=8, activation='relu'),
  MaxPooling1D(pool_size=2),
  Flatten(),
  Dense(10, activation='relu'),
  Dense(1, activation='sigmoid')
])

It fails with following error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-32-0787290fce6e> in <module>()
      7   Flatten(),
      8   Dense(10, activation='relu'),
----> 9   Dense(1, activation='sigmoid')
     10 ])
     11 model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])

6 frames
/tensorflow-2.0.0/python3.6/tensorflow_core/python/training/tracking/base.py in _method_wrapper(self, *args, **kwargs)
    455     self._self_setattr_tracking = False  # pylint: disable=protected-access
    456     try:
--> 457       result = method(self, *args, **kwargs)
    458     finally:
    459       self._self_setattr_tracking = previous_value  # pylint: disable=protected-access

/tensorflow-2.0.0/python3.6/tensorflow_core/python/keras/engine/sequential.py in __init__(self, layers, name)
    112       tf_utils.assert_no_legacy_layers(layers)
    113       for layer in layers:
--> 114         self.add(layer)
    115 
    116   @property

/tensorflow-2.0.0/python3.6/tensorflow_core/python/training/tracking/base.py in _method_wrapper(self, *args, **kwargs)
    455     self._self_setattr_tracking = False  # pylint: disable=protected-access
    456     try:
--> 457       result = method(self, *args, **kwargs)
    458     finally:
    459       self._self_setattr_tracking = previous_value  # pylint: disable=protected-access

/tensorflow-2.0.0/python3.6/tensorflow_core/python/keras/engine/sequential.py in add(self, layer)
    194       # If the model is being built continuously on top of an input layer:
    195       # refresh its output.
--> 196       output_tensor = layer(self.outputs[0])
    197       if len(nest.flatten(output_tensor)) != 1:
    198         raise TypeError('All layers in a Sequential model '

/tensorflow-2.0.0/python3.6/tensorflow_core/python/keras/engine/base_layer.py in __call__(self, inputs, *args, **kwargs)
    815           # Build layer if applicable (if the `build` method has been
    816           # overridden).
--> 817           self._maybe_build(inputs)
    818           cast_inputs = self._maybe_cast_inputs(inputs)
    819 

/tensorflow-2.0.0/python3.6/tensorflow_core/python/keras/engine/base_layer.py in _maybe_build(self, inputs)
   2139         # operations.
   2140         with tf_utils.maybe_init_scope(self):
-> 2141           self.build(input_shapes)
   2142       # We must set self.built since user defined build functions are not
   2143       # constrained to set self.built.

/tensorflow-2.0.0/python3.6/tensorflow_core/python/keras/layers/core.py in build(self, input_shape)
   1013     input_shape = tensor_shape.TensorShape(input_shape)
   1014     if tensor_shape.dimension_value(input_shape[-1]) is None:
-> 1015       raise ValueError('The last dimension of the inputs to `Dense` '
   1016                        'should be defined. Found `None`.')
   1017     last_dim = tensor_shape.dimension_value(input_shape[-1])

ValueError: The last dimension of the inputs to `Dense` should be defined. Found `None`.

Could not find a version that satisfies the requirement tensorflow-text==1.15.0

Am getting the following error while trying to use your package along with TF 1.15:

$ pip install tensorflow-text==1.15.0
ERROR: Could not find a version that satisfies the requirement tensorflow-text==1.15.0 (from versions: 0.1.0rc2, 0.1.0, 1.0.0b0, 1.0.0b2, 1.15.0rc0, 1.15.1, 2.0.0rc0, 2.0.1, 2.1.0rc0, 2.1.1, 2.2.0rc1)
ERROR: No matching distribution found for tensorflow-text==1.15.0

Despite knowing that this version does really exist: https://pypi.org/project/tensorflow-text/1.15.0/
Any thoughts what's going on here?

Here my environment information:

 $ pip --version
pip 20.0.2 from /Users/danielk/opt/anaconda3/envs/py367/lib/python3.6/site-packages/pip (python 3.6)

$ python --version
Python 3.6.7 :: Anaconda, Inc.

Tensorflow text incompatible with Tensorflow 2.0.0-beta1

Hi !

I just installed tensorflow==2.0.0-beta1 and I get the following issue :

ERROR: tensorflow-text 1.0.0b0 has requirement tensorflow==2.0.0b0, but you'll have tensorflow 2.0.0b1 which is incompatible.

How can I resolve this issue ?

Thanks,

Maxime

pip install of tensorflow-text disables tensorflow-gpu

It seems that a pip install of tensorflow-text>=2.0.0rc0 also installs tensorflow-2. If you previously had installed tensorflow-gpu the new one will disable GPU access.

Steps to reproduce:

  1. Build a new docker image with tf-gpu:
    Dockerfile
FROM tensorflow/tensorflow:latest-gpu-py3-jupyter
WORKDIR /root

Build with docker build -t prueba .

Test correct GPU access:

$ docker run --runtime=nvidia --rm -it prueba:latest  python -c "import tensorflow as tf; print(tf.test.is_gpu_available())"
2019-12-06 18:55:03.225780: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-12-06 18:55:03.252825: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2592000000 Hz
2019-12-06 18:55:03.253637: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x43890e0 executing computations on platform Host. Devices:
2019-12-06 18:55:03.253666: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2019-12-06 18:55:03.256129: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-12-06 18:55:03.361065: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-06 18:55:03.362691: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4460890 executing computations on platform CUDA. Devices:
2019-12-06 18:55:03.362794: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 960M, Compute Capability 5.0
2019-12-06 18:55:03.363437: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-06 18:55:03.364914: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce GTX 960M major: 5 minor: 0 memoryClockRate(GHz): 1.176
pciBusID: 0000:02:00.0
2019-12-06 18:55:03.365825: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-12-06 18:55:03.370683: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-12-06 18:55:03.373270: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2019-12-06 18:55:03.373944: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2019-12-06 18:55:03.376933: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2019-12-06 18:55:03.378858: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2019-12-06 18:55:03.384248: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-12-06 18:55:03.384400: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-06 18:55:03.384853: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-06 18:55:03.385166: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-12-06 18:55:03.385213: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-12-06 18:55:03.385816: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-06 18:55:03.385834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2019-12-06 18:55:03.385842: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2019-12-06 18:55:03.385961: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-06 18:55:03.386314: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-06 18:55:03.386633: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/device:GPU:0 with 3330 MB memory) -> physical GPU (device: 0, name: GeForce GTX 960M, pci bus id: 0000:02:00.0, compute capability: 5.0)
True
  1. New dockerfile with tensorflow-text:
FROM tensorflow/tensorflow:latest-gpu-py3-jupyter
WORKDIR /root

RUN pip install tensorflow-text>=2.0.0rc0

Build and test... no GPU:

$ docker run --runtime=nvidia --rm -it prueba:latest  python -c "import tensorflow as tf; print(tf.test.is_gpu_available())"
2019-12-06 19:02:11.488695: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-12-06 19:02:11.512879: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2592000000 Hz
2019-12-06 19:02:11.513972: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3726f30 executing computations on platform Host. Devices:
2019-12-06 19:02:11.514007: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
False
  1. Workaround: uninstall tensorflow after installing tensorflow-text only to user. Dockerfile:
FROM tensorflow/tensorflow:latest-gpu-py3-jupyter
WORKDIR /root

RUN pip install --user tensorflow-text>=2.0.0rc0
RUN pip uninstall -y tensorflow

build and test, tensorflow-gpu works

$ docker build -t prueba -f Dockerfile .
Sending build context to Docker daemon  4.096kB
Step 1/4 : FROM tensorflow/tensorflow:latest-gpu-py3-jupyter
 ---> 88178d65d12c
Step 2/4 : WORKDIR /root
 ---> Using cache
 ---> 39616c78086e
Step 3/4 : RUN pip install --user tensorflow-text>=2.0.0rc0
 ---> Running in a08e37ee49da
  WARNING: The scripts saved_model_cli, tensorboard, tf_upgrade_v2, tflite_convert, toco and toco_from_protos are installed in '/root/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: You are using pip version 19.2.3, however version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Removing intermediate container a08e37ee49da
 ---> d5c415130f01
Step 4/4 : RUN pip uninstall -y tensorflow
 ---> Running in 23e3d7e9a6a8
Uninstalling tensorflow-2.0.0:
  Successfully uninstalled tensorflow-2.0.0
Removing intermediate container 23e3d7e9a6a8
 ---> 7b91c3f6eeed
Successfully built 7b91c3f6eeed
Successfully tagged prueba:latest
$ docker run --runtime=nvidia --rm -it prueba:latest  python -c "import tensorflow as tf; print(tf.test.is_gpu_available())"
2019-12-06 19:03:36.662969: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-12-06 19:03:36.688813: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2592000000 Hz
2019-12-06 19:03:36.689431: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4c00060 executing computations on platform Host. Devices:
2019-12-06 19:03:36.689461: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2019-12-06 19:03:36.691759: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-12-06 19:03:36.733931: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-06 19:03:36.734501: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4cd7810 executing computations on platform CUDA. Devices:
2019-12-06 19:03:36.734534: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 960M, Compute Capability 5.0
2019-12-06 19:03:36.734750: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-06 19:03:36.735122: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce GTX 960M major: 5 minor: 0 memoryClockRate(GHz): 1.176
pciBusID: 0000:02:00.0
2019-12-06 19:03:36.735374: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-12-06 19:03:36.737036: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-12-06 19:03:36.738244: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2019-12-06 19:03:36.738574: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2019-12-06 19:03:36.740186: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2019-12-06 19:03:36.741472: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2019-12-06 19:03:36.745057: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-12-06 19:03:36.745242: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-06 19:03:36.745721: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-06 19:03:36.746093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-12-06 19:03:36.746178: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-12-06 19:03:36.746822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-06 19:03:36.746837: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2019-12-06 19:03:36.746847: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2019-12-06 19:03:36.747053: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-06 19:03:36.747601: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-06 19:03:36.748010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/device:GPU:0 with 3330 MB memory) -> physical GPU (device: 0, name: GeForce GTX 960M, pci bus id: 0000:02:00.0, compute capability: 5.0)
True

Op type not registered 'RegexSplitWithOffsets'

Environment

OS:

Ubuntu : Release 18.04
python : 3.6.9

Dependencies versions:

tensorflow -- 2.0.0
tensorflow-text -- 2.0.1

Error :

2019-12-16 14:59:30.379512: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at partitioned_function_ops.cc:113 : Not found: Op type not registered 'RegexSplitWithOffsets' in binary running on pc. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.

Error and Pipeline Description

I am using BertTokenizer in a tensorflow.keras.layers.Layer custom class

Training model.fit --> works fine
Evaluating model.evaluate--> works fine
Saving model.save --> works fine
loading using model.load --> works fine
serving with tensorflow serving --> ERROR

When using tensorflow serving I get the Not found: Op type not registered 'RegexSplitWithOffsets' error

KeyError: 'WhitespaceTokenizeWithOffsets' when trying to run SavedModel

I'm using this for simple tokenization in text classifier model. When I try to run my model from SavedModel export file, I get this error: KeyError: 'WhitespaceTokenizeWithOffsets'

TF: 1.14.0
TF-text: 0.1.0

Error:
KeyError: 'WhitespaceTokenizeWithOffsets'

When I do "import tensorflow_text" it works .

Having read different blogs and issues on tensorflow/text and tensorflow/serving git repo , I found that issue was TF Text ops is not included in core graph .

From tensorflow/serving#1490 including .so explicitly in Java at run time using throws segmentation fault .

Reading the reply from issue here - #130 , there was no code of Java provided where adding .so file explicitly we are able to solve this issue. Is there any java code available ?

Non-existent components references in docs & comments

Hi, thanks for this awesome project. We are taking a look at this lib for some NLP research projects at Twitter and it looks very promising.

Some things I found browsing the project:

There is a reference to "TextExtractor" at
https://github.com/tensorflow/text/blob/9a92e9716e58460203d59192037eaa217d958f00/docs/api_docs/python/text/sentence_fragments.md

and also in the docstring of the function at

* 0x01 (ILL_FORMED) - Text is ill-formed according to TextExtractor;

There is a reference to the "SAFT Tokenization library" in

Uses the SAFT Tokenization library for sentence and word breaking

From the context these sound very useful. Is there a timeline for incorporating them into the code?

Either way, it might be good to remove these references for the time being to avoid confusion.

Unable to train model with Keras API and tensorflow_text

Trying to train a model using the example from the README file, but it doesn't work:

import tensorflow as tf
import tensorflow_text as text
model = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(None,), dtype='int32', ragged=True),
text.keras.layers.ToDense(pad_value=0, mask=True),
tf.keras.layers.Embedding(100, 16),
tf.keras.layers.LSTM(32),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])

ERROR:
ValueError: The two structures don't have the same nested structure.

First structure: type=TensorSpec str=TensorSpec(shape=(), dtype=tf.string, name=None)

Second structure: type=RaggedTensor str=tf.RaggedTensor(values=Tensor("input_1/flat_values:0", shape=(None,), dtype=int32), row_splits=Tensor("input_1/row_splits_0:0", shape=(None,), dtype=int64))

More specifically: Substructure "type=RaggedTensor str=tf.RaggedTensor(values=Tensor("input_1/flat_values:0", shape=(None,), dtype=int32), row_splits=Tensor("input_1/row_splits_0:0", shape=(None,), dtype=int64))" is a sequence, while substructure "type=TensorSpec str=TensorSpec(shape=(), dtype=tf.string, name=None)" is not
Entire first structure:
.
Entire second structure:
.

Error: Could not retrieve ICU NFKC normalizer

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
  • OS Platform and Distribution: MacOS 10.14.5
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version: 2.0.0b0
  • Python version: Python 3.7.2
  • CUDA/cuDNN version: None
  • GPU model and memory: None

Describe the current behavior
Error on calling normalize_utf8

Describe the expected behavior
normalize_utf8 should not fail

Code to reproduce the issue

import tensorflow as tf
import tensorflow_text as text
tf.enable_eager_execution()

print(text.normalize_utf8(['Äffin']))

Error

/usr/local/lib/python3.7/site-packages/tensorflow_text/python/ops/normalize_ops.py in normalize_utf8(input, normalization_form, name)
     87       return input_tensor.with_flat_values(result)
     88     else:
---> 89       return gen_normalize_ops.normalize_utf8(input_tensor, normalization_form)

<string> in normalize_utf8(input, normalization_form, name)

/usr/local/lib/python3.7/site-packages/six.py in raise_from(value, from_value)

InternalError: U_FILE_ACCESS_ERROR: Could not retrieve ICU NFKC normalizer [Op:NormalizeUTF8]

TextVectorization layer vs TensorFlow Text

The latest TF version 2.1 added a new Keras layer for text processing in the graph which is TextVectorization . This layers seems to support custom tokenization and all typical preprocessing stuff (here a detailed article on how to use it).

vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=400)

What does this means for TF Text which now still have to be used with ToDense layer as very few layers support RaggedTensors? And when to use each one of the two?

Build from source instructions

The current pypi package is a bit old and there's no instruction how to build a latest one from source.
Also, is there a nightly build somewhere?

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.