Hi 👋

I work mostly with text and write in Python. I occasionally post in [[my blog]] for my readings and thoughts.

embeddings's People

Contributors

Stargazers

Watchers

embeddings's Issues

ValueError: window shape cannot be larger than input array shape

Hi,
Thanks for your work

I tried to fine tune Bert using your tokenizer API and mapping it to huggingface dataset mnli. But I got this issue:

Traceback (most recent call last):
  File "/mnt/beegfs/work/vu/works/bert-tokenize/finetune_bert.py", line 27, in <module>
    tokenized_datasets = dataset.map(tokenization, batched=True)
  File "/storage/ukp/work/vu/miniconda/envs/token-reload/lib/python3.10/site-packages/datasets/dataset_dict.py", line 770, in map
    {
  File "/storage/ukp/work/vu/miniconda/envs/token-reload/lib/python3.10/site-packages/datasets/dataset_dict.py", line 771, in <dictcomp>
    k: dataset.map(
  File "/storage/ukp/work/vu/miniconda/envs/token-reload/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2376, in map
    return self._map_single(
  File "/storage/ukp/work/vu/miniconda/envs/token-reload/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 551, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/storage/ukp/work/vu/miniconda/envs/token-reload/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 518, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/storage/ukp/work/vu/miniconda/envs/token-reload/lib/python3.10/site-packages/datasets/fingerprint.py", line 458, in wrapper
    out = func(self, *args, **kwargs)
  File "/storage/ukp/work/vu/miniconda/envs/token-reload/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2764, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "/storage/ukp/work/vu/miniconda/envs/token-reload/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2644, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "/storage/ukp/work/vu/miniconda/envs/token-reload/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2336, in decorated
    result = f(decorated_item, *args, **kwargs)
  File "/mnt/beegfs/work/vu/works/bert-tokenize/finetune_bert.py", line 18, in tokenization
    return tokenizer(
  File "/storage/ukp/work/vu/miniconda/envs/token-reload/lib/python3.10/site-packages/text_embeddings/base/__init__.py", line 139, in __call__
    second_embeddings = self.text2embeddings(second_text)
  File "/storage/ukp/work/vu/miniconda/envs/token-reload/lib/python3.10/site-packages/text_embeddings/visual/vtr.py", line 119, in text2embeddings
    sliding_window_view(image_array, (image_array.shape[0], self.window_size)),
  File "<__array_function__ internals>", line 180, in sliding_window_view
  File "/storage/ukp/work/vu/miniconda/envs/token-reload/lib/python3.10/site-packages/numpy/lib/stride_tricks.py", line 332, in sliding_window_view
    raise ValueError(
ValueError: window shape cannot be larger than input array shape

Here is my code snippet:

from text_embeddings.visual import VTRTokenizer
from transformers import BertForSequenceClassification

tokenizer = VTRTokenizer(
    font="/storage/ukp/work/vu/works/bert-tokenize/NotoSans-Regular.ttf",
)

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)

def tokenization(example):
    return tokenizer(
        example["premise"], 
        example["hypothesis"],
        padding="longest",
        truncation="longest_first",
    )

dataset = load_dataset('multi_nli')

tokenized_datasets = dataset.map(tokenization, batched=True)

Do you have any workaround to fix it ?
Thanks

Exporting tokenizer state

Hi - I am trying out your code to train a Charformer/GBST model on Thai Wikipedia. I aim to cover both ASCII and Thai.

I'm running into three concerns now - would appreciate if you have advice:

I know that ByT5 goes from a byte's value directly to its index in embeddings. In Charformer, to add one non-ASCII character, would I need to expand vocab_size to 256 + 1, or should vocab_size need to fit the highest codepoint (i.e. vocab_size > 3675 to include 0x0e5b)
Once I pass the tokenizer many lines of text, I'd like to export and reuse the tokenizer. I ran into these errors when running to_onnx and to_torchscript:

Unsupported: ONNX export of Slice with dynamic inputs. DynamicSlice is a deprecated experimental op.
Compiled functions can't take variable number of arguments or use keyword-only arguments with defaults

Other library code wants to pass a batch as an array of texts to the tokenizer (similar to top-level readme example). Based on the docs I know that Charformer expects torch.tensor([list(line_of_text.encode('utf-8'))]), and I need to pad this for same-length strings. Should I try to make this more like other tokenizers w/ PaddingStrategy, etc

Recommend Projects