unum-cloud / uform Goto Github PK

Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️

Home Page: https://unum-cloud.github.io/uform/

License: Apache License 2.0

Python 53.00% Swift 18.73% Jupyter Notebook 13.55% JavaScript 14.72%

huggingface-transformers language-vision multimodal pytorch semantic-search transformer cross-attention vector-search bert neural-network

uform's Introduction

UForm

Pocket-Sized Multimodal AI
For Content Understanding and Generation

Multimodal Embeddings from 64 to 768 Dimensions • 1B Parameter Chat
Short Texts • Images • 🔜 Video Clips • 🔜 Long Documents
ONNX • CoreML • PyTorch
Python • JavaScript • Swift

Welcome to UForm, a multimodal AI library that's as versatile as it is efficient. UForm tiny embedding models will help you understand and search visual and textual content across various languages. UForm small generative models, on the other hand, don't only support conversational and chat use-cases, but are great for fast image captioning and Visual Question Answering (VQA). With compact custom pre-trained transformer models, this can run anywhere from your server farm down to your smartphone.

Features

Tiny Embeddings: 64-dimensional Matryoshaka-style embeddings for extremely fast search.
Throughput: Thanks to the small size, the inference speed is 2-4x faster than competitors.
Portable: Models come with native ONNX support, making them easy to deploy on any platform.
Quantization Aware: Down-cast embeddings from f32 to i8 without losing much recall.
Multilingual: Trained on a balanced dataset, the recall is great across over 20 languages.

Models

For accuracy and speed benchmarks refer to the evaluation page.

Embedding Models

Model	Parameters	Languages	Architecture
`uform3-image-text-english-large` 🆕	365 M	1	12 layer BERT, ViT-L/14
`uform3-image-text-english-base`	143 M	1	4 layer BERT, ViT-B/16
`uform3-image-text-english-small` 🆕	79 M	1	4 layer BERT, ViT-S/16
`uform3-image-text-multilingual-base`	206M	21	12 layer BERT, ViT-B/16

Generative Models

Model	Parameters	Purpose	Architecture
`uform-gen2-dpo` 🆕	1.2 B	Chat, Image Captioning, VQA	qwen1.5-0.5B, ViT-H/14
`uform-gen2-qwen-500m`	1.2 B	Chat, Image Captioning, VQA	qwen1.5-0.5B, ViT-H/14
`uform-gen` ⚠️	1.5 B	Image Captioning, VQA	llama-1.3B, ViT-B/16

Quick Start Examples

Embedding Models

First, pip install uform. Then, load the model:

from uform import get_model, Modality

processors, models = get_model('unum-cloud/uform3-image-text-english-small')

model_text = models[Modality.TEXT_ENCODER]
model_image = models[Modality.IMAGE_ENCODER]
processor_text = processors[Modality.TEXT_ENCODER]
processor_image = processors[Modality.IMAGE_ENCODER]

Embed images:

import requests
from io import BytesIO
from PIL import Image

image_url = 'https://media-cdn.tripadvisor.com/media/photo-s/1b/28/6b/53/lovely-armenia.jpg'
image_url = Image.open(BytesIO(requests.get(image_url).content))
image_data = processor_image(image)
image_features, image_embedding = model_image.encode(image_data, return_features=True)

Embed queries:

text = 'a cityscape bathed in the warm glow of the sun, with varied architecture and a towering, snow-capped mountain rising majestically in the background'
text_data = processor_text(text)
text_features, text_embedding = model_text.encode(text_data, return_features=True)

For more details check out:

Python docs on embedding models in python/README.md
JavaScript docs on embedding models in javascript/README.md
Swift docs on embedding models in swift/README.md

Generative Models

The generative models are natively compatible with

from transformers import AutoModel, AutoProcessor

model = AutoModel.from_pretrained('unum-cloud/uform-gen2-dpo', trust_remote_code=True)
processor = AutoProcessor.from_pretrained('unum-cloud/uform-gen2-dpo', trust_remote_code=True)

prompt = 'Question or Instruction'
image = Image.open('image.jpg')

inputs = processor(text=[prompt], images=[image], return_tensors='pt')

with torch.inference_mode():
     output = model.generate(
        **inputs,
        do_sample=False,
        use_cache=True,
        max_new_tokens=256,
        eos_token_id=151645,
        pad_token_id=processor.tokenizer.pad_token_id
    )
prompt_len = inputs['input_ids'].shape[1]
decoded_text = processor.batch_decode(output[:, prompt_len:])[0]

For more details check out:

Python docs on generative models in python/README.md
JavaScript docs on generative models 🔜
Swift docs on generative models 🔜

Technical Details

Down-casting, Quantization, Matryoshka, and Slicing

Depending on the application, the embeddings can be down-casted to smaller numeric representations without losing much recall. Switching from f32 to f16 is recommended in almost all cases, unless you are running on very old hardware without half-precision support. Switching to i8 with linear scaling is also possible, but will be noticeable in the recall on larger collections with millions of searchable entries. Similarly, for higher-dimensional embeddings (512 or 768), a common strategy is to quantize them into single-bit representations for faster search.

import numpy as np

f32_embedding: np.ndarray = model.encode_text(text_data, return_features=False)
f16_embedding: np.ndarray = f32_embedding.astype(np.float16)
i8_embedding: np.ndarray = (f32_embedding * 127).astype(np.int8)
b1_embedding: np.ndarray = np.packbits((f32_embedding > 0).astype(np.uint8))

Alternative approach to quantization is to use the Matryoshka embeddings, where the embeddings are sliced into smaller parts, and the search is performed in a hierarchical manner.

import numpy as np

large_embedding: np.ndarray = model.encode_text(text_data, return_features=False)
small_embedding: np.ndarray = large_embedding[:, :256]
tiny_embedding: np.ndarray = large_embedding[:, :64]

Both approaches are natively supported by the USearch vector-search engine and the SimSIMD numerics libraries. When dealing with small collections (up to millions of entries) and looking for low-latency cosine distance calculations, you can achieve 5x-2500x performance improvement over Torch, NumPy, SciPy, and vanilla Python using SimSIMD.

from simsimd import cosine, hamming

distance: float = cosine(f32_embedding, f32_embedding) # 32x SciPy performance on Apple M2 CPU
distance: float = cosine(f16_embedding, f16_embedding) # 79x SciPy performance on Apple M2 CPU
distance: float = cosine(i8_embedding, i8_embedding) # 133x SciPy performance on Apple M2 CPU
distance: float = hamming(b1_embedding, b1_embedding) # 17x SciPy performance on Apple M2 CPU

Similarly, when dealing with large collections (up to billions of entries per server) and looking for high-throughput search, you can achieve 100x performance improvement over FAISS and other vector-search solutions using USearch. Here are a couple of examples:

from usearch.index import Index

f32_index = Index(ndim=64, metric='cos', dtype='f32') # for Matryoshka embeddings
f16_index = Index(ndim=64, metric='cos', dtype='f16') # for Matryoshka embeddings
i8_index = Index(ndim=256, metric='cos', dtype='i8') # for quantized embeddings
b1_index = Index(ndim=768, metric='hamming', dtype='b1') # for binary embeddings

Compact Packaging

PyTorch is a heavy dependency to carry, especially if you run on Edge or IoT devices. Using vanilla ONNX runtime, one can significantly reduce memory consumption and deployment latency.

$ conda create -n uform_torch python=3.10 -y
$ conda create -n uform_onnx python=3.10 -y
$ conda activate uform_torch && pip install -e ".[torch]" && conda deactivate
$ conda activate uform_onnx && pip install -e ".[onnx]" && conda deactivate
$ du -sh $(conda info --envs | grep 'uform_torch' | awk '{print $2}')
> 5.2G    ~/conda/envs/uform_torch
$ du -sh $(conda info --envs | grep 'uform_onnx' | awk '{print $2}')
> 461M    ~/conda/envs/uform_onnx

Most of that weight can be further reduced down to 100 MB for both the model and the runtime. You can pick one of many supported ONNX execution providers, which includes XNNPACK, CUDA and TensorRT for Nvidia GPUs, OpenVINO on Intel, DirectML on Windows, ROCm on AMD, CoreML on Apple devices, and more to come.

Multimodal Chat in CLI

The generative models can be used for chat-like experiences in the command line. For that, you can use the uform-chat CLI tool, which is available in the UForm package.

$ pip install uform
$ uform-chat --model unum-cloud/uform-gen2-dpo --image=zebra.jpg
$ uform-chat --model unum-cloud/uform-gen2-dpo \
>     --image="https://bit.ly/3tIVg9M" \
>     --device="cuda:0" \
>     --fp16

uform's People

Contributors

Stargazers

Watchers

Forkers

danieltanfh95 codeaudit dan255 akamil-etsy treksis mbrukman hbcbh1999 babyblue26 seanreynoldscs 384068026 hoooon89 touristshaun rishabhm12 vovor gurgenyegoryan kaggledevs ishkhan42 alkia laclouis5 ezhangle abeusher victor8733 jaedukseo allinbsv techthiyanes blackforestboi wnma3mz copperdong f901107 ironman-i carlosalosno lmmx alkesh211 yimilirui zsxkib lihuibng tommy13579 karthikra kapulkin sailfish009 ego shabbirhasan1 ashvardanian neurlnetworker xiechengmude hypoechoic ake020675 khankindle xianminx shahinsharifi

uform's Issues

Releasing training dataset

First of all, great work and congrats on the release. I was wondering whether you are planning on releasing the cleaned up 4M dataset?

Additional dependencies?

I've created a virtual environment, run pip3 install uform and am attempting to run the following code:

import uform

I'm receiving the following error

    import uform
  File "/home/zetaphor/Code/self-tracker/uform.py", line 1, in <module>
    from uform.gen_model import VLMForCausalLM, VLMProcessor
ModuleNotFoundError: No module named 'uform.gen_model'; 'uform' is not a package

I'm using Python version 3.11.2

Are there additional dependencies I'm missing?

CLIP for Voice

Would it be sane to get your model to support text to audio clips like this?

One of the DALLE3 engineers has a personal project called Tortise-TTS where he has a voice version of CLIP he calls CLVP.

https://github.com/neonbjb/tortoise-tts/blob/1e061bc6752f05bccb59748c8bd7c7fc85d54988/tortoise/models/clvp.py#L24

I think he used lucidrains CLIP as a template: https://github.com/lucidrains/DALLE-pytorch/blob/58c1e1a4fef10725a79bd45cdb5581c03e3e59e7/dalle_pytorch/dalle_pytorch.py#L272

CoreML Model

Have y'all experimented with exporting the uform model (which is fantastic, by the way) as a CoreML model, so it can be run on-device more efficiently?

Can not load multilingual model. (ERROR in huggingface transformers library)

At calling

model = uform.get_model('unum-cloud/uform-vl-multilingual')

which calls at line 293

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')

generates the following error:

NOTE: I have the latest version of transformers lib (4.30.2)

Benchmark script errors on loading InstructBLIP processor

I've tried running the code and found what looks like a bug in the benchmark script, I'm just diagnosing now

The traceback seems to point to the type of the image parameter at line 68:

 53 def bench_captions(
 54     model,
 55     processor,
 56     prompt: str,
 57     images: List[Image.Image],
 58 ) -> List[str]:
 59     total_duration = 0
 60     total_length = 0
 61     model = torch.compile(model)
 62     for image in images:
 63         seconds, text = duration(
 64             lambda: caption(
 65                 model=model,
 66                 processor=processor,
 67                 prompt=prompt,
 68                 image=image,
 69             )
 70         )
 71         total_duration += seconds
 72         total_length += len(text)
 73 
 74     del model
 75     del processor
 76     print(f"Throughput: {total_length/total_duration:.2f} tokens/s")

Click to expand traceback (captured by pytest)

scripts/bench.py:141: in <module>
    bench_captions(
scripts/bench.py:63: in bench_captions
    seconds, text = duration(
scripts/bench.py:48: in duration
    result = callable()
scripts/bench.py:64: in <lambda>
    lambda: caption(
scripts/bench.py:22: in caption
    inputs = processor(prompt, image, return_tensors="pt")
/home/louis/miniconda3/envs/uform/lib/python3.11/site-packages/transformers/models/instructblip/processing_instructblip.py:89: in __call__
    text_encoding = self.tokenizer(
/home/louis/miniconda3/envs/uform/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2802: in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
/home/louis/miniconda3/envs/uform/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2860: in _call_one
    raise ValueError(
E   ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> entering PDB >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PDB post_mortem >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> /home/louis/miniconda3/envs/uform/lib/python3.11/site-packages/transformers/tokenization_utils_base.py(2860)_call_one()

I expanded this code out [no lambda] and it still gives the same error but the data flow is clearer

def bench_captions(
    model,
    processor,
    prompt: str,
    images: List[Image.Image],
) -> List[str]:
    total_duration = 0
    total_length = 0
    model = torch.compile(model)

    def caption_image(image, model=model, processor=processor, prompt=prompt):
        return caption(model=model, processor=processor, prompt=prompt, image=image)

    for image in images:
        seconds, text = duration(partial(caption_image, image=image))
        total_duration += seconds
        total_length += len(text)

    del model
    del processor
    print(f"Throughput: {total_length/total_duration:.2f} tokens/s")

The traceback is pointing to the loading of the processor of the InstructBLIP model.

It was reported but not resolved in transformers (I think unrelated huggingface/transformers#21366)

The bug seems to be that we are passing unnamed arguments, and they're getting misused as a result:

        inputs = processor(prompt, image, return_tensors="pt")

The InstructBLIP signature is __call__(self, images, text)

(Pdb) pp self.__call__.__func__.__code__.co_varnames
('self',
 'images',
 'text',
...

The docs say that

The InstructBlipForConditionalGeneration forward method, overrides the __call__ special method.

I think this must be what is supposed to be getting called.

Debugging in PDB shows this is what is happening

(Pdb) p images
'Summarize the visual content of the image.'
(Pdb) p text
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=2787x4181 at 0x7FBE4A910090>

Does this reproduce for you?

Cause

Update I found the cause is indeed passing positional args, if you print the processor param names they are, respectively:

texts, images, ... (Uform-gen)
text, images, ... (Llava)
images, text, ... (Instruct-BLIP)

I'm surprised this benchmark was working before

Solution

Since the parameter order varies you can't use positional args, but the parameter names differ too: text/texts.

In fact the odd one out here is from uform itself, so that should change, and this will work.

You can't just pass images=image (InstructBlipProcessor will get multiple values for the argument images)

This cannot be solved by passing text=text to Uform-Gen's VLMProcessor, that leads to a later error in the model.generate step.

It looks like switching the order of these arguments in VLMProcessor is the best solution.

If I patch it, everything works (but that's not to say don't fix the VLMProcessor argument order!).

def caption(model, processor, prompt: str, image: Image.Image) -> str:
    var_names = processor.__call__.__func__.__code__.co_varnames
    prompt_kwarg = next(kw for kw in iter(var_names) if kw.startswith("text"))
    processor_kwargs = {prompt_kwarg: prompt, "images": image, "return_tensors": "pt"}
    inputs = processor(**processor_kwargs)
...

Environment details

OS: Linux
Environment: conda
Python: 3.11.5
Transformers: 4.36.2

Click to show full pip list

(uform) louis 🌟 ~/lab/uform/uform $ pip list
Package            Version    Editable project location
------------------ ---------- ---------------------------
Brotli             1.0.9
certifi            2023.11.17
cffi               1.16.0
charset-normalizer 2.0.4
cryptography       41.0.7
filelock           3.13.1
fsspec             2023.12.2
gmpy2              2.1.2
huggingface-hub    0.20.1
idna               3.4
iniconfig          2.0.0
Jinja2             3.1.2
MarkupSafe         2.1.1
mkl-fft            1.3.8
mkl-random         1.2.4
mkl-service        2.4.0
mpmath             1.3.0
networkx           3.1
numpy              1.26.2
packaging          23.2
Pillow             10.0.1
pip                23.3.1
pluggy             1.3.0
pycparser          2.21
pyOpenSSL          23.2.0
PySocks            1.7.1
pytest             7.4.4
PyYAML             6.0.1
regex              2023.12.25
requests           2.31.0
safetensors        0.4.1
setuptools         68.2.2
sympy              1.12
tokenizers         0.15.0
torch              2.1.2
torchaudio         2.1.2
torchvision        0.16.2
tqdm               4.66.1
transformers       4.36.2
triton             2.1.0
typing_extensions  4.7.1
uform              1.0.3      /home/louis/lab/uform/uform
urllib3            1.26.18
wheel              0.41.2

Click to show full conda list

# packages in environment at /home/louis/miniconda3/envs/uform:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
blas                      1.0                         mkl  
brotli-python             1.0.9           py311h6a678d5_7  
bzip2                     1.0.8                h7b6447c_0  
ca-certificates           2023.12.12           h06a4308_0  
certifi                   2023.11.17      py311h06a4308_0  
cffi                      1.16.0          py311h5eee18b_0  
charset-normalizer        2.0.4              pyhd3eb1b0_0  
cryptography              41.0.7          py311hdda0065_0  
cuda-cudart               11.8.89                       0    nvidia
cuda-cupti                11.8.87                       0    nvidia
cuda-libraries            11.8.0                        0    nvidia
cuda-nvrtc                11.8.89                       0    nvidia
cuda-nvtx                 11.8.86                       0    nvidia
cuda-runtime              11.8.0                        0    nvidia
ffmpeg                    4.3                  hf484d3e_0    pytorch
filelock                  3.13.1          py311h06a4308_0  
freetype                  2.12.1               h4a9f257_0  
fsspec                    2023.12.2                pypi_0    pypi
giflib                    5.2.1                h5eee18b_3  
gmp                       6.2.1                h295c915_3  
gmpy2                     2.1.2           py311hc9b5ff0_0  
gnutls                    3.6.15               he1e5248_0  
huggingface-hub           0.20.1                   pypi_0    pypi
idna                      3.4             py311h06a4308_0  
iniconfig                 2.0.0                    pypi_0    pypi
intel-openmp              2023.1.0         hdb19cb5_46306  
jinja2                    3.1.2           py311h06a4308_0  
jpeg                      9e                   h5eee18b_1  
lame                      3.100                h7b6447c_0  
lcms2                     2.12                 h3be6417_0  
ld_impl_linux-64          2.38                 h1181459_1  
lerc                      3.0                  h295c915_0  
libcublas                 11.11.3.6                     0    nvidia
libcufft                  10.9.0.58                     0    nvidia
libcufile                 1.8.1.2                       0    nvidia
libcurand                 10.3.4.101                    0    nvidia
libcusolver               11.4.1.48                     0    nvidia
libcusparse               11.7.5.86                     0    nvidia
libdeflate                1.17                 h5eee18b_1  
libffi                    3.4.4                h6a678d5_0  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libiconv                  1.16                 h7f8727e_2  
libidn2                   2.3.4                h5eee18b_0  
libjpeg-turbo             2.0.0                h9bf148f_0    pytorch
libnpp                    11.8.0.86                     0    nvidia
libnvjpeg                 11.9.0.86                     0    nvidia
libpng                    1.6.39               h5eee18b_0  
libstdcxx-ng              11.2.0               h1234567_1  
libtasn1                  4.19.0               h5eee18b_0  
libtiff                   4.5.1                h6a678d5_0  
libunistring              0.9.10               h27cfd23_0  
libuuid                   1.41.5               h5eee18b_0  
libwebp                   1.3.2                h11a3e52_0  
libwebp-base              1.3.2                h5eee18b_0  
llvm-openmp               14.0.6               h9e868ea_0  
lz4-c                     1.9.4                h6a678d5_0  
markupsafe                2.1.1           py311h5eee18b_0  
mkl                       2023.1.0         h213fc3f_46344  
mkl-service               2.4.0           py311h5eee18b_1  
mkl_fft                   1.3.8           py311h5eee18b_0  
mkl_random                1.2.4           py311hdb19cb5_0  
mpc                       1.1.0                h10f8cd9_1  
mpfr                      4.0.2                hb69a4c5_1  
mpmath                    1.3.0           py311h06a4308_0  
ncurses                   6.4                  h6a678d5_0  
nettle                    3.7.3                hbbd107a_1  
networkx                  3.1             py311h06a4308_0  
numpy                     1.26.2          py311h08b1b3b_0  
numpy-base                1.26.2          py311hf175353_0  
openh264                  2.1.1                h4ff587b_0  
openjpeg                  2.4.0                h3ad879b_0  
openssl                   3.0.12               h7f8727e_0  
packaging                 23.2                     pypi_0    pypi
pillow                    10.0.1          py311ha6cbd5a_0  
pip                       23.3.1          py311h06a4308_0  
pluggy                    1.3.0                    pypi_0    pypi
pycparser                 2.21               pyhd3eb1b0_0  
pyopenssl                 23.2.0          py311h06a4308_0  
pysocks                   1.7.1           py311h06a4308_0  
pytest                    7.4.4                    pypi_0    pypi
python                    3.11.5               h955ad1f_0  
pytorch                   2.1.2           py3.11_cuda11.8_cudnn8.7.0_0    pytorch
pytorch-cuda              11.8                 h7e8668a_5    pytorch
pytorch-mutex             1.0                        cuda    pytorch
pyyaml                    6.0.1           py311h5eee18b_0  
readline                  8.2                  h5eee18b_0  
regex                     2023.12.25               pypi_0    pypi
requests                  2.31.0          py311h06a4308_0  
safetensors               0.4.1                    pypi_0    pypi
setuptools                68.2.2          py311h06a4308_0  
sqlite                    3.41.2               h5eee18b_0  
sympy                     1.12            py311h06a4308_0  
tbb                       2021.8.0             hdb19cb5_0  
tk                        8.6.12               h1ccaba5_0  
tokenizers                0.15.0                   pypi_0    pypi
torchaudio                2.1.2               py311_cu118    pytorch
torchtriton               2.1.0                     py311    pytorch
torchvision               0.16.2              py311_cu118    pytorch
tqdm                      4.66.1                   pypi_0    pypi
transformers              4.36.2                   pypi_0    pypi
typing_extensions         4.7.1           py311h06a4308_0  
tzdata                    2023c                h04d1e81_0  
uform                     1.0.3                    pypi_0    pypi
urllib3                   1.26.18         py311h06a4308_0  
wheel                     0.41.2          py311h06a4308_0  
xz                        5.4.5                h5eee18b_0  
yaml                      0.2.5                h7b6447c_0  
zlib                      1.2.13               h5eee18b_0  
zstd                      1.5.5                hc292b87_0

RPC implementation

UForm has the potential to become our primary interface for all things multi-modal, both for local on-device inference and cloud deployments. So let's add an RPC backend to it!

README example now invalid

I tried running the code in the readme, related to using unum-cloud/uform-gen2-qwen-500m

It doesnt work.
I got the following error.

 File "/home/phil/stable-diffusion-experiments/tokenspace/unum/unum2_test.py", line 3, in <module>
    model = AutoModel.from_pretrained("unum-cloud/uform-gen2-qwen-500m", trust_remote_code=True)
  File "/home/phil/stable-diffusion-experiments/venv-scratch/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    return model_class.from_pretrained(
  File "/home/phil/stable-diffusion-experiments/venv-scratch/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3462, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/home/phil/.cache/huggingface/modules/transformers_modules/unum-cloud/uform-gen2-qwen-500m/3912572ad204f82a2b0f875d3a1700faaebab719/modeling_uform_gen.py", line 63, in __init__
    self.text_config = AutoConfig.from_pretrained(
  File "/home/phil/stable-diffusion-experiments/venv-scratch/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1098, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/home/phil/stable-diffusion-experiments/venv-scratch/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 795, in __getitem__
    raise KeyError(key)
KeyError: 'qwen2'

How to cite your work

I can't find any research paper corresponding to this work. How can I cite your work in my research paper? I need it in the form of bibtex, for example like below:

@misc{shukor2022efficient,
      title={Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment}, 
      author={Mustafa Shukor and Guillaume Couairon and Matthieu Cord},
      year={2022},
      eprint={2208.13628},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Porting JavaScript package to browser

The current @unum-cloud/uform package uses onnxruntime-node. The WASM-based in-browser alternatives should be easy to swap in. It would be great to provide the user with a knob to select the backend or detect it automatically 🤗

Request for iOS demo.

Hi there,
I know uform can export as CoreML (https://huggingface.co/unum-cloud/uform-coreml-onnx/tree/main), but as an iOS developer without machine learning background, it's little hard for me to run it on iOS device.
An simple iOS demo will really appreciated <3.

ONNX Runtime crashes on exit (JavaScript)

If you run npm test you'll see models being downloaded and validated, but at the end, when the actual UForm tests have passed, it prints:

terminate called after throwing an instance of 'onnxruntime::OnnxRuntimeException'
  what():  /onnxruntime_src/onnxruntime/core/session/ort_env.cc:90 static void OrtEnv::Release(OrtEnv*) env_ptr == p_instance_.get() was false. 

Aborted

And exits with code 134. It's either coming from UForm or ONNX. In the first case, we may not be disposing some of the state properly.

Bug: can't load unum-cloud/uform-vl-english

uform==0.2.1

import uform

model = uform.get_model('unum-cloud/uform-vl-english')

result:

lib/python3.11/site-packages/timm/models/_factory.py:114: UserWarning: Mapping deprecated model name deit3_base_patch16_224_in21ft1k to current deit3_base_patch16_224.fb_in22k_ft_in1k.
  model = create_fn(
Traceback (most recent call last):
  File "t.py", line 5, in <module>
    model = uform.get_model('unum-cloud/uform-vl-english')
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/uform.py", line 484, in get_model
    model.text_encoder.load_state_dict(state['text_encoder'])
  File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TextEncoder:
	Unexpected key(s) in state_dict: "backbone.embeddings.position_ids".```

python crashes when loading model under macOS (by uform.get_model)

minimal reproducible code:

import uform

model = uform.get_model("unum-cloud/uform-vl-multilingual-v2")

execution result:

python main.py
Fetching 5 files: 100%|████████████████████████████████████████████████████| 5/5 [00:00<00:00, 14009.03it/s]
Segmentation fault: 11
(test-py3.11) bash-3.2$ /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

other infomation:

macOS: 14.0 (23A344) on Apple M1 (not compatible with?
Python 3.11.8
running in poetry env

crash_report.txt

will the training code be released in the future?

Great work!
and will the training code be shared in the future ?
thank you very much.

Problem with Batch Input

Hi,

Thanks for your repository. I have a question, when I want to give a batch of data points to the model I face with the following error:

for batch_idx, (inputs, targets) in enumerate(test_loader):
----> 9 image_info = model.preprocess_image(inputs).unsqueeze(0)
10 #image_info = model.preprocess_image(images).unsqueeze(0)

2 frames
/usr/local/lib/python3.10/dist-packages/uform/models.py in convert_to_rgb(image)
21 # lambda is not pickable
22 def convert_to_rgb(image):
---> 23 return image.convert("RGB")

AttributeError: 'Tensor' object has no attribute 'convert'

CoreML FP16 model

I was playing around with CoreML exports and I'm using the coco-sm tool to assess the performance. I benchmarked three configurations of the multilingual V2 model:

The original PyTorch FP32 UForm model (already provided in the coco-sm package)
The CoreML FP32 model
The CoreML FP16 model

The CoreML FP32 model yields metrics very close to the original PyTorch model, which is fine. However, the CoreML FP16 model gives metrics close to zero for all languages.

It looks like that the drop in performance is due to the text-encoder only. I tried to export the image-encoder to FP16 and to keep the text-encoder in FP32 and this gave a performance en par with the FP32 model.

This needs more investigation but this may be due to an overflow issue in some of the weights of the text-encoder during the FP16 conversion:

RuntimeWarning: overflow encountered in cast

Next I'm going to try the same thing with the OpenCLIP model to see if this issue only affects UForm or the text LLM text encoders in general. This would be quite unfortunate because the FP32 multilingual model is quite heavy (> 400 MB) and this would be great to store it in FP16 to reduce its size.

No module named 'uform.models'

the problem is:
Traceback (most recent call last):
File "/content/ugen-image-captioning-hf/app.py", line 2, in
from uform import gen_model
File "/usr/local/lib/python3.10/dist-packages/uform/gen_model.py", line 18, in
from uform.models import VisualEncoder
ModuleNotFoundError: No module named 'uform.models'

Maybe this line should be "from uform.torch_models import VisualEncoder"?

Caption for Driver's License is incorrect

If I run the example captioning code on the first image at https://en.wikipedia.org/wiki/Driver's_licenses_in_the_United_States with max_new_tokens=1024, I get results like:
'A woman in a red jacket stands in front of a map, posing for a picture with a passport in her hand. The passport is on the left side of the image, and the woman is in the center. The map is in the background, providing context for the location.<|im_end|>'
OR
"The image features a postage stamp with a woman's face on it, depicting a woman in a red jacket. The stamp is from the United States and has a denomination of $10. The woman's face is prominently displayed on the stamp, making it a unique and eye-catching design. The stamp is placed in the center of the image, taking up a significant portion of the frame.<|im_end|>"

Probably attributable to the training dataset but disappointing nevertheless. Also, is there a prompt I could use to extract all the text from an image? Or would I need to fine tune for that?

Can I use the joint_embedding for Composed Image Retrieval (CIR) ?

After reading https://www.unum.cloud/blog/2023-02-20-efficient-multimodality i found very interesting the multimodal encoder. My first thought was this will produce an embeding in the same latent space of the visual and textual embeddings, for solving the CIR problem:

joint_embedding = model.encode_multimodal(image=image_info, text=text_info)

But after further examination of the loss functions (ALBEF and ViCHA) Im not sure if that is the case.

Image-Text Matching (ITM) -> A linear layer on top of the multimodal emb, followed by a sigmoid to predict the similarity of both modalities. Useful to predict similarity, but not to aling the embedding.
Masked Language Modeling (MLM) -> BERT like
Masked Image Modeling -> MAE / I-JEPA like
Hierarchical Image-Text Contrastive (H-ITC) -> layer-wise CLIP
Visual Concepts Extraction (VCE) -> Keywords extraction

My insight is to make use of this late-fusion multimodal encoder to align the joint embedding to the same latent space of img and text embeddings:

There are several papers about this problem, but the UForm multimodal encoder looks very similar to the "fusion" family of CIR methods:

ARTEMIS
Combiner
MagicLens (the latest and probably the best one)

(Screenshot from https://arxiv.org/abs/2303.11916v3)

unum-cloud / uform Goto Github PK

uform's Introduction

UForm

Pocket-Sized Multimodal AI For Content Understanding and Generation

Features

Models

Embedding Models

Generative Models

Quick Start Examples

Embedding Models

Generative Models

Technical Details

Down-casting, Quantization, Matryoshka, and Slicing

Compact Packaging

Multimodal Chat in CLI

uform's People

Contributors

Stargazers

Watchers

Forkers

uform's Issues

Cause

Solution

Environment details

Recommend Projects

Recommend Topics

Recommend Org

Pocket-Sized Multimodal AI
For Content Understanding and Generation