monatis / clip.cpp Goto Github PK

CLIP inference in plain C/C++ with no extra dependencies

License: MIT License

CMake 2.26% Python 3.26% C++ 16.10% C 77.43% Shell 0.96%

c clip cpp ggml image-search multimodal

clip.cpp's Introduction

👋 Hey, I'm Yusuf!

I'm an AI research engineer from Turkey. 📊 My work is usually related to NLProc, automatic speech recognition and neural text-to-speech. I'm always passionate about efficient implementations and green AI as an abolutionist vegan. 🌱

🗞️ Timeline

The timeline below is dynamically updated with the messages I posted to a Telegram bot. 🤖

monatis/clip.cpp	ggerganov/llama.cpp	monatis/stable-diffusion-tf-docker

unum-cloud/usearch	damian0815/llama.cpp	abetlen/llama-cpp-python

🤙 Some more places where you can find me

clip.cpp's People

Contributors

Stargazers

Watchers

Forkers

green-sky gj-raza pent gp-1108 teleprint-me denis-ismailaj ellonde yossef-dawoad thebabush nethole kchro3 rayanramoul ente-io andrewkkan ionosnetworks saintarian phronmophobic

clip.cpp's Issues

python binding: OSError libggml.so: cannot open shared object file

Hello, first super thanks for your awesome work in the clip.cpp,
I encountered an error wile experimenting with your python binding.
installing clip_cpp python binding with:

 pip install clip_cpp

while the install is successful importing the Clip model through this

from clip_cpp import Clip

result in this error, which it cannot link the libggml.so

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
[<ipython-input-3-3550bf91251e>](https://localhost:8080/#) in <cell line: 1>()
----> 1 from clip_cpp import Clip

2 frames
[/usr/lib/python3.10/ctypes/__init__.py](https://localhost:8080/#) in __init__(self, name, mode, handle, use_errno, use_last_error, winmode)
    372 
    373         if handle is None:
--> 374             self._handle = _dlopen(self._name, mode)
    375         else:
    376             self._handle = handle

OSError: libggml.so: cannot open shared object file: No such file or directory

Bug: openai clip-vit-base-patch16 failes with memory error

use any of the models here: https://huggingface.co/Green-Sky/ggml_openai_clip-vit-base-patch16

clip_model_load: ggml ctx size = 287.12 MB
.................................................clip_model_load: model size =   285.77 MB / num tensors = 397
clip_model_load: 8 MB of compute buffer allocated
clip_model_load: model loadded

ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 16992432, available 16777216)
zsl: /home/green/workspace/clip.cpp/ggml/src/ggml.c:4341: ggml_new_tensor_impl: Assertion `false' failed.
Aborted (core dumped)

Prepare clip.cpp for upcoming llava.cpp

I'm still not 100% sure whether to call it llava.cpp or by another name to indicate its future support for other multimodal generation models in the future --maybe multimodal.cpp or lmm.cpp (large multimodal model). Open to suggestions by let's call it llava.cpp with a code name.

Update CMakeLists.txt with a flag CLIP_STANDALONE to toggle standalone mode. When ON, build against the ggml submodule. When OFF, build with ggml.h and ggml.c files directly included in llama.cpp.
Implement a function to get hidden states from a given layer index, to be used in llava.cpp.
Create another repo for llava.cpp. the llava.cpp repo should add both clip.cpp and llama.cpp repos as submodules and build with CLIP_STANDALONE=OFF to build against ggml sources included in llama.cpp.

support for larger models

currently lager models dont load. (tested with https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K)

clip_model_load: loading model from '../models/laion_clip-vit-h-14-laion2b-s32b-b79k/ggml-model-f16.bin' - please wait...clip_model_load: n_vocab = 49408
clip_model_load: num_positions   = 77
clip_model_load: t_hidden_size  = 1024
clip_model_load: t_n_intermediate  = 4096
clip_model_load: t_n_head  = 16
clip_model_load: t_n_layer = 24
clip_model_load: image_size = 224
clip_model_load: patch_size   = 14
clip_model_load: v_hidden_size  = 1280
clip_model_load: v_n_intermediate  = 5120
clip_model_load: v_n_head  = 16
clip_model_load: v_n_layer = 32
clip_model_load: ftype     = 1
clip_model_load: ggml ctx size = 1887.22 MB
.................................................................................................................clip_model_load: model size =  1882.50 MB / num tensors = 909
clip_model_load: model loadded
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 100900784, available 100663296)
zsl: /home/green/workspace/clip.cpp/ggml/src/ggml.c:4131: ggml_new_tensor_impl: Assertion `false' failed.
Aborted (core dumped)

after modifying

clip.cpp/clip.cpp

Line 713 in a12792d

new_clip->buf_compute.resize(96 * 1024 * 1024);

with a * 100ul, it gets futher, but now fails with:

clip_model_load: ggml ctx size = 1887.22 MB
.................................................................................................................clip_model_load: model size =  1882.50 MB / num tensors = 909
clip_model_load: model loadded
zsl: /home/green/workspace/clip.cpp/ggml/src/ggml.c:11044: ggml_compute_forward_soft_max_f32: Assertion `!isnan(sp[i])' failed.
zsl: /home/green/workspace/clip.cpp/ggml/src/ggml.c:11044: ggml_compute_forward_soft_max_f32: Assertion `!isnan(sp[i])' failed.
Aborted (core dumped)

Support custom mean-std normalization

Some LAION checkpoints, the large variant for example, use different mean and std values for image normalization. Figure out a way to, preferably, encode what values to use, or introduce another funcion to preprocess with these values.

Can this use convNeXt architecture?

As I last checked, ggml supports CONV_2D, so it should work? or even expand to any tranditional CNN like resnet?

Segmentation Fault and Core Dump when running image-search-build with Multiple images in folder Using the 4bit model

Issue Description:

When running the image-search-build command with a folder containing more than one image, it results in a segmentation fault and core dump.

Problem Description:

Running the image-search-build command crashes with a segmentation fault when the specified image folder contains more than one image. This issue occurs only when there are multiple images in the folder, while it works fine with a single image.

Steps to Reproduce:

Prepare a folder containing multiple images.
Run the following command:

$ bin/image-search-build -m ../models/laion_clip-vit-b-32-laion2b-s34b-b79k.ggmlv0.q4_1.bin ../tests

clip_model_load: loading model from '../models/laion_clip-vit-b-32-laion2b-s34b-b79k.ggmlv0.q4_1.bin' - please wait....................................................clip_model_load: model size =    93.92 MB / num tensors = 397
clip_model_load: model loaded
main: starting base dir scan of '../tests'
main: processing 2 files in 'tests'
.Segmentation fault (core dumped)

Expected Behavior:

The program should handle folders with multiple images correctly and generate the expected output.

Actual Behavior:

The program crashes with a segmentation fault when processing a folder with multiple images, resulting in a core dump.

Additional Information:

I have attempted the following troubleshooting steps, but none have resolved the issue:
- Ensuring correct file paths and contents.
- Checking memory usage, ensuring sufficient available memory.
I have also tried different image folders and model files, but the problem persists.
The program works fine with a single image in the folder; the issue only arises when there are multiple images.

Environment Information:

Operating System: Ubuntu 20.04
RAM: 16G

Your help and support in addressing this matter would be greatly appreciated. Thank you!

Introduce requirements.txt for python scripts please

Also, make sure you include the correct python version in the documentation

Implement bicubic interpolation

Currently clip.cpp uses linear interpolation in image preprocessing. The original implementation uses the bicubic interpolation from Pillow. It needs refactoring from Pillow https://github.com/python-pillow/Pillow/blob/main/src/libImaging/Resample.c#L46-L62

include license file

Many companies require a permissive license to allow use of open-source code (typically MIT or Apache2). Some projects wish to restrict use to research or personal (often with Creative Commons BY-NC-SA). For what it's worth, ggml uses the commercial-friendly MIT license.

Would you mind adding a license file that fits with your goals for the project? That clarity would be greatly appreciated.

not enough space in the context's memory pool (on Apple M1 Max, 32GB RAM, clip-vit-b-32)

Hi there,

Thank you so much for making this library. I'm unfortunately running into the following error

./main --model '/Users/lucasigel/Downloads/laion_clip-vit-b-32-laion2b-s34b-b79k.ggmlv0.q4_0.bin'  --text "test" --image '/00000002.jpg' -v 1

clip_model_load: loading model from '/Users/lucasigel/Downloads/laion_clip-vit-b-32-laion2b-s34b-b79k.ggmlv0.q4_0.bin' - please wait....................................................clip_model_load: model size =    85.06 MB / num tensors = 397
clip_model_load: model loaded

ggml_new_tensor_impl: not enough space in the context's memory pool (needed 12051936, available 8388608)
Assertion failed: (false), function ggml_new_tensor_impl, file ggml.c, line 4449.

zsh: abort      ./main --model  --text "test" --image  -v 1

I'm running on a Mac Studio with M1 Max and 32 GB of RAM. I tried every available model binary on huggingface and still got the same memory pool error. Is this due to a memory allocation bug? I see in #17 that this got solved for some cases and I'm wondering if there are lingering issues here

Can u please make exe of this project?

Hello, first of all thank you for this!

Can you please compile your project so we don't need to compile it ourselves? like stable diffusion cpp so all we need is to download from release and use it.

kind regards

Metal support?

Hi, awesome work on this project!

I'm building some Swift apps using llama.cpp, and I'd love to try getting clip.cpp running on my app too.

I'm curious if you're going to support running clip.cpp on Metal like llama.cpp?

Building with -DCLIP_BUILD_IMAGE_SEARCH=ON for image-search fails, ‘cos_gt’ is not a member of ‘unum::usearch’

Hi. I can compile and run clip.cpp's main in the normal way and it works. It's cool, thanks.

I cannot compile and run clip.cpp's image-search functionality by using the cmake -DCLIP_BUILD_IMAGE_SEARCH=ON. When I do that it compiles the normal executables fine but upon reaching image search it fails. I attempted to go into the _deps for usearch and built them myself but that also failed to provide them in the right place, I think. I don't really know what the "error: ‘cos_gt’ is not a member of ‘unum::usearch’ " error means, I've been assuming the libs just aren't being found.

I am attempting to build on Debian 11 with g++ (Debian 10.2.1-6) 10.2.1 20210110. cmake version 3.18.4.

superkuh@janus:~/app_installs/clip.cpp/build4$ cmake -DCLIP_BUILD_IMAGE_SEARCH=ON ..
-- The C compiler identification is GNU 10.2.1
-- The CXX compiler identification is GNU 10.2.1
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE  
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- Linux detected
-- Configuring done
-- Generating done
-- Build files have been written to: /home/superkuh/app_installs/clip.cpp/build4
superkuh@janus:~/app_installs/clip.cpp/build4$ l
total 84K
drwxr-xr-x 13 superkuh superkuh 4.0K Oct  8 12:25 ..
drwxr-xr-x  5 superkuh superkuh 4.0K Oct  8 12:25 _deps
-rw-r--r--  1 superkuh superkuh  21K Oct  8 12:25 CMakeCache.txt
-rw-r--r--  1 superkuh superkuh  12K Oct  8 12:25 Makefile
drwxr-xr-x  4 superkuh superkuh 4.0K Oct  8 12:25 ggml
-rw-r--r--  1 superkuh superkuh 2.1K Oct  8 12:25 cmake_install.cmake
drwxr-xr-x  2 superkuh superkuh 4.0K Oct  8 12:25 bin
drwxr-xr-x  3 superkuh superkuh 4.0K Oct  8 12:25 models
drwxr-xr-x  4 superkuh superkuh 4.0K Oct  8 12:25 examples
drwxr-xr-x  3 superkuh superkuh 4.0K Oct  8 12:25 tests
-rw-r--r--  1 superkuh superkuh 7.4K Oct  8 12:25 compile_commands.json
drwxr-xr-x  5 superkuh superkuh 4.0K Oct  8 12:25 CMakeFiles
drwxr-xr-x  9 superkuh superkuh 4.0K Oct  8 12:25 .
superkuh@janus:~/app_installs/clip.cpp/build4$ make
Scanning dependencies of target ggml
[  4%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml.c.o
[  8%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-alloc.c.o
[ 13%] Linking C static library libggml.a
[ 13%] Built target ggml
Scanning dependencies of target clip
[ 17%] Building CXX object CMakeFiles/clip.dir/clip.cpp.o
[ 21%] Linking CXX static library libclip.a
[ 21%] Built target clip
Scanning dependencies of target quantize
[ 26%] Building CXX object models/CMakeFiles/quantize.dir/quantize.cpp.o
[ 30%] Linking CXX executable ../bin/quantize
[ 30%] Built target quantize
Scanning dependencies of target common-clip
[ 34%] Building CXX object examples/CMakeFiles/common-clip.dir/common-clip.cpp.o
[ 39%] Linking CXX static library libcommon-clip.a
[ 39%] Built target common-clip
Scanning dependencies of target extract
[ 43%] Building CXX object examples/CMakeFiles/extract.dir/extract.cpp.o
[ 47%] Linking CXX executable ../bin/extract
[ 47%] Built target extract
Scanning dependencies of target simple_c
[ 52%] Building C object examples/CMakeFiles/simple_c.dir/simple.c.o
[ 56%] Linking CXX executable ../bin/simple_c
[ 56%] Built target simple_c
Scanning dependencies of target zsl
[ 60%] Building CXX object examples/CMakeFiles/zsl.dir/zsl.cpp.o
[ 65%] Linking CXX executable ../bin/zsl
[ 65%] Built target zsl
Scanning dependencies of target main
[ 69%] Building CXX object examples/CMakeFiles/main.dir/main.cpp.o
[ 73%] Linking CXX executable ../bin/main
[ 73%] Built target main
Scanning dependencies of target image-search
[ 78%] Building CXX object examples/image-search/CMakeFiles/image-search.dir/search.cpp.o
/home/superkuh/app_installs/clip.cpp/examples/image-search/search.cpp: In function ‘int main(int, char**)’:
/home/superkuh/app_installs/clip.cpp/examples/image-search/search.cpp:114:44: error: ‘cos_gt’ is not a member of ‘unum::usearch’ 
  114 |     unum::usearch::index_gt<unum::usearch::cos_gt<float>> embd_index;
      |                                            ^~~~~~
/home/superkuh/app_installs/clip.cpp/examples/image-search/search.cpp:114:44: error: ‘cos_gt’ is not a member of ‘unum::usearch’
/home/superkuh/app_installs/clip.cpp/examples/image-search/search.cpp:114:56: error: template argument 1 is invalid
  114 |     unum::usearch::index_gt<unum::usearch::cos_gt<float>> embd_index;
      |                                                        ^~
/home/superkuh/app_installs/clip.cpp/examples/image-search/search.cpp:116:16: error: request for member ‘view’ in ‘embd_index’, which is of non-class type ‘int’
  116 |     embd_index.view("images.usearch");
      |                ^~~~
/home/superkuh/app_installs/clip.cpp/examples/image-search/search.cpp:127:47: error: request for member ‘size’ in ‘embd_index’, which is of non-class type ‘int’
  127 |     if (image_file_index.size() != embd_index.size()) {
      |                                               ^~~~
/home/superkuh/app_installs/clip.cpp/examples/image-search/search.cpp:158:31: error: request for member ‘search’ in ‘embd_index’, which is of non-class type ‘int’
  158 |     auto results = embd_index.search({vec.data(), vec.size()}, params.n_results);
      |                               ^~~~~~
make[2]: *** [examples/image-search/CMakeFiles/image-search.dir/build.make:82: examples/image-search/CMakeFiles/image-search.dir/search.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:473: examples/image-search/CMakeFiles/image-search.dir/all] Error 2
make: *** [Makefile:149: all] Error 2

I have looked in compile_commands.json for the relevant section which is failing and tried it by itself many times while altering other aspects (like where the _dep compiled stuff I did manually was).

superkuh@janus:~/app_installs/clip.cpp/build4/examples/image-search$ /usr/bin/c++ -DUSEARCH_USE_NATIVE_F16=0 -DUSEARCH_USE_OPENMP=0 -DUSEARCH_USE_SIMSIMD=0 -I/home/superkuh/app_installs/clip.cpp/. -I/home/superkuh/app_installs/clip.cpp/examples -I/home/superkuh/app_installs/clip.cpp/ggml/src/. -I/home/superkuh/app_installs/clip.cpp/ggml/src/../include -I/home/superkuh/app_installs/clip.cpp/ggml/src/../include/ggml -I/home/superkuh/app_installs/clip.cpp/build4/_deps/usearch-src/include -O3 -DNDEBUG -march=native -mf16c -mfma -mavx -mavx2 -o CMakeFiles/image-search.dir/search.cpp.o -c /home/superkuh/app_installs/clip.cpp/examples/image-search/search.cpp
/home/superkuh/app_installs/clip.cpp/examples/image-search/search.cpp: In function ‘int main(int, char**)’:
/home/superkuh/app_installs/clip.cpp/examples/image-search/search.cpp:114:44: error: ‘cos_gt’ is not a member of ‘unum::usearch’
  114 |     unum::usearch::index_gt<unum::usearch::cos_gt<float>> embd_index;
      |                                            ^~~~~~
/home/superkuh/app_installs/clip.cpp/examples/image-search/search.cpp:114:44: error: ‘cos_gt’ is not a member of ‘unum::usearch’
/home/superkuh/app_installs/clip.cpp/examples/image-search/search.cpp:114:56: error: template argument 1 is invalid
  114 |     unum::usearch::index_gt<unum::usearch::cos_gt<float>> embd_index;
      |                                                        ^~
/home/superkuh/app_installs/clip.cpp/examples/image-search/search.cpp:116:16: error: request for member ‘view’ in ‘embd_index’, which is of non-class type ‘int’
  116 |     embd_index.view("images.usearch");
      |                ^~~~
/home/superkuh/app_installs/clip.cpp/examples/image-search/search.cpp:127:47: error: request for member ‘size’ in ‘embd_index’, which is of non-class type ‘int’
  127 |     if (image_file_index.size() != embd_index.size()) {
      |                                               ^~~~
/home/superkuh/app_installs/clip.cpp/examples/image-search/search.cpp:158:31: error: request for member ‘search’ in ‘embd_index’, which is of non-class type ‘int’
  158 |     auto results = embd_index.search({vec.data(), vec.size()}, params.n_results);
      |                               ^~~~~~

I assumed the deps weren't being built. So I tried to build them myself,

superkuh@janus:~/app_installs/clip.cpp/build4/_deps/usearch-src$ cmake -DUSEARCH_BUILD_CLIB=YES .
-- Configuring done
-- Generating done
-- Build files have been written to: /home/superkuh/app_installs/clip.cpp/build4/_deps/usearch-src
superkuh@janus:~/app_installs/clip.cpp/build4/_deps/usearch-src$ make
[ 33%] Built target bench
[ 66%] Built target test
Scanning dependencies of target usearch_c
[ 83%] Building CXX object c/CMakeFiles/usearch_c.dir/lib.cpp.o
[100%] Linking CXX shared library ../libusearch_c.so
[100%] Built target usearch_c

But this, or similar attempts to build the various other parts of USearch _deps did not help. I cannot seem to get the compile of clip.cpp to know where the USearch libs are. If this is actually the problem.

Use scratch buffers

It can be helpful for optimizing the memory usage especially for larger models.

Support downloading models in Python bindings

Together with #60, it would be awesome to support downloading pre-converted models from HF Hub.

Internally, we can have a dict of model names and their URLs.

Migrate to GGUF

This will also resolve #32 and #49

Slower image encode the lower the quantization

I'm running this model clip-vit-base-patch32_ggml on my intel mac and it looks like the lower the quantization the slower image encoding is. I tried the main clip-vit-base-patch32_ggml-model-f32.gguf model and the q8_0 and q4_0 variants.

These are the encode times I get for a batch of 4 images:

clip-vit-base-patch32_ggml-model-f32.gguf
Avg Batch Img Encode Time: 272.21ms

clip-vit-base-patch32_ggml-model-f16.gguf
Avg Batch Img Encode Time: 665.07ms

clip-vit-base-patch32_ggml-model-q8_0.gguf
Avg Batch Img Encode Time: 333.96ms

clip-vit-base-patch32_ggml-model-q5_1.gguf
Avg Batch Img Encode Time: 322.71ms

clip-vit-base-patch32_ggml-model-q5_0.gguf
Avg Batch Img Encode Time: 354.86ms

clip-vit-base-patch32_ggml-model-q4_1.gguf
Avg Batch Img Encode Time: 330.20ms

clip-vit-base-patch32_ggml-model-q4_0.gguf
Avg Batch Img Encode Time: 539.32ms

f16 looks like an outlier, taking the most time.
But looking at f32(272.21ms) -> q8_0(333.96ms) -> q5_0(354.86ms) -> q4_0(539.32ms), time is getting worse. Its better with the _1 variants though.

Anyone know if this expected or is there something wrong?

bug: missing clip_free() in example/main.cpp

which causes the model to leak.

python bindings🐍: Support for accepting list of Input in the encoding methods

iterating over a path that has images and calculate the image embeddings in python is really expensive,
I was thinking maybe it can be offloaded into the c code and exposed into the python binding.
as an example, form the doc notebook:

image_files= [...] #list of image paths
##⚠️it take about ~30 min to embed 5000 images of fashion dataset
image_embeddings = [model.load_preprocess_encode_image(im) for im in tqdm(image_files)]
image_embeddings = np.array(image_embeddings, dtype=np.float16)

the favorable behavior may be like this:

images = [...] #list of image paths
# accepting list of image files 
# load_preprocess_encode_images iterating and process it in c exposed to the bindings
image_embeddings = model.load_preprocess_encode_images(image_files)
image_embeddings = np.array(image_embeddings, dtype=np.float16)

Write a better readme

Demonstrate model conversion, detail how to compile, explain the general API.

Talk about possible usage scenarios, especially the cold start issue.

Write instructions for Apple Mac

% uname -mps
Darwin arm64 arm

git clone https://github.com/monatis/clip.cpp.git --recurse-submodules clip.cpp
cd clip.cpp
mkdir build
cd build
cmake .. -GNinja
ninja

Error

CMake Warning at ggml/src/CMakeLists.txt:48 (message):
  Your arch is announced as x86_64, but it seems to actually be ARM64

Move ZSL implementation to `clip` lib as a function

THe implementation should be copied from zsl.cpp as a single function, similar to clip_compare_text_and_image, and it should implement the logic end-to-end.

Bonus: It can support multi-label scheme as HuggingFace's ZSL pipeline, e.g., do not squash all the scores into softmax.

Optional warmup when loading model

Currently, we set memory buffer to a fixed size, which can be improved in following way.

Default to the mem requirement for the base size, but allow it to be defined with a comp-time constant.
Implement a mechanism that automatically discovers the memory required in clip_model_load function and uses that one in clip_*_encode functions.

The second path may slow down the initialization for many users, so it should be explicitly turned on even if it's chosen for implementation. I'm not sure that it's user-friendly.

[ZSL] Results doesn't match hugging face demo

./bin/zsl -m ../../laion_clip-vit-b-32-laion2b-s34b-b79k.ggmlv0.f16.bin --image  ../pic.png --text "playing music" --text "playing sports"
clip_model_load: loading model from '../../laion_clip-vit-b-32-laion2b-s34b-b79k.ggmlv0.f16.bin' - please wait....................................................clip_model_load: model size =   288.93 MB / num tensors = 397
clip_model_load: model loaded

playing music = 0.5308
playing sports = 0.4692

Expected results:
playing music = 1.000
playing sports = 0.000
https://huggingface.co/laion/CLIP-ViT-B-32-laion2B-s34B-b79K

Bug: openai's clip-vit-large-patch14-336 failes with assert

https://huggingface.co/openai/clip-vit-large-patch14-336/
this model has a larger image? size

lip_model_load: ggml ctx size = 819.86 MB
.........................................................................clip_model_load: model size =   817.09 MB / num tensors = 589
clip_model_load: 16 MB of compute buffer allocated
clip_model_load: model loadded

GGML_ASSERT: /home/green/workspace/clip.cpp/clip.cpp:1086: nx == image_size && ny == image_size
Aborted (core dumped)

Typo in the error message

clip.cpp/clip.cpp

Line 802 in f2b5c61

printf("%s: model loadded\n\n", __func__);

loadded -> loaded

Introduce Java bindings

Since the entire public API is C-compatible, JNI or JNA might be possible now. That could be used for interesting use cases such as image search directly on Android phones etc.

Do proper benchmarking

Decide on a common dataset to do a proper benchmarking.

Ideally, it should be able to compare both inference speed and the vector quality, e.g., zero-shot labeling, image retrieval etc.

Improve zero-shot labeling

As reported in #44, ZSL doesn't match HF's behavior:

After reviewing HF's code for ZSL, I figured out that they don't normalize vectors prior to the dot product calculation in ZSL. In clip.cpp, all encoding functions return normalized vectors, so we need to make normalization optional. This will require a signature change for those functions.

Additionally, we can write a single function that runs the zero-shot labeling task end-to-end, similar to clip_compare_text_and_image, and expose that function to the Python binding as well.

Question about the CLIP model in clip.cpp and llama.cpp

Hi,

I have been trying out the LLaVA models in llama.cpp and they work great! I am curious about the CLIP model and clip.cpp used in llama.cpp:

I would like to use the same CLIP model to encode texts and images from the CLIP models from LLaVA, but it seems like the source code in llama.cpp does not have a way to encode text. I am curious if it is possible to add text encoding capabilities in the clip.cpp inside llama.cpp, or is it possible to load the LLaVA 's CLIP in clip.cpp and gain text encoding capabilities.

I hope this is not confusing and a right place to ask. Thank you for the great work!

Publish as a Pip-installable Python package

Now that we have Python bindings implemented, it would be great if we provide a Pip-installable package.

There might be some complexities to ship the binary DLL for different platforms and SIMD instructions, but X86_64 binaries with AVX2 for Linux and Windows should be sufficient in the first place. Other platforms and instructions sets (e.g., AVX512) can build from source.

Memory leak: clip_tokenize and clip_image_preprocess

It looks like clip_tokenize and clip_image_preprocess both allocate arrays:

clip.cpp/clip.cpp

Line 721 in 8f34872

res->data = new float[res->size]();

https://github.com/monatis/clip.cpp/blob/8f348725271db67517de871dea4a4e8a159e664f/clip.cpp#675

however there is no way to free these arrays. I'm not sure if I'm missing something here with C++, but these methods should either provide a free or the ability to pass your own buffer.

Experiment with batch inference

Especially image encoding seems to be doable with reasonable effort.

I set batch dimension manually to 1, instead it can be set to the actual number of images. Concatenation may need extra attention.

Support image-only

Use image only for scanning a image and finding its classes

no module named 'gguf'

GGUF library is needed to convert the model so it should be added into the requirements.txt and description.

Vision only model memory issue

I have created a vision-only model with the convert tool. When im trying to load it and use it with the python binding as shown below my memory just explodes until my system crashes.

model = Clip(
            model_path_or_repo_id="/path/to/model",
        )
# This is where it explodes
model.load_preprocess_encode_image("/path/to/image")

I have tried both with or without quantization and same results. When I run with same with the full model, i.e. not vision-only it works. Didn't investigate further, and this is a decent workaround for now for me, just wanted to let you know :)

QuickGELU - not SOTA ?

I just read through the open_clip readme and found this section:

NOTE: Many existing checkpoints use the QuickGELU activation from the original OpenAI models. This activation is actually less efficient than native torch.nn.GELU in recent versions of PyTorch. The model defaults are now nn.GELU, so one should use model definitions with -quickgelu postfix for the OpenCLIP pretrained weights. All OpenAI pretrained weights will always default to QuickGELU. One can also use the non -quickgelu model definitions with pretrained weights using QuickGELU but there will be an accuracy drop, for fine-tune that will likely vanish for longer runs.

does that mean, that we should not use QuickGELU/insert it as a hyperparameter ?

Add example for zero-shot image labelling

It should accept one image and at least 2 texts and label the image with one of the texts in a zero-shot fashion.

Hey there

Been following your progress with anticipation. Here are a couple of notes:

the ggml submodule commit is nowhere to be found (even in your fork). had to manually checkout your master.
hardcoded paths in examples/main ftw
i got a better model to work too. https://huggingface.co/laion/CLIP-ViT-B-32-laion2B-s34B-b79K
but i had to modify the conversion script to read "projection_dim" from config instead (both times).
quick similarity test with the apple picture:

model a dog a red apple

openai b32 0.228 0.341

laion b32 laion2b s34b b79k 0.126 0.345

model	`a dog`	`a red apple`
openai b32	0.228	0.341
laion b32 laion2b s34b b79k	0.126	0.345

keep up the good work!

Provide Python bindings

This is necessary for use cases where it is required to access embeddings from Python for further processing. see this

Support batch inference for models other than patch32

The current batch inference code is only applicable to the patch32 model. When using other models such as patch16 or patch14, it produces incorrect results. Specifically, the behavior is such that the first embedding result in a batch is correct, but all subsequent results are a single incorrect fixed value.

monatis / clip.cpp Goto Github PK

clip.cpp's Introduction

👋 Hey, I'm Yusuf!

🗞️ Timeline

clip.cpp's People

Contributors

Stargazers

Watchers

Forkers

clip.cpp's Issues

Recommend Projects

Recommend Topics

Recommend Org