Code Monkey home page Code Monkey logo

ip-adapter's Introduction

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

GitHub


Introduction

we present IP-Adapter, an effective and lightweight adapter to achieve image prompt capability for the pre-trained text-to-image diffusion models. An IP-Adapter with only 22M parameters can achieve comparable or even better performance to a fine-tuned image prompt model. IP-Adapter can be generalized not only to other custom models fine-tuned from the same base model, but also to controllable generation using existing controllable tools. Moreover, the image prompt can also work well with the text prompt to accomplish multimodal image generation.

arch

Release

  • [2024/01/19] πŸ”₯ Add IP-Adapter-FaceID-Portrait, more information can be found here.
  • [2024/01/17] πŸ”₯ Add an experimental version of IP-Adapter-FaceID-PlusV2 for SDXL, more information can be found here.
  • [2024/01/04] πŸ”₯ Add an experimental version of IP-Adapter-FaceID for SDXL, more information can be found here.
  • [2023/12/29] πŸ”₯ Add an experimental version of IP-Adapter-FaceID-PlusV2, more information can be found here.
  • [2023/12/27] πŸ”₯ Add an experimental version of IP-Adapter-FaceID-Plus, more information can be found here.
  • [2023/12/20] πŸ”₯ Add an experimental version of IP-Adapter-FaceID, more information can be found here.
  • [2023/11/22] IP-Adapter is available in Diffusers thanks to Diffusers Team.
  • [2023/11/10] πŸ”₯ Add an updated version of IP-Adapter-Face. The demo is here.
  • [2023/11/05] πŸ”₯ Add text-to-image demo with IP-Adapter and Kandinsky 2.2 Prior
  • [2023/11/02] Support safetensors
  • [2023/9/08] πŸ”₯ Update a new version of IP-Adapter with SDXL_1.0. More information can be found here.
  • [2023/9/05] πŸ”₯πŸ”₯πŸ”₯ IP-Adapter is supported in WebUI and ComfyUI (or ComfyUI_IPAdapter_plus).
  • [2023/8/30] πŸ”₯ Add an IP-Adapter with face image as prompt. The demo is here.
  • [2023/8/29] πŸ”₯ Release the training code.
  • [2023/8/23] πŸ”₯ Add code and models of IP-Adapter with fine-grained features. The demo is here.
  • [2023/8/18] πŸ”₯ Add code and models for SDXL 1.0. The demo is here.
  • [2023/8/16] πŸ”₯ We release the code and models.

Installation

# install latest diffusers
pip install diffusers==0.22.1

# install ip-adapter
pip install git+https://github.com/tencent-ailab/IP-Adapter.git

# download the models
cd IP-Adapter
git lfs install
git clone https://huggingface.co/h94/IP-Adapter
mv IP-Adapter/models models
mv IP-Adapter/sdxl_models sdxl_models

# then you can use the notebook

Download Models

you can download models from here. To run the demo, you should also download the following models:

How to Use

SD_1.5

  • ip_adapter_demo: image variations, image-to-image, and inpainting with image prompt.
  • ip_adapter_demo

image variations

image-to-image

inpainting

structural_cond structural_cond2

multi_prompts

ip_adpter_plus_image_variations ip_adpter_plus_multi

ip_adpter_plus_face

Best Practice

  • If you only use the image prompt, you can set the scale=1.0 and text_prompt=""(or some generic text prompts, e.g. "best quality", you can also use any negative text prompt). If you lower the scale, more diverse images can be generated, but they may not be as consistent with the image prompt.
  • For multimodal prompts, you can adjust the scale to get the best results. In most cases, setting scale=0.5 can get good results. For the version of SD 1.5, we recommend using community models to generate good images.

IP-Adapter for non-square images

As the image is center cropped in the default image processor of CLIP, IP-Adapter works best for square images. For the non square images, it will miss the information outside the center. But you can just resize to 224x224 for non-square images, the comparison is as follows:

SDXL_1.0

The comparison of IP-Adapter_XL with Reimagine XL is shown as follows:

sdxl_demo

Improvements in new version (2023.9.8):

  • Switch to CLIP-ViT-H: we trained the new IP-Adapter with OpenCLIP-ViT-H-14 instead of OpenCLIP-ViT-bigG-14. Although ViT-bigG is much larger than ViT-H, our experimental results did not find a significant difference, and the smaller model can reduce the memory usage in the inference phase.
  • A Faster and better training recipe: In our previous version, training directly at a resolution of 1024x1024 proved to be highly inefficient. However, in the new version, we have implemented a more effective two-stage training strategy. Firstly, we perform pre-training at a resolution of 512x512. Then, we employ a multi-scale strategy for fine-tuning. (Maybe this training strategy can also be used to speed up the training of controlnet).

How to Train

For training, you should install accelerate and make your own dataset into a json file.

accelerate launch --num_processes 8 --multi_gpu --mixed_precision "fp16" \
  tutorial_train.py \
  --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5/" \
  --image_encoder_path="{image_encoder_path}" \
  --data_json_file="{data.json}" \
  --data_root_path="{image_path}" \
  --mixed_precision="fp16" \
  --resolution=512 \
  --train_batch_size=8 \
  --dataloader_num_workers=4 \
  --learning_rate=1e-04 \
  --weight_decay=0.01 \
  --output_dir="{output_dir}" \
  --save_steps=10000

Once training is complete, you can convert the weights with the following code:

import torch
ckpt = "checkpoint-50000/pytorch_model.bin"
sd = torch.load(ckpt, map_location="cpu")
image_proj_sd = {}
ip_sd = {}
for k in sd:
    if k.startswith("unet"):
        pass
    elif k.startswith("image_proj_model"):
        image_proj_sd[k.replace("image_proj_model.", "")] = sd[k]
    elif k.startswith("adapter_modules"):
        ip_sd[k.replace("adapter_modules.", "")] = sd[k]

torch.save({"image_proj": image_proj_sd, "ip_adapter": ip_sd}, "ip_adapter.bin")

Third-party Usage

Disclaimer

This project strives to positively impact the domain of AI-driven image generation. Users are granted the freedom to create images using this tool, but they are expected to comply with local laws and utilize it in a responsible manner. The developers do not assume any responsibility for potential misuse by users.

Citation

If you find IP-Adapter useful for your research and applications, please cite using this BibTeX:

@article{ye2023ip-adapter,
  title={IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models},
  author={Ye, Hu and Zhang, Jun and Liu, Sibo and Han, Xiao and Yang, Wei},
  booktitle={arXiv preprint arxiv:2308.06721},
  year={2023}
}

ip-adapter's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ip-adapter's Issues

Some Implemention Details Discussion

Hi, Dear Authors.

After reading the code, I found that you let the image features after projection (concept features) put into the Adapter layers.

However, in some relevant works (personally), eg. InstantBooth (https://arxiv.org/abs/2304.03411), Subject Diffusion (https://arxiv.org/abs/2307.11410), they inject the image token features (patch features) to the adapter layers in the UNet of SD (they use self-attn).
It seems like patch features may contain more detailed features so that the model can preserve the characteristics of input images better.

Concept features seem to have more high-level semantic info. Maybe this change can help IP-Adapter be more flexible but to some extent lose some abilities of subject/identity-driven generation abilities.

It is just my personal opinion, talks are welcome.

adding noise to the uncond images

have you experimented adding a little noise to the zeroed tensors?

I made a few tests with the Plus model and the results are... interesting. Basically instead of a zeroed embed I'm passing a random +/- 0.5 noise (or even higher).

This is a quick example, sometimes the result is quite better, you just need to keep it low otherwise it starts "burning" the image.

noise example

wondering if this makes any sense

datasets

First, thanks for your great job. there is question about the differences between the promt image and real-value image. could you offer some examples?

bug when not using torch 2.0

File "/home/code/third_party/IP-Adapter/ip_adapter/attention_processor.py", line 192, in init
raise ImportError("AttnProcessor2_0 requires PyTorch 2.0, to use it, please upgrade PyTorch to 2.0.")
ImportError: AttnProcessor2_0 requires PyTorch 2.0, to use it, please upgrade PyTorch to 2.0.
maybe I should modify below code

if is_torch2_available:
    from .attention_processor import IPAttnProcessor2_0 as IPAttnProcessor, AttnProcessor2_0 as AttnProcessor, CNAttnProcessor2_0 as CNAttnProcessor
else:
    from .attention_processor import IPAttnProcessor, AttnProcessor, CNAttnProcessor

to

if is_torch2_available():

Doubts regarding the model

I'm just trying to understand how IP Adapter does these powerful manipulations.

So we are encoding the image into embeddings, combining them with the prompt's embeddings and then creating an image based on that. I see we are changing the unet, but I don't understand how.

I also ran into issues using embeddings directly instead of passing in the prompt as a string. Here are the 2 doubts in more detail:

1

What does this do:

input_ids = pipe.tokenizer(prompt, return_tensors="pt").input_ids

I saw this in the set_ip_adapter function. Is this the one that's changing the vae decoder? What does this do, and am I incorrect in assessing this?

2

I changed some function inputs, in order to use prompt embeds directly, for example doing this:

input_ids = pipe.tokenizer(prompt, return_tensors="pt").input_ids
for i in range(0, input_ids.shape[-1], max_length):
    concat_embeds.append(pipe.text_encoder(input_ids[:, i: i + max_length])[0])
prompt_embeds = torch.cat(concat_embeds, dim=1)

Tried inputting this, and commenting out the following in IP Adapter's generate(...) function.

prompt_embeds = self.pipe._encode_prompt(...)

This however gave incorrect results in the output image. What am I doing wrong?

3 (Out of scope)

I know this is out of scope, but is there any way of also loading a LoRA model's weights in the model's pipeline? If I load a LoRA model using pipe.load_lora_weights(...) before calling ip_adapter, will it retain those weights?

Thanks

I know these are many questions. Thanks a lot in advance!

Edits:

Forgot to mention earlier, but this is some amazing work you guys did, I love the idea.

Differences between your results and the colab demo output

Greetings,
First of all thank you for this achievement, the potential of this tool is astounding.

I have a problem with the sample code:
Following precisely the code in the Colab demo, I notice some differences between the output I get and the one shown by you in the results.

I can't understand why.
Anyone know why?

Note: I left the SD1.5 set as per code.

About foreground segmentation

Hello, I have a question. During inference, the foreground objects need to be picked out using instance segmentation. Is this necessary during training? Also, what is the specific segmentation method used in the paper? Looking forward to your reply, thank you.

Images from base model getting ruined if I load IP_Adapter

base_model_path = "yiffymix16_32"
image_encoder_path = "models/image_encoder/"
ip_ckpt = "models/ip-adapter_sd15.bin"
device = "cuda"

controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose", torch_dtype=torch.float16)
pipe = StableDiffusionControlNetPipeline.from_pretrained("yiffymix16_32", safety_checker=None, controlnet=controlnet, torch_dtype=torch.float16)

# the below line is causing issues
ip_model = IPAdapter(pipe, image_encoder_path, ip_ckpt, device)

pipe = pipe.to("cuda")
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

I have this above code. Now when i run the pipe() independently (without IPAdapter, for example image = pipe(...)), it produces hazy, unnatural images like this:
weird

But if I comment out the line ip_model = IPAdapter(pipe, image_encoder_path, ip_ckpt, device), the pipe.() gives proper results. any idea why?

About train code

I run tutorial_train.py and save the related param of β€˜unet’ and β€˜ip-adapter_sd15.bin’. But when I load the unet param with StableDiffusionPipeline, I get the warning:

weights of the model checkpoint were not used when initializing UNet2DConditionModel:
 ['down_blocks.0.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, down_blocks.0.attentions.1.transformer_blocks.0.attn2.processor.to_k_ip.weight, down_blocks.0.attentions.1.transformer_blocks.0.attn2.processor.to_v_ip.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, down_blocks.1.attentions.1.transformer_blocks.0.attn2.processor.to_k_ip.weight, down_blocks.1.attentions.1.transformer_blocks.0.attn2.processor.to_v_ip.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn2.processor.to_k_ip.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.1.attentions.2.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.1.attentions.2.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.2.attentions.2.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.2.attentions.2.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.3.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.3.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.3.attentions.1.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.3.attentions.1.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.processor.to_v_ip.weight, mid_block.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, mid_block.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight']

I think maybe the reason about the training code with the lines:

else:
   layer_name = name.split(".processor")[0]
   weights = {
                "to_k_ip.weight": unet_sd[layer_name + ".to_k.weight"],
                "to_v_ip.weight": unet_sd[layer_name + ".to_v.weight"],
    }
    attn_procs[name] = IPAttnProcessor(hidden_size=hidden_size, cross_attention_dim=cross_attention_dim)
    attn_procs[name].load_state_dict(weights)

So, I want to know why it is set like this, and how should I modify my inference code so that the code does not have such warnings?

Cannot find or load the image encoder from models/image_encoder/ on huggingface

Hi, I am trying to run the ip_adapter_controlnet_demo_new.ipynb notebook, however I keep getting this error from the following line below. I also tried to download the model manually but couldn't find it on huggingface. I definately don't have a local directory with the same name.

Thank you so much for your help.

Line:

# load ip-adapter
ip_model = IPAdapter(pipe, image_encoder_path, ip_ckpt, device)

Error:


---------------------------------------------------------------------------
HFValidationError                         Traceback (most recent call last)
File [c:\Users\avika\anaconda3\envs\interactive_dance_thesis\lib\site-packages\transformers\configuration_utils.py:675](file:///C:/Users/avika/anaconda3/envs/interactive_dance_thesis/lib/site-packages/transformers/configuration_utils.py:675), in PretrainedConfig._get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
    673 try:
    674     # Load from local folder or from cache or download from model Hub and cache
--> 675     resolved_config_file = cached_file(
    676         pretrained_model_name_or_path,
    677         configuration_file,
    678         cache_dir=cache_dir,
    679         force_download=force_download,
    680         proxies=proxies,
    681         resume_download=resume_download,
    682         local_files_only=local_files_only,
    683         token=token,
    684         user_agent=user_agent,
    685         revision=revision,
    686         subfolder=subfolder,
    687         _commit_hash=commit_hash,
    688     )
    689     commit_hash = extract_commit_hash(resolved_config_file, commit_hash)

File [c:\Users\avika\anaconda3\envs\interactive_dance_thesis\lib\site-packages\transformers\utils\hub.py:428](file:///C:/Users/avika/anaconda3/envs/interactive_dance_thesis/lib/site-packages/transformers/utils/hub.py:428), in cached_file(path_or_repo_id, filename, cache_dir, force_download, resume_download, proxies, token, revision, local_files_only, subfolder, repo_type, user_agent, _raise_exceptions_for_missing_entries, _raise_exceptions_for_connection_errors, _commit_hash, **deprecated_kwargs)
    426 try:
    427     # Load from URL or cache if already cached
--> 428     resolved_file = hf_hub_download(
...
    703 try:
    704     # Load config dict
    705     config_dict = cls._dict_from_json_file(resolved_config_file)

OSError: Can't load the configuration of 'models/image_encoder/'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'models/image_encoder/' is the correct path to a directory containing a config.json file

Pytorch compile compatibility

Hi! Thank you for your amazing work!
When I'm compiling unet with torch.compile(unet) decoupled text/image attention seems to stop working. Do you have fix for that?

Which CLIP model do you use?

I notice that you provide image encoder on your own space, is it different from the models released by openai?

"RuntimeError: Attempting to deserialize object on a CUDA device" error in Automatic1111 v1.6, ControlNet v1.1.409 - Apple Silicon OSX

Using the SD1.5 IP-Adaptor models in v1.6.0 Automatic1111 environment (python: 3.10.13, torch: 2.0.1 ) with latest ControlNet on Apple ARM architecture generates a random image and produces console runtime error below. COMMANDLINE_ARGS="--skip-torch-cuda-test --upcast-sampling --no-half-vae --use-cpu interrogate".

2023-09-10 21:04:57,087 - ControlNet - STATUS - preprocessor resolution = 512
*** Error running process: /Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/extensions/sd-webui-controlnet/scripts/controlnet.py
    Traceback (most recent call last):
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/modules/scripts.py", line 619, in process
        script.process(p, *script_args)
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/extensions/sd-webui-controlnet/scripts/controlnet.py", line 977, in process
        self.controlnet_hack(p)
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/extensions/sd-webui-controlnet/scripts/controlnet.py", line 966, in controlnet_hack
        self.controlnet_main_entry(p)
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/extensions/sd-webui-controlnet/scripts/controlnet.py", line 808, in controlnet_main_entry
        detected_map, is_image = preprocessor(
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/extensions/sd-webui-controlnet/scripts/utils.py", line 75, in decorated_func
        return cached_func(*args, **kwargs)
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/extensions/sd-webui-controlnet/scripts/utils.py", line 63, in cached_func
        return func(*args, **kwargs)
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/extensions/sd-webui-controlnet/scripts/global_state.py", line 35, in unified_preprocessor
        return preprocessor_modules[preprocessor_name](*args, **kwargs)
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/extensions/sd-webui-controlnet/scripts/processor.py", line 350, in clip
        from annotator.clipvision import ClipVisionDetector
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/extensions/sd-webui-controlnet/annotator/clipvision/__init__.py", line 81, in <module>
        clip_vision_h_uc = torch.load(clip_vision_h_uc)['uc']
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/modules/safe.py", line 108, in load
        return load_with_extra(filename, *args, extra_handler=global_extra_handler, **kwargs)
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/modules/safe.py", line 156, in load_with_extra
        return unsafe_torch_load(filename, *args, **kwargs)
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/serialization.py", line 809, in load
        return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/serialization.py", line 1172, in _load
        result = unpickler.load()
      File "/opt/homebrew/Cellar/[email protected]/3.10.13/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pickle.py", line 1213, in load
        dispatch[key[0]](self)
      File "/opt/homebrew/Cellar/[email protected]/3.10.13/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pickle.py", line 1254, in load_binpersid
        self.append(self.persistent_load(pid))
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/serialization.py", line 1142, in persistent_load
        typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/serialization.py", line 1116, in load_tensor
        wrap_storage=restore_location(storage, location),
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/serialization.py", line 217, in default_restore_location
        result = fn(storage, location)
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/serialization.py", line 182, in _cuda_deserialize
        device = validate_cuda_device(location)
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/serialization.py", line 166, in validate_cuda_device
        raise RuntimeError('Attempting to deserialize object on a CUDA '
    RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.```

Selection of training data

Congrats again on the great work.

Could you please help clarify how the data subset was selected from the LAION?

Thanks.

Questions

Hi! first of all thanks for a really great work.

I see that you've added a face-conditioned model for SD1.5 recently, do you have any plans on releasing similar model for SDXL? Also could you give any estimates on how long does it take to train the IP-Adapter model? You mentioned 1M steps in your paper, could you elaborate how many hours/days is it?

Also I have few improvements ideas, based on my experience:

  1. why do you use very final CLIP embeddings, even after projection? it's known to contain mostly semantic information and lose details, while you're probably interested in details of original image as well. this is especially important for face conditioned model. maybe taking some penultimate tokens would work better? (for example SDXL takes hidden_states[-2] for text conditioning
  2. there are a lot of things in the training script that could be cached. text encoding, clip embeddings, vae (since you're not doing augmentations anyway). also you mentioned using deepspeed in your paper, but I can't see it here. Does it mean this is just a draft of the training, and you have a more advanced one?
  3. for face conditioning it would make sense to mask the final loss and only apply it on face area, to avoid penalising model for predicting wrong background for example

The different SD models

Thanks for your work

I observe ipadapter can't generate similar images with image prompt when image prompt is anime character style. But when i use the anime style dreambooth or corresponding character lora, the ipadapter performs better. I'd like to ask whether the ipadapter only works when the foundation model can produce similar results with the image prompt.

IPAdaptor for MultiControlNet

First of all, this work is amazing truly!
Could there also be a support for multicontrol in ipadapters. I was trying canny and inpaint controlnets and faced errors.

Original dataset

Hi! Thank you for amazing work! It works like a charm!
I wonder which dataset you've used during training? Can you share more info about it? You specified in the paper that this is subset of LAION & COYO datasets. Maybe you have parameters that you've used for filtering such data? Aesthetic score threshold, p_unsafe / p_watermark, image size? And the proportion of LAION and COYO in your data
Do you think results will be different when trained on smaller dataset, let's say 1M samples? Do you think results would improve if used full resolution using variable aspect ratio bucketing instead of center crop?

Issue with `load_ip_adapter`

Maybe you guys have seen this error before

Traceback (most recent call last):
  File "/pkg/modal/_container_entrypoint.py", line 351, in handle_input_exception
    yield
  File "/pkg/modal/_container_entrypoint.py", line 437, in run_inputs
    res = imp_fun.fun(*args, **kwargs)
  File "/root/modal_testing/adapter.py", line 188, in run
    ip_model = IPAdapterPlus(
  File "/content/IP-Adapter/ip_adapter/ip_adapter.py", line 52, in __init__
    self.load_ip_adapter()
  File "/content/IP-Adapter/ip_adapter/ip_adapter.py", line 84, in load_ip_adapter
    self.image_proj_model.load_state_dict(state_dict["image_proj"])
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Resampler:
	Missing key(s) in state_dict: "latents", "proj_in.weight", "proj_in.bias", "proj_out.weight", "proj_out.bias", "norm_out.weight", "norm_out.bias", "layers.0.0.norm1.weight", "layers.0.0.norm1.bias", "layers.0.0.norm2.weight", "layers.0.0.norm2.bias", "layers.0.0.to_q.weight", "layers.0.0.to_kv.weight", "layers.0.0.to_out.weight", "layers.0.1.0.weight", "layers.0.1.0.bias", "layers.0.1.1.weight", "layers.0.1.3.weight", "layers.1.0.norm1.weight", "layers.1.0.norm1.bias", "layers.1.0.norm2.weight", "layers.1.0.norm2.bias", "layers.1.0.to_q.weight", "layers.1.0.to_kv.weight", "layers.1.0.to_out.weight", "layers.1.1.0.weight", "layers.1.1.0.bias", "layers.1.1.1.weight", "layers.1.1.3.weight", "layers.2.0.norm1.weight", "layers.2.0.norm1.bias", "layers.2.0.norm2.weight", "layers.2.0.norm2.bias", "layers.2.0.to_q.weight", "layers.2.0.to_kv.weight", "layers.2.0.to_out.weight", "layers.2.1.0.weight", "layers.2.1.0.bias", "layers.2.1.1.weight", "layers.2.1.3.weight", "layers.3.0.norm1.weight", "layers.3.0.norm1.bias", "layers.3.0.norm2.weight", "layers.3.0.norm2.bias", "layers.3.0.to_q.weight", "layers.3.0.to_kv.weight", "layers.3.0.to_out.weight", "layers.3.1.0.weight", "layers.3.1.0.bias", "layers.3.1.1.weight", "layers.3.1.3.weight". 
	Unexpected key(s) in state_dict: "proj.weight", "proj.bias", "norm.weight", "norm.bias".

@xiaohu2015 any ideas??

ip_adapter-plus-face for SDXL?

Hello..
is there an SDXL version of this model "ip_adapter-plus-face"? .. or is there a way to use it with SDXL?

thank you :)

SDXL support

really awesome work thank guys for that

do you have a plan for supporting SDXL?

invalid load key, '<'. [ip adapter plus face demo]

https://colab.research.google.com/drive/1_Vtos4PRqZWAg69sC9XuSBBt6Mw_aDCz?usp=sharing

ip_model = IPAdapterPlus(pipe, image_encoder_path, ip_ckpt, device, num_tokens=16)


UnpicklingError Traceback (most recent call last)
in <cell line: 1>()
----> 1 ip_model = IPAdapterPlus(pipe, image_encoder_path, ip_ckpt, device, num_tokens=16)

3 frames
/usr/local/lib/python3.10/dist-packages/torch/serialization.py in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
1031 "functionality.")
1032
-> 1033 magic_number = pickle_module.load(f, **pickle_load_args)
1034 if magic_number != MAGIC_NUMBER:
1035 raise RuntimeError("Invalid magic number; corrupt file?")

UnpicklingError: invalid load key, '<'.

A1111: scaled_dot_product_attention

getting the issue of AttributeError: module 'torch.nn.functional' has no attribute 'scaled_dot_product_attention'
I have the latest torch, controlnet 1.1.409, renamed to .pth,
everything looks good just erroring out.

seeing if anyone else has this issue

Does fine-grained IPAdapterPlus support control net?

I change the controlnet demo from IPAdapter to IPAdapterPlus, while using "models/ip-adapter-plus_sd15.bin" as adapter model checkpoint. but failed in loading ip-adapter.
ip_model = IPAdapterPlus(pipe, image_encoder_path, ip_ckpt, device)

error log:

Cell In[8], line 2
      1 # load ip-adapter
----> 2 ip_model = IPAdapterPlus(pipe, image_encoder_path, ip_ckpt, device)

File /mnt/data/aigc/IP-Adapter/ip_adapter/ip_adapter.py:52, in IPAdapter.__init__(self, sd_pipe, image_encoder_path, ip_ckpt, device, num_tokens)
     49 # image proj model
     50 self.image_proj_model = self.init_proj()
---> 52 self.load_ip_adapter()

File /mnt/data/aigc/IP-Adapter/ip_adapter/ip_adapter.py:84, in IPAdapter.load_ip_adapter(self)
     82 def load_ip_adapter(self):
     83     state_dict = torch.load(self.ip_ckpt, map_location="cpu")
---> 84     self.image_proj_model.load_state_dict(state_dict["image_proj"])
     85     ip_layers = torch.nn.ModuleList(self.pipe.unet.attn_processors.values())
     86     ip_layers.load_state_dict(state_dict["ip_adapter"])

File ~/anaconda3/envs/IP-Adapter/lib/python3.10/site-packages/torch/nn/modules/module.py:2041, in Module.load_state_dict(self, state_dict, strict)
   2036         error_msgs.insert(
   2037             0, 'Missing key(s) in state_dict: {}. '.format(
   2038                 ', '.join('"{}"'.format(k) for k in missing_keys)))
   2040 if len(error_msgs) > 0:
-> 2041     raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
   2042                        self.__class__.__name__, "\n\t".join(error_msgs)))
   2043 return _IncompatibleKeys(missing_keys, unexpected_keys)

RuntimeError: Error(s) in loading state_dict for Resampler:
	size mismatch for latents: copying a param with shape torch.Size([1, 16, 768]) from checkpoint, the shape in current model is torch.Size([1, 4, 768]).

About infernece result

The paper said that when the scale param is set to 0, it is equivalent to the ability of the base text2img model.

But when I set the scale to 0 and compared it with the original text2img base model, I found that there are still some differences in the generation images (all other parameters remain the same setting).

Is it because of the def set_ip_adapter() and def load_ip_adapter() in the inference code?

do ipadapter_face and ipadapter together

can we try to do ipadapter face and ipadapter together?
maybe like this
hidden_states = hidden_states + self.scale1 * ip_hidden_states + self.scale2 * ip_face_hidden_states

Automatic-1111

Hi the paper is really interesting and your results kick ass. not sure how to use this in automatic 1111 though, I tried putting the models in the controlnet model folder but they weren't showing up. Any chance you can update the readme with the process? Thanks.

Idea: Perspective or scene adapter

Is it possible to train an adapter, that will be grab ONLY perspective/scene/location condition from image?

SD1 often struggling with good perspective and angles. It will fix big problem.

sdxl model checkpoint load error .

i use self.ip_ckpt ./IP-Adapter/sdxl_models/ip-adapter_sdxl.bin,
but when i load it for my sdxl inpainting model :
RuntimeError: Error(s) in loading state_dict for ImageProjModel: size mismatch for proj.weight: copying a param with shape torch.Size([8192, 1280]) from checkpoint, the shape in current model is torch.Size([32768, 1280]). size mismatch for proj.bias: copying a param with shape torch.Size([8192]) from checkpoint, the shape in current model is torch.Size([32768]).

Training code

Hi,

This project is just awesome, thanks for it !

Will you release the training/finetuning code ?

Thanks

Models .bin not showing on controlnet a1111

Hi, I placed the models ip-adaptater_sd15.bin , ip-adapter-plus_sd15.bin and ip-adapter-plus-face_sd15.bin into ../stable-diffusion-webui > extensions > sd-webui-controlnet > models but when I restart a1111, they not showing into the model field of controlnet ( 1.1.4 )

Thanks

Errors in using sdxl

Hello, the plugin you made is very useful, but I used it the day before and it was fine. The next day, I made an error and many others couldn't use it. 1.5 is okay, but xl will make an error. My error prompt is

Error occurred when executing IPAdapter:

Input type (torch. FloatTensor) and weight type (torch. HalfTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor

File "D: AI StableSwarmUI dlbackend comfy ComfyUI execution. py", line 151, in recursive_ Execute

Output_ Data, output_ Ui=get_ Output_ Data (obj, input_data_all)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D: AI StableSwarmUI dlbackend comfy ComfyUI execution. py", line 81, in get_ Output_ Data

Return_ Values=map_ Node_ Over_ List (obj, input_data_all, obj. JUNCTION, allow_interrupt=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D: AI StableSwarmUI dlbackend comfy ComfyUI execution. py", line 74, in map_ Node_ Over_ List

Results. append (getattr (obj, func) (* * sliceDict (input_data_all, i)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D: AI StableSwarmUI dlbackend comfy ComfyUI custom_nodes IPAda

Can you run ip-adapter sdxl using Colab's free tier?

I have tried running it but always run into memory issues and it terminates at this part - IPAdapterXL(pipe, image_encoder_path, ip_ckpt, device)

Even though loading the model with StableDiffusionXLPipeline.from_pretrained works fine. I also tried using accelerate but still face issues. Does this mean it's not possible to run it on Colab's free tier?

Preserving human face likeness

First off, congratulations on this project and thank you so much for your work!

I'm testing using your SDXL demo, and am generally getting good results. However, my use case is really for human "personalization", like DreamBooth. I've tried using your multi-model prompt with some of my images and the likeness of the face is not quite what I'd like, meaning that the face shows variation where I wish it would be more true to the original input.

Do you have any suggestions on settings or values that I could tweak to try and improve this? Or other ideas?

Again, thanks for everything!

IP Adapter Plus Img-to-Img Pipeline

Hello,

thank you for this wonderful model!

I am trying to run ImgtoImg pipeline using IP Adapter Plus following the example in the original notebook:

pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
    base_model_path,
    torch_dtype=torch.float16,
    scheduler=noise_scheduler,
    vae=vae,
    feature_extractor=None,
    safety_checker=None
)

ip_model = IPAdapterPlus(pipe, image_encoder_path, ip_ckpt, device, num_tokens=16)

images = ip_model.generate(pil_image=image, num_samples=4, num_inference_steps=50, seed=seed, image=g_image, strength=scale)

but I am getting the following error

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[7], line 6
      4 for scale in [0.7, 0.75, 0.8, 0.95]:
      5     print(scale)
----> 6     images = ip_model.generate(pil_image=image, num_samples=4, num_inference_steps=50, seed=seed, image=g_image, strength=scale)
      7     grid = image_grid(images, 1, 4)
      8     display(grid)

File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/IP_Adapter/ip_adapter/ip_adapter.py:132, in IPAdapter.generate(self, pil_image, prompt, negative_prompt, scale, num_samples, seed, guidance_scale, num_inference_steps, **kwargs)
    129 if not isinstance(negative_prompt, List):
    130     negative_prompt = [negative_prompt] * num_prompts
--> 132 image_prompt_embeds, uncond_image_prompt_embeds = self.get_image_embeds(pil_image)
    133 bs_embed, seq_len, _ = image_prompt_embeds.shape
    134 image_prompt_embeds = image_prompt_embeds.repeat(1, num_samples, 1)

File /mnt/a0b764eb-cdc5-4f46-9a2e-e2f11deba631/PYTHON_CACHE/IP_Adapter/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/IP_Adapter/ip_adapter/ip_adapter.py:239, in IPAdapterPlus.get_image_embeds(self, pil_image)
    237 clip_image = self.clip_image_processor(images=pil_image, return_tensors="pt").pixel_values
    238 clip_image = clip_image.to(self.device, dtype=torch.float16)
--> 239 clip_image_embeds = self.image_encoder(clip_image, output_hidden_states=True).hidden_states[-2]
    240 image_prompt_embeds = self.image_proj_model(clip_image_embeds)
    241 uncond_clip_image_embeds = self.image_encoder(torch.zeros_like(clip_image), output_hidden_states=True).hidden_states[-2]

File /mnt/a0b764eb-cdc5-4f46-9a2e-e2f11deba631/PYTHON_CACHE/IP_Adapter/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /mnt/a0b764eb-cdc5-4f46-9a2e-e2f11deba631/PYTHON_CACHE/IP_Adapter/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:1311, in CLIPVisionModelWithProjection.forward(self, pixel_values, output_attentions, output_hidden_states, return_dict)
   1288 r"""
   1289 Returns:
   1290 
   (...)
   1307 >>> image_embeds = outputs.image_embeds
   1308 ```"""
   1309 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
-> 1311 vision_outputs = self.vision_model(
   1312     pixel_values=pixel_values,
   1313     output_attentions=output_attentions,
   1314     output_hidden_states=output_hidden_states,
   1315     return_dict=return_dict,
   1316 )
   1318 pooled_output = vision_outputs[1]  # pooled_output
   1320 image_embeds = self.visual_projection(pooled_output)

File /mnt/a0b764eb-cdc5-4f46-9a2e-e2f11deba631/PYTHON_CACHE/IP_Adapter/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /mnt/a0b764eb-cdc5-4f46-9a2e-e2f11deba631/PYTHON_CACHE/IP_Adapter/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:866, in CLIPVisionTransformer.forward(self, pixel_values, output_attentions, output_hidden_states, return_dict)
    863 if pixel_values is None:
    864     raise ValueError("You have to specify pixel_values")
--> 866 hidden_states = self.embeddings(pixel_values)
    867 hidden_states = self.pre_layrnorm(hidden_states)
    869 encoder_outputs = self.encoder(
    870     inputs_embeds=hidden_states,
    871     output_attentions=output_attentions,
    872     output_hidden_states=output_hidden_states,
    873     return_dict=return_dict,
    874 )

File /mnt/a0b764eb-cdc5-4f46-9a2e-e2f11deba631/PYTHON_CACHE/IP_Adapter/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /mnt/a0b764eb-cdc5-4f46-9a2e-e2f11deba631/PYTHON_CACHE/IP_Adapter/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:195, in CLIPVisionEmbeddings.forward(self, pixel_values)
    193 def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor:
    194     batch_size = pixel_values.shape[0]
--> 195     patch_embeds = self.patch_embedding(pixel_values)  # shape = [*, width, grid, grid]
    196     patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
    198     class_embeds = self.class_embedding.expand(batch_size, 1, -1)

File /mnt/a0b764eb-cdc5-4f46-9a2e-e2f11deba631/PYTHON_CACHE/IP_Adapter/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /mnt/a0b764eb-cdc5-4f46-9a2e-e2f11deba631/PYTHON_CACHE/IP_Adapter/lib/python3.10/site-packages/torch/nn/modules/conv.py:463, in Conv2d.forward(self, input)
    462 def forward(self, input: Tensor) -> Tensor:
--> 463     return self._conv_forward(input, self.weight, self.bias)

File /mnt/a0b764eb-cdc5-4f46-9a2e-e2f11deba631/PYTHON_CACHE/IP_Adapter/lib/python3.10/site-packages/torch/nn/modules/conv.py:459, in Conv2d._conv_forward(self, input, weight, bias)
    455 if self.padding_mode != 'zeros':
    456     return F.conv2d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
    457                     weight, bias, self.stride,
    458                     _pair(0), self.dilation, self.groups)
--> 459 return F.conv2d(input, weight, bias, self.stride,
    460                 self.padding, self.dilation, self.groups)

RuntimeError: GET was unable to find an engine to execute this computation

Am I doing something wrong or this feature is not implemented in Adapter Plus?

How can I download the generated images?

I am using ip_adapter_multimodal_prompts: generation with multimodal prompts in python. But I want to download the images in a folder whenever I run this code. How to do that after the below line?

multimodal prompts

images = ip_model.generate(pil_image=image, num_samples=1, num_inference_steps=50, seed=42,
prompt="wearing a hat on the beach", scale=0.6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.