This project is awesome！！！ I have two small questions, Hav

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="use

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Performance on other models already based on sd1.5 and training dataset about ip-adapter HOT 14 OPEN

tencent-ailab commented on July 28, 2024

Performance on other models already based on sd1.5 and training dataset

from ip-adapter.

Comments (14)

xiaohu2015 commented on July 28, 2024 2

@Laidawang you can detect the face in the image, and crop it.

from ip-adapter.

xiaohu2015 commented on July 28, 2024 1

@Laidawang you are right, but building such a dataset needs a certain amount of work, of course it will make the IP-Adapter more powerful (in fact, that is in our plan). By the way, we have trained an IP-Adapter which uses face image as image prompt (https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter-plus-face_demo.ipynb). during training, we use face as image condition, but the full image is GT.

That's to say, the batch["clip_image"] in the training script corresponds to the cropped image, and the batch["images"] corresponds to the full image?

yes.

from ip-adapter.

JasonSongPeng commented on July 28, 2024 1

@Laidawang hi, the IP-Adapter only needs trained on sd1.5, but can be used on most community models. For training, you need prepare image-text pairs, and convert the data into a json file:
[

      {"text": "A dog", "image_file": "dog.jpg"},

      {"text": "A cat", "image_file": "cat.jpg"}



]

Dear xiaohu,

May I ask one question about the json file of training data? Is the 'text' similar to how we train Lora model? I mean, if there are many elements in my images in terms of table,chair,carpet .etc, how should I prepare the 'text'?

Looking forward your reply.
Best,

from ip-adapter.

xiaohu2015 commented on July 28, 2024

@Laidawang hi, the IP-Adapter only needs trained on sd1.5, but can be used on most community models. For training, you need prepare image-text pairs, and convert the data into a json file:

[
      {"text": "A dog", "image_file": "dog.jpg"},
      {"text": "A cat", "image_file": "cat.jpg"}

]

from ip-adapter.

Laidawang commented on July 28, 2024

@xiaohu2015 , thank you for your help, so we use this image as input of clip and the ground truth, Will this limit the variety of image embbending? Because in my experiment, when the scale is high (0.9 or higher), it will basically restore the image completely, but when it is low (0.3), it will cause some empty scenes.
I'm trying to use inpaiting to create a background for some small objects with this tec.

from ip-adapter.

xiaohu2015 commented on July 28, 2024

@Laidawang you maybe adjust the scale and add some text prompts to get good results. For now, we just use same image as condition and ground truth, it maybe limit its generation ability. In addition, we are also exploring better solutions

from ip-adapter.

Laidawang commented on July 28, 2024

@xiaohu2015 I think you use semantically consistent prompt and image during training, which will cause problems when the input image and prompt semantically inconsistent.
maybe we can try such training, for example: prompt: a cat image: an empty scene, Gt: a cat in such a scene. Or vice versa: prompt: describes the scene, image: a cat, GT: the cat is in the scene, i think this separate the influence of the prompt and the input image at the embbending level.

from ip-adapter.

xiaohu2015 commented on July 28, 2024

@Laidawang you are right, but building such a dataset needs a certain amount of work, of course it will make the IP-Adapter more powerful (in fact, that is in our plan). By the way, we have trained an IP-Adapter which uses face image as image prompt (https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter-plus-face_demo.ipynb). during training, we use face as image condition, but the full image is GT.

from ip-adapter.

Laidawang commented on July 28, 2024

wow，that's really nice

from ip-adapter.

Laidawang commented on July 28, 2024

@Laidawang you are right, but building such a dataset needs a certain amount of work, of course it will make the IP-Adapter more powerful (in fact, that is in our plan). By the way, we have trained an IP-Adapter which uses face image as image prompt (https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter-plus-face_demo.ipynb). during training, we use face as image condition, but the full image is GT.

in that case, How to make a dataset, can you give an example？

from ip-adapter.

hkunzhe commented on July 28, 2024

@Laidawang you are right, but building such a dataset needs a certain amount of work, of course it will make the IP-Adapter more powerful (in fact, that is in our plan). By the way, we have trained an IP-Adapter which uses face image as image prompt (https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter-plus-face_demo.ipynb). during training, we use face as image condition, but the full image is GT.

That's to say, the batch["clip_image"] in the training script corresponds to the cropped image, and the batch["images"] corresponds to the full image?

from ip-adapter.

hkunzhe commented on July 28, 2024

@Laidawang you are right, but building such a dataset needs a certain amount of work, of course it will make the IP-Adapter more powerful (in fact, that is in our plan). By the way, we have trained an IP-Adapter which uses face image as image prompt (https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter-plus-face_demo.ipynb). during training, we use face as image condition, but the full image is GT.

That's to say, the batch["clip_image"] in the training script corresponds to the cropped image, and the batch["images"] corresponds to the full image?

yes.

Thank you for such a quick reply! I've tried the model ip-adapter-plus-face_sd15.bin, and I find it's still hard to preserve human face likeness as discussed in #5. Do you think it would be better to replace the original CLIP with a face-specified CLIP model like FaRL, or do you have a better suggestion?

from ip-adapter.

xiaohu2015 commented on July 28, 2024

@hkunzhe I think you can make a try, for CLIP models, I found that it can only learn the similar structure of face. Hence, I think using face-specified model is more hopeful. However, my eary experiments using features from face recognition models does not work well, it is hard to training and learing using only diffusion losses.

from ip-adapter.

KevinChen880723 commented on July 28, 2024

@xiaohu2015 Thanks for your great work!
Did you pre-train a face recognition before training it with the diffusion model, or train them simultaneously?
Is it possible for you to describe your previous experiments briefly?
Thanks a lot for your help in advance! Have a nice day :)

from ip-adapter.

Performance on other models already based on sd1.5 and training dataset about ip-adapter HOT 14 OPEN

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent