Code Monkey home page Code Monkey logo

Comments (14)

xiaohu2015 avatar xiaohu2015 commented on July 28, 2024 2

@Laidawang you can detect the face in the image, and crop it.

from ip-adapter.

xiaohu2015 avatar xiaohu2015 commented on July 28, 2024 1

@Laidawang you are right, but building such a dataset needs a certain amount of work, of course it will make the IP-Adapter more powerful (in fact, that is in our plan). By the way, we have trained an IP-Adapter which uses face image as image prompt (https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter-plus-face_demo.ipynb). during training, we use face as image condition, but the full image is GT.

That's to say, the batch["clip_image"] in the training script corresponds to the cropped image, and the batch["images"] corresponds to the full image?

yes.

from ip-adapter.

JasonSongPeng avatar JasonSongPeng commented on July 28, 2024 1

@Laidawang hi, the IP-Adapter only needs trained on sd1.5, but can be used on most community models. For training, you need prepare image-text pairs, and convert the data into a json file:


[

      {"text": "A dog", "image_file": "dog.jpg"},

      {"text": "A cat", "image_file": "cat.jpg"}



]

Dear xiaohu,

May I ask one question about the json file of training data? Is the 'text' similar to how we train Lora model? I mean, if there are many elements in my images in terms of table,chair,carpet .etc, how should I prepare the 'text'?

Looking forward your reply.
Best,

from ip-adapter.

xiaohu2015 avatar xiaohu2015 commented on July 28, 2024

@Laidawang hi, the IP-Adapter only needs trained on sd1.5, but can be used on most community models. For training, you need prepare image-text pairs, and convert the data into a json file:

[
      {"text": "A dog", "image_file": "dog.jpg"},
      {"text": "A cat", "image_file": "cat.jpg"}

]

from ip-adapter.

Laidawang avatar Laidawang commented on July 28, 2024

@xiaohu2015 , thank you for your help, so we use this image as input of clip and the ground truth, Will this limit the variety of image embbending? Because in my experiment, when the scale is high (0.9 or higher), it will basically restore the image completely, but when it is low (0.3), it will cause some empty scenes.
I'm trying to use inpaiting to create a background for some small objects with this tec.

from ip-adapter.

xiaohu2015 avatar xiaohu2015 commented on July 28, 2024

@Laidawang you maybe adjust the scale and add some text prompts to get good results. For now, we just use same image as condition and ground truth, it maybe limit its generation ability. In addition, we are also exploring better solutions

from ip-adapter.

Laidawang avatar Laidawang commented on July 28, 2024

@xiaohu2015 I think you use semantically consistent prompt and image during training, which will cause problems when the input image and prompt semantically inconsistent.
maybe we can try such training, for example: prompt: a cat image: an empty scene, Gt: a cat in such a scene. Or vice versa: prompt: describes the scene, image: a cat, GT: the cat is in the scene, i think this separate the influence of the prompt and the input image at the embbending level.

from ip-adapter.

xiaohu2015 avatar xiaohu2015 commented on July 28, 2024

@Laidawang you are right, but building such a dataset needs a certain amount of work, of course it will make the IP-Adapter more powerful (in fact, that is in our plan). By the way, we have trained an IP-Adapter which uses face image as image prompt (https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter-plus-face_demo.ipynb). during training, we use face as image condition, but the full image is GT.

from ip-adapter.

Laidawang avatar Laidawang commented on July 28, 2024

wow,that's really nice

from ip-adapter.

Laidawang avatar Laidawang commented on July 28, 2024

@Laidawang you are right, but building such a dataset needs a certain amount of work, of course it will make the IP-Adapter more powerful (in fact, that is in our plan). By the way, we have trained an IP-Adapter which uses face image as image prompt (https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter-plus-face_demo.ipynb). during training, we use face as image condition, but the full image is GT.

in that case, How to make a dataset, can you give an example?

from ip-adapter.

hkunzhe avatar hkunzhe commented on July 28, 2024

@Laidawang you are right, but building such a dataset needs a certain amount of work, of course it will make the IP-Adapter more powerful (in fact, that is in our plan). By the way, we have trained an IP-Adapter which uses face image as image prompt (https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter-plus-face_demo.ipynb). during training, we use face as image condition, but the full image is GT.

That's to say, the batch["clip_image"] in the training script corresponds to the cropped image, and the batch["images"] corresponds to the full image?

from ip-adapter.

hkunzhe avatar hkunzhe commented on July 28, 2024

@Laidawang you are right, but building such a dataset needs a certain amount of work, of course it will make the IP-Adapter more powerful (in fact, that is in our plan). By the way, we have trained an IP-Adapter which uses face image as image prompt (https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter-plus-face_demo.ipynb). during training, we use face as image condition, but the full image is GT.

That's to say, the batch["clip_image"] in the training script corresponds to the cropped image, and the batch["images"] corresponds to the full image?

yes.

Thank you for such a quick reply! I've tried the model ip-adapter-plus-face_sd15.bin, and I find it's still hard to preserve human face likeness as discussed in #5. Do you think it would be better to replace the original CLIP with a face-specified CLIP model like FaRL, or do you have a better suggestion?

from ip-adapter.

xiaohu2015 avatar xiaohu2015 commented on July 28, 2024

@hkunzhe I think you can make a try, for CLIP models, I found that it can only learn the similar structure of face. Hence, I think using face-specified model is more hopeful. However, my eary experiments using features from face recognition models does not work well, it is hard to training and learing using only diffusion losses.

from ip-adapter.

KevinChen880723 avatar KevinChen880723 commented on July 28, 2024

@xiaohu2015 Thanks for your great work!
Did you pre-train a face recognition before training it with the diffusion model, or train them simultaneously?
Is it possible for you to describe your previous experiments briefly?
Thanks a lot for your help in advance! Have a nice day :)

from ip-adapter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.