Congratulations on the arxiv

Failing to reproduce results about structured-diffusion-guidance HOT 4 OPEN

weixi-feng commented on August 17, 2024

Failing to reproduce results

from structured-diffusion-guidance.

Comments (4)

weixi-feng commented on August 17, 2024

Hi,

Thank you for your interest in our work. We also observe that a certain amount of our images are similar to the stable diffusion outputs. As is written in the manuscript, for evaluations, we discarded 20% of the most similar pairs and randomly sampled around 1,500 image pairs for comparison. So the 5-8% improvement could be discounted when considering all images. Among the winning cases in the head-to-head comparison, there are 31% for “fewer missing objects”, 14.1% for “better-matched colors”, and 54.8% for “other attributes or details”. So it is within expectation to see many images with details enhanced (like the last one in "two blue sheep and a red goat", 00054/64, 00058/68 in "a red bird and a green apple", 00026/36 in "a white goat standing..."). I also tried the banana prompt and observed detail enhancement on the "bananas" in 3/10 cases, while the rest look similar. It could be that we randomly ran into some good initialization for Fig. 4 while in general, the improvement is not significant for the banana prompt.

For the conjunction prompt (i.e., using "and" to connect two objects), you may want to try multiple keys and a single value. Note that the single value is not the plain encoding of the original prompt but an aligned version (also see eq. 5-6). This method is more likely to generate both objects simultaneously (like Fig. 1 (right) or Fig. 5 (top right)). However, as reflected in Table 2, these prompts seem quite challenging for existing T2I models, and we still expect incomplete compositions in most cases. If you can run this codebase and fix the seed to 42, you should be able to get the following results.

multiple keys, single value (right) / single key, multiple values (middle) / regular (right)

I am not entirely sure, but your implementation looks correct. You may use the following initialized noise patterns and see if they result in the same images as above.
init.zip

As mentioned in README, improvement is not guaranteed on every sample but rather system-level. Overall we find it hard to quantify images compositionally, and we are still working on metrics beyond human evaluation to improve the experiment section. You may download this batch of examples to have a better understanding of the overall performance. Please let me know if this helps and if you have further questions.

Thanks,
Weixi

from structured-diffusion-guidance.

Birch-san commented on August 17, 2024

thanks very much for the detailed response!
okay, looking at the samples from your Google Drive: yeah, they seem to vary in a pretty similar way to the way my own results varied. that's heartening; maybe my reproduction is close or equivalent.

I tried to see if I could generate the same images as from your Google Drive (even by turning off structured diffusion and trying to match your vanilla results).

I used the 4 noised latents .pts you provided, prompt "a red car and a white sheep", 15 steps DPM-Solver++ (2M) (k-diffusion), 16-bit Unet, 32-bit latents, 32-bit sampling, stable-diffusion 1.4, diffusers.

with structured diffusion off:

these don't match the "baseline" samples from your Google Drive:

err, well 3.pt came out like 00000-0-a red car and a white sheep.png. seems more likely to be a coincidence though.

maybe I'll have to resort to getting the reference implementation running in order to do a comparison.

thanks for mentioning "multiple keys, single value". I think I built something like that along the way, but deleted it thinking I'd misunderstood. originally I had indeed written it such that it aligned all the noun-phrase embeddings onto one prompt embedding (instead of onto several). so I can undelete that and try it out.

"many attributes with details enhanced" might explain why (my implementation of) structured diffusion upgraded my bird into a photorealistic one ("A red bird and a green apple"):

standard | structured

I guess next step is for me to try running the reference implementation on my machine, see if I can get the same baseline outputs, then see if I can get the same structured diffusion outputs with my algorithm. if it is indeed an equivalent implementation: it'd mean we can enjoy improved perf (does more work in parallel and fuses a lot of multiplication) and diffusers support. the downside however is that if my implementation is equivalent, then the results I got would be valid too (which didn't come close to the best results from the paper). but I haven't tried "multiple keys, single value", so that's worth exploration too.

from structured-diffusion-guidance.

weixi-feng commented on August 17, 2024

I forgot to mention that the prompt is "a white car and a red sheep", and the provided init noise patterns correspond to the 12 images displayed in the reply, not to any images in Google Drive.

Even with the same noise initialization, there might be some other randomness that causes slight differences between the provided images and your generation results. But you should be able to get the same 12 images using the codebase here. We are also working on a huggingface demo using Gradio, and hopefully, we can make it available soon. Hope this helps!

from structured-diffusion-guidance.

Edenzzzz commented on August 17, 2024

yes, this is indeed very soon.

from structured-diffusion-guidance.

Failing to reproduce results about structured-diffusion-guidance HOT 4 OPEN

Comments (4)

Related Issues (13)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent