Light

ugu11 / text2img-gen Goto Github PK

View Code? Open in Web Editor NEW

0.0 1.0 0.0 25.36 MB

Python 0.73% Jupyter Notebook 99.27%

text2img-gen's Introduction

Text-to-Image Generation with Diffusion Models

SDXL datasets

Training

256x256, augmented with random crops, flips and rotations

ImageNet: 1.8 million images, 1000 object categories
OpenImages: 1.2 million images, 600 oject categories

Evaluation

COCO: 330000 images, 80 objects
ImageNet
LSUN

Metrics:

FID
IS
Learned Perceptual Image Patch Similarity (LPIPS)

SDXL Future work:

Single stage: Currently, we generate the best samples from SDXL using a two-stage approach with an additional refinement model. This results in having to load two large models into memory, hampering accessibility and sampling speed. Future work should investigate ways to provide a single stage of equal or better quality.
Text synthesis: While the scale and the larger text encoder (OpenCLIP ViT-bigG) help to improve the text rendering capabilities over previous versions of Stable Diffusion, incorporating byte-level tokenizers or simply scaling the model to larger sizes may further improve text synthesis.
Architecture: During the exploration stage of this work, we briefly experimented with transformer-based architectures such as UViT and DiT, but found no immediate benefit. We remain, however, optimistic that a careful hyperparameter study will eventually enable scaling to much larger transformer-dominated architectures.
Distillation: While our improvements over the original Stable Diffusion model are significant, they come at the price of increased inference cost (both in VRAM and sampling speed). Future work will thus focus on decreasing the compute needed for inference, and increased sampling speed, for example through guidance-, knowledge- and progressive distillation. [FIXED WITH SDXL-TURBO]
Our model is trained in the discrete-time formulation of DDPM, and requires offset-noise for aesthetically pleasing results. The EDM-framework of Karras et al. is a promising candidate for future model training, as its formulation in continuous time allows for increased sampling flexibility and does not require noise-schedule corrections.

SDXL Limitations:

Struggles to generate intricate structures, like hands.
Still doesn't achieve perfect photorealistic images.
Relying on large-scale datasets might introduce some social and racial biases.
Exhibits the concept bleeding phenomenon, which refers to having difficulties generating images with multiple objects and concepts, merging or overlaping distinct visual elements.
Struggles rendering text in images. Might be fixed character-level text encoders.

SDXL Improvements (attempt) List:

VAE similar to DALL-E 3: Replace the decoder of the VAE with a DDPM and apply Adversarial Diffusion Distillatio instead of Consistency distillation. Or actually try both methods. DALL-E 3 saw improvements in fine details of the images.
Image captioner: To try handling longer and highly descriptive text prompts, train a image captioner to generate prompts from images and build a dataset to train the text2img model, similar to DALL-E 3.
Transform a two-stage model into a one-stage model: Can the U-Net be conditioned to act as a base model or as a refiner and use a single model to perform a two-stage inference (base + refinement)?
Experiment with character-level text encoders to improve rendered text in the images.

text2img-gen's People

Contributors

Watchers

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.