Code Monkey home page Code Monkey logo

lumina-t2x's Introduction

$\textbf{Lumina-T2X}$: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

intro_large

📰 News

  • [2024-05-10] 🔥🔥🔥 We released the technical report on arXiv.
  • [2024-05-09] 🚀🚀🚀 We released Lumina-T2A (Text-to-Audio) Demos. Examples
  • [2024-04-29] 🔥🔥🔥 We released the 5B model checkpoint and demo built upon it for text-to-image generation.
  • [2024-04-25] 🔥🔥🔥 Support 720P video generation with arbitrary aspect ratio. Examples 🚀🚀🚀
  • [2024-04-19] 🔥🔥🔥 Demo examples released.
  • [2024-04-05] 😆😆😆 Code released for Lumina-T2I.
  • [2024-04-01] 🚀🚀🚀 We release the initial version of Lumina-T2I for text-to-image generation.

🚀 Quick Start

In order to quickly get you guys using our model, we built different versions of the GUI demo site.

Lumina-T2I 5B model demo:

[node1]

Lumina-Next-T2I 2B model demo:

[node1] [node2]

For more details about training and inference, please refer to Lumina-T2I README.md

📑 Open-source Plan

  • Lumina-T2I (Training, Inference, Checkpoints)
  • Lumina-T2V
  • Web Demo
  • Cli Demo

📜 Index of Content

Introduction

We introduce the $\textbf{Lumina-T2X}$ family, a series of text-conditioned Diffusion Transformers (DiT) capable of transforming textual descriptions into vivid images, dynamic videos, detailed multi-view 3D images, and synthesized speech. At the core of Lumina-T2X lies the Flow-based Large Diffusion Transformer (Flag-DiT)—a robust engine that supports up to 7 billion parameters and extends sequence lengths to 128,000 tokens. Drawing inspiration from Sora, Lumina-T2X integrates images, videos, multi-views of 3D objects, and speech spectrograms within a spatial-temporal latent token space, and can generate outputs at any resolution, aspect ratio, and duration.

🌟 Features:

  • Flow-based Large Diffusion Transformer (Flag-DiT): Lumina-T2X adopts the flow matching formulation and is equipped with many advanced techniques, such as RoPE, RMSNorm, and KQ-norm, demonstrating faster training convergence, stable training dynamics, and a simplified pipeline.
  • Any Modalities, Resolution, and Duration within One Framework:
    1. $\textbf{Lumina-T2X}$ can encode any modality, including mages, videos, multi-views of 3D objects, and spectrograms into a unified 1-D token sequence at any resolution, aspect ratio, and temporal duration.
    2. By introducing the [nextline] and [nextframe] tokens, our model can support resolution extrapolation, i.e., generating images/videos with out-of-domain resolutions not encountered during training, such as images from 768x768 to 1792x1792 pixels.
  • Low Training Resources: Our empirical observations indicate that employing larger models, high-resolution images, and longer-duration video clips can significantly accelerate the convergence speed of diffusion transformers. Moreover, by employing meticulously curated text-image and text-video pairs featuring high aesthetic quality frames and detailed captions, our $\textbf{Lumina-T2X}$ model is learned to generate high-resolution images and coherent videos with minimal computational demands. Remarkably, the default Lumina-T2I configuration, equipped with a 5B Flag-DiT and a 7B LLaMA as the text encoder, requires only 35% of the computational resources compared to Pixelart-$\alpha$.

framework

📽️ Demo Examples

Text-to-Image Generation


Text-to-Video Generation

720P Videos:

Prompt: The majestic beauty of a waterfall cascading down a cliff into a serene lake.

video_720p_1.mp4
video_720p_2.mp4

Prompt: A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

video_tokyo_woman.mp4

360P Videos:

video_360p.mp4

Text-to-3D Generation

multi_view.mp4

Text-to-Audio Generation

Note

Attention: Mouse over the playbar and click the audio button on the playbar to unmute it.

Prompt: Semiautomatic gunfire occurs with slight echo

Generated Audio:

semiautomatic_gunfire_occurs_with_slight_echo.mp4

Groundtruth:

semiautomatic_gunfire_occurs_with_slight_echo_gt.mp4

Prompt: A telephone bell rings

Generated Audio:

a_telephone_bell_rings.mp4

Groundtruth:

a_telephone_bell_rings_gt.mp4

Prompt: An engine running followed by the engine revving and tires screeching

Generated Audio:

an_engine_running_followed_by_the_engine_revving_and_tires_screeching.mp4

Groundtruth:

an_engine_running_followed_by_the_engine_revving_and_tires_screeching_gt.mp4

Prompt: Birds chirping with insects buzzing and outdoor ambiance

Generated Audio:

birds_chirping_repeatedly.mp4

Groundtruth:

birds_chirping_repeatedly_gt.mp4

More examples

⚙️ Diverse Configurations

We support diverse configurations, including text encoders, DiTs of different parameter sizes, inference methods, and VAE encoders. Additionally, we offer features such as 1D-RoPE, image enhancement, and more.


📄 Citation

@article{gao2024luminat2x,
      title={Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers}, 
      author={Peng Gao and Le Zhuo and Ziyi Lin and Chris Liu and Junsong Chen and Ruoyi Du and Enze Xie and Xu Luo and Longtian Qiu and Yuhang Zhang and Chen Lin and Rongjie Huang and Shijie Geng and Renrui Zhang and Junlin Xi and Wenqi Shao and Zhengkai Jiang and Tianshuo Yang and Weicai Ye and He Tong and Jingwen He and Yu Qiao and Hongsheng Li},
      journal={arXiv preprint arXiv:2405.05945},
      year={2024}
}

lumina-t2x's People

Contributors

pommespeter avatar frankluox avatar chrisliu6 avatar kamisatokanade avatar gaopengpjlab avatar zhuole1025 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.