Code Monkey home page Code Monkey logo

m2ugen's Introduction

M2UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models

PWC PWC PWC

This is the official repository for M2UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models.

๐Ÿš€ Introduction

The M2UGen model is a Music Understanding and Generation model that is capable of Music Question Answering and also Music Generation from texts, images, videos and audios, as well as Music Editing. The model utilizes encoders such as MERT for music understanding, ViT for image understanding and ViViT for video understanding and the MusicGen/AudioLDM2 model as the music generation model (music decoder), coupled with adapters and the LLaMA 2 model to make the model possible for multiple abilities. The model architecture is given in m2ugen.py.

To train our model, we generate datasets using a music captioning and question answering model, i.e. the MU-LLaMA model. The dataset generation methods are given in the Datasets folder.

๐Ÿค— HuggingFace Demo

We have provided a HuggingFace Space to see our model in action: M2UGen/M2UGen-Demo.

๐Ÿค– Model Setup

We use Python 3.9.17 for this project and the library requirements are given in requirements.txt. Create a conda environment using

conda create --name <env> --file requirements.txt

Ensure that the NVIDIA Driver is version 12 or above to be compatible with PyTorch 2.1.0.

For the working of our model, Facebook's LLaMA-2 model weights are required, details on obtaining these weights are given on HuggingFace.

The trained checkpoints for our model is available here:

The needed pretrained multi-modal encoder and music decoder models can be found here:

The directory of the checkpoints folder can be organized as follows:

.
โ”œโ”€โ”€ ...
โ”œโ”€โ”€ M2UGen                
โ”‚   โ”œโ”€โ”€ ckpts
โ”‚   โ”‚   โ”‚โ”€โ”€ LLaMA
โ”‚   โ”‚   โ”‚   โ”‚โ”€โ”€ 7B
โ”‚   โ”‚   โ”‚   โ”‚   โ”‚โ”€โ”€ checklist.chk
โ”‚   โ”‚   โ”‚   โ”‚   โ”‚โ”€โ”€ consolidated.00.pth
โ”‚   โ”‚   โ”‚   โ”‚   โ”‚โ”€โ”€ params.json
โ”‚   โ”‚   โ”‚   โ”‚โ”€โ”€ llama.sh
โ”‚   โ”‚   โ”‚   โ”‚โ”€โ”€ tokenizer.model
โ”‚   โ”‚   โ”‚   โ”‚โ”€โ”€ tokenizer_checklist.chk
โ”‚   โ”‚   โ”‚โ”€โ”€ M2UGen-MusicGen
โ”‚   โ”‚   โ”‚   โ”‚โ”€โ”€ checkpoint.pth
โ”‚   โ”‚   โ”‚โ”€โ”€ M2UGen-AudioLDM2
โ”‚   โ”‚   โ”‚   โ”‚โ”€โ”€ checkpoint.pth
โ”‚   โ”‚   โ”‚โ”€โ”€ knn.index
โ””โ”€โ”€ ...

Once downloaded, the Gradio demo can be run using these checkpoints.

For model with MusicGen

python gradio_app.py --model ./ckpts/M2UGen-MusicGen/checkpoint.pth --llama_dir ./ckpts/LLaMA-2 --music_decoder musicgen

For model with AudioLDM2

python gradio_app.py --model ./ckpts/M2UGen-AudioLDM2/checkpoint.pth  --llama_dir ./ckpts/LLaMA-2 --music_decoder audioldm2  --music_decoder_path cvssp/audioldm2

๐Ÿ—„๏ธ Dataset Generation

We use the MU-LLaMA and MPT-7B models to generate the MUCaps, MUEdit, MUImge and MUVideo datasets. For each of the datasets, run the scripts in the folder Datasets in its numbered order to generate the datasets.

The datasets are also available for download here:

Apart from the generated datasets, M2UGen also utilizes the COCO and Alpaca datasets. For the COCO dataset, download the 2014 train dataset from here and place the files in the COCO folder under Datasets. The Alpaca dataset file is already provided under Datasets/Alpaca.

๐Ÿ”ง Model Training

To train the M2UGen model, run the train_musicgen.sh or train_audioldm2.sh script. The scripts are designed to train the model for all three stages with MusicGen and AudioLDM2 music decoders respectively.

The main model architecture is given in m2ugen.py and the modified MusicGen and AudioLDM2 architectures are present within the musicgen and audioldm2 folders respectively. The data folder contains the python files to handle loading the dataset. The dataset.py file will show the use of different datasets based on the training stage. The code for the training epochs are present in engine_train.py.

๐Ÿ”จ Model Testing and Evaluation

To test the M2UGen model, run gradio_app.py.

usage: gradio_app.py [-h] [--model MODEL] [--llama_type LLAMA_TYPE] [--llama_dir LLAMA_DIR]
                      [--mert_path MERT_PATH] [--vit_path VIT_PATH] [--vivit_path VIVIT_PATH]
                      [--knn_dir KNN_DIR] [--music_decoder MUSIC_DECODER]

optional arguments:
  -h, --help            show this help message and exit
  --model MODEL         Name of or path to M2UGen pretrained checkpoint
  --llama_type LLAMA_TYPE
                        Type of llama original weight
  --llama_dir LLAMA_DIR
                        Path to LLaMA pretrained checkpoint
  --mert_path MERT_PATH
                        Path to MERT pretrained checkpoint
  --vit_path VIT_PATH   Path to ViT pretrained checkpoint
  --vivit_path VIVIT_PATH
                        Path to ViViT pretrained checkpoint
  --knn_dir KNN_DIR     Path to directory with KNN Index
  --music_decoder MUSIC_DECODER
                        Decoder to use musicgen/audioldm2

To evaluate the M2UGen model and other compared models in our paper, please refer to Evaluation folder.

๐Ÿงฐ System Hardware requirements

For training, stage 1 and 2 use a single 32GB V100 GPU while stage 3 uses 2 32GB V100 GPUs. For inference, a single 32GB V100 GPU is used. For loading model checkpoint, approximately 49GB of CPU memory is required.

๐Ÿซก Acknowledgements

This code contains elements from the following repo:

โœจ Cite our work

If you find this repo useful, please consider citing:

@article{hussain2023m,
  title={{M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models}},
  author={Hussain, Atin Sakkeer and Liu, Shansong and Sun, Chenshuo and Shan, Ying},
  journal={arXiv preprint arXiv:2311.11255},
  year={2023}
}

Star History

Star History Chart

m2ugen's People

Contributors

crypto-code avatar shansongliu avatar eltociear avatar littleor avatar

Stargazers

Vincent avatar  avatar  avatar  avatar  avatar  avatar XiaohuaiLe-AALab avatar Sijin Chen avatar AaronYang avatar Tiange Zhu avatar Mohd Hafizuddin Kamilin avatar  avatar glacier16127 avatar  avatar  avatar  avatar Bo-Yu Chen avatar Joann Ching avatar Scott Petersen avatar  avatar  avatar GabrielXie avatar Vedant Kalbag avatar Leonard Musk avatar wangxinyu avatar  avatar Ramtin Hosseini avatar EinEnergy avatar OD avatar gomoku avatar  avatar Sandesh Bharadwaj avatar  avatar  avatar  avatar  avatar Chirag Pritmani avatar  avatar  avatar  avatar  avatar Fu Guanyu avatar  avatar Matteo avatar Dhruv avatar Vijay Jaisankar avatar Huan Zhang avatar Yixiao Zhang avatar  avatar Hai Carroll avatar  avatar Zhuoyuan Mao avatar  avatar  avatar  avatar  avatar voodoohop avatar  avatar lishangjie avatar  avatar  avatar  avatar Junichi Shimizu avatar Zech.J avatar  avatar  avatar  avatar  avatar ้‚ตๆฐ avatar xiaofof avatar Liu Yukun avatar Lailson Henrique avatar  avatar  avatar  avatar nate.river avatar  avatar zes avatar Bingsong Bai avatar joelxiangnanchen avatar Ziheng Chi avatar Takeru Saito avatar  avatar Dons Tang avatar  avatar yearnyeen ho avatar Yuwei Zhang avatar lingchuL avatar Jerry Wu avatar Jeff Carpenter avatar lcc-404 avatar Adam avatar wuyangfeng avatar Lil'Horse avatar  avatar hchen avatar ๊ฐ•์ง€ํ˜ avatar Jian Kim avatar kev avatar Shayan Dadman avatar

Watchers

tang avatar Nickolay V. Shmyrev avatar Rishikesh (เค‹เคทเคฟเค•เฅ‡เคถ) avatar wackey avatar  avatar  avatar  avatar Wendong Gan avatar  avatar  avatar

m2ugen's Issues

Data decompression failure

image

Data decompression failure

gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

How to generate the evaluation datasets?

Hi, thanks for sharing your work, it's very impressive. We'd like to evaluate the M2UGen model and other compared models with your evalution datasets. In Evaluation/Image2Music/evaluate.py, we noticed that there is a file called MUImageEvalInstructions.json, which might be the splits of evaluation dataset. However, we didn't find this file anywhere, neither in Github nor Huggingface.

Could you please share this file or sharing the code for generating the split of evaluation dataset? It will help a lot. Thanks!

Cannot set up using Llama-2

First of all, thank you for making your great research results open source!

I want to reproduce the research results, but there are some setup issue using Llama-2.

Here it refers to params.json, but it is not included in Llama-2.

with open(os.path.join(llama_ckpt_dir, "params.json"), "r") as f:

https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main

It seems that the code is based on the original Llama (not Llama-2).
Could you please improve it so that it works with Llama-2?

Is llama2 model finetuned on all three stages?

In the original paper, it claims that in the first phase, all parameters, with the exception of those associated with the Multi-modal Understanding Adapters, undergo freezing. In my understanding, llama2 should only be fine-tuned in the third stage, but in the code it seems that llama2 is fine-tuned in lora in all three stages because llama and lora appear in the trainable parameter names in all three stages of the 'get_trainable_params' function

Errors in loading state_dict for M2UGen when running inference.py

Hi! I tried to run inference.py but encoutered below error as it seems indicating some missing keys and mismatches.
I believe I have set up the checkpoints files correctly.
For the LLaMA model, I made a request to Meta and downloaded the 7B with a signed download link.
Others than that, I got everything from huggingface, also the knn.

Not sure what I should fix at this point. I would appreciate it if you could give me some hints! I hope the problem is from my setup.

Traceback (most recent call last):
  File "/workspace/M2UGen/M2UGen/inference.py", line 95, in <module>
    load_result = model.load_state_dict(new_ckpt, strict=True)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for M2UGen:
        Missing key(s) in state_dict: "vit_model.encoder.layer.12.attention.attention.query.weight", "vit_model.encoder.layer.12.attention.attention.query.bias", 
...
...
        size mismatch for vit_model.embeddings.cls_token: copying a param with shape torch.Size([1, 1, 768]) from checkpoint, the shape in current model is torch.Size([1, 1, 1024]).
        size mismatch for vit_model.embeddings.position_embeddings: copying a param with shape torch.Size([1, 197, 768]) from checkpoint, the shape in current model is torch.Size([1, 197, 1024]).
        size mismatch for vit_model.embeddings.patch_embeddings.projection.weight: copying a param with shape torch.Size([768, 3, 16, 16]) from checkpoint, the shape in current model is torch.Size([1024, 3, 16, 16]).
        size mismatch for vit_model.embeddings.patch_embeddings.projection.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
...

Below is the command I used to run inference

python M2UGen/inference.py --video_file "/workspace/video-test.mp4" --model ./ckpts/M2UGen-MusicGen/checkpoint.pth --llama_dir ./ckpts/LLaMA --music_decoder musicgen

and this is the structure of the ckpts folder

โ”œโ”€โ”€ LLaMA
โ”‚   โ”œโ”€โ”€ 7B
โ”‚   โ”‚   โ”œโ”€โ”€ checklist.chk
โ”‚   โ”‚   โ”œโ”€โ”€ consolidated.00.pth
โ”‚   โ”‚   โ””โ”€โ”€ params.json
โ”‚   โ”œโ”€โ”€ tokenizer.model
โ”‚   โ””โ”€โ”€ tokenizer_checklist.chk
โ”œโ”€โ”€ M2UGen-MusicGen
โ”‚   โ””โ”€โ”€ checkpoint.pth
โ””โ”€โ”€ knn.index

The performance seems deviate from what is demonstrated on the demo page.

Thank you for sharing this valuable work. However, I encountered some peculiarities when testing the model. It appears that the performance significantly deviates from what is demonstrated on the demo page. I'm curious if others have observed similar discrepancies, or perhaps I've made an error. If that's the case, I would appreciate any corrections from the authors.

I adhered to the instructions provided in the README to establish the gradio environment and utilized the M2UGEN-MusicGEN-Medium model for inference, which performs best in the paper. Below are the screenshots of the results.

  1. For music understanding: While the generated audio seems acceptable, the description is entirely inaccurate.
    image

  2. For music editing: The edited music retains its original piano sound.
    image-10

  3. For image-to-music. I tested three common instruments with simple images, but the recognition was completely off, and the music generated bore no relevance to the image.
    image-2
    image-3
    image-4
    image-5

  4. I tried to replicate a result from the demo page. However, despite running the experiments three times, I was unable to achieve the perfect results showcased in the demo.
    image-6
    image-7
    image-8

  5. I tried to generate music that would suit the mood of an image without any instrument constraints. The image's mood should be perceived as exciting and fast-paced by common standards. However, the system returned a slow and calm piece of music.
    image-9

All of the above experiments were conducted without cherry-picking and were generated on the first attempt. Please do not hesitate to correct me if I have made any mistakes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.