Code Monkey home page Code Monkey logo

audio-diffusion's Introduction

Audio Diffusion Research

  • Team members : 楊佳誠、邱以中、蔡桔析

Introduction

  • This is the final project for NYCU_DLP course.
  • After reading the paper"Palette:ASimple,General Framework for Image-to-Image Translation," we found it interesting to investigate whether the ablation results are consistent with those in the audio domain.
  • We will compare the L1 and L2 loss and also evaluatethe significance of self-attention and normalization in the audio diffusion architecture.

Method

Model

  • The audio diffusion backbone utilizes a U-Net architecture.
  • The class_embedding of the UNet2DModel is utilized to incorporate both the time embedding and class_embedding, treating them as part of the conditional input of the model.

Dataset

  • ESC-50 consists of 5-second-long recordings organized into 50 semantical classes, with 40 examples per class.
  • This dataset consists of the following five main categories: Animals, Natural, Human non-speech sounds, Interior sounds, and Exterior noises.

Data preprocessing

  • The .wav data is preprocessed into Mel Spectrograms. This can be done by audio_diffusion/scripts/audio_to_images.py.
  • The Mel spectrograms will be normalized according to the experiment setting.

Training

  • We write our code in the audio_diffusion/scripts folder.
  • Use train_unet.py to train our model
  • Load the preprocess data folder by the path where audio_to_images.py generates.

Sampling

  • Use audio_diffusion/scripts/test_cond_model.py to generate sample. This program generate 40 .wav files for 50 classes in ESC-50.
  • There are several things need to modified before you run this code:
    1. Modify the parameter of parser
    2. Replace the path in line 171 by your pretrained unet weight.(ex. /unet/diffusion_pytorch_model.bin)
    3. Modify the model_index.json file in your saved model path:
    "mel": [
        "audio_diffusion", # change null to "audio_diffusion"
        "Mel"
    ],

Evaluation

  • We utilize our model to generate 50 classes of audio, producing 40 audio samples for each class as evaluation data.
  • The FAD score is a metric employed to measure the similarity between evaluation data and original data. A lower FAD score indicates a closer match between the distributions of the generated and real audio.
  • The CA score, utilizing pretrained Contrastive Language-Audio Pretraining (CLAP), is used to assess whether our model can successfully generate the correct voice.

Expected file

  • To compute FAD and CA, the path should contain 50 folder, which is named from 0 to 49 by its label. Each folder should contain 40 .wav generate from the same class.
  • Take a look at audio_evaluate/Predict/L2 as an example.

FAD

  • Use audio_evaluate/evaluate.py to compute FAD score.
dir_1 = path of ground truth .wav.
dir_2 = path of generate .wav.

CA

Demo

cat.mp4
crickets.mp4
frog.mp4

Reference

  1. https://github.com/teticio/audio-diffusion
  2. https://github.com/LAION-AI/CLAP
  3. https://github.com/gudgud96/frechet-audio-distance
  4. https://github.com/huggingface/diffusers
  5. https://arxiv.org/abs/2111.05826

audio-diffusion's People

Contributors

roman-yang avatar romanycc avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

fallantbell

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.