Code Monkey home page Code Monkey logo

salmonn's Introduction

SALMONN: Speech Audio Language Music Open Neural Network

๐Ÿš€๐Ÿš€ Welcome to the repo of SALMONN!

SALMONN is a large language model (LLM) enabling speech, audio event, and music inputs, which is developed by the Department of Electronic Engineering of Tsinghua University and ByteDance. Instead of speech-only input or audio-event-only input, SALMONN can perceive and understand all kinds of audio inputs and therefore obtains emerging capabilities such as multilingual speech recognition & translation and audio-speech co-reasoning. This can be regarded as giving the LLM "ears" and cognitive hearing abilities, which makes SALMONN a step towards hearing-enabled artificial general intelligence.

๐Ÿ”ฅ News

  • [10-08] โœจ We release the model checkpoint and the inference code of SALMONN !

๐ŸŒŸ Structure

The model architecture of SALMONN is shown below. A window-level Q-Former is used as the connection module to fuse the outputs from a Whisper speech encoder and a BEATs audio encoder as augmented audio tokens, which are aligned with the LLM input space. The LoRA adaptor aligns the augmented LLM input space with its output space. The text prompt is used to instruct SALMONN to answer open-ended questions about the general audio inputs and the answers are in the LLM text responses.

โšก๏ธ Demos

Compared with traditional speech and audio processing tasks such as speech recognition and audio caption, SALMONN leverages the general knowledge and cognitive abilities of the LLM to achieve a cognitively oriented audio perception, which dramatically improves the versatility of the model and the richness of the task. In addition, SALMONN is able to follow textual commands, and even spoken commands, with a relatively high degree of accuracy. Since SALMONN only uses training data based on textual commands, listening to spoken commands is also a cross-modal emergent ability.

Here are some examples of SALMONN.

Audio Response
gunshots.wav sac
duck.wav story
music.wav mc

๐ŸŒˆ How to inference in CLI

  1. Our environment: python3 verion is 3.9.17 and pip3 install soundfile librosa torch==2.0.1 transformers==4.28.0 peft==0.3.0.
  2. Download whisper large v2 to whisper_path.
  3. Download Fine-tuned BEATs_iter3+ (AS2M) (cpt2) to beats_path.
  4. Download vicuna 13B v1.1 to vicuna_path.
  5. Download salmonn v1 to ckpt_path.
  6. Running with python3 cli_inference.py --ckpt_path xxx --whisper_path xxx --beats_path xxx --vicuna_path xxx in A100-SXM-80GB. Now you can input wav_path and prompt. Enjoy yourself !

๐ŸŒˆ How to launch a web demo

  1. Same as How to inference in CLI: 1-5.
  2. Running with python3 web_demo.py --ckpt_path xxx --whisper_path xxx --beats_path xxx --vicuna_path xxx in A100-SXM-80GB.

๐Ÿ‘€ Team

Team Tsinghua: Wenyi Yu, Changli Tang, Guangzhi Sun, Chao Zhang

Team ByteDance: Xianzhao Chen, Wei Li, Tian Tan, Lu Lu, Zejun Ma

โœจ Citation

If you find SALMONN great and useful, please cite our paper:

@article{tang2023salmonn,
      title={{SALMONN}: Towards Generic Hearing Abilities for Large Language Models}, 
      author={Changli, Tang and Wenyi, Yu and Guangzhi, Sun and Xianzhao, Chen and Tian, Tan and Wei, Li and Lu, Lu and Zejun, Ma and Chao, Zhang},
      journal={arXiv:2310.13289},
      year={2023}
}

salmonn's People

Contributors

tcl606 avatar yu-doit avatar chenxianzhao123 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.