OmniSEP: Unified Omni-modal Sound Separation

Abstract

The scaling up has brought tremendous success in the fields of vision and language in recent years. When it comes to audio, however, researchers encounter a major challenge in scaling up the training data, as most natural audio contains diverse interfering signals. To address this limitation, we introduce \textbf{Omni}-modal Sound \textbf{Sep}aration (\textbf{OmniSep}), a novel framework capable of isolating clean soundtracks based on omni-modal queries, encompassing both single-modal and multi-modal composed queries. Specifically, we introduce the \texttt{Query-Mixup} strategy, which blends query features from different modalities during training. This enables OmniSep to optimize multiple modalities concurrently, effectively bringing all modalities under a unified framework for sound separation. We further enhance this flexibility by allowing queries to influence sound separation positively or negatively, facilitating the retention or removal of specific sounds as desired. Finally, OmniSep employs a retrieval-augmented approach known as \texttt{Query-Aug}, which enables open-vocabulary sound separation. Experimental evaluations on MUSIC, VGGSOUND-CLEAN+, and MUSIC-CLEAN+ datasets demonstrate effectiveness of OmniSep, achieving state-of-the-art performance in text-, image-, and audio-queried sound separation tasks.

A.Sound Separation with Queries of Different Modalities.

A.1.Text-Query

Query Mixture Interference Target Prediction

A.2.Image-Query

Query Mixture Interference Target Prediction

A.3.Audio-Query

Query Mixture Interference Target Prediction

B.Sound Separation with Negative Queries.

Mixture Interference Target OmniSEP OmniSEP+Neg

C.Sound Separation with Query-Aug.

Query Mixture Target OmniSEP OmniSEP+Query-Aug

D.AudioSep with Query-Aug.

* The AudioSep model referenced in this section is the official open-source version, trained on the audio datasets of over 14,000 hours.
Query Mixture Target AudioSep AudioSep+Query-Aug

E.Sound Separation on Real Videos.

* All real video samples in this section are from the Paris Olympic Games. * Click on the label to view the query content.
original video Image Image + Audio
Sample1

original video Text Text - Text (negative)
Sample2

original video Text Text - Audio (negative)
Sample3

F. More Examples on Negative-Query.

original video AudioSep( Text ) OmniSep( Text ) OmniSep( Text )+ Audio (negative)
Sample1
original video CLIPSEP( Image ) OmniSep( Image ) OmniSep( Image )+ Audio (negative)
Sample2