Represent a given set of audio inputs as a numeric vector, which can then for example be finetuned for other ML/AI problems or placed in an embeddings database for easy ANN search with similar audio clips. In the extreme case it could facilitate better AI audio generation similar to how CLIP can guide a VQGAN.
Although the 30 second minimum input is a bit of a bummer since it may not allow much granularity in the resulting embeddings.
Although the 30 second minimum input is a bit of a bummer since it may not allow much granularity in the resulting embeddings.