
Applying machine learning and deep learning methods to audio analysis - gidim
https://www.comet.ml/blog/?p=916
======
jononor
As an introduction introduction I guess this is OK. However there are two
major limitations:

1: The feature extraction ends with mean-summarizing across the entire audio
clip - leaving no temporal information. This only works well for simple tasks.
At least mentioning something about analysis windows and temporal modelling
would be good, as the natural next step. Be it LSTM/GRU on the MFCC, or CNN on
mel-spectrogram.

2: The folds of the Urbansound8k dataset are not respected in the evaluation.
In Urbansound8k different folds contains clips extracted from the same
original audio files, usually very close in time. So mixing the folds for the
testset means it is no longer entirely "unseen data". The model very likely
exploits this data leakage, as the reported accuracy is above SOTA (for no
data-augmentation) - unreasonable given the low fidelity feature
representation. At least mentioning this limitation and that the performance
number they give cannot be compared with other methods, would be prudent.

When I commented similarly on r/machinelearning the authors acknowledged these
weaknesses, but did not update the article to reflect it.

~~~
gidim
we're working on another version fixing the folds issue on Urbansound8k and
will update the article asap.

~~~
jononor
Nice!

~~~
gidim
just to clarify - are you referring to this experiment?
[https://www.comet.ml/demo/urbansound8k/be09e32700cd435fb6b55...](https://www.comet.ml/demo/urbansound8k/be09e32700cd435fb6b554befb01fc4c)

~~~
jononor
Sure, that demonstrates the issue. Problem is with using train_test_split(X,
yy, test_size=0.2..) - this assumes independent samples, which is violated for
this dataset (because some come from same source audio files). The easiest
(and completely acceptable) is to use one fold as the validation data, one
fold for the test set, and the remaining folds as training.

This problem is unfortunately quite common even in academic papers using this
dataset, even though the authors warn about it.

EDIT: There is one more issue with Urbansound8k folds, and that is that the
difficulty of the various folds is quite different. So one should ideally
report the performance across all folds (mean/std or boxplot). But this is a
minor issue compared to data leakage.

PS: Nice use of Comet.ml platform this, collaborating online on improving the
experimental setup :)

~~~
nikolaskaris
Hey jononor — we've updated the post to split the training and test sets based
on the folds. Good catch and thanks again for reporting this. Some of the
experiments in the project will still have the old code, but the blog post
will reflect this new train/test split.

~~~
jononor
Nice. Did you update the reported results also? I think they will change quite
a bit

------
jononor
Warning: shameless-self-promotion. For those that wish to go a bit beyond this
article, I gave a presentation on the topic at EuroPython.
[https://www.youtube.com/watch?v=uCGROOUO_wY](https://www.youtube.com/watch?v=uCGROOUO_wY)
It explains how to build models that can make use of temporal variations and
learn the feature representations based on the (Mel) spectrogram. Especially
suited if you are already familiar with image-classification using
Convolutional Neural Networks.

------
m0zg
As one of the long-suffering Comet.ml customers, I wish they'd spend more time
working on their site's performance and less on writing blog posts. It takes
multiple seconds for graphs to render, and leaving any part of Comet.ml UI
open in the browser leads to spinning fans and quick battery drain when
working from a laptop. The logging component will sometimes hang without a
warning and hang your training session as well. Bizarrely, there's no way to
show min/max metric values for ongoing and completed runs (AKA the only thing
a researcher actually cares about): you have to log them separately in order
to display them.

This is a weird field: these are not difficult problems to solve, yet as far
as I can tell, all of the popular choices available so far each suck in their
own unique way and there's no option that I know of that actually offers
convenience and high performance. FOSS options are barely existent, as well,
and they also suck.

For the things where Comet.ml would be too onerous to deal with, I still use
pen and paper.

~~~
alon7
We're actually very happy with Comet and have been using it on v large
projects (>50 researchers, 10k models). You can reduce the refresh interval
and the amount of data points reported if things feel slow

~~~
m0zg
I don't log that many points as it is: about 4K data points per run in total
(windowed average loss and LR every 25-30 batches, eval metrics every epoch),
for all metrics combined. I also log the same data to TensorBoard, which
renders everything pretty much instantaneously with no issues at all, even
though I tell it to not downsample beyond 5K samples per graph.

~~~
gidim
M0zg do you mind sending me an email with your project? Happy to look into it.
gideon _a t_ comet.ml

~~~
gidim
Also keep in mind that unlike tensorboard we keep your full data series
available in the API and only downsample the charts to 15k points.

------
syntaxing
Is there a method to detect a specific word and tell me the timestamp
throughout an audio sample easily? I've been trying to implement something
like this but wasn't sure how to approach it.

~~~
yorwba
If you already have the transcript without timestamps (e.g. for an audiobook
where you know the source text), you could use
[https://github.com/readbeyond/aeneas](https://github.com/readbeyond/aeneas) ,
which infers the timestamps by aligning text-to-speech output with the audio
using dynamic time warping.

If you don't have the transcript, you'd use a transcription service that also
gives you timestamps. E.g. there was a frontpage submission yesterday where
someone used AWS Transcription to count the number of words in each minute of
a talk:
[https://news.ycombinator.com/item?id=21635939](https://news.ycombinator.com/item?id=21635939)

