
Reproducible machine learning with PyTorch and Quilt - akarve
https://blog.paperspace.com/reproducible-data-with-pytorch-and-quilt/
======
p1esk
Oh, this resonates with me so much! I'm running 4 different DeepSpeech models
right now, each using a differently processed version of LibriSpeech dataset
(mfcc/fbanks/linear spectrograms, deltas? energy? padding? etc). Because the
original DS papers didn't bother describing it, and every implementation I
found uses completely different methods and libraries.

Not to mention every one of those implementation packages their preprocessed
version into a different data format, and then creates a different data
pipeline (and I only looked at tensorflow implementations)

~~~
stealthcat
Why don't you use STFT + Conv2D like Deep Speech 2 did. It works well in my
case.

~~~
p1esk
The DeepSpeech2 paper does not include any details about audio processing. I
see an older Baidu-Research implementation of DS1 that uses "log of linear
spectrogram from FFT energy". Also, there's a pytorch implementation [1],
where they use Librosa's STFT, is that what you're referring to?

That's two more implementations that I haven't considered. I'm sure most of
the processing steps under the hood are the same or similar, but as I'm not an
audio processing expert, I can't tell which method is better (and why).

And it's hard to tell if it "works well" because or despite the way I
processed the files.

[1]
[https://github.com/SeanNaren/deepspeech.pytorch](https://github.com/SeanNaren/deepspeech.pytorch)

------
dkobran
In case you missed it, here's a link to the full training example that you can
run yourself:
[https://www.paperspace.com/console/jobs/jvqssfqawv5zn/logs](https://www.paperspace.com/console/jobs/jvqssfqawv5zn/logs)

Inference example:
[https://www.paperspace.com/console/jobs/js4mqzm91fj2lg](https://www.paperspace.com/console/jobs/js4mqzm91fj2lg)

Disclosure: I work on Paperspace

------
infinity0
A step in the right direction for machine learning in science, but they could
have done some more research into naming conflicts:

$ apt-cache show quilt

Package: quilt

[..]

Description-en: Tool to work with series of patches

Quilt manages a series of patches by keeping track of the changes each of them
makes. They are logically organized as a stack, and you can apply, un-apply,
refresh them easily by traveling into the stack (push/pop). . Quilt is good
for managing additional patches applied to a package received as a tarball or
maintained in another version control system. The stacked organization is
proven to be efficient for the management of very large patch sets (more than
hundred patches). As matter of fact, it was designed by and for Linux kernel
hackers (Andrew Morton, from the -mm branch, is the original author), and its
main use by the current upstream maintainer is to manage the (hundreds of)
patches against the kernel made for the SUSE distribution. . This package
provides seamless integration into Debhelper or CDBS, allowing maintainers to
easily add a quilt-based patch management system in their packages. The
package also provides some basic support for those not using those tools. See
README.Debian for more information.

$ zcat /usr/share/doc/quilt/changelog.gz | tail -n3

Version 0.26 (Tue Oct 21 2003) \- Change summary not available

~~~
akarve
i hear you. on pypi the name is uncontested so, at least in the python eco-
system, there is only one quilt. that said, for future revisions we'll try for
a unique name because it can indeed be confusing, e.g. in the apt-get case.

------
jononor
Was not aware of Quilt for hosting datasets. Is it the go-to in this area?
What other alternatives are there?

~~~
eindiran
You can use AWS to host open datasets:
[https://aws.amazon.com/opendata/public-
datasets/](https://aws.amazon.com/opendata/public-datasets/)

These are some other people working in roughly the same space:
[http://datproject.org/](http://datproject.org/)
[http://www.pachyderm.io/](http://www.pachyderm.io/)

But it does seem like Quilt is a go-to, if you are looking for a "Github for
data" host.

~~~
jmaxfield
I use Quilt pretty much daily and while I like AWS open datasets I don't think
it is as actively developed on as Quilt is. DAT project on the other hand I
really do like as a way to simply transfer large amounts of data between
contributors, that said, if you are just trying to get data out there and have
people use it freely for their own work I think Quilt presents the solution
due to searchable and easily understood python (and I think an R repo) usage
of datasets.

------
cwyers
It seems to me like the machine learning algorithm here is mostly learning how
to add JPEG compression artifacts to images.

------
ForFreedom
Isn't quilt just bluring the pixels to an extend?

~~~
akarve
Quilt isn't doing the inference (the PyTorch model is). But, in any case, no.
Super-resolution is more than blurring, it's pixel inference.
[https://arxiv.org/abs/1609.05158](https://arxiv.org/abs/1609.05158)

------
rhacker
Please please please don't kill our favorite plot device. Make sure the
process takes exactly 3 days.

