VkFFT – Vulkan Fast Fourier Transform Library

zdw · on Sept 27, 2020

If I were a hiring person at AMD or Intel, I'd shortlist this guy for a job, as they need help competing against the headstart CUDA has in the GPU-base compute space.

slavik81 · on Sept 28, 2020

The AMD Math Libraries team is hiring [1], and one of the libraries they develop is rocFFT [2]. Disclosure: I work at AMD, though not on rocFFT.

[1]: https://jobs.amd.com/job/Calgary-GPU-Libraries-Software-Deve... [2]: https://github.com/ROCmSoftwarePlatform/rocFFT

tinus_hn · on Sept 29, 2020

The author lists his email address on the site and indicates he’s looking for a position.

slavik81 · on Sept 29, 2020

I should probably make it clear that I have nothing to do with the hiring process whatsoever. Though, it seems I could provide a referral.

jjeaff · on Sept 27, 2020

Ya, but the important question is can they invert a binary tree on a whiteboard?

umvi · on Sept 28, 2020

Just get a clear whiteboard, draw the binary tree, then flip the whiteboard 180 around the vertical axis so you are now looking through the back of the whiteboard.

repsilat · on Sept 28, 2020

Honestly, it's an O(1) solution depending on choice of data structure, possibly constant-factor more efficient through your program depending on language, and if you're programming on a whiteboard it might also arguably be the idiomatic way to do it in that context.

Limited in that you can only have one tree per whiteboard, though.

qppo · on Sept 28, 2020

"Write an FFT" is the DSP engineer interview question that's analogous to tree traversal algorithm whiteboarding. The hard part is remembering how a butterfly computation works, and you'll almost never need to implement it.

mangamadaiyan · on Sept 28, 2020

... or are leetcode-proficient, these days.

TomVDB · on Sept 28, 2020

One should hope that the non-CUDA GPU compute library ecosystem has already advanced beyond being able to calculate FFTs!

singhrac · on Sept 28, 2020

Sure, but if Nvidia/OpenAI/Google/Facebook have shown anything, it's that there's always more kernels to invent and train bigger nets with.

andi999 · on Sept 28, 2020

Last time I checked there was no good fft for AMD.

slavik81 · on Sept 28, 2020

What are the common applications for these sorts of GPU-accelerated FFTs? We mostly just solved problems analytically in undergrad, and the little bit of naive coding we did seemed pretty fast. I feel like this must be used for problems I would have learned about in grad school, if I had continued in electrical engineering.

DTolm · on Sept 28, 2020

I have used VkFFT to create GPU version of a magnetic simulation software Spirit (https://github.com/DTolm/spirit). Except for FFT it also has a lot of general linear algebra routines, like efficient GPU reduce/scan and system solvers, like CG, LBFGS, VP, Runge-Kutta and Depondt. This version of Spirit is faster than CUDA based software that has been out and updated for ~6 years due to the fact that I have full control over all the code I use. You might want to check the discussions on reddit for this project: https://www.reddit.com/r/MachineLearning/comments/ilcw2f/p_v... and https://www.reddit.com/r/programming/comments/il9sar/vulkan_...

Reelin · on Sept 28, 2020

Likely any HPC application that has an FFT somewhere in its pipeline and is otherwise amenable to being run on a GPU.

Fluid flow, heat transfer, and other such physical phenomena that you might want to simulate.

Phase correlation in image processing is another example. (https://en.wikipedia.org/wiki/Phase_correlation)

MD simulations rely on FFT but I'm not sure how much is typically (or can be) done on the GPU. For example, NAMD employs cuFFT on the GPU in some cases. (https://aip.scitation.org/doi/10.1063/5.0014475)

amelius · on Sept 28, 2020

Machine learning uses CNNs, which are directly based on FFTs.

hikarudo · on Sept 28, 2020

How are CNNs directly based on FFTs? Sure you can use CNNs with FFT features, but in my experience this is not common.

amelius · on Sept 28, 2020

Convolutions are typically computed using FFTs.

https://en.wikipedia.org/wiki/Convolution_theorem

DTolm · on Sept 28, 2020

He is not wrong, convolutions between an image and a small kernel can be done faster by direct multiplication than by padding the kernel and performing FFT + iFFT. This is what tensor cores are aiming to do really fast. However, doing a convolution betwen an image and a kernel with the similar size is the general use case for the convolution theorem and is the thing that is currently implemented in VkFFT.

hadeson · on Sept 28, 2020

It could be used to accelerate Convolutional Neural Nets training [0]

[0] https://arxiv.org/abs/1312.5851

enriquto · on Sept 28, 2020

If you could filter and focus raw radar data in realtime it would be really cool!

gorkish · on Sept 28, 2020

Software defined radio / RF DSP is another area where FFT and IFFT performance and accuracy are critical.

looping__lui · on Sept 28, 2020

Imaging. E.g., large convolutions.

HelloNurse · on Sept 28, 2020

The same as any FFT, but accelerated; with the tradeoff that the cost of moving data from and to the GPU needs to be amortized. It's also a good proof of concept for other kinds of GPU computations.

p1mrx · on Sept 28, 2020

How does using Vulkan for computation fit into the OpenCL/CUDA landscape? Is CUDA's proprietary nature doing meaningful harm, and does Vulkan help?

Jhsto · on Sept 28, 2020

You can run OpenCL kernels on Vulkan at least in theory: SPIR-V supports OpenCL memory model. CUDA might be machine translatable if you can compile into LLVM target (clang seems to have experimental support developed outside of Nvidia) which you then retarget into SPIR-V using a cross-compiler. The LLVM to SPIR-V cross-compiler however is limited in its translation for the time being.

In general, Vulkan is a thing which commands the GPU, but is not opinionated on what the language used to represent the kernel is as long as it compiles to SPIR-V. SPIR-V in itself is like parallel LLVM IR. If you look into the project source, the shaders are in GLSL which have been pre-compiled using a cross-compiler into SPIR-V. The C file you find on the project root constitutes as the loader program for the SPIR-V files.

Futhark project did some initial benchmarks on translating OpenCL to Vulkan. The results were mainly slowdowns. You can read about it in here: https://futhark-lang.org/student-projects/steffen-msc-projec...

jgavris · on Sept 28, 2020

We run OpenCL on top of Vulkan in a production application on Android, thanks to a project from Google / Codeplay and other contributors https://github.com/google/clspv. SPIR-V can't represent all of OpenCL, but maybe enough for most people's use cases.

pjmlp · on Sept 28, 2020

Badly, OctaneRender had moved away from Vulkan into CUDA, because they found out that Vulkan compute wasn't at the level that they wanted.

https://home.otoy.com/octane2020-rndr-released/

"OTOY | GTC 2020: Real-Time Raytracing, Holographic Displays, Light Field Media and RNDR Network"

https://www.youtube.com/watch?v=Qfy6CTaSHcc

littlestymaar · on Sept 28, 2020

I couldn't find any details about the migration on either links but it looks like they make massive use of Nvidia-specific features, so even with exactly the same performances it would make total sense to use Cuda just because the tooling is more mature.

pjmlp · on Sept 28, 2020

The video presentation at GTC clearly discusses it.

They moved into Optix 7 as backend.

littlestymaar · on Sept 28, 2020

The video being almost two hours long, I'm not surprised I missed it when skimming. Do you know by chance at which point of the video it is discussed?

pjmlp · on Sept 28, 2020

https://youtu.be/Qfy6CTaSHcc?t=932

littlestymaar · on Sept 28, 2020

Thanks. But I don't see how this fits with your previous statement:

> Badly, OctaneRender had moved away from Vulkan into CUDA, because they found out that Vulkan compute wasn't at the level that they wanted.

They mostly talk about Vulkan+Cuda interop which isn't really supported, and they explicitly said they consider rewriting everything using Vulkan to get rid of this issue. So from what I understand, they are still pretty bullish on Vulkan, but it will require a lot of work and it will take some time (“but probably won't be this year”).

pjmlp · on Sept 28, 2020

Optix doesn't do Vulkan and I doubt that NVidia will ever bother, and OTOY most likely will rather use they resources elsewhere like the new Metal render.

littlestymaar · on Sept 28, 2020

> and OTOY most likely will rather use they resources elsewhere like the new Metal render.

Maybe you're right, and this guy is wrong. But you gotta admit that contradicting the speaker of the video you're using as a source is unusual to say the least.

pjmlp · on Sept 28, 2020

Apparently watching a 2h video is too much to ask nowadays, or am I now supposed to link all snippets and blogs from OTOY just to win a couple Internet brownie points?

littlestymaar · on Sept 28, 2020

> Apparently watching a 2h video is too much to ask nowadays,

Sorry if I don't spend 2h of my time watching a video presenting the features included in the new version of a piece of software I will never use, just so we have the same knowledge about this product. You generously posted a link to a part of this video relevant to our discussion and I'm thankful for that, but I'm also puzzled because the speaker contradicts you in this very snippet you chose!

Good day.

querez · on Sept 28, 2020

"VkFFT aims to provide community with an open-source alternative to Nvidia's cuFFT library, while achieving better performance."

There are no error bars on the graphs, so it's very hard to judge if the minor differences are significant. I work in research, so probably I'm peculiar about this point, but: I'd expect better from anyone who's taken basic statistics. But from a quick look, it seems like the performance is pretty much just "on par".

It would also be nice to know how performance is on other hardware. I'm assuming it's tuned to nvidida GPUs (or maybe even the specific GPU mentioned). But how does this perform on Intel or AMD hardware? How does it compare to `rocFFT` or Intel's own implementation?

DTolm · on Sept 28, 2020

The FFT and iFFT are performed consecutively up to 1000 times and then each run is done 5 more times. The total result is averaged both for VkFFT and cuFFT and stays roughly the same between launches. The minor performance gains (5-20%) are noticeable. If you have a better testing technique, I am open to the suggestions.

I have tested VkFFT on Intel UHD620 GPU and the performance scaled on the same rate as most benchmarks do. There are a couple of parameters that can be modified for different GPUs (like the amount of memory coalesced, which is 32bits on Nvidia GPUs after Pascal and is 64bits for Intel). I have no access to an AMD machine, otherwise I would have refined the lauch configuration parameters for it too. I have not tested other libraries than cuFFT yet.

querez · on Sept 28, 2020

Thanks for the further clarification! If you ran this several times, you could calculate standard deviations or confidence intervals. It would be nice if you could report one such measure, so it's clearer that the differences are not just some random fluctuations. E.g. you could include them as error bars in your plots. You could also run a statistical test (in this case, a t-test is very easy to do) and report the p-value. Those are the things I'd expect my students to do if they'd have to do something like this for a report or a project, because it's the only way for people to judge if differences show clear signal or are just random fluctuations due to measurement noise.

Also: I should've said this in my first post already, which in hindsight might sound too negative: I think this is a cool project and you did a great job! I just thought this might improve the presentation of your results a bit.

DTolm · on Sept 28, 2020

GPU is a very consistent device, so the purpose of such big sample sizes and multiple launches with averaging is to reduce all the deviations almost to zero. The error is <1% in this case and showing it on the plot will not really change it. The values, however, change when I update the code and improve it, so this is by no means the final way the benchmark will look like. I will think on how to adress this better in the future, but for now I think the best solution if you doubt the results is to launch VkFFT and see what it outputs for yourself.

datenwolf · on Sept 28, 2020

> GPU is a very consistent device.

You'd think that, but I found all GPUs I'm using here to exhibit multimodal distribution of execution times in the FFT (this is for the cuFFT codepath). The GTX980 (not shown in the plot) and the Titan-X even have very prominent outliers. This is a figure that's going to be in the paper I'm currently writing:

https://dl.datenwolf.net/gpu_oct_benchmark_plots.pdf

I'm comparing the OCT processing execution times (with HOT caches, mind you) between a Titan-X and a GTX1080. The difference also shows up very prominently when looking at the kernel scheduling order as reported by NVPP.

DTolm · on Sept 28, 2020

I use the averaged data of 1000 merged launches and then average the end result over a number of runs. Merging FFT calls is actually the way how I use VkFFT in Vulkan Spirit (with some other shaders between), so this benchmark is fairly close to the real life application use case. My benchmark most likely averages out multimodal distribution effects by design.

datenwolf · on Sept 28, 2020

The OCT data we process comes in at about 4GSamples/s and my benchmark is for ~5ms of capture data, in the considered dataset 1D-FFT with a length of 2048 points and a block size of 128. It is not a synthetic benchmark, I'm measuring the real life application behavior here (and to eliminate the runtime behavior effects of the other parts I can flip a flag skipping over the DAQ codepath, working on allocated, but uninitialized buffers).

DTolm · on Sept 28, 2020

Small FFTs like 2048 only utilize one SM and the way they are given to the GPU may produce some fluctuations. It also depends on the way your code works. Synchronizations are also more impactful in this case. Do you launch a big grid that consists of multiple samples combined in a matrix or you launch each sample separately?

datenwolf · on Sept 28, 2020

I'm aware of all of that. And yes, we're very synchronization dependent. However we also spent a lot of time tinkering with the launch parameter and properly interleaving all synchronization events and fences due to our demands on achieving low latency.

Find our original publication here: https://doi.org/10.1364/BOE.5.002963

Since then we improved on that. For the resampling and complex tonemapping we determined empirically that a grid of 128 threads, each processing a whole line achieves the best throughput; there's a 2D parameter space of possible launch configurations and we brute force the whole thing (so far I didn't benchmark the RTX20xx and RTX30xx GPUs, but it was consistent between the GTX690 to GTX1080). The FFT plan is what cufftPlan1d is producing for a single axis transform over a 2D array, usually 2048 point FFT, but with up to 4096 lines (well, technically whatever the maximum dimension for 3D textures is).

> Do you launch a big grid that consists of multiple samples combined in a matrix

Of course!

> or you launch each sample separately?

Of course not, that'd be stupid.

DTolm · on Sept 28, 2020

Well, most likely I won't be able to help explaining the fluctuation easily then, as you have spent a lot of time on it already. It would be cool to try VkFFT in this usage scenario at some pont in the future though - it also can do 1D FFTs of grouped in matrix sequences.

datenwolf · on Sept 28, 2020

As I already mentioned over at https://www.reddit.com/r/vulkan/comments/i2ivzh/new_vulkan_f... I'm going to do that. And will let you know how it goes.

DTolm · on Sept 28, 2020

If you happen to need any assistance in refining VkFFT for your use case, feel free to contact me.

llukas · on Sept 28, 2020

Please check out cuFFTDx - you may be able to fuse parts of your pipeline on-chip.

eximius · on Sept 28, 2020

If it's multimodal, then averaging it out is the wrong thing to do. A histogram would be more appropriate to display the different modes.

Jhsto · on Sept 28, 2020

I think this guy will have no problem getting hired. Being conscious enough to push code online works so much better than the CV preparation courses. You know you're on the right path when you are asked to play up your CV abstract than to downplay it.

Personally, I would have a hard time hiring anyone without a Github account and less so working in a place where nobody has one.

ncmncm · on Sept 28, 2020

To me a Gitlab account, instead, would signify superior judgment.

adamnemecek · on Sept 28, 2020

Not if you want your work to be discovered.

ncmncm · on Sept 28, 2020

Only the best will discover it.

solipsism · on Sept 28, 2020

Getting downvoted, but this is no more arbitrary, myopic, and unfair to the applicant than the parent.

ncmncm · on Sept 28, 2020

The Microsoft Defense Force has been activated.

Despite it, my statement remains true: I do, in fact, adjudge candidates the more favorably for a Gitlab account than a Github account. It demonstrates conscious choice in a knee-jerk world.

(Microsoft doesn't need your assistance, boiz.)

FieryTransition · on Sept 28, 2020

If I ever got minus points for using github rather than gitlab in an interview, I think my opinion of the workplace would be that they would rather focus on minute details rather than actual problem solving and social skills.

ncmncm · on Sept 29, 2020

You would have no idea whether you got minus points for failing to make the jump to Gitlab.

solipsism · on Sept 30, 2020

You say that proudly? You'll judge a developer based on petty preferences, and then not do them the favor of telling them as much?

Must be a blast working there...

ncmncm · on Sept 30, 2020

Have you ever been told why you didn't get a job? Any one of them could have been for failing to make the jump.

That said, I have never nixed anybody for being only on Github. And, thank you, it is a blast, but mainly because we don't have meetings.

whateveracct · on Sept 28, 2020

Yep. I put all my new stuff on GitLab. It's just a better tool. Especially it's issue-tracking and CI (even considering actions)

I just link my GitLab very obviously on my GitHub

oxxoxoxooo · on Sept 27, 2020

What is "Native zero padding to model open systems"? And how come it is "up to 2x faster than simply padding input array with zeros"?

gct · on Sept 28, 2020

So you can pad your input array with zeros, but the algorithm doesn't know that it's padded, and will just compute with those zeros like any other value. If you could tell it that they were zeros it could take advantage of x*0=0 and x+0=x to significantly reduce computation. That's what I think that is.

DTolm · on Sept 28, 2020

That is almost the correct answer. To go even further, there are sequences that are completely full of zeros in the padded case of multidimensional FFTs and we can omit their FFTs entirely.

oxxoxoxooo · on Sept 28, 2020

Thank you for the reply! Could you be more specific? In the case of 1D FFT, the right half (possibly zero-padded) of the signal is completely mixed up with the left half after the first pass [of breath-first FFT]. If the right half was all zeros, would it still be twice as fast in 1D case? Do you have any pointers to literature which discusses this?

DTolm · on Sept 28, 2020

No, the 1D case will mostly save on the fact that it transfers 2x times less data from the vram to the chip. The up to 2x times increase in performance was mainly related to 2D and 3D cases, where only 1/4 or 1/8 of the data is nonzero. In 2D, when doing 1D FFTs over x-axis, we omit sequences after Ny/2 because we know they are full of 0 and thus their result will be 0. So we do 0.5Ny x-axis ffts and full Nx y-axis ffts. For a square system this will mean a drop from 2N to 1.5N sequences. In 3D the drop will be even bigger, from 3N^2 to (1/4+1/2+1)=1.75N^2 sequences (almost 2x).

Lichtso · on Sept 28, 2020

Very cool!

Seems a bit more feature complete than my take on the problem: https://github.com/Lichtso/VulkanFFT

Still, to beat CUDA with Vulkan a lot is still missing: Scan, Reduce, Sort, Aggregate, Partition, Select, Binning, etc.

DTolm · on Sept 28, 2020

I have some of these routines like Reduce and Scan in my other project https://github.com/DTolm/spirit. It also has implementations of linear algebra solvers like CG, VP, Runge-Kutta and some others. These routines have to be inlined in users shaders in some way to have a good performance. Releasing them as a standalone library will require some thinking due to the fact that some routines have multiple shader dispatches.

meisel · on Sept 28, 2020

Warning: LGPL license

ncmncm · on Sept 28, 2020

... which, being a header-only library, happens to place no restrictions or requirements of any kind on the calling program.

detaro · on Sept 28, 2020

I don't think it's that easy? LGPLv3 has an explicit carve-out for headers which makes that scenario easy, but this is 2.1...

loa_in_ · on Sept 28, 2020

Paragraph 5 of the LGPL version 2.1 states:

A program that contains no derivative of any portion of the Library, but is designed to work with the Library by being compiled or linked with it, is called a "work that uses the Library". Such a work, in isolation, is not a derivative work of the Library, and therefore falls outside the scope of this License.

Reelin · on Sept 28, 2020

I think there's a misunderstanding here.

By falling outside the scope of the license, the LGPL isn't viral in the way the GPL is. Code you write can use an LGPL licensed library but be licensed differently itself.

However, you (the developer) are still subject to various legal requirements if you make use of an LGPL licensed library! If you fail to meet those requirements (basically, allow relinking against a modified version of the library) then you are in violation of the license and (in most jurisdictions) subject to penalty under copyright law.

pabs3 · on Sept 28, 2020

The program seems like it would contain inlined code from the header though right?

detaro · on Sept 28, 2020

continuing:

> However, linking a "work that uses the Library" with the Library creates an executable that is a derivative of the Library (because it contains portions of the Library), rather than a "work that uses the library". The executable is therefore covered by this License. Section 6 states terms for distribution of such executables.

When a "work that uses the Library" uses material from a header file that is part of the Library, the object code for the work may be a derivative work of the Library even though the source code is not. Whether this is true is especially significant if the work can be linked without the Library, or if the work is itself a library. The threshold for this to be true is not precisely defined by law.

That certainly does have consequences for what you can do with the software - the object code of your compiled program will include parts of the library.

meisel · on Sept 28, 2020

In that case, if header-only is outside the scope of the license, it begs the question why they would pick that license in the first place. But anyways, it doesn't seem clear from the passage how headers fit in, considering that these headers are not just APIs, they contain the implementation themselves.

kbumsik · on Sept 28, 2020

In this case I interpret LGPL as "Please don't maintain your own fork (with bugfixs) in your company internally, contribute to my repo directly", which make sense.

DTolm · on Sept 28, 2020

This was indeed what I was thinking in the first place, when I made VkFFT - your project doesn't have to be open-source, but please share your modifications to VkFFT. I think about switching it to MPL 2.0, is this one better for everybody?

Reelin · on Sept 28, 2020

Yes, I believe the MPL fits the usecase you describe.

As far as I understand it, proprietary code using an LGPL licensed library is more or less incompatible with templates in header files (!!!) since there's no way (AFAIK?) to relink against a modified version without providing your full source code. Supposedly the LGPLv3 provides an exception for header files but personally I wouldn't go anywhere near it because it seems quite vague - "small" macros, templates that are less than 10 lines (what constitutes a line?) etc.

So as currently licensed (LGPL), I don't think your library is usable as part of a proprietary project.

The MPL, in contrast, places no relinking requirements on the developer. You only have to share any changes you happen to make to the MPL licensed code.

ncmncm · on Sept 28, 2020

Your code, your license. Nobody else has earned a say.

Ask anybody who actually contributes, usefully, what they think. Their opinion might mean something.

andi999 · on Sept 28, 2020

Because they do not know, and are domain experts, not licence experts: like probably almost everybody else here.

ncmncm · on Sept 28, 2020

It doesn't "beg the question" (that being the name of a logical fallacy), but does invite it.

Sometimes LGPL is chosen out of confusion, sometimes symbolically. Whatever the legal demands it does not impose, it would be rude not to honor its intent.

fulafel · on Sept 28, 2020

Doesn't this mean there is no permission to use it, as that's the default situation?

ncmncm · on Sept 30, 2020

No, because it says you may re-publish provided that you abide by the following: [nothing].

phkahler · on Sept 27, 2020

Isn't LGPL 2.1 is an odd license for something like this? Does it produce a library?

microcolonel · on Sept 27, 2020

> Does it produce a library?

It is a library.

bialpio · on Sept 28, 2020

A _header-only_ library. Not sure how LGPL works for those - not much to avoid linking against... Throw it in your own .dll / .so and use that in your closed-source projects? Standard disclosure: IANAL.

formerly_proven · on Sept 28, 2020

Uh, no, it's not. The shaders are clearly part of the work, so you need to make sure that the shaders are "dynamically linked"; i.e. can be replaced by the end user with their own version in order to comply with the terms of the LGPL.

loa_in_ · on Sept 28, 2020

Paragraph 5 of the LGPL version 2.1 states:

A program that contains no derivative of any portion of the Library, but is designed to work with the Library by being compiled or linked with it, is called a "work that uses the Library". Such a work, in isolation, is not a derivative work of the Library, and therefore falls outside the scope of this License.

kbumsik · on Sept 28, 2020

I am confused with that paragraph. So if I statically compiled a project with Qt without any modification, does it fall outside the scope of LGPL as well?

rjeli · on Sept 28, 2020

IIRC it’s unclear if static linking LGPL is okay, so most people steer clear

cycloptic · on Sept 28, 2020

Static linking is okay and is allowed as a derivative work, but you need to ship object files.

https://www.gnu.org/licenses/gpl-faq.en.html#LGPLStaticVsDyn...

>(1) If you statically link against an LGPLed library, you must also provide your application in an object (not necessarily source) format, so that a user has the opportunity to modify the library and relink the application.

This is covered in sections 6 and 6a of the LGPL 2.1.

DTolm · on Sept 28, 2020

Hello, I am the author of VkFFT. When I made VkFFT I wanted license to be like this: your project doesn't have to be open-source, but please share your modifications to VkFFT. I think about switching it to MPL 2.0, is this one better for everybody?

drran · on Sept 28, 2020

Goal of LGPL is to protect freedom of developers to modify code of a library (e.g. to improve it, fix bugs, fix incompatibilities, port to new hardware, etc.) shipped as part of proprietary application. This freedom is important for developers of niche OS'es. This protection works, because of hard consequences for proprietary code in case of license violation: proprietary code must be released under GPL then.

MPL doesn't protect such developers, but it "forces" contribution of changes back to original code. In reality, it's ignored.

Header-only library cannot be compiled as separate library, so requirements of LGPL cannot be met by proprietary code.

cycloptic · on Sept 28, 2020

That would probably do what you want because it applies on a per-file basis and not to the entire program, but make sure you read the MPL FAQ and the fine print before making such a change: https://www.mozilla.org/en-US/MPL/2.0/FAQ/

snovv_crash · on Sept 28, 2020

Yes, MPLv2 does what you want. Consider that it's also used by Eigen. I strongly advocate for MPLv2 for math style libraries.

Seirdy · on Sept 28, 2020

It might be better to simply switch to the LGPL v3 or later, since that explicitly addresses header libs.

kbumsik · on Sept 28, 2020

So then the OP is a LGPL header-only library, which is always statically compiled in every source. Do we need to do the same thing, and if yes, how we provide an application to allow recompiling?

cycloptic · on Sept 28, 2020

If you can't use the object files to rebuild another version of the program with a modifier version of the library then no that probably wouldn't be compliant. In that case you would probably have to modify the library so it's not inlining every function. (Disclaimer: IANAL)

pabs3 · on Sept 28, 2020

The program seems like it would contain inlined code from the header though right?

fluffything · on Sept 28, 2020

> Support for big FFT dimension sizes. Current limits: C2C - (2^24, 2^15, 2^15),

What about bigger than big? > 2^29 or so ? Are these sizes for double precision ?

DTolm · on Sept 28, 2020

Currently, I hit the limit of maximum workgroups amount for one submit dispatch (this is why y and z axis are lower than x one for now). It can be removed by adding multiple dispatches to the code, which I will do in one of the next updates. To go past 2^24 I need to polish the four stage FFT algorithm to allow for >2 data transfers, which I have implemented, but not yet tested. There will also be a single precision limit in this range, as the twiddle factors values will be close to 1e-8 which will be close to a machine error.

bobowzki · on Sept 28, 2020

I wonder if this works on the raspberry pi with the new Vulkan drivers.

Mizza · on Sept 28, 2020

I'm very eager to see GPU acceleration make its way into audio production, which is all still heavily CPU bound.

A Free GPUFFT implementation will certainly help! Great work.

mmis1000 · on Sept 28, 2020

https://en.wikipedia.org/wiki/AMD_TrueAudio I believe AMD did that, but there is little to no softwares actually make use of it.

adamnemecek · on Sept 28, 2020

It's not gonna happen, audio is much less throughput intensive but a lot more latency sensitive.

reitzensteinm · on Sept 28, 2020

You can read off a GPU in 10us, which is just a single sample at 96khz.

If your entire stack lived in the GPU, and you're just reading out the result, this is trivial.

If you're constantly copying buffers back and forth because some effects are implemented in the CPU and some in the GPU, not so much!

It's probably the case that a full stack GPU implementation would blow what we have out of the water, but you'd lose your entire ecosystem in the process, so it's probably never going to happen.

cma · on Sept 28, 2020

Sony is trying some stuff along those lines with the PS5. They have one compute unit on the GPU with a few features fused off that is dedicated to audio.

Twirrim · on Sept 28, 2020

What's the latency for integrated GPUs?

reitzensteinm · on Sept 28, 2020

Crystalwell had a shared CPU/GPU L4 cache with ~50ns latency. I don't think there was a programming model where you could bounce data back and forth that fast, but I don't see a reason why the hardware wouldn't be capable of it.

codetrotter · on Sept 28, 2020

I would think a GPU might help if you have a lot of audio channels and a lot of effects on each channel.

But even if that is not the case, machine learning is making its way into music production tools more and more. No doubt a beefy GPU will be useful to a lot of music production professionals in the future at least, as the tools they are using begin to leverage ML more and more.

viraptor · on Sept 28, 2020

Why do you think it's not going to happen? And for which use case?

The time budget to refresh a video frame is 8ms on 120HZ if everything else came free. In practice closer to <4ms. So even looking at the close to worst conditions, that's about the delay of the sound traveling a meter - should be fine for a lot of real life applications.

necubi · on Sept 28, 2020

Audio processing is real-time, which means that you cannot miss your deadline. If you do miss you get audible glitches, whereas in graphics you just get a slowdown. For that reason, audio code is written in a very particular, real-time safe style that avoids locks, allocations, syscalls, and anything else that is not guaranteed to return within a bounded amount time.

How long the deadline is depends on your buffer size and sample rate. To my ear, buffer sizes of >128 samples (at a sample rate of 44.1 KHz) have detectable latency (although the amount of latency will depend also on how many applications are in your signal chain). At 128 samples you have just under 3ms to do your processing.

Also note that for graphics, the output is the GPU itself. So you don't need to wait for the output to move back to the CPU, it's already where it needs to be.

viraptor · on Sept 28, 2020

But let's keep some realistic context. The audio output is under 1 MBps. You can push that much over original PCI (not express) and ~1ms delay on PCI was "everything must be broken, reset the whole bus". Pushing audio samples both ways on PCIe will not be an issue.

https://www.cycfi.com/2019/04/gpu-dsp-latency/

> PCI-E 3.0 standard guarantees data transfer for 4 kb data with 1-2 μsec (3-10 round trip).

Copying CPU-GPU-CPU:

> size: 8192 bytes, time: 4.72 us,

This should not be a meaningful impact in any audio workflow.

andi999 · on Sept 28, 2020

What OS are we talking? It sounds like you need a real time os.

necubi · on Sept 29, 2020

MacOS has the ability to prioritize audio threads to avoid scheduling misses [0]. Linux has a set of patches [1] to give it soft real-time capabilities. I don't know as much about Windows, but I assume it has similar capabilities.

And it's not like this is ABS firmware, where someone might die if you miss your deadline, and where hard real-time OSes are used. But you do get glitches in the audio stream, which in performance contexts is still pretty bad.

[0] https://developer.apple.com/documentation/audiotoolbox/workg... [1] https://wiki.archlinux.org/index.php/Professional_audio#Real...

krzepah · on Sept 28, 2020

What are you talking about ? Most people working with sound today do so by using a DAW like Ableton, QBase or Fruity Loop that work on Windows or sometimes Mac Os

andi999 · on Sept 28, 2020

Comment before was saying you cannot miss a deadline; i didnt know this was possible with windows or macos.

viraptor · on Sept 28, 2020

They're not hard-realtime, so they definitely can miss a few samples. You can workaround that by reconfiguring an external device rather than generating the samples locally, but that's not what Ableton or FL do.

Win and Mac have soft-realtime scheduling and the mentioned software does what it can without guarantees https://help.ableton.com/hc/en-us/articles/209072289-How-to-... I've been to a live gig before where the performance was paused a few times because artist's mac was misbehaving.

colejohnson66 · on Sept 28, 2020

Could it be possible to “prerender” the audio on the GPU when it’s not being worked on (say, a track not being edited)? Then just play that track if it’s not edited before the user hits play?

Mizza · on Sept 28, 2020

This is classic way of reducing CPU usage, just bounce a part of a track to raw audio an play it back so it doesn't need to render in real time. A GPU doesn't really change the equation there.

There are some methods of synthesis which rely on FFT which can't really be done well in real-time with the CPU (PadSynth, PaulStretch) that I'm hoping this will help with.

singhrac · on Sept 28, 2020

I've heard credible claims that GPUs these days (esp. TPUs) have lower latency for big models than CPUs. I haven't really investigated, but I could see it happening if you give the TPU a huge L1 cache or something.

someguydave · on Sept 28, 2020

Perhaps for large calculations? Otherwise the PCI transfer delay would be a big latency hit?

adamnemecek · on Sept 28, 2020

Yeah until TPUs can directly communicate with the sound card, it sounds slow.

rektide · on Sept 28, 2020

may someday please someone help dethrone the underlord of AI & rise us up

person_of_color · on Sept 28, 2020

This guy will get a foot in but still have to do a gotcha interview loop