In general, carefully written devectorized Julia will be faster than equivalent vectorized code because it is possible to avoid allocating containers at intermediate stages of the computation. This is surprising for people coming from e.g. Matlab, R, or Numpy, because vectorized code is often faster in those environments. In those languages, direct scalar code is slow, and vectorization is about expressing the computation in a form that will call out to fast C operations, but the C still pays a price to allocate intermediates. In Julia, the JIT is able to emit fast machine code for scalar operations, so you don't need to express your code in a form that calls out to C. There are JIT compilers for Matlab and Python too, but Julia's JIT is very high quality and everyone uses it all the time.
The main changes I made here were hoisting computations of constants out of the loop, and preallocating a 2D Float64 array to hold the results. These were worth roughly 2x each. I've chosen the storage format so that it is accessed in column order. Both optimizations have their own sections in the Julia Manual's performance tips section:
edit: on my machine the running time seems dominated by the generation of random numbers, rather than anything else.
perhaps another optimisation is to elimate the `exp` from the inner loop, and just do addition. if you're only interested in the final price, you can do one `exp` at the end.
paths[1, i] = st = s0
for j in 2:M
st *= exp(a + b*randn())
paths[j, i] = st
paths[1, i] = log_st = log(s0)
for j in 2:M
log_st += (a + b*randn())
paths[j, i] = log_st
Another approach is to keep reseeding a fast RNG with a slow but higher quality RNG.
paths = zeros(Float64, M, I)
rands = zeros(Float64, M-1, 1)
for i in 1:I
paths[1, i] = st = s0
for j in 2:M
st *= exp(a + b*rands[j-1])
paths[j, i] = st
Another option would be to generate all M-1xI random numbers ahead of time, but that would be less memory efficient.
the difference i see between generating numbers on-the-fly using boost and reading them from disk is 6.41s vs 0.32s, ie a 20x speedup, for 10 repetitions of 100,000 paths over 100 timesteps.
edit: here's the code - https://gist.github.com/fcostin/3ed3512ad167adf9c785
Consider averaging n numbers from a buffer that you've filled with m random variates with m < n: the standard deviation of your average will decrease like 1/sqrt(m), not 1/sqrt(n), so you might as well have just averaged the independent random numbers you produced instead of reusing some of them.
Maybe the circular buffer idea would be useful in non-quantitative contexts, though, e.g. for simulating something in a game or a movie, where all you need is the appearance of randomness.
a) work slower than RNG – do too much work, wasting CPU;
b) work faster than RNG – produce very poor randomness.
Neither of these situations is good, and dealing with two threads in the first place is a bit of a complication. If you're going to do that, it's better to synchronize the two threads so that the work thread blocks when it needs more randomness than the RNG thread has produced and vice versa.
I recently added a monte carlo based option pricing example. which can be seen here: https://github.com/arrayfire/arrayfire_examples/blob/master/...
I am bringing this up because all the different paths can be calculated in parallel. Usually without a matrix library in C++, you will end up having to write low level CUDA / OpenCL / SSE code to get the performance.
Here is what an intel i7 CPU does using our library + OpenCL:
: Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz, 7875 MB, OpenCL Version: 1.2
Compute Device: 
Time for 100000 paths - vanilla method: 65.815 ms, barrier method: 68.271 ms
BTW, we are working on porting the library to other languages as well. Right now, it is supported for R, Java and Fortran.
Currently I use BSM; however, live performance is poor in extracting implied volatility from NBBO of option spreads as I use a naive approach to iterate and converge on the IV. I assume that's what people use CUDA and GPU for to calculate the greeks and pricing of the whole US option chain series in realtime.
I'm curious if using CUDA or GPU rented from aws would make performance faster, in a) extracting IV from market ticks, b) figuring out break-even and hedging/adjustment points in different complex option spreads, c) calculating expected payoff in term's of Kelly's for a spread or potential adjustment.
Gl with your endeavor.
I don't work in finance anymore, but Heston is not a pricing method per say. In the linked article, the OP assumes that the price of a bitcoin follows a lognormal distribution (i.e: dS_t = mudt + sigmadW_t), while Heston assumes a different diffusion process (dS_t = mudt + digma_tdW_t, and d(sigma_t) = kappa(thau - sigma_t)dt+ zetasqrt(sigma_t) dW'_t) .
I would say that Monte Carlo is in general slower and less accurate than the other methods you mentionned, but more flexible in the sense that it allows more complicated diffusion models or payoffs (e.g: option on the max or min of a basket of underlyings).
If you can give me links or pseudo code for these algorithms, I can look it up (may be even make a simple example and put it on our github page!).
There were some useful discussions on HN at https://news.ycombinator.com/item?id=7383121
Thus Julia can be seen as a tool that leverages Pandas, not as a rival. In fact, most Python libraries are very easy to wrap in Julia because of the PyCall package (https://github.com/stevengj/PyCall.jl).
Does either Python or Julia have an IDE as good as RStudio? Where it's really easy to write code and get visualizations side by side?
I'd recommend Python as it has a lot of feature parity with R these days (especially with the Pandas library for fast dataframe operations).
I still fire up R/RStudio with ggplot2 for more complicated plots though...
Haven't used it long enough to form an opinion on it but check it out, I think it's what you're looking for
$ time pypy testcode.py
$ time python testcode.py
$ time python testcode_vectorized.py
are finite-element models or some other approximation of the probability distribution used in practice as an alternative to simulation when closed-form solutions are not available?
To be honest, if Julia wanted to really captivate an audience it would need to come out with a package or library which did something needed by the community in a way that Python didn't. Then you could conceivably see larger adoption rates.
Also, if history is any guide, the "new" and the "hot" tend not to beat an incumbent just because they're new and hot. Having the most room to improve is not a blessing. You have to improve at a rate that is faster and greater than the incumbent can improve. And if they can't improve at a faster rate, libraries and ecosystems win by offering something compelling that is either hard or impossible to replicate in the incumbent.
I don't really like to look at it so competitively, however. Open source is notoriously not a zero-sum game – everyone wins when there are more good options, whether they are in Python or Julia.
I take it the point is that Julia is generally faster without the optimization step, but the article seemed so heavily focused on this particular problem that I have to question whether it matters in this case.
In a high frequency trading situation, if you can estimate a "correct" price faster than your competitor can, then you get to make the trade, and they don't. Doesn't get much more performance sensitive than that.