Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Vector - A High-Level Programming Language for GPU Computing (zhehaomao.com)
142 points by zhemao on Dec 21, 2013 | hide | past | favorite | 54 comments



Please think about renaming it. This is the most generic, hard to find thing possible. Consider `vecmao` has only 13.5k hits on google.


hear-hear! some of my least favorite things to search and research about are matters related to "R", "C", "dock", "boost", "Go"... i never had any such problems with "erlang", "numpy", or "gromacs".

Please do not underestimate the critical nature and endless annoyance associated with naming things after common concepts (or worse yet, single characters)


There seems to be trend of namning things with language as suffix. Like xx.js and xx.py or xx.go


there's a language named "js" or "py"? i'm only aware of languages named "javascript" or "python" ;)


You knew what OP meant. ;)

The suffix is just meant to be unambiguous. Those familiar with the target language will understand what language the library is written for.


What's that thing about there being only two difficult things in Computer Science?

But yeah, we aren't particularly concerned about the name, since we don't really plan on continuing development.


By 13.5k I mean 89. Use double quotes around the search term.


High-level???

This is high level — http://hackage.haskell.org/package/accelerate


Higher level I guess. Higher level than CUDA or OpenCL. But Accelerate is pretty cool. I've been meaning to learn Haskell, so I might take a closer look at it some time. We were aiming for a different niche. Vector is basically just "Coffeescript for CUDA", in that most of the syntax is a one-to-one mapping, but annoying details like memory management are abstracted away.


I don't get you. Is not C high-level?


No, C is as low level as you can get. It directly maps into the instructions of a register machine in runs on.


I invite anyone who thinks C is low level to try their hand at CUDA or OpenCL.

You think C is bad? Those are far worse.

This is at least a great step in the right direction. It's not "low level" as you describe just because it's C.


You know what's worse than CUDA or OpenCL? General purpose algorithms implemented in a graphics API.

I read a lot of GPGPU papers at the university, and I could never understand the older ones, that described algorithms by mapping everything to graphics elements, and computed the solutions as a side-effect of rendering something.

Next to that, undestanding an algorithm implemented in CUDA is a breeze.


i will preface this by saying i'm a C programmer at heart,

CUDA and OpenCL demand a depth of understanding of both C, and how your code is executed on many core processors. but i wouldn't call either of them terrible.

i would however, very much like to see a widespread higher level API for doing compute on GPUs, if only to encourage people to understand the lower level details.


I didn't say at all that C is bad, as a matter of fact I write most of my code in C. It is, however a low-level programming language not much different from assembly.

The reason it is that it operates on concepts that are not abstract but are of register machines. A high completely abstracts underlying architecture so the code can be executed in any possible environment by means of translation, be it a $10MM Cray or a cellular automaton or a mechanical computer. In essence, a high level language provides an abstract notation for computation.

C, however, is not abstract at all. Variables? How those will map to a dataflow computation? Pointers? Would not work beyond register CPUs (e.g. it is very cumbersome to translate C programs to Javascript as a result and usually leads to a memory array emulation). Fixed-width types? Volatile pointers? Returns from the middle of a procedure? Gotos? Come on, how those are even going to run on a non-conventional architecture?

C also does not completely define the language semantics leaving certain operations implementation-specific or undefined. It also (before C11) didn't define any memory model, thus making it impossible to even describe an algorithm that depends on specific properties of memory accesses in a way that will be portable across different register machines. C algorithms are close to impossible to translate to be run on memory-less computation devices as the whole concept of C is based around having local memory and a stack space with certain properties (unless one wants to emulate a register machine, see JavaScript remark above).

In certain areas it lead to some ugly solutions like Cuda where the language looks like C but semantics is completely different and GPUs are the closest thing to he original C target as you can get.


Assembly directly maps into machine instructions.

The mapping between C and assembly is very nontrivial, and is definitely not one-to-one.


C is no where near as low level as you can get. C-- is closer. LLVM-IR is even closer. but really, the lowest level you can get is the assembly of the architecture you're running on.

at the time of the creation, the general consensus was that C was too high level, that it abstracted away the actual workings of the code.

we've been introduced to high level languages that have made us re-evaluate what it means to be low or high level. but make no mistake, C is.. at least medium-level..!


At the risk of being pointlessly pedantic: assembly is surely not the lowest level relevant to the topic at hand. When programming for very high performance, you usually need to consider the microarchitecture you are targeting.


Hmm. Most "super-high-performance" projects I've seen find they can get more bang for the buck by switching to a different microarchitecture (FPGA etc) or exploiting parallelism (buy ten computers), not so much optimizing the machine code.


While it makes CUDA more readable, I feel like the amount of time taken to write the code in this language will be very close to writing actual CUDA code for someone who is experienced with it.


Maybe not faster to write, but it'd be less repetitive. I've written a bit of CUDA code, and having to put in a bunch of cudaMemcpy calls everywhere got pretty old. Also, reduce is pretty annoying to implement properly, and I'd rather not have to do it again for every possible reducing function.


That is what libraries are for.


Very interesting! I'm also implementing a programming language for my undergrad dissertation (but specifically for agent based simulations).

The thing that struck me most about vector was the radically different for loops (compared to C). I'm assuming you're purposefully crippling them to make parallelisation easier? Or is there another reason?

EDIT: One other thing - the website fails to scroll nicely on a mac (in chrome). I had to manually use the scroll bars instead of being able to 2 finger swipe...


Yes, the special for loop syntax is to make it consistent with the "pfor" syntax. The "pfor" syntax is that way so that it can be parallelized.

Also, I can't believe I forgot to mention this in the post, but both for and pfor can sweep multiple iterators, so

    for (i in 0:10, j in 0:5) {
   }
Is equivalent to

    for (i = 0; i < 10; i++) {
        for (j = 0; j < 5; j++) {
       }
    }


Hey, @zhemao, wasn't kidding about wanting to talk about bringing you on board here. Seriously takes a lot of talent to do what you've done :)



Startup culture is a cancer. Quite trying to sway him from true greatness. All hail Emperor Bozos.


Hey thanks. I've actually already accepted a full-time offer from Amazon, so I'm not looking around anymore. My teammates have all accepted full-time offers from other companies as well.


When I was about your age I've joined IBM for 6 years. The work was great and I liked everything I did there (well, at least for the first 3 years). In the hindsight though I realize that I basically wasted these years.


Ah, interesting. Of course, you can do the same thing in one for loop in C like so:

    for(i=0,j=0;i<10;j=(j+1)%5, j==0?i++:){
    }
Not that you ever would of course, but it does demonstrate the power of for loops in C.


Not sure what's going on with the scrolling. It's just a plain static webpage with some CSS. No fancy JS or anything.


Just tried it again using Chrome on OSX. Works fine for me. Do you have any weird browser extensions that could be screwing it up?


Not that I know of - but there was an odd iframe on top of the page that stopped scrolling from working. When I got rid of it, it started working again. odd...


I'm wondering about the timings on page 36 in the vector.pdf; those can't be seconds or it would be way too slow. (I've written a program[1] to calculate the mandelbrot set on the CPU with SIMD optimizations, and SMT support, on my ageing laptop with a Core 2 duo it calculates the start set in about 0.07 seconds.) It would be interesting if you provided the pure C program that was used for the timings as then I could get a real grasp of the performance of the GPU variant.

[1] https://github.com/pflanze/mandelbrot.git

(BTW, also in the PDF, page 35, you write "computes the number of iterations til convergence for that point", that should be "divergence", right?)

PS. I'm quite impressed by what you achieved in the given time frame.


You can find the benchmarks in the "bench" directory of the git repo. The CPU code we generate for the benchmark is not particularly optimized and is completely single-threaded (so not really a fair comparison).


Oh well, this is embarrassing. I rewrote the CPU benchmark in C and it does indeed perform much faster. I think it has something to do with the use of the CUDA complex number functions. Unfortunately, I do not have my desktop with the GPU set up to recompute the GPU numbers.


I'm getting the following when running "vagrant up"; this is on Debian.

  $ vagrant up
  /home/chrishaskell/src/vector/Vagrantfile:7:in `<top (required)>': undefined method `configure' for Vagrant:Module (NoMethodError)
          from /usr/lib/ruby/vendor_ruby/vagrant/config/loader.rb:115:in `load'
          from /usr/lib/ruby/vendor_ruby/vagrant/config/loader.rb:115:in `block in procs_for_source'
          from /usr/lib/ruby/vendor_ruby/vagrant/config.rb:41:in `block in capture_configures'
          from <internal:prelude>:10:in `synchronize'
          from /usr/lib/ruby/vendor_ruby/vagrant/config.rb:36:in `capture_configures'
          from /usr/lib/ruby/vendor_ruby/vagrant/config/loader.rb:114:in `procs_for_source'
          from /usr/lib/ruby/vendor_ruby/vagrant/config/loader.rb:51:in `block in set'
          from /usr/lib/ruby/vendor_ruby/vagrant/config/loader.rb:45:in `each'
          from /usr/lib/ruby/vendor_ruby/vagrant/config/loader.rb:45:in `set'
          from /usr/lib/ruby/vendor_ruby/vagrant/environment.rb:377:in `block in load_config!'
          from /usr/lib/ruby/vendor_ruby/vagrant/environment.rb:392:in `call'
          from /usr/lib/ruby/vendor_ruby/vagrant/environment.rb:392:in `load_config!'
          from /usr/lib/ruby/vendor_ruby/vagrant/environment.rb:327:in `load!'
          from /usr/bin/vagrant:40:in `<main>'
If you post the generated C code then I'll give the timings and try to compare what it's doing differently.

The CPU I'm using (Intel(R) Core(TM)2 Duo CPU T9300 @ 2.50GHz) was released in July 2006 [1]. The GPU you're using was released on 15 June 2007 [2]. My CPU code calculates the 1246x998 pixel image of the zoomed out view (real=-2..2, imag=-1.6..1.6, maxdepth=200) in 0.07 seconds, if your GPU code does about the same in 0.61 sec, then that's about 8 times slower than the slightly older CPU can do with hand optimized C code. That wouldn't be such a pretty result yet :)

[1] http://en.wikipedia.org/wiki/Intel_Core_2 [2] http://en.wikipedia.org/wiki/GeForce_8_Series


I've redone the CPU benchmark in C and run the CPU and GPU benchmarks on an EC2 G2 instance. The blog post has been updated with the corrected results.


Nothing against this particular language, but... I feel like there is a new language at least every day. It would seem that this does more harm than good to the developer community's progress. Of course, languages need to be iterated on in addition to the programs they compose. But, there is now such a large spread of similar languages that it necessarily slows the development of the most productive ones by blurring/resetting the focus constantly. Many technical problems can be solved with existing languages, rather than eliciting the distraction of a brand new language. Though, in this case, there is perhaps a clear purpose for the specialization of the language. There is certainly a benefit to new languages that offer truly new concepts or optimizations.


I actually welcome new languages. Even if some may be buggy, and lacking in features. Most will most likely end up being lost or unused. But the knowledge gained from developing it is spread out into the industry. Also, it is distracting if you try and follow every new trend. Like you said, there are already a good amount of options available. No need to learn each and every new language. Though it is fun to download them on a Saturday and learn about the ideas the creator(s) had in mind when developing it.


I agree. This was just a class project, and I don't plan on continuing development. These features would be a lot more useful rolled into existing programming languages.


Would you be interested in trying to adapt some of your approaches into a C++ GPGPU library (https://github.com/kylelutz/compute)?


Hey that's pretty cool, and would probably make OpenCL usable by mere mortals. One improvement that I see you could borrow from vector is getting rid of this explicit copying business. Take a look at the array implementation in our runtime library.

https://github.com/vectorlang/vector/blob/master/rtlib/vecto...

Basically, the VectorArray class contains both the host array pointer and the device array pointer. There are also two boolean flags, h_dirty and d_dirty. When you modify array elements on the host, h_dirty is set to one. Then, when you run a kernel, the data is copied to the device if h_dirty is set, h_dirty is cleared, and d_dirty is set. When you try to read an array element again on the CPU, the data is copied from device to host if d_dirty is set, and d_dirty is then cleared.


I hear this same refrain about Linux distros. Personally, I think there's a lot of merit to the proliferation of languages. Languages don't just change the way we write code, they change the way we think. A programming language monoculture would lead to a thinking monoculture and that would be disastrous for innovation.


We agree. Language bloat is a cacophonous. That's why we don't call ArrayFire a language. It's just a library with compatibility for existing languages, e.g. C, C++, Fortran.


Copycatting is the sincerest form of flattery :p

http://arrayfire.com


Well crap. I assumed something like it had been done before, but I'd never heard of this.


How similar is it?


Reminds me of an early version of ArrayFire from 2009 or so. The project highlights 3 aspects:

* Automatic memory management - Been in ArrayFire since 2008

* Their pfor statement - See ArrayFire's GFOR, http://www.accelereyes.com/arrayfire/c/page_gfor.htm

* High-order functions - Been in ArrayFire since 2009

It's always interesting to watch other people reinvent the wheel. It takes a lot of talent though. If the people behind this want an awesome opportunity to join with our team (where we live this stuff every day and have developed a great culture and customer focus), give me a holler. Find me at http://notonlyluck.com


It's interesting how much startups tend to talk about how great the culture is. Can you elaborate on this 'developed culture.' I am really curious and hoping for a real response, not fluff.


I've written dozens of posts about it. Maybe peruse some of the posts here: http://notonlyluck.com/category/culture/


I see, thanks.


Howdy goldenkey, founder of https://commando.io. Would you mind shooting us an e-mail, you can find it on the site.


This was an undergraduate project? Props.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: