Hacker News new | comments | show | ask | jobs | submit login
Parallella, a $99 Linux Supercomputer (zdnet.com)
354 points by microwise 1259 days ago | hide | past | web | 167 comments | favorite

I'm really disappointed about how shallow the discussions about Adapteva are, and have been, on HN.

To remind everyone, the H = hacker. This device is a godsend, as far as I'm concerned. For the first time ever I get fully documented access to compute array on chip. No the architecture wasn't designed for anything specific, like graphics, but that means I don't get bogged down in details I don't care about, like some obscure memory hierarchy.

The chip is plain, simple, low-power, and begging for people to have an imagination again. Stop asking what existing things you can do with it, ask what future things having something like this on a SoC would enable.

Also, you should really be thinking about the chip at the instruction level, writing toy DSL to asm compilers. Thinking along the lines of, oh yeah I'll use OpenCL so I can be hardware agnostic, is never going to allow you to see what can be possible with it. If you read the docs you'll see what a simple and regular design it is, perfect for writing your own simple tooling.

It's been a long time, but I feel like a kid again. Like when I first discovered assembly on my 8086. Finally a simple device I can tinker with, play, and wring performance out of.

Hallelujah! :)

This is exactly my take on the Parallella board, and it's the exact reason why I donated to the Kickstarter back in the Fall.

I want to try my hand at writing a real, efficient, many:many message-passing API on top of SHM. It's something I've been interested in for a while (and am doing in an a side project for x86_64). Not because it hasn't been done a thousand times before, but because it's neat.

I want to write a compiler for the Parallella. Not because there aren't compilers already, but because I've never written a compiler that targets a RISC architecture before. I've never written a compiler that respects pipelining.

I want to write a parallelized FFT based on the original paper for the Parallela. I've used various FFT libraries before, but never actually implemented an FFT straight up. Why? Not because it's never been done before, but just because it's an idea that appeals to me. And for practice parallelizing algorithms . . .

I want to write a raytracer for the Parallella. Not because I haven't written a raytracer before, but because I think that I'll be able to do something interesting with a Parallella raytracer that I haven't done before: real-time (ish) raytracing. Not because that hasn't been done before, but because it'd be neat to build.

I want to build a distributed physics engine. Not because there aren't excellent open-source physics engines (Bullet, ODE, etc.) -- but because I find the problem interesting. It's something I've wanted to do for a while, but never got around to. Why? Because it's interesting.

I could go on, but I'll stop here. The Parallella, I think, is a catalyst for a lot of small projects that I've wanted to do for a while. The Parallella is my excuse to spend time on random projects that will never go anywhere beyond a page on my website describing what they are, plus a link to the source code.

And, you know what? That seems perfect to me. That's why I want a Parallella, and that's why I'm eagerly awaiting mine within the next month or three. (Hopefully!)

Also worth mentioning is the Mojo board, also a Kickstarter project (http://www.kickstarter.com/projects/1106670630/mojo-digital-...) which is a simple FPGA system neatly packaged not unlike the Raspberry Pi.


FPGA allows you to explore a lot of things that just aren't possible in a traditional CPU, no matter how parallel.

Sounds cool, but worth pointing out that the Parallela is Zynq based, and so comes with a Xilinx FPGA built into the SoC that includes the dual ARM cores. The FPGA provides the "glue" for the Epiphany chip to talk to the CPU, but there's plenty of spare capacity.

The more the merrier, though. I wish I had time to play with FPGA's - I have a Minimig (Amiga reimplementation where the custom chips are all in an FPGA) and I'm on the list for an FPGA Replay (targeting FPGA reimplementation of assorted home computers including the Amiga, and arcade machines in an FPGA).

Nice! Even CC-by-sa licensed!

Do you know how its Spartan-6 XC6SLX9 compares to the Zynq 7010 on the Parallela?

If they are roughly equal, I guess the Parallela is a better deal since it has an ARM too.

Too bad Icarus Verilog can't synthesize at all anymore, Xilinx ISE is very heavy-weight and not FOSS. And they both use that.

Well stated. When I got an OLPC the point wasn't that it was a "good" laptop, it was that it was a "documented" laptop and that is rare indeed. What I find particularly amusing are the comments on things like the Raspberry Pi (and Parallela) which talk about all the things they do with "real" computers that these things cannot do. Its amusing because when the first 8 bit machines, the Mark-8 the Altair 8800, and the KIM-1 came out the exact same criticisms were leveled against them from the the same sorts of people. People who were using the existing infrastructure (mini-computers at the time of the Altair) whining that you couldn't run multi-user, you had no storage to speak of, you probably couldn't even assemble a program much less compile one on them, etc etc etc. The Apple II got huge amounts of disrespect from the establishment as a "baby's toy" or a kids toy designed to look like a computer. Then Visicalc hit and those voices faded.

Not enough people understand what is and what isn't possible in a parallel computer setup. Creating a really really simple setup is a great way to let them get their heads around it without getting the information overload shutdown.

"...I'll use OpenCL so I can be hardware agnostic..."

Woah now...

Using OpenCL is imminently sensible for a large swath of applications. And "Hardware Agnosticism" is beneficial for a large number of reasons. This is only the first "accessible" platform that has been released... we can expect many, many more. DSL -> asm on each can get a bit tiresome.

Many apps can benefit from running in more than one place. For instance, I scaled out an authoritative game server using OpenCL recently. Because I used OpenCL, I could run it on Amazon GPGPU instances as well as locally on the server here. Two different sets of cards... same codebase... only difference was instead of 100's of thousands of users on the local server, you can support millions on Amazon. There is something to be said for that sort of flexibility.

Couple that with a clustered vert.x event bus, and you begin to see the power such a system might bring... and that's just for a trivial application like gaming!!!

Imagine the benefits accrued to other, more complex, applications!

You should think carefully prior to being dismissive of hardware agnosticism. If you want to tinker and determine what is possible with this particular platform, by all means, use platform specific tools. However, if you want to bring the power of parallel processing to bear in solving problems wherever you find them... thinking about hardware agnosticism really is a "must-do" as my daughter would say.

I think his point is worrying about hardware agnosticism with a device like this is a bit like putting the cart before the horse.

"I'm really disappointed about how shallow the discussions about Adapteva are, and have been, on HN. To remind everyone, the H = hacker."

Thank You for saying this, i wanted to write about this but never found the time. I have started reading HN less and less because there is no longer substance in the comments and they usually focus on something that makes up 10% of the article, i actually find it weird at times. I am depending more and more on the weekly newsletters but still miss the old comments which were optimistic, on point and taught me something about the subject central to the article.

Back on topic. "What Adapteva has done is create a credit-card sized parallel-processing board" This is so cool,as a linux user I hope to get my hands on one of these asap! Take a look at the kickstarter video http://www.kickstarter.com/projects/adapteva/parallella-a-su...

  Stop asking what existing things you can do with it,
Ask away, I'll keep linking http://en.wikipedia.org/wiki/Embarrassingly_parallel

When I was 13, I started to tinker around with electronics. I didn't really have a proper grasp of electronics except from electronics playsets at 8-10. I designed some circuits using the bits of digital logic knowledge I had, but they called for dozens of chips to implement simple tasks. Then, I stumbled across 8bit AVRs. The possibilities compared to my feeble steps around XOR gates and Flip Flop ICs...they put me in awe. So I started playing around with it, and tried to push bounds, optimize code, adressing, busses, multiplexing, etc. And then I found out about the 32 bit ARM Chips that came for a similar price to the AVRs. Again, the possibilities...And then CPLDs and FPGAs, Neuronal Network ICs, and so on. Every single time, this rush of excitement. I'm an addict, amn't I?

And the Adapteva is similar. From what it sounds like, you just have this brute power at your fingertips. I'll read into the specs a little bit, but it sounds frankly awe inspiring even before opening them up. Want to PWM control 500 leds individually? From what it sounds like in this article, the thing got power for double (off the top of my head). Want to create stereooptic 3d models using a camera and an IMU in real time? It sounds like it can do that (over the thumb calculation). Want to overlay your own 3d models into it and display them using the occulus 3d?

What I'm trying to get across is, if you're not as excited as if you found out santa is real, you're not excited enough.

EDIT: THERE'S A D*"§ FPGA ON THAT THING!!!! I'll stop datasheeting to avoid hyperventilation.

ask what future things having something like this on a SoC would enable.

I asked that and came up blank. And I haven't seen answers from anyone else, either. Has Adapteva themselves shown any examples where their chip beats a GPU?

I can think of two: documentation and simplicity. Comparing Parallella with GPUs only in performance is missing the point. The board is open and quite understandable for non-experts like me. It's a platform for learning and experimenting, like a Raspberry Pi but more geeky. You will probably have an easier time tinkering with this and getting it to do useful things, that's the point. I feel, like the parent commenter, quite excited about this.

In the comment thread on the article someone points out that the Adapteva chip doesn't do double precision floating-point, which limits its usefulness (to put it mildly). If the goal is to provide people with a low-cost platform to experiment with parallel programming, surely a decent NVidia card gives you less expensive (given you can plug it into a PCI slot and it will work) access to more CPUs that run faster and do more.

It took a long time for GPU's to get double precision floating point and plenty of GPGPU work was done with them prior to that, so it's not a deal breaker

Not sure if world first or AMD's first, but it was around this timeframe, 2007: "AMD Delivers First Stream Processor with Double Precision Floating Point Technology" http://phys.org/news113757140.html

I'm 32 or so years of programming, I've hardly ever done anything that needed, or used, floats. It may limit it's usefulness, but most of what people tend to want double precision for is incidentally also stuff that is easily vectorized, in which case a GPU will crush it anyway.

And a "decent NVidia card" doesn't allow me to combine arbitrary independent C programs to each individual core, and doesn't give me full low level guides for hardware access. It's a completely different beast.

Well, you can still do double-floats, combining two 32-bit floats for a greater precision. While that doesn't get you full double precision, it just might be enough. And of course you can extend the same idea to implement quad-floats and so on.


One interesting application could be realtime 3D rendering because this is an area with small overhead. I know that the chip does not support floating point but that could be simulated by fixed point integers.

I agree with Shamanmuni that the great advantage of Parallela chip over GPUs is open source (full documentation). It's a practical study tool for real parallel programming tasks that many students can afford.

It supports floating point. Just not double precision floating point. That's good enough for 3D rendering.

"I fundamentally disagree that SoCs are different than desktop technology. They're just smaller. High-end architectures are already driven to be as power-efficient as possible, so when you cut them down they're still efficient. For example, Kabini is an "SoC" that has the same GCN CU as a "desktop" Kaveri or a discrete Radeon."

You didn't really read what I said did you. A key factor for embedded electronics is power draw. Based on a quick Google, AMD Kabini is using approximately 15W of power: http://techreport.com/news/24186/new-details-early-benchmark...

On the other hand, the 64-core Parallella is using approximately 2W: http://www.kickstarter.com/projects/adapteva/parallella-a-su...

Hope you can start to see the difference now.

Yes, but in those 15W of power, Kabini will likely have 128 stream processors, and I suspect more memory bandwidth: http://forums.anandtech.com/showthread.php?t=2278693

The Parallella doesn't seem inherently more appropriate for embedded devices; it just depends on your requirements. Kabini would be embarrassingly power-hungry in plenty of embedded applications, while the Parallella might be laughably slow in plenty of other embedded applications.

Don't forget, by the way, that "embedded" doesn't mean "battery".

You're nitpicking. I already gave a few examples of where Parallella would be a good fit. To remind you, let's revisit the OpenCV for robotics application. The Parallella is shaping up to be a great device for OpenCV applications, do you at least admit that?

I see, I am less confused now. You commented elsewhere on this page, and assume that we all have read those comments, though they are not in this chain right here.

You're approaching the question from a different angle. The key word is SoC. Think embedded performance, rather than desktop performance.

Just to give you a few examples... OpenCV for robotics platforms, cheap low-power SDR capable of transmission, SIP encryption and compression. One might argue you could stick a GPU in a robot, I'd personally want something better suited to the task (lower power).

I fundamentally disagree that SoCs are different than desktop technology. They're just smaller. High-end architectures are already driven to be as power-efficient as possible, so when you cut them down they're still efficient. For example, Kabini is an "SoC" that has the same GCN CU as a "desktop" Kaveri or a discrete Radeon.

This is wrong. For example, low-power embedded ARM chips are not simply cut-down high-end x86 chips. If you optimize for power usage instead of raw performance, there are many design decisions that come out differently, resulting in a design that is qualitatively different and not just "scaled down powerhorse".

ZenoArrow is talking about embedded SoCs though. While Kabini is a SoC, it isn't really suited for embedded applications. These chips usually have a ton of GPIO, built-in support for different communication protocols, analog to digital converters, a lower power draw, etc. etc.

Exactly. I'm very excited to get my board and break into it. It really is intended to be a testbed for multiprocessing and hacking.

One thing that I'd like to see is what other people do with this product. I think that really will be the best part of this board. I'd equate it to minecraft (if I may be so bold). They didn't create a computer in minecraft, they created the possibility to create a computer and that was enough. That's how I see this board.

My biggest question is what do I need to know to use this? How can I write things that take advantage of this massive parallelization? Is there any reading anyone would recommend? Or maybe some basic examples for writing GPU based parallel software?

How can we get started with this?

I'm so stoked about this board! I've been really wanting to have a board to try out hybrid computing/HPC with for a long time now, but everything has been a bit beyond my reach/justification pricewise. I got so excited about this board reading about it. The sky is the limit with this hardware, and now it'll that much more available to everybody.

Thinking about everything I can do with this board is making my head explode!

I'm really happy that you clarified this for me. I was a bit confused because the newest KickStarter video focuses mainly on a young girl using the computer to surf the web and doesn't really detail anything that has been mentioned in these comments. After reading everything, it seems awesome though.

writing toy DSL to asm compilers

Sounds like an ideal use of Forth.

Good reminder!

Do you see applications in an embedded sense, or are you looking at it to augment a regular computer's capability?

I'm actually thinking Adapteva has a lot of future in present areas of growth.

1) On the mobile side, you can have Epiphany, their compute fabric, as a unit directly on the mobile SoC. You can do codec offload, like WebP, WebM, SILK/Opus. You can do basic computer vision for augmented reality applications, or image recognition. Or perhaps physics, integrate gyro output, position the device in absolute three space. I dunno, the point is the compute is open, there for exploitation. It's not like OpenCL where I have to beg the drivers to be available, correct, or performant. Nor is it like Qualcomm's Hexagon, where who knows if I can use it, and I sure as hell won't without signing an NDA.

2) As far as cloud and heterogenous compute goes, again I see an embedded Epiphany being useful. Everybody whines about various things, like for example missing double-precision. Firstly, it's not like the architecture can't be extended in future. But more importantly they miss little details. Each node in Epiphany can branch and do integer. You can see it doing wire-speed protobuff de/coding and other parallel data shuffling of long-living data, that could be compressed, or interleaved somehow.

I'm more of a low-power, cloud kind of guy. So that's what I'll be playing with the most when I get my hands on the kit. That and maybe some parallel graph rewriting. Who knows, the sky's the limit.

I wish to pick one nit: This is not the first chipset to be fully documented and have this sort of massively-parallel structure. GreenArrays is producing hardware right now if you want to go play: http://www.greenarraychips.com/

Greenarrays eval board is $450, though. Parallella is $99 with everything. Parrallella's original intent was to fit in an Altoid tin, but it turns out rounded corners are expensive ;). Would have been a nice case, though.

And yes, I gave $99, hoping to have one soon.

AFAIK the GreenArray chip provides only 128 bytes per core while Parallela supports 32 KBytes per core.

As a former Forth hacker I was enthusiastic at the first glance of the GA but 128 Bytes per core were really disappointing. What could that amount of RAM be useful for?

Ok, since when is the ARM Cortex fully documented, and what depth does your comment add to this discussion? Your emotional appeal is more shallow than any other comment here! Throw in some programmer jargon and nostalgia, and a smiley and your comment is practically garbage. :)

There's so much negativity in this thread, wasn't the whole idea that these guys had plans and an architecture to scale up to the order of a terra-flop by 2014, and 20 by 2022? And look! they're shipping. This first chip may not be impressive, but I'll welcome a new player to the market who has big plans to innovate.

I don't get the negativity either. If you look at the architecture manual, this is like a cheap Tilera. It's an interesting programming model (lots of cores in a shared memory SMP with weak memory ordering), and the CPU's are pretty vanilla RISC architectures. For $99, it's a great way to play with something that has the properties of the kinds of CPU's you might see in a future supercomputer.

I wrote my notes up here last time this was doing the rounds: http://41j.com/blog/2012/10/my-take-on-the-adapteva-parallel...

I'm pretty skeptical, having played with the Tilera I'm not sure it gives you enough of a benefit to warrant the extra effort. The Parallella also looks a lot like a Tilera, I do wonder if there might be IP issues there down the line.

I also still think our best bet for this kind of thing is multicore ARM systems.

Not to take away anything from Parallela, but you can also get to play around with something that's used in today's supercomputers by buying any Geforce 5xx and start programming in CUDA.

A fully documented ARM architecture as an accelerator chip is certainly interesting though. It will take time until the software tooling catches up, but the initial buzz in the HPC community about those ARM newcomers is certainly there. I'd give them a good chance in the long run to outrun Intel MICs and catch up to NVIDIA Tesla.

What I'd like to see next is a PCI express expansion card using this technology. See, one of the great benefits about Tesla cards is that you can swap them out in your supercomputers just like you do it with RAM - and you get the newest chip architecture, as long as your PCI bus can handle the load. For multi purpose systems you often still like to have a good number of x86 cores in there however.

Small cheap cards for learning are a great idea.

But then they've included some really weird wording. They know that people are hostile to that wording, yet they chose to continue to use it.

When you're educating people it's important to be clear about terminology.

Having said that, I think it's neat, and I wish them luck. I think they're missing one of the main points of the RPi's success - it is dirt cheap. At $35 people will take a risk. At $90 people need to think about it. That might sound odd on HN where people tend to have a lot more disposable income.

I am very happy to see this coming out. The negativity found here is really disappointing. I will be very happy to get hold of a good number of cores running with a hardware memory model that is more out-of-order than x86.

Sounds like a glorious playground :)

As someone who uses supercomputers, I'm not sure I entirely understand the market of this product. It's really cool and I'd love to have one to tinker with, but due to its high parallelization, I see no benefit of using this over a graphics card. I'm not sure if $99 can get you a GPU that reaches 90 GFlops though... perhaps that's where the benefit lies.

EDIT: After reviewing their website, I notice they state

> One important goal of Parallella is to teach parallel programming...

In this respect, I can see how this is useful. Adapting scientific software to GPUs can be difficult and isn't the easiest thing to get into for your average person. This board, with its open-source toolkit and community could make this process a lot easier.

You can't do individual branching for every single strand of computation for a typical GPU. With their chips they are trying to create a "third" category between highly parallel, but few individual instruction threads, and normal CPU's.

And the $99 is not for a mass produced card or chip, but for an initial small production run computer that includes one of the parallela chip.

You pay for a dual-core ARM computer based on the Zynq SoC (which means you get an FPGA in the deal) with an Epiphany chip. Getting a Zynq dev-board for that price in itself makes it worth it for a lot of people.

I think this may just be a novel way to sell a dev board for their custom silicon and get some of that heavy kickstarter press coverage.

If you figure that what they're really trying to do is get people familiar with it and see how well it might augment one of their existing ARM products it starts to make a lot of sense.

For instance I have a low end 4 bay arm based nas. It's insanely modest specs (1.6ghz single core + 512MB ram) actually are quite sufficient for most nas tasks. But it's really more like a home server platform as they have all sorts of addons that include things like CCTV archiving, DVR, ip pbx - you get the picture. But if you really start treating it like a general purpose server you quickly realize that some common workloads perform horribly on that arm core and it's frustrating.

It can easily push 800mbps or so with nfs smb or cifs, but if you want to rsync+ssh you're looking at less than a 10th of that because of the various fp needs of that chain. Native rsync with no ssh/no compression does somewhat better but still poorly due to its heavy use of cryptographic hash functions for delta transfers.

There are plenty of other examples - file system compression, repairing multipart files with par2 (kind of like raid for file sets). Face detection, file integrity hashing) And if it could do on the fly video transcoding (don't even think about it) it could happily replace another full system i have running plex server.

There's probably a lot of devices that the designers default to arm but have to skip features that are heavy fp. If somebody in the firm has played around with a chip you can just drop in and not change your soc or toolchain that starts sounding pretty good i'd guess - and likely still far cheaper than an atom soc.

Interesting observations. I know AMD is making an ARM chip (http://www.anandtech.com/show/6418/amd-will-build-64bit-arm-...). Have they said anything about FP?

It's an interesting (read: niche) market, to be sure. It's the same market as someone who might buy 8 raspberry pi boards, or 4 ODroid U2 machines for the purpose of learning about parallel computation.

The Epipchany chip (the coprocessor on these boards) is supported as of GCC 4.8, so we also may see some novel ways to offload work to this chip in the future.

You can buy an AMD 7750 for $99, which theoretically gives you in the range of 800-900 GFLOPS (single precision).

Each TigerSHARC-like DSP core has 32 kByte embedded memory, and memory-like access to other core's memory space. The system has GBit Ethernet, runs Linux and takes only 2-3 W power.

This is ideal cluster material.

You can get close to 1TF in the GPU space for $99 these days, but at much lower power efficiency. The Adapteva hardware looks to be around 18GF/W, whereas a GPU is about half that at $99.

The AMD Radeon HD 7790 ($150) does 21 GFLOPS/Watt. So I would say Adapteva is definitely comparable to GPUs in this approximate price range.

And that's just at the Kickstarter price. GPUs have had decades of competition to squeeze the extra costs out and build up economies of scale.

Comparing it with a GPU is a natural discussion to have. I believe there is more information on their website to answer the question. But maybe someone else can explain the programming paradigm difference. http://www.adapteva.com/introduction/

Each of the 64 cores can independently run arbitrary C/C++ code, which is much more flexible than a GPU. Each core has 32KB of local memory, which can also be accessed by the other cores, and there's 1GB of external memory too.

Specs: http://www.adapteva.com/products/silicon-devices/e64g401/

Architecture: http://www.adapteva.com/wp-content/uploads/2012/10/epiphany_...

SDK docs: http://www.adapteva.com/wp-content/uploads/2013/04/epiphany_...

I wonder how much work it would take to port cpuminer to this platform for use in mining bitcoins and litecoins. Probably not worth it for bitcoins with the new FPGA hardware but could be interesting for litecoin depending on the memory access speed to main memory.

You mean the new ASIC hardware? The FPGA are ancient in the crazy world of Bitcoin, although a Litecoin FPGA will emerge in the next few month considering Litecoins value.

Looks like I was about a year behind.

Interesting architecture. I like how well-documented everything is. Usually, either the low-level ISA for accelerator chips is not documented at all (like with GPUs), or detailed documentation is only available under NDA, and only proprietary development tools are available (like with FPGAs).

The topology reminds me of this paper "The Landscape of Parallel Computing Research: A View from Berkeley" http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-18...

> the low-level ISA for accelerator chips is not documented at all (like with GPUs)

Let me show you the AMD Southern Islands ISA specs: http://developer.amd.com/wordpress/media/2012/10/AMD_Souther...

Exactly. That's the real point. Are you going to be able to run these together and get on the TOP500. Lord no! But, can you learn how to parallel program on one? Sure, and that's no small matter.

Could it be good for path finding and other NP Hard problems? It could do good things for the AI world if that's the case.

First of all, which is the NP-hard pathfinding problem you're talking about? When I hear "pathfinding" I think "shortest path", which is (deterministic) polynomial time (exact class dependant on exactly which variant of the problem, but even Floyd-Warshall is Θ(|V|³)).


Anyway, no. If you have an NP-hard problem, and you want an exact answer (i.e. you are denying yourself approximate solutions), and you want to solve it for large inputs, unless you have either proven that P=NP by construction (heh), or you have a non-deterministic computing machine (heh), you're basically screwed. Going parallel isn't going to help, any more than a hypothetical billion-GHz serial CPU is going to help. Asking this question suggests a fundamental lack of understanding about what is interesting (or rather, infuriating) about NP-hard problems.

Parallel processing models give you, at best, linear speedup. If your problem is O(too-big), and your input is large, linear speedup doesn't help, no matter how much linear speedup you have.

> Parallel processing models give you, at best, linear speedup.

Most commonly, yes, though in practice superlinear speedups can occur in some situations. (Not that this negates your overall point, just a nitpick.)

Can you give one or more examples?

Say you're running a problem of size N on a processor with a cache of size N/2. Adding another processor with another N/2 cache means your whole problem now fits in cache, so you'll probably suffer fewer cache misses, and thus could end up running more than 2x faster.

Wikipedia also points out possibilities at an algorithmic level: https://en.wikipedia.org/wiki/Speedup#Super_linear_speedup

Hmn. I don't find those examples compelling, but maybe I'm just missing something.

I don't have a thought about the wikipedia backtracking example.

Regarding cache-vs-RAM, or regarding RAM-vs-disk, I see no reason you cannot take the work that the parallel units do, and serialize it onto a single processor. Let me consider the example you gave.

Initially, you have two processors with caches of size N/2, and a problem of size N on disk. You effortlessly split the problem into two parts at zero cost (ha!), load the first part on procA, the second part on procB. You pay one load-time. Now you process it, paying one processing-time, then store the result to disk, paying one store-time. Now you effortlessly combine it at zero cost (again, ha!), and you're done.

In my serial case, I do the same effortless split, I load (paying one load), compute (paying one compute), then store (paying one store). Then I do that again (doubling my costs). Then I do the same effortless combine. My system is 1/2 as fast as yours.

In short, I think "superlinear speedup" due to memory hierarchy is proof-by-construction that the initial algorithm could have been written faster. What am I missing?

OK, bear with me as we get slightly contrived here...(and deviate a little from my earlier, somewhat less thought out example).

Say your workload has some inherent requirement for random (throughout its entire execution) access to a dataset of size N. If you run it on a single processor with a cache of size N/2, you'll see a lot of cache misses that end up getting serviced from the next level in the storage hierarchy, slowing down execution a lot. If you add another processor with another N/2 units of cache, they'll both still see about the same cache miss rate, but cache misses then don't necessarily have to be satisfied from the next level down -- they can instead (at least some of the time) be serviced from the other processor's cache, which is likely to be significantly faster (whether you're talking about CPU caches relative to DRAM in an SMP system or memory relative to disk between two separate compute nodes over a network link).

For a more concrete example somewhat related (though not entirely congruent) to the above scenario, see http://www.ece.cmu.edu/~ece845/docs/lardpaper.pdf (page 211).

Hmn. I think the reason there's superlinear speedup in the paper you linked is because the requests must be serviced in order. If you only care about throughput, and you can service the requests out-of-order, then you can use LARD in a serial process too, to improve cache locality, and achieve speed 1/Nth that of N-machine LARD. But to serve requests online, you can't do that reordering, so with one cache you'd be constantly invalidating it, thus the increased aggregate cache across the various machines results in superlinear speedup.

So, mission accomplished! I now believe that superlinear speedup is a real thing, and know of one example!

It runs MPI and has a GBit Ethernet interface. They've already bult an 8-node demo cluster.

The Zynq 7020 arguably has enough spare capacity and I/O ports to implement a 3d torus with ~GByte/s throughput for each link.

I find it incredibly dishonest of Adapteva to equate it to a "theoretical 45 GHz CPU". There are much better ways to talk about the performance level of their hardware than that metric, especially given the rest of the text in their Kickstarter pitch is aimed at people who need to inherently understand the hardware's execution model in order to program it effectively.

The computing industry has established language and metrics to discuss computing performance and, while the waters often get muddied when the hardware is wide, that's a step too far.

  This board should deliver about 90 GFLOPS of performance, or — in terms 
  PC users understand — about the same horse-power as a 45GHz CPU.
That doesn't seem too outrageous to me.

Edit: They state the real fact and then give another figure explicitly stating it's an attempt to translate this into a metric the average user can somewhat relate to.

According to http://en.wikipedia.org/wiki/FLOPS#Computing it seems that they're off by a factor of two, but I'm guessing that's just an honest mistake.

Second edit: I was under the impression that this was the result of dumbing down by a journalist, however it seems it's from Parallela itself. That is a bit disingenuous indeed.

It is 90 GFLOPS at less that 5 Watts. That's not too shabby. Adapteva claims that the Epiphany cores have a GFLOPS/W ratio of 50.

See here: http://streamcomputing.eu/blog/2012-08-27/processors-that-ca...

Also, the boards for the backers feature a ZYNQ-7020 SOC by XILINX which sports a 1.3M Gate FPGA, available to the user. This ain't bad either.

Most others - like my MacBook here [1] - are advertising GHz / core not "summing" the cores.

So the statement "PC users understand" is false.

[1] http://store.apple.com/us/browse/home/shop_mac/family/macboo...

A single Ivy Bridge core has 8 Flops/MHz of computing power. 45GHz Ivy Bridge would be able to do 360GFlops.

If you are correct for the first clause(8 Flops/MHz), 45GHz of Ivy Bridge core has 360k Flops (8 Flops/MHz * 45GHz ==> 8 Flops * 45k).

8 double-precision flops/cycle/core is the correct figure for Ivy Bridge and Sandy Bridge. With Haswell adding FMA, that figure doubles again(!)

Hum, no. Sandy/Ivy Bridge can only execute 4 double-precision instructions per cycle per core, in the form of two SSE instructions per cycle (one instruction doing adds, the other doing muls, executed by different units).

Doing 8 double-precision instructions per cycle would translate to either four 128-bit SSE instructions, or two 256-bit AVX instructions per cycle, which is not possible (unless I did not keep track of the latest AVX capabilities).

It should read 8 FLOPS per cycle double precision. So a 3 GHz 4 core Ivy Bridge processor could theoretically peak at 96 GFLOPS double precision, 192 GFLOPS single precision.

The only reason it isn't a big lie is that it's an utterly meaningless statement. In any case it is a misrepresentation of what modern CPUs are capable of.

Me neither. The only catch is that you can't get serial computation that fast, but I assume anyone buying something called "Parallella" would realize that already.

Because all of us dumb PC users measure performance in terms of "horse-power". :-)

They do answer that on their kickstarter page: http://www.kickstarter.com/projects/adapteva/parallella-a-su...

Why do you say the Parallella is a 45GHz computer?

We have received a lot of negative feedback regarding this number so we want to explain the meaning and motivation. A single number can never characterize the performance of an architecture. The only thing that really matters is how many seconds and how many joules YOUR application consumes on a specific platform.

Still, we think multiplying the core frequency(700MHz) times the number of cores (64) is as good a metric as any. As a comparison point, the theoretical peak GFLOPS number often quoted for GPUs is really only reachable if you have an application with significant data parallelism and limited branching. Other numbers used in the past by processors include: peak GFLOPS, MIPS, Dhrystone scores, CoreMark scores, SPEC scores, Linpack scores, etc. Taken by themselves, datasheet specs mean very little. We have published all of our data and manuals and we hope it's clear what our architecture can do. If not, let us know how we can convince you.

I read this as:

Why do you say the Parallella is a 45GHz computer?

This is a Kickstarter project. In order to be successful, we need to attract as much attention from spam blogs as possible. To do that, facts are not particularly useful. What we need is something exciting. If we say we have 64 cores, that's not exciting. 64? I've forgotten how to count that low. Similarly, if we say we have a 700MHz processor, most people listening to us talk will actually start laughing in our faces. So that's no good. But thanks to our mathematical forefathers, there are many ways to make small numbers big. We could add the two numbers, saying we have a 764MHz machine. But that's not exciting, and the units don't work. We could divide the two numbers, yielding 10.94MHz. The units work, but that number is even smaller! Finally, we could try multiplication! And, boy, does that deliver! 45GHz!

TLDR: The only reason you're here is because of our misleading and dishonest claim. But now you're here. Please cough up your hard-earned cash which we might not use to go on a nice tropical vacation. You can trust us, we'd never mislead you...

Their answer to 'Why do you call the Parallella a supercomputer?' is also pretty curious: ;)

> The Parallella project is not a board, it's intended to be a long term computing project and community dedicated to advancing parallel computing.

> The current $99 board aren't considered supercomputers by 2012 standards, but a cluster of 10 Parallella boards would have been considered a supercomputer 10 years ago.

Wait, what? :D (emphasis mine)

> Our goal is to put a bona-fida supercomputer in the hands of everyone as soon as possible but the first Parallella board is just the first step. Once we have a strong community in place, work will being on PCIe boards containing multiple 1024-core chips with 2048 GFLOPS of double precision performance per chip. At that point, there should be no question that the Parallella would qualify as a true supercomputing platform.

They are clearly wrong. The purpose of higher clock rates is to produce a given answer in a smaller amount of time (latency). The purpose of adding more processors (cores) is to produce more answers in a given time (throughput). They are free to report their results using any standard measurement of throughput. Their answer is weasely.

But then, clock rate is irrelevant to latency and throughput. Really matters how much more work per cycle you get done, how fast you can move IO, and the cost of throughput per watt (and whether the system can meet your requirements at all).

They are "clearly wrong" when talking to geeks about specific types of problems. For most most people this means nothing, and multiplying it is fine. And for a lot of situations where you are considering batch jobs, multiplying it is fine as a quick illustration. It is not as if the raw numbers tell you anything anyway, since the characteristics of the system are so unusual.

They miscalculated how people would interpret it, and got burned. But they've been clear about what it is they actually mean the whole time.

I don't see a big difference between this and "petaflops" measurements that are the de-facto standard in bragging about supercomputers. You really only hit that peak performance for embarrassingly-parallel problems, but unless you have a specific workload or benchmark to talk about, it's the best you have and is a fairly well-accepted practice in the industry.

On a related note "petaflops" would be a great name for a pet bunny.

"this board should deliver about 90 GFLOPS of performance, or --in terms PC users understand-- about the same horse-power as a 45GHz CPU."

This is wrong.

A 4-core 3.0 GHz x86-64 processor delivers more GFLOPS than the Parallela: 96 GFLOPS with SSE instructions, because each core can execute 8 single precision instructions, 4 adds and 4 muls, each cycle. And yes, when Parallela claims 90 GFLOPS, they mean single-precision.

For example, for the same price as Parallela, you can get a $100 Phenom II X4 965 (4-core, 3.4 GHz, 125W) delivering 109 GFLOPS. Count $200 to include minimal mobo/RAM/PSU (if all you care about is raw GFLOPS).

The main advantage that Parallela has with their exotic architecture over x86-64 is a better GFLOPS/Watt metric. But if you care about this metric you should consider GPUs, which beat Parallela: http://parallelis.com/parallela-supercomputing-for-all-of-us...

Parallela may not beat anything on GFLOPS/Watt and GFLOPS/$, but if they can maintain ease of development (x86-64's stronghold) while doing not too bad on these 2 metrics (dominated by GPUs), they may be a good compromise and may have a shot at succeeding in the HPC market.

Exactly right. ARMs lure isn't really the current performance for supercomputing, it's rather the expectation that they will hit the next big performance wall much later than x86 because of its simple architecture that's suitable for the maximum amount of cores per die space. Give it 2-3 years and we might have the big step in supercomputing architecture at hand.

Note: the $99 version has 16 cores, not 64 cores. http://www.kickstarter.com/projects/adapteva/parallella-a-su... (+ 2 ARM cores)

Can anyone explain what the practical differences between something like this and a gpgpu approach? It doesn't sound particularly performant compared to modern gpus otherwise. Maybe they add in some more general purpose instructions for a little more flexibility?

A typical GPU can execute a small number of threads on a large number of streams of data carefully laid out in memory. Every time you want to do something conditionally on just one data stream, you waste a lot of capacity.

In contrast, the Epiphany chips can execute individual threads on each core in parallel on data either local to the core (fastest), on any other core, or in separate main memory.

The current Epiphany chips aren't too spectacular, since the core count is "low". They can "only" execute 16 individual instruction streams in parallel. But that's on a chip the size of your finger nail, and their roadmap is aiming for 1024 core chips.

They're effectively aiming for people to find ways of making effective use of simple, small, power efficient cores for problems that are not "data parallel" enough to be efficiently done on GPU's.

This might be a really stupid question:

How difficult would it be in the practical sense to keep all of the cores on something like this "fed" with enough information to get benefits from its concurrency?

I mean to "feed" all 64 cores enough data/code so they can all "do something" concurrently is one hell of a job all on its own!

Depends a lot on the type of problem, and I think that's going to be what makes or breaks them. They have some good examples, but you're right, it's a hard problem and one of the reasons it's so important for them to get these dev boards out.

Awesome explanation, thanks. I love that you can learn about things outside your field here without it being super dumbed down or having to google half a dozen domain specific terms.

If nothing else it's very small and uses very little power

I backed this project simply because it's a great idea to build a very small highly parallel computer that runs on very little power. Maybe this one won't hit it out of the park but it might give other people ideas. Building the first one of anything is always hard. Add a little serendipity and we might get an entirely new use for computers.

Just saying that I could do more with a $99 graphics card sort of misses the point.

It's such a great idea that AMD Kabini already did it.

Ok, so we should stop there and not encourage further development in the market? I want a 1000 core "Raspberry Pi" that sells for $50.

Let me know if there's anything else that I can do to help.

How many megahashes this hardware computes?

I love this board, but keep in mind that the entire premise of this board is parallel execution of _separate instruction streams_. From the performance people are getting from GPU's for bitcoin mining, I presume the calculations can be done extremely parallel with few instruction streams - for that a normal GPU is likely to be a far better choice.

Much, much less than the specialized ASIC platforms do.

But when we consider the consuming power of ASIC platform, I think this board has strength. They said this board consumes 5 watt for typical jobs.


> But when we consider the consuming power of ASIC platform, I think this board has strength.

No, it can't possibly have.

SHA is half bitshifts-by-constants. On an ASIC platfrom, those essentially refactor to no-ops. There is no way, no how general-purpose hardware could ever possibly get anywhere near even a piss-poor special purpose ASIC for this task. If you think otherwise you simply don't understand the domain. Those 600-watt ASIC systems contain multiple chips and run at tens of GHashes/s. That 5-watt chip, if it's very, very good, might maybe break 40MHash/s.

It's nowhere near fast enough. My 7970s can push out about 1.3Ghash/s and combined they are capable of around 7 TFLOPs. When (/if) they release the BFL Jalapeño it'll run at 5 Ghash/s and be powered by USB. 90 GFLOPs is equivalent to a decent processor, but nowhere near powerful enough for bitcoin mining.

Not enough. Don't bother.

They say 90 GFlops.


> For example a Radeon 6990 has 5.2 gigaFLOPS of computing power[1] and yields roughly 800 megahash/s in bitcoin mining.

That was in July 2011. Mining is harder now.

Radeon 6990 is 5.1 TERAflops (5099 GigaFLOPS) ... several orders of magnitude faster than this thing


Doing a single hash isn't harder though. Increased difficulty just means you have to do more hashes. So the numbers given should still be about right.

Cayman XT [Radeon 6970] outputs 2.7 TFLOPS in Single-Precision and 675 GFLOPS in Dual Precision. [1] I think the 6990 is just two of those, or at least that's normally their convention.

[1] http://www.brightsideofnews.com/news/2010/12/13/amd-radeon-h...

I agree with you. This might be good for bitcoin mining. :)

The real question is: How much power does it consume compared to its hashing power?

While I find this quite exciting from a pure developer perspective, it also reminded me that I haven't had anything I'd call a Desktop box in quite some time.

If I were to ever get a Desktop machine again, it would have to be cheap and light, definitely don't want anything clunky, otherwise a laptop seems preferable to me. There do not seem that many products that would fill that gap, Intel's NUC is too expensive, the Raspberry PI too slow. Apple's mini Mac seems like the best proposition in this segment.

I wonder if the Parallela could not only be used as development center, but also as a Desktop computer? It won't run any fancy games, that's clear, but it may actually be usable for browsing, watching videos and office duties.

Wow, I remember seeing the original Kickstarter for this and thinking "this will never see the light of day", yet here it is. I still find it a bit of an odd product; neither for hobby or business, but it sure is cheap.

It's a developer board. The product is the chips, not this board. This board is there mainly to get a dev board in the hands of people who might want to build cool stuff with it.

That they've actually managed to get it price competitive with a lot of cheap ARM computers, despite sporting a Zynq (ARM SoC with built in FPGA) is amazing.

Can't help but wonder if they are in fact taking a loss, backed by Adapteva.

They seem to actually have support from some of the hardware manufacturers. From Update #31 "Much gratitude goes out to the component manufacturers who really “got it” (Xilinx, Analog Devices, Intersil, Micron, Microchip, Samtec all deserve special thanks). Without their help we would be losing $100 per board!"

So, the backers are getting a Very Good Deal, with the hopes that a successful launch will make demand high enough to make the $99 viable with volume.

I can think of some amazing uses for this. I'm tempted to get one just to port this old hack of mine to it:


That's very cool!

a supercomputer is cluster of machines connected by high throughput, low latency interconnect.

Hundreds of servers connected together with 1Gigabit is still a "grid cluster" .. you need at least 10Gigabit Ethernet (over iWARP) or infiniband (RDMA) to be considered a supercomputer.

This is marketing B.S.! this B.S. is "emphasized" by the 90GFLOPS = 45Ghz thing. 90GFLOPS is by a single 45GHz "ALU" (perhaps an ALU doing Multiply-Add - MADD op.) not a full fledged CPU (like the i7 or Xeon, which has 4-8 cores with each core having 3 ALU's) as the readers might imply.

For example the i7 3770K does 121.6GFLOPS @ "only" 3.5Ghz (ref> table page 2 http://elrond.informatik.tu-freiberg.de/papers/WorldComp2012...)

measuring performance with Ghz is soooo Penitum III! the whole thing is very misleading, and I don't like that!

Supercomputer? not even funny! Its a Super-"Raspberry Pi". That's it!

How does this compare the the new Intel MIC (Xeon Phi) co-processor boards? I think they claim 1TFLOP. Can we think of this as a low-powered alternative?


The general idea is similar - lots of cores with distributed SRAM memory and some shared DRAM, all sitting on 2D mesh network. The main difference is that Epiphany is made of custom simple RISC cores, while Xeon Phi uses 1st gen Pentiums with huge SIMD FPUs slapped on for higher FP throughput (and TDP).


It looks like (from info on wikipedia pages) the Xeon Phi 3100, gets about 3.3 GFLOPS/WATT, whereas the Epiphany E64G401 manages about 50 GFLOPS/WATT.

So something like 10 of these might compare to 1 xeon phi, and still be cheaper in terms of hardware, and much cheaper in terms of power consumption.

Off the top of my mind (sorry, I don't have the time to double check now):

Phi has shared GDDR and distributed caches. Phi cores and caches are connected through a bidirectional ring interconnect, not a 2D mesh network. Still similar, but not as much.

I wish these things had just a bit more memory. Most of the interesting algorithms I work with (bioinformatics) really want 4G of memory. A lot of them you can squeeze down to 2G but 1G is just out of the question.

i think this is cool, but wouldn't learning OpenCL be more future proof for someone wanting to get into parallel processing? Seems like there is more drive behind GPU development than specialist hardware like this

This is really cool. Since it's linux I assume it can run the JVM, correct? That's incredibly powerful, as even GPU programming requires bridge libraries. And what, $99? That's incredible. I'm going to get one...

IIRC Linux does not run on Adapteva. Linux runs on the ARM which is next to the Adapteva chip.

This is perfect for numerical computing applications like software-defined radio or image processing, which can now be done on embedded platforms. I'll definitely be ordering a board when they're available.

Can someone explain to me how a $99 computer can have 45ghz of processing power, but an i7 costs 3x that for 1/10th that clock speed? What does this $99 miss out on that my i7 has the capability of doing?

First of all, this 45GHz figure definitely isn't valid for modern x86 chips - thanks to multiple cores and SIMD instructions they reach few dozen GFLOPS at stock frequencies.

Furthermore, x86 chips pack all of their performance in low number of cores, what makes them much more useful for common scalar code. And if 20 times higher scalar performance isn't enough to convince you to pay premium, the complexity required to achieve this level of scalar performance definitely is enough to discourage Intel from selling you i7s for $99.

This is kind of a novelty. Your i7 has way, way more power for jobs which only use a few cores. Most normal jobs are like that, so unless you have specific requirements, the i7 is going to give you much better performance.

I'm not familiar with this so I have a question: Can you interconnect a few of these boards to create a more powerful unit? I notice they have "expansion connectors"...

I only want to know one thing. How fast can it mine Bitcoins?! I feel like that's the new "...but can it run Crysis?"

It is not easy finding people with good parallel programming skills. Hopefully, this will help things along.

Of course if you learn on Adapteva then your knowledge may not translate to the "worse" architectures that are used in the real world. If you want to learn parallel programming, the computer you already have supports threads, CSP, actors, OpenMP, OpenCL, etc.

It might be interesting to have a go at writing a version of Connection Machine Lisp for it.

Cool. I don't know any other cheap way to experiment with optimization for 64 cores.

Hmm I wonder if there is a way to bypass your graphics card and use this as a GPU?

I want one, or two, maybe more. I'm totally fascinated with parallel computing.

So where do I buy one?

Sign up on the site and they'll mail you when you can order. If you only need FPGA, you can get the Mojo (see link elsewhere in thread) from May.

Can it run XBMC?


usb 3.0 would have been nice

I wonder how those in the performance computing sector feel about running a proprietary supervisor with built in DRM on each and every CPU? Raspberry users might not care when for just hobbyist applications, but I doubt any serious scientist is going to overlook that.


Intel platforms have a very similar risk via SMM and the platform code & controller. It's less advanced, but it can easily exert full control over the system without the os allowing it, minus access to some registers and on die cache. It could DMA in or out of the gpu memory as well.

Whether your soc vendor forces a secure supervisor to load is up to them, and i'd be surprised if an HPC builder had trouble finding vendors to supply parts with a totally controllable boot chain.

I'm sure there are ways to obscure it, but there are just as many ways on x86 platforms, the only real difference being that you could pull the eprom and reflash it and inspect the other board components. There's also plenty of evil things you can put in a soc without relying on trustzone.

Bottom line is you have to trust your vendor. If you want a soc integrated and fab monitored by a business/state that is politically aligned with yours it is probably just a matter of paying a premium.

The hardware cost of TrustZone is rather low and vendors of "compute SoCs" have no reason to ship hypervisor software on their chips.

And Raspberry Pi probably doesn't run any secure mode hypervisor as well.

Trustzone is just a set of hardware features. Most ARM devices don't come with a proprietary supervisor. In fact, Linux used to run in the secure world on some development devices.

Can it mine BitCoin competitively?

> Can it mine BitCoin competitively?

Prior to the popularity of mining using GPUs, it would have been the shizzle.

Today's ASIC-based systems will hash circles around it.

It's only pulling 2W, so it really depends on the performance per W. Maybe that'll be my first project...

Could you imagine a Beowulf cluster of these?

Only on Slashdot.

So how fast is this really? It doesn't sound like much of a supercomputer to me. If it were so super for $99, it'd have been hyped everywhere already and gamers would not buy desktops anymore. It rather sounds like a platform to practice multithreading on.

Supercomputer =/= "a fast computer".

Then what is it?

Wikipedia also seems to say it's "a fast computer": "a computer at the frontline of current processing capacity, particularly speed of calculation"


A supercomputer is a machine that's I/O bound instead of CPU bound, at least as a first approximation.

You'll get lots of specs thrown at you like in mid 2013 a supercomputer means using X, Y, and Z technologies. But that is just a longer format version of the above.

A pessimist usually warps the definition to a machine that's primarily programmer limited rather than CPU or I/O limited, LOL.

Over the decades as parallelism has been popular its drifted more toward being financially limited more so than anything else, in the long run this is probably going to be the new definition, a overall system who's performance is solely limited economically. You might think thats all computers, not so, there's plenty which are inherently limited by architecture to low performance, or limited by programming to single core / single thread tasks.

The biggest bummer of supercomputers in the parallel era is no one is doing anything about latency. That's nice that your 2000 processor design with 20 deep pipelines eventually after enormous latency can really churn stuff out, but the olden days pursuit of low latency resulting in speed was pretty interesting technologically. Hilariously you'll even get noobs who don't even understand the difference between latency and speed or claim there isn't one.

I've heard folks say before (and I include university profs in that) that it's a computer design focused on parallel calculation and low-latency I/O.

Not sure if this machine qualifies under that, but that is at least one competing definition to just "cutting edge and very fast"

These guys are completely dishonest. I saw their kickstarter video where they said that for $99 you could have "a computer many times faster than anything on the market ZOMG".

Yeah, maybe it's faster for all those times during the day when you calculate matrix chain products. But for largely single-threaded tasks, like EVERYTHING you do on a day to day basis, it's going to be significantly slower than your average dual-core i3.

I backed them on kickstarter, and I don't remember seeing any claim like what you claim to have seen.

To me it was always clear that the current models are not particularly fast. They may be fast "per watt", and if they succeed in their roadmap, then their future 1024 core chips may be fast for the subset of problems that they are suitable for.

In the meantime, the kickstarter page is/was careful to focus on this as a stepping stone, and developer platform for playing with the technology first and foremost, and not as being about delivering some incredibly fast computer for end users.

If anything, they've provided an extreme amount of data, down to cycle counts for memory accesses and the instruction set, and they've dumped a lot of code in our laps, including drivers etc., and the final unit actually comes with a faster version of the Zynq SoC than what they promised, after Xilinx apparently gave them an amazing deal.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact