Hacker News new | past | comments | ask | show | jobs | submit login

I don't really agree, at least from a longer term perspective. It's early days yet, but XLA seems to be a promising intermediate representation for allowing the DL frameworks to run on a wider array of hardware without user-facing software changes. It has traction with Google, NVIDIA, IIRC Intel and maybe more (others are definitely using the same approach of compute graph splitting and subgraph scheduling, but I'm not certain if they are using XLA specifically - I know some aren't like Mindspore).

XLA has already proven its value by allowing PyTorch to run on TPUs (shittily, but that appears to be more of a VM/GCP infra problem than an XLA problem). The work done for TPUs (and to a lesser extent for GPU optimization) has started to expose some of the major issues and so work can start on addressing them (the cost of dynamic XLA compilation as tensor shapes change and how lots of important code assumes that accelerator-to-CPU communication isn't tooo expensive, but it's is a huge issue when trying to compile the graph into machine-specific code with XLA or similar to because it forces you to only be able to compile small subgraphs).

It's early, but the rise of a really effective IR in XLA combined with the huge amount of resources that Google/NVIDIA can pour into XLA makes me very bullish on purpose-built hardware for AI training. It will take a while I admit.




Mr/mrs anonymous HN person, please put some info in your profile. You clearly have some deep knowledge of TPUs that I didn’t expect to pop up offhandedly on HN. You’re correct on all counts: dynamic tensor shapes are more or less impossible with XLA, making it more or less impossible to train a model with arbitrary image size inputs, even though the math would allow for that; the pytorch XLA work on TPUs is indeed kind of shitty, and I’m surprised as heck that literally anyone said this except me; and XLA as an IR is promising for portability. Now I’m curious what you’ve been doing to have experienced these things, since there didn’t seem to be many others who have (or at least, who are vocal about it).

I agree with you, but I think we differ on our timetables. I am bearish for the next two years, at which point I’ll awaken from my slumber and become a flaming bull. (It helps to remember that “we overestimate the impact of years, but underestimate the impact of decades.” I try to plan accordingly.)

In other words, if you’re bullish that two years from now we’ll start seeing portability implemented in the field across various HPC chips, then we fully agree. But that’s also a glacial pace; GPT-2 changed the world almost two years ago now, and DALL-E seems to be the next frontier for doing interesting generative work. So, we’ll split the difference and say that the bears and bulls will meet in two years for a deep learning hackathon. As a bonus, the pandemic will be over by then, so it can be an in-person meetup.


Ah I see - I think we're pretty much on the same page in terms of timetables. Although if you include TPU, I think it's fair to say that custom accelerators are already a moderate success.

Updated my profile. I've been working on DL training platforms and distributed training benchmarking for a bit so I've gotten a nice view into the GPU/TPU battle.

Shameless plug: you should check out the open-source training platform we are building, Determined[1]. One of the goals is to take our hard-earned expertise on training infrastructure and build a tool where people don't need to have that infrastructure expertise. We don't support TPUs, partially because a lack of demand/TPU availability, and partially because our PyTorch TPU experiments were so unimpressive.

[1] GH: https://github.com/determined-ai/determined, Slack: https://join.slack.com/t/determined-community/shared_invite/...



+1 please put some info in your profile.


I'm dying to know what they work on, either officially or in their spare time. https://news.ycombinator.com/item?id=26586151


> from a longer term perspective

Yes.

Longer term, new hardware will also make it practical to train large models in a fully parallelized, fully distributed manner -- i.e., without having to backpropagate gradients, which requires a lot of complex bookkeeping and plumbing for distributed training.

Recent progress suggests this will happen. See, for example:

https://arxiv.org/abs/2006.04182

https://arxiv.org/abs/2103.03725

https://arxiv.org/abs/2010.01047

I for one am excited to see what happens over the next decade as it becomes trivial to train/use models with 1K, 1M, or 1B times more dense connections than present state-of-the-art models.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: