Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
login
Instrumentation checklist for running large GPU clusters
(
together.ai
)
2 points
by
amaldavid
5 months ago
|
hide
|
past
|
favorite
|
2 comments
amaldavid
5 months ago
[–]
Just stumbled upon this blog which details out testing and validating large GPU clusters before running training workloads. Any other similar blogs which adds more nuance in terms of debugging the issues once we identify them as well?
roanakb
5 months ago
|
parent
[–]
this is a good one for debugging rdma:
https://docs.redhat.com/en/documentation/red_hat_enterprise_...
Consider applying for YC's Spring batch! Applications are open till Feb 11.
Guidelines
|
FAQ
|
Lists
|
API
|
Security
|
Legal
|
Apply to YC
|
Contact
Search: