Hacker News new | comments | show | ask | jobs | submit login
How to implement a multi-CDN strategy (streamroot.io)
131 points by jlouazel 5 days ago | hide | past | web | favorite | 52 comments





This is a nicely written article, however it's worth noting that the performance/reliability/availability differences across CDNs at a particular moment in time are pretty much non existent. These providers share the same backbone networks, same IX PoPs etc and thereby have little diversification benefit. See https://blog.edgemesh.com/understanding-diversification-netw... )

Where multi-CDN really shines is helping with regional specific solutions (e.g. China , India, Brazil, Argentina etc). It's probably worth nothing that the team at Streamroot helps do this client side and their p2p style option helps localize traffic as well. The former is certainly the way to go and the latter really helps add network level diversification. Of course - I'm biased as we offer similar lower level solutions.


This statement is not even close to true. At a particular point in time, different CDNs can have very different performance even for the same ISP in the same region.

Absolutely. If you were a CDNs only customer this may be true but the reality is that you’re not and they are always going to be over-subscribed. Having worked at a CDN provider (Cloudflare), I can tell you that they are constantly battling resource contention via DDoS or other reliability issues.

Multi-CDN is the way to go for performance and availability, though as a customer it can be challenging because you’re forced to limit your configuration to the lowest common denominator of features and there’s not a great way to test consistency of your configurations across all vendors.

This article is essentially a high level sales pitch though; I didn’t find it all that useful. I implemented multi-CDN at Pinterest using Cedexis (DNS based), though with modern DNS providers like NSOne, Cloudflare, Dynect, a modern spark-based ETL pipeline, and the browser navigation timing API (RUM), it wouldn’t be too challenging to build something resembling Cedexis yourself.


Indeed the article is more an introduction to multi-CDN concepts, but take into account that it was written for the HTTP video streaming use-case, and not static content CDNs. A client-side implementation of switching would not be very useful for that kind of content.

For Cedexis, I think the strenght is not only in the configurable DNS routing system, but also because they set up a lot of probes for different CDNs & clouds, and share global aggregated data that anyone can access, which can be useful when you don't a Alexa top1000 traffic.


Have to disagree here, you can have very different performances from different CDNs for the same user at the same time. Some CDNs have their own backbone (or at least partially), a lot of them are using different routes, and a lot of the time, the issue is not with the backbone, but with peering inter-connections, which can be different between each ISP & CDN. And a CDN's capacity is shared between all its customers, so if you get a huge peak from one of them, it can impact the others too. Old but good example: Before Apple started building out its own CDN, it was using the leading commercial one, and when Apple was doing its iOS/MacOS updates, other broadcasters were having big troubles delivering their streams at the same time because the CDN was overloaded - but it doesn't mean that other CDNs were also all down. That's also why most video broadcasters are now doing multi-CDN for their biggest live events like Superbowl or the world cup - to be able to distribute the load on several networks.

In this case - given this and the other comments - I stand corrected! I would love to see some data on examples of this occurring in the wild - and it 100% makes sense that a congested CDN provider would impact neighbors. It would be great if someone could do a writeup on examples and the detection/mitigation strategies. Perhaps issues like these (alongside cost) are driving the DIY CDN adoptions (Apple being the exemplar but also Tesla etc)? Also the Pinterest example is a great real world example - and they do an awesome job especially given the size of that cache - so there must be so real value from a performance standpoint! Out of curiosity does it seem like these dynamic switching decisions are better at the server level or client level?

Shower thought: what if html/http/browsers supported, as a primitive, the concept of "fetch this asset from url A, or if that doesn't work, B, or if that doesn't work, C ..."?

If video is being served via HLS (which it probably is in 2018), then the manifests support redundant streams, where multiple hosts can be specified for each stream. [0]

hls.js supports this, as do many other clients. IME it works nicely for providing some client-side switching in case one of your hosts/CDNs goes down.

[0] https://developer.apple.com/library/archive/documentation/Ne...


Both HLS and DASH support redundant streams (by adding a redundant variant URL in the HLS playlist and multiple BaseURLs in the DASH manifest). It's indeed the simplest way to have the easiest way to have a fallback client-side. If you use it, you should make sure that the player supports it, and that the retry mechanisms are rightly configured (like for instance all the MaxRetry config params in hls.js: https://github.com/video-dev/hls.js/blob/master/docs/API.md#... )

"If that doesn't work" isn't the problem.

As a silly limiting example, imagine that you host Netflix on your dial-up connection as url A.

It works.

Oh, okay, right, let's set a timeout then, if it takes more than 1 second to load, we try url B.

That works, but now we've got a 1 second delay on everything. Okay, we'll update the default to be url B.

Conditions are changing all the time as a result of bottlenecks in the infrastructure moving about.

What I think you'd actually need to do is something like this - initially, fetch from multiple endpoints simultaneously with an early-cancel (so you don't waste bandwidth on the slower ones).

For N seconds you just use the fastest one (perhaps with an 'if it doesn't work' mechanism, sure).

Every N seconds you re-evaluate the fastest endpoint using the multi-fetch.

And so on and so forth.

There are better algorithms, this is back of the envelope stuff.


You solution to bandwidth congestion is for everyone to use 3x+ more bandwidth than they need?

Is this a bot?

Firstly, I'm not solving anything. I'm explaining why fallback URLs are not equivalent to CDNs.

You don't use a CDN because your site doesn't work, you use it because it's faster.

Secondly, no, doing an occasional speed test, using data you'd be downloading anyway, then selecting an endpoint between speedtests does not increase bandwidth usage by 3x.

Baffled.


Some people use CDNs as regular static-site web hosts, or hosts of an SPA client JS blob when they have an otherwise-"serverless" architecture. CDNs are not always about serving large media assets.

They're not consuming 3x bandwidth if they're bailing out after downloading less then a kilobyte of a video that's tends or hundreds of megabytes in size.

That extra bandwidth is a rounding error in the grand scheme of things.

It could be important, though, for the client to signal the server to close the connection. Theoretically the connection would drop after several seconds and the server would stop transmitting, but I could imagine some middleware cheerfully downloading the whole stream and throwing it away.


The problem with with this approach is that you're only considering time to first byte, which is part of the equation especially in the case of smaller files like scripts, but in case of larger files like video segments, throughput is more important. If you only wait for 1kB to download, then you essentially measure time to first byte.

Then the instruction to stop the download is not instantaneous, so by the time you realize you have downloaded 1kB on the client side, the server might already have sent the whole video segment on the other side, so this is not the way to go in order to optimize congestion


For video you can fetch different chunks from different endpoints simultaneously, not the same chunk, therefore not wasting bandwidth at all.

This is more or less what we do with our our client-side switching solution at Streamroot: we first make sure the user has enough video segments in its buffer, and then we try to get the next video segments for different CDNs, so we're able to compare the download speeds, latency and RTT between the different CDNs without adding any overhead. You don't necessarily download the segments at the same time, but with some estimation and smoothing algoriths you're able to have some meaningful scores for each CDN. The concepts and problematics here are very close to the problematic of bandwidth estimation for HTTP Adaptive Bitrate streaming formats like HLS & DASH, because you have an instable network, and you are only able to estimate the bandwidth from discrete segment measurements. [0]

If you want to do it on a sub-asset (video segment, or image or JS file) level, it's possible by doing byte-range requests (ask bytes 0-100 from CDN A and 101-200 from CDN B), but in that case you still add some overhead for establishing the TCP connection, and in the end as you need the whole asset to use it, you'll just limit the download speed to the minimum of the two.

[0]https://blog.streamroot.io/abr-algorithms-work-optimize-stac...


I think that could be very beneficial. If it were a built in feature for HTTP (or more broadly level, maybe TCP/IP), it would not only save people the hassle of reinventing the wheel, it would also be easier to ensure it's on by default for all static resources and thus get benefit across the board.

Perhaps it could be done in a flexible, extensible way as well. Create a limited language (no loops or dangerous stuff) to express policy, search order, etc. And design it so the client side doesn't necessarily have carte blanche and the server side can maintain some control if necessary.


Internal browser support for local caching based on a hash versus "where it came from" would be helpful as well.

Yes that would be great. Imagine a git-like web, where browsers could fetch the difference in chunks of a cached file when there's changes on the server side.

Doesn't work for streaming Live Events (pay per view), which is the main use case for multi-CDN

Their list of redundancy, agility, and cost doesn't seem exclusive to video. Though perhaps more compelling given the time sensitive nature and bandwidth amount.

The problem is that that leaks information about your viewing habits on one site, to another.

Can't you do that with DNS records where there are multiple IPs on a A record?

Basically, we're already doing this for fault tolerance and load balancing within a single CDN. Except that currently we randomize the IPs. To enforce priorities, you'd want the IPs in the A record at least partially ordered by provider.


Multiple IPs on an A record works to some extent, most (many?) browsers will silently retry another ip from the list if some of the IPs don't accept a connection; I don't know if they'll try another IP on timeout though.

But you can't actually expect any ordering to make it through to the client. Your authoritative server may reorder the records, their recursive server may reorder the records, and the client resolution library may also reorder the records. There's actually an RFC advocating reordering records in client libraries; it's fairly misguided, but it exists in the wild. Reordering is also likely to happen in OS dns caches where those are used.


For reference, RFC 3484 [1] is the misguided RFC that tells people they should sort their DNS responses to maximize the common prefix between the source and destination address. This is probably helpful when the common prefix is meaningful, but when numbering agencies give out neighboring /24's to unrelated networks, and related networks often have ips widely distributed across the overall address space, it's not actually useful.

[1] https://www.ietf.org/rfc/rfc3484.txt


Thanks for clarifying!

That's not quite what an anycast to those addresses does. It's more like an approximation of the nearest server.

They do for some limited items e.g. <object> does nested fallback.

That's not sufficient for something like CDN selection though, you want a fallback in case of failure but you first want to select based on various criteria.


Combine with SRI and some convention to just ask one of several hosts for it based on the hash(es) and we have concent-addressable loading.

So IPFS?

IPFS would be a particular implementation of those more general concepts. SRI + multiple HTTP sources would be another more incremental approach.

I'm still waiting for a browser to figure out I mean "com" when I typed "cim," and you're thinking CDN retries would work?

Are you thinking of something like BitTorrent? Why ask for the whole file from a list of hosts when you could ask for any bit of the file they might have?

Then people would create browser plugins or greasemonkey scripts to always optimise for things the viewer cares about (time to start, likelihood of switching providers mid-stream, likelihood of getting full resolution for the longest subset of the video, ...) and disregard the prioritisation set by the provider (which might care about costs, which depend on contracted minimums, overage tariffs etc.).

Then providers would need to combat this by dropping the most expensive CDNs, causing a race to the bottom in which everyone loses: users have worse streaming experience, providers lose customers, good CDNs make less money, margins for bad CDNs are squeezed.


The people who would install such kind of add-on is so small it would have like zero effect on the provider's costs …

This sounds similar to https://www.conviva.com/precision/

Unfortunately you need to know a lot more and the devil is in the details. Supporting the various streaming devices/browsers is a huge pain in the ass.

Full Disclosure: I worked for both Conviva, and Akamai.


Nikolay from Streamroot.io here, co-author of the article.

Yes Conviva provides a service that can give you information about the QoS for the CDN by aggregating data from their customers (they provide a video analytics solution), but it doesn't make the switching (nor on the server side or on the client side), so the video player would need to implement its own logic themselves.

The solution from Streamroot can use this kind of APIs like Conviva Precision, or the one from its competitors like Youbora and Cedexis, and the real value it adds is the client-side switching capability to the players, so it's quite complementary to those solutions.

And indeed the devil is in the details, that's why we built this client-side SDK so the customers don't have to implement all the logic themselves on each platform and device. It was easier for us as we already have SDKs and plugins for most players for our P2P hybrid delivery solution.


There are several factors to consider in a multi-CDN delivery solution.

First, is it VoD or Live? HLS (and DASH) have a second URL option (base URL in DASH), for the client to determine when to choose that Fallback URL. If playback falls back to the second URL, that fallback experience to the viewer, could have had some buffering, or bitrate downshifts triggering that player decision.

Although stream playback recovers/continues, the user experience could have and likely was impacted. Here a second CDN in the multi-CDN deployment was accessed by the client. There is no intelligence here, in the provider selection. Typically the (perceived) most reliable CDN gets that first spot, and the backup CDN gets the Fallback position (second URL) in the manifest/MPD.

In Live, you have the opportunity to provide intelligent CDN selection on every manifest/MPD refresh. If your multi-CDN selection layer has intelligence, access to performant metrics, in real time, that manifest can now point (directed) to the alternate CDN. This requires a level of manifest management on the session level, so that the m3u8 retains the proper historical CDN selection so as not to break playback for that session (in most if not all cases).

There are client solutions, DNS solutions, and cloud solutions that are neither client (sdks), or DNS based. You get to decide how you want integration to be managed and how much work your team can/can't invest in your solutions ongoing level of effort.

Why is most important to consider is the viewer experience, and how playback can best be delivered to avoid buffering, downshifts, the things that cause a viewer to abandon your content and possibly not come back.

If a CDN is performant, and N+1 users are now beginning to watch a stream on that providers network, capacity could be (often is) an issue. Continuing to send users to that CDN may be a sub-optimal experience. Metrics measuring playback determine that bitrates are dropping, buffering increasing, and serve new requests with an alternate CDN providing a better playback experience.

Video is a tightly controlled series of events. We work with chunks of 10s, 6s, 2s, for large buffers, and fast start times. Continually trying to balance the benefits of both.

With an SDK client based solution, you have engineering effort to keep up with OS/hardware updates, testing new code in SDKs, and then pushing out across several platforms, players, etc. Can be daunting.

With DNS, you have TTLs to manage, while lower is better, faster for that next user, there is no mid-stream switching with intelligence once the client is pulling manifests from a specific provider.

With a cloud based solution, each individual stream/user/device is measured and Can selection performed in real time for Live, and for _each_ request on VoD.

Disclaimer, I work at DLVR, and formerly Cedexis. = ]


For VoD I like the approach where you use a fast and reliable CDN for the first seconds and in the background buffer the rest of the video from a cheap location/CDN.

This works if you download video faster than real time which is almost always the case. That way you get the best of both worlds.


That's clever!

What do metrics show for UX for that workflow? (Bonus: What tool for capturing play data?)


Using this comment thread to plug a question: Is there a way to use Cloudflare (or other DNS server providers) in order to dynamically fallback to CDN A (e.g. Cloudfront) in case CDN B (e.g. Netlify CDN) is down?

I don't think Cloudflare Load Balancing can do this (yet), but Dyn can: https://dyn.com/active-failover/

The crucial part is CNAME compatibility. Most DNS services I've had experience with can only do failover between IPs.


Thank you, will try to implement this with Dyn

You can control DNS failover with custom health checks in AWS Route 53. You can also do latency based routing. I can't speak for Cloudflare, but I imagine they have similar capability.

Thank you. I overall prefer CF over R53, will try to check out if this can be achieved with either workers or page rules.

I work for a startup called DLVR (Deliver) and we do this for video streams. HLS, DASH, MSS http://www.dlvr.com

Pretty sure Netlify's CDN is Akamai. So top tier for sure.

off topic: the cookie banner is completely hiding the navbar/logo/navigation.

Thanks for noticing! We'll make sure to improve this

well the most annoying part is the navbar stays... worst UX.

I guess they only care about video. But for websites multi-CDN means essentially building your own CDN where using other CDNs isn't even a good idea, since they don't provide enough granularity of control to monitor and choose nodes and therefore limit you in what you can achieve in terms of latency and availability. DNS is also your biggest and often the only friend here, learning and deploying it yourself is critical, don't rely on any vendor to do it.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: