Where multi-CDN really shines is helping with regional specific solutions (e.g. China , India, Brazil, Argentina etc). It's probably worth nothing that the team at Streamroot helps do this client side and their p2p style option helps localize traffic as well. The former is certainly the way to go and the latter really helps add network level diversification. Of course - I'm biased as we offer similar lower level solutions.
Multi-CDN is the way to go for performance and availability, though as a customer it can be challenging because you’re forced to limit your configuration to the lowest common denominator of features and there’s not a great way to test consistency of your configurations across all vendors.
This article is essentially a high level sales pitch though; I didn’t find it all that useful. I implemented multi-CDN at Pinterest using Cedexis (DNS based), though with modern DNS providers like NSOne, Cloudflare, Dynect, a modern spark-based ETL pipeline, and the browser navigation timing API (RUM), it wouldn’t be too challenging to build something resembling Cedexis yourself.
For Cedexis, I think the strenght is not only in the configurable DNS routing system, but also because they set up a lot of probes for different CDNs & clouds, and share global aggregated data that anyone can access, which can be useful when you don't a Alexa top1000 traffic.
hls.js supports this, as do many other clients. IME it works nicely for providing some client-side switching in case one of your hosts/CDNs goes down.
As a silly limiting example, imagine that you host Netflix on your dial-up connection as url A.
Oh, okay, right, let's set a timeout then, if it takes more than 1 second to load, we try url B.
That works, but now we've got a 1 second delay on everything. Okay, we'll update the default to be url B.
Conditions are changing all the time as a result of bottlenecks in the infrastructure moving about.
What I think you'd actually need to do is something like this - initially, fetch from multiple endpoints simultaneously with an early-cancel (so you don't waste bandwidth on the slower ones).
For N seconds you just use the fastest one (perhaps with an 'if it doesn't work' mechanism, sure).
Every N seconds you re-evaluate the fastest endpoint using the multi-fetch.
And so on and so forth.
There are better algorithms, this is back of the envelope stuff.
Firstly, I'm not solving anything. I'm explaining why fallback URLs are not equivalent to CDNs.
You don't use a CDN because your site doesn't work, you use it because it's faster.
Secondly, no, doing an occasional speed test, using data you'd be downloading anyway, then selecting an endpoint between speedtests does not increase bandwidth usage by 3x.
That extra bandwidth is a rounding error in the grand scheme of things.
It could be important, though, for the client to signal the server to close the connection. Theoretically the connection would drop after several seconds and the server would stop transmitting, but I could imagine some middleware cheerfully downloading the whole stream and throwing it away.
Then the instruction to stop the download is not instantaneous, so by the time you realize you have downloaded 1kB on the client side, the server might already have sent the whole video segment on the other side, so this is not the way to go in order to optimize congestion
If you want to do it on a sub-asset (video segment, or image or JS file) level, it's possible by doing byte-range requests (ask bytes 0-100 from CDN A and 101-200 from CDN B), but in that case you still add some overhead for establishing the TCP connection, and in the end as you need the whole asset to use it, you'll just limit the download speed to the minimum of the two.
Perhaps it could be done in a flexible, extensible way as well. Create a limited language (no loops or dangerous stuff) to express policy, search order, etc. And design it so the client side doesn't necessarily have carte blanche and the server side can maintain some control if necessary.
Basically, we're already doing this for fault tolerance and load balancing within a single CDN. Except that currently we randomize the IPs. To enforce priorities, you'd want the IPs in the A record at least partially ordered by provider.
But you can't actually expect any ordering to make it through to the client. Your authoritative server may reorder the records, their recursive server may reorder the records, and the client resolution library may also reorder the records. There's actually an RFC advocating reordering records in client libraries; it's fairly misguided, but it exists in the wild. Reordering is also likely to happen in OS dns caches where those are used.
That's not sufficient for something like CDN selection though, you want a fallback in case of failure but you first want to select based on various criteria.
Then providers would need to combat this by dropping the most expensive CDNs, causing a race to the bottom in which everyone loses: users have worse streaming experience, providers lose customers, good CDNs make less money, margins for bad CDNs are squeezed.
Unfortunately you need to know a lot more and the devil is in the details. Supporting the various streaming devices/browsers is a huge pain in the ass.
Full Disclosure: I worked for both Conviva, and Akamai.
Yes Conviva provides a service that can give you information about the QoS for the CDN by aggregating data from their customers (they provide a video analytics solution), but it doesn't make the switching (nor on the server side or on the client side), so the video player would need to implement its own logic themselves.
The solution from Streamroot can use this kind of APIs like Conviva Precision, or the one from its competitors like Youbora and Cedexis, and the real value it adds is the client-side switching capability to the players, so it's quite complementary to those solutions.
And indeed the devil is in the details, that's why we built this client-side SDK so the customers don't have to implement all the logic themselves on each platform and device. It was easier for us as we already have SDKs and plugins for most players for our P2P hybrid delivery solution.
First, is it VoD or Live?
HLS (and DASH) have a second URL option (base URL in DASH), for the client to determine when to choose that Fallback URL. If playback falls back to the second URL, that fallback experience to the viewer, could have had some buffering, or bitrate downshifts triggering that player decision.
Although stream playback recovers/continues, the user experience could have and likely was impacted. Here a second CDN in the multi-CDN deployment was accessed by the client. There is no intelligence here, in the provider selection. Typically the (perceived) most reliable CDN gets that first spot, and the backup CDN gets the Fallback position (second URL) in the manifest/MPD.
In Live, you have the opportunity to provide intelligent CDN selection on every manifest/MPD refresh. If your multi-CDN selection layer has intelligence, access to performant metrics, in real time, that manifest can now point (directed) to the alternate CDN. This requires a level of manifest management on the session level, so that the m3u8 retains the proper historical CDN selection so as not to break playback for that session (in most if not all cases).
There are client solutions, DNS solutions, and cloud solutions that are neither client (sdks), or DNS based. You get to decide how you want integration to be managed and how much work your team can/can't invest in your solutions ongoing level of effort.
Why is most important to consider is the viewer experience, and how playback can best be delivered to avoid buffering, downshifts, the things that cause a viewer to abandon your content and possibly not come back.
If a CDN is performant, and N+1 users are now beginning to watch a stream on that providers network, capacity could be (often is) an issue. Continuing to send users to that CDN may be a sub-optimal experience. Metrics measuring playback determine that bitrates are dropping, buffering increasing, and serve new requests with an alternate CDN providing a better playback experience.
Video is a tightly controlled series of events. We work with chunks of 10s, 6s, 2s, for large buffers, and fast start times. Continually trying to balance the benefits of both.
With an SDK client based solution, you have engineering effort to keep up with OS/hardware updates, testing new code in SDKs, and then pushing out across several platforms, players, etc. Can be daunting.
With DNS, you have TTLs to manage, while lower is better, faster for that next user, there is no mid-stream switching with intelligence once the client is pulling manifests from a specific provider.
With a cloud based solution, each individual stream/user/device is measured and Can selection performed in real time for Live, and for _each_ request on VoD.
Disclaimer, I work at DLVR, and formerly Cedexis. = ]
This works if you download video faster than real time which is almost always the case. That way you get the best of both worlds.
What do metrics show for UX for that workflow?
(Bonus: What tool for capturing play data?)
The crucial part is CNAME compatibility. Most DNS services I've had experience with can only do failover between IPs.