I looked through the source code and saw string concatenation to set the active texture unit among many other inefficiencies. Everything does isReady() checks multiple times per frame (rather than creating it only once its ready). The material parameters are indexed in the worst possible way. The stats gathering code is still active in release. The list goes on and on :)
Some caches are even implemented with bugs. Half the code use a cache in a certain way and the other half in a different way. That was fun to dig into.
I tried to optimize Babylon for a few days at work before giving up - my general rule of thumb is that if I'm about to refactor more than 20% of a codebase its much, much faster to rewrite it if I'm already familiar with the problem space. Took about two weeks to write a proprietary renderer running circles around Babylon - but only supported static meshes.
Babylon was useful to prototype but the mobile performances aren't there, even after days of profiling and optimizing every slow path it was still an order of magnitude slower than an in-house renderer designed for performance from the ground up.
That's to optimize the scene, not the engine itself :)
There could be some tradeoffs to those suggestions as well. For example using unindexed geometry for simple meshes can still be slower if there's many vertex attributes. Its also not uncommon to render tesselated meshes - there's a sweet spot in triangle size for mobile GPUs, at least tile-based ones like PowerVR. With VR barrel distortion applied in the vertex shader during the main pass you definitely don't want cubes made out of only 12 triangles.
Vertex count isn't that important a metric anyways; you can push a few million polygons in a few hundred draw calls to mobile GPUs every frame and still run a smooth 30FPS. Desktop is an order of magnitude higher (5k draw calls/frame is common). The number of draw calls, the cost of their shaders and how fast the CPU can push them are much more important. There's little difference between 20k and 40k polygon meshes, but there's a huge one between 20 and 40 draw calls. Its creating batches that's costly, not running them.
We also had heuristics to determine an appropriate device pixel ratio without completely disabling the scaling. So for mobile devices with a ratio of 3 instead of tripling the pixel count we'd settle for a ratio in between. Text projected in 3D was just unreadable on iPhone without this and going all the way to 3x was overkill.
I did call freeze() on materials but the material/effect caches were trashed quite often and the bind() implementation is very expensive; it does quite a few hash lookups and indirections. A lot of our uniforms had to be updated every frame so we ended up separating the materials from their parameters and indexing the later with bitfields. Setting a shader was just looping through a dirty bitfield and doing a minimum of uniform uploads. This also allowed for global parameters quite easily (binary OR on material/global parameter bitfields). There was only 3 arrays of continuous memory to touch to fully setup a shader (values, locations, descriptors), and they could be reused between materials so it was very CPU-cache friendly.
Looking at the profiler most of the lost performance came from the engine, not the scene.
Here's the string concatenation to set the active texture unit. (By the way the fastest way to do it is "gl.TEXTURE0 + channel" instead of creating the string to index in the proper constant).
As for the broken cache, I think it was Engine._activeTexturesCache; sometimes its indexed by texture channel other times by GL enum values (this makes the cache array explode to 30k elements and causes cache misses in half the code paths.)
From what I remember, lots of caches are needlessly trashed many times per frame.
There's also noticeable overhead to all of those "private static <constant> = value;" with public getters.
You won't see it in the code. Run it through the debugger; the value of "channel" is sometimes the value of the GL enum rather than the index of the texture unit.
Some caches are even implemented with bugs. Half the code use a cache in a certain way and the other half in a different way. That was fun to dig into.
I tried to optimize Babylon for a few days at work before giving up - my general rule of thumb is that if I'm about to refactor more than 20% of a codebase its much, much faster to rewrite it if I'm already familiar with the problem space. Took about two weeks to write a proprietary renderer running circles around Babylon - but only supported static meshes.
Babylon was useful to prototype but the mobile performances aren't there, even after days of profiling and optimizing every slow path it was still an order of magnitude slower than an in-house renderer designed for performance from the ground up.