This was my question as well. How good are systems over a stream of data?

I am not expert in this field: how tracking actually works with a time dimension? These must be some sort of "state" carried over frame-by-frame? What is the "size" of this state? Objects just do not disappear and reappear for certain frames? This latter effect you can often see on many automatic labeling demos you find on GitHub.

