I've basically rewritten almost all of it now. One important note about our use case is that our product is a library that application developers add to their application. This means we do not control the application and we have to play very nice. Additionally any dependencies we include (such as an analytics library) reflect on us. This is also why it is hard to get apps rereleased on changes in our code - users punish app updates that have no visible functional differences.
Here is a list of things that mattered:
* The library needs to have a posture on how it is used by multiple different components in the same app. For example it can intend there to be one canonical source/package, or each component could make a private fork.
* If the canonical package is chosen then it must work with concurrent but different reporting ids and settings. (For example Google screw this up by having the tracker be a singleton.) It needs to be possible to find out the version number from tools so they can complain about being out of date.
* For the private fork posture it is easiest if the code is all one file (use nested classes). It should use a sqlite database name that differs per fork so they don't clash with each other.
* The library will have "slow" work that needs to be done. This includes updating the SQLite database with new events, clearing out too old events on startup, and sending event batches to the server. I updated the mixpanel code so that it returns Runnables for that work, and then my library can use existing slow work threads. Most however will want the library to work out where to run the slow work.
* I deleted the code that reads the unique device id (aka UDID). Some companies are happy grabbing that - our privacy policy is far stronger. We generate a random unique id string on first run. Even the device code being present but unused is enough to set off binary analyzers.
* You'll want to grab some other stuff by default (eg carrier information, device model, os version)
* Make debugging easy. For example the logcat mechanism on Android works nicely. Mixpanel were just logging their API call, not any detail. For example it would say "track" instead of "track: clicked" (where "clicked" is the event type)
* Sessions are what matters most. Mixpanel has no concept of sessions. For example when they purge unsent old events, they just delete the ones older than the time frame (was hard coded as 48 hours). However this means it could end up deleting the first half of a session but transmitting the rest. A better approach is to have a session id that is updated on each start, then delete all events belonging to old session ids.
In terms of implementation details, a comparison of Google Analytics to Mixpanel is useful.
Google only have one tracker instance, although you wouldn't know that from the API so multiple usage silently doesn't work. They have an extremely complicated custom variable scheme for adding extra data for each event. Ultimately their database stores a query string for each event. If there are 10 to send then they make 10 separate GET requests.
Mixpanel supports multiple instances, but almost everything was hard coded (eg dispatch intervals, expiry of old data). You supply events with arbitrary JSON data, including a list of "super properties" which are added to every event. This is a very good approach. The database stores the events. When submitting, a POST request is generated with a batch of events (up to 50, again it was hard coded as two different numbers in two different places).
If you use query strings (in the sense of a GET) then there is a danger of the data being logged by proxy servers, hitting URI length issues, and being unable to batch.
Here is a list of things that mattered:
* The library needs to have a posture on how it is used by multiple different components in the same app. For example it can intend there to be one canonical source/package, or each component could make a private fork.
* If the canonical package is chosen then it must work with concurrent but different reporting ids and settings. (For example Google screw this up by having the tracker be a singleton.) It needs to be possible to find out the version number from tools so they can complain about being out of date.
* For the private fork posture it is easiest if the code is all one file (use nested classes). It should use a sqlite database name that differs per fork so they don't clash with each other.
* The library will have "slow" work that needs to be done. This includes updating the SQLite database with new events, clearing out too old events on startup, and sending event batches to the server. I updated the mixpanel code so that it returns Runnables for that work, and then my library can use existing slow work threads. Most however will want the library to work out where to run the slow work.
* I deleted the code that reads the unique device id (aka UDID). Some companies are happy grabbing that - our privacy policy is far stronger. We generate a random unique id string on first run. Even the device code being present but unused is enough to set off binary analyzers.
* You'll want to grab some other stuff by default (eg carrier information, device model, os version)
* Make debugging easy. For example the logcat mechanism on Android works nicely. Mixpanel were just logging their API call, not any detail. For example it would say "track" instead of "track: clicked" (where "clicked" is the event type)
* Sessions are what matters most. Mixpanel has no concept of sessions. For example when they purge unsent old events, they just delete the ones older than the time frame (was hard coded as 48 hours). However this means it could end up deleting the first half of a session but transmitting the rest. A better approach is to have a session id that is updated on each start, then delete all events belonging to old session ids.
In terms of implementation details, a comparison of Google Analytics to Mixpanel is useful.
Google only have one tracker instance, although you wouldn't know that from the API so multiple usage silently doesn't work. They have an extremely complicated custom variable scheme for adding extra data for each event. Ultimately their database stores a query string for each event. If there are 10 to send then they make 10 separate GET requests.
Mixpanel supports multiple instances, but almost everything was hard coded (eg dispatch intervals, expiry of old data). You supply events with arbitrary JSON data, including a list of "super properties" which are added to every event. This is a very good approach. The database stores the events. When submitting, a POST request is generated with a batch of events (up to 50, again it was hard coded as two different numbers in two different places).
If you use query strings (in the sense of a GET) then there is a danger of the data being logged by proxy servers, hitting URI length issues, and being unable to batch.