All this engineering talent and effort to track dwell times and clicks for ads. At some point, I think we need to take a critical look at the ad economy and whether continued investments actually help anybody, or rather just give certain parties competitive advantages in selling ads. Smart regulations can perhaps make ads "dumber" and drive a lot of the engineering away, as well as being a boon to privacy, while still leaving sites with a way to generate revenue from content.
Twitter very clearly thinks I'm a doctor. I am not a doctor. They show me lots of ads that are obviously oriented towards medical professionals–many of them include the language "your patients" as well as medical jargon. My wife is not a doctor, my parents aren't doctors, the only doctors I know are casual friends and family I see at most once a year. I do not have any medical conditions, and nobody I know has any of the conditions treated by the products being advertised.
I've been using Twitter at least a few times a week for several years, I follow and engage with lots of other users, none of whom are doctors as far as I know. If the ads I'm seeing are derived from this complex personality and interest profiling, it has totally misfired. This has been the case for at least two or three years.
Several years ago a conversation about a similar topic prompted me to look at the ad targeting data Facebook had on me. At the time I'd had a Facebook account for 12 years with lots of posts, group memberships and ~500 friends. Their cutting edge data collection and complex ad targeting algorithms had identified my "Hobbies and activities" as: "Mosquito", "Hobby", "Leaf" and "Species": https://imgur.com/nWCWn63. Whatever that means.
I've managed millions of dollars in ad spend on these platforms over time, and still regard most targeted ad platforms as dancing on the edge between legitimacy and being blatantly fraudulent. They work well if you're a sophisticated buyer, but if you're not they're pretty much a hole in the internet into which you can pour money.
As well, most people in tech are outliers of some sort, with highly unique searches on platforms. Simply put, we are not the norm. We aren't 'norms'.
While I'm sure you're correct in stating that the accuracy of such tracking platforms is not 100%, I suspect that when it comes to those who do not understand technology? Tracking is much better, more accurate.
My position for clarity, "being able to use a phone" is not "understanding technology". Some seem to think they are tech aware, because they can navigate a phone's OS. Or use a computer for work, by using word.
These people are likely more accurately tracked, for their lack of understanding, whilst combined with their heavy usage of computing devices, makes them most susceptible to tracking.
* It being the Advertising "Industry"
Maybe fintech could offer similar challenges in some areas? I'm thinking HFT etc.
I greatly enjoyed the challenge of building data pipelines in my role at $LAST_COMPANY, an SSP, that delivered valid transactional + related entity data to a near realtime reporting system - and scaled up and down as needed to maintain data timeliness while minimising costs (as ad traffic (well internet traffic as a whole) has high seasonality throughout the day).
But I didn't enjoy working for a company in the ad-tech market though - far too many deals that, while legal, felt sleazy - and (from my limited experience) often seemed to begin with handshakes made by sales reps on booze / cocaine fueled nights in red light districts.
If I could solve similar business problems without being in a market that makes me want to have a long shower, I'd be keen.
When all you care about is the next quarterly report, it's real easy to sell the idea of getting rid of all that expensive technical competence from the company payroll, letting someone else know that stuff instead, then hire some a team of cheap oompa loompas straight outta college to spend all day writing -X-M-L- YAML instead.
> Google also makes $1B equity investment in CME Group
Wonder if anyone is running large NATS Jetstream/Liftbridge or Pulsar (yahoo runs those) clusters. I guess Pulsar might be #2 in terms of adoption at large scale?
This architecture certainly exists, but is a lot more burdensome and less frequent than partitioning by customer id across a Kafka topic.
I'm very hostile to a lot of hipster tech but Kafka is one of the few genuinely good pieces of tech from the whole "Big Data" craze of the past decade.
AFAIK LINE has not had such success. I wouldn't be surprised if most people in the US did not know of LINE, unless they were avid readers of TechCrunch or something it just doesn't come up.
Would be interesting to know what % of people on HN know about LINE though
I wouldn't be surprised if it was perfectly fine though -- with compression (and all the video/image specific tricks) the file sizes should get pretty small...
The link is a great reference by the way.
> Your API should use cloud storage (for example, AWS S3) and simply push a reference to S3 to Kafka or any other message broker.
This is more or less what I figured. We already archive to S3 anyways so switching to using it as transport would be straightforward.
I'm by no means a Kafka expert or a video expert of course, but glad I could serve as a rubber duck. Maybe there's some lessons to be learned from Encore?
> The link is a great reference by the way.
Yeah the amount of info in there is pretty good -- feels like Kafka could definitely be tuned to do the job but maybe it's better to just start with something better attuned.
> This is more or less what I figured. We already archive to S3 anyways so switching to using it as transport would be straightforward.
Yeah I figured this is what you were trying to avoid -- the round trips to S3 to get the data to the processing would be wasteful if the data is in this case small enough to flow along the processing route. Guess it really depends on your data. I could have sworn I saw some analysis of how kafka performs versus the size of messages it must deliver...
Looks like DZone has some good content, LinkedIn of course... Ah I finally found the one I was looking for and it's DZone. All those links make mention of message size
If it’s that your company signed a sponsorship with Google, I understand. But then why not replace Kafka with Google-managed services altogether? Are there things that Kafka does that PubSub doesn’t do that you really need (such as unlimited message retention)?
For example this new architecture, from the end-users perspective, compared to 2010 Twitter would be interesting. I'm sure much of the technology is needed for monetization, but it would be a fascinating look anyway.
Amazon's solutions can be fit into various architectures, but are more generalist tools than the BigQuery, DataFlow and BigTable application.
Google's solutions are also cheaper and/or easier to work with, for very large data processing.
Btw, Bunny is also $10 flat rate for unlimited image processing... and they only bill the post-processed bandwidth cost. The price of my AWS equivalent for that pipeline was quite the comparison. :)
Is Bunny Optimizer (bunny.net)  what you are referring to?
Plus, as you found, $10 for unlimited image processing :)
I look forward to the day they have a key value store along with functions at edge. I expect their pricing model to be simply irresistible.
> but are more generalist tools than the BigQuery, DataFlow and BigTable application
Doesn't AWS provide similar tools?
BigQuery == Redshift
DataFlow == Kinesis (Analytics/Flink/EMR-Spark/EMR-BEAM)
BigTable == Redshift Spectrum
"Kinesis (Analytics/Flink/EMR-Spark/EMR-BEAM)" - this is kinda the point. In AMZ you're building out the dataflow process that fits your use-case and optimizes where you would like, rather than submitting to an existing (Google) design that is optimized+priced for very large dataset.
I'm surprised that they mentioned Twitter EventBus, I thought that they were migrating away from that to Apache Kafka entirely.  Mind you, they've got a lot of tech going on, so not surprising if it's still present in legacy systems.
Fun fact, if you dig into the architecture of Twitter EventBus, it will seem awfully familiar to the architecture of Apache Pulsar (storage separated from brokers, storage based on Bookkeeper), and that's no coincidence, as Sijie Guo, CEO of StreamNative developed EventBus at Twitter (also a main dev of BookKeeper). StreamNative is to Pulsar as Confluent is to Kafka.
And the reasons that Twitter moved from EventBus to Kafka also apply to Pulsar, which is worth keeping in mind the next time an HN commenter proclaims "Le roi (Kafka) est mort, vive le roi! (Pulsar)".
I think somewhere in there is a link to a story about how only one popular user (Was it Ashton Kutcher?) could tweet at a time. I seem to recall it ran on a singly MySQL server for quite a while too.
Also, I wonder why they went with GCP instead of AWS. Does Twitter have a deal with Google that I'm not aware of?
It is 463k eps for anyone wondering.
Sanity check: say 10^5 seconds/day. Then it would be
400 * 10^9 / 10^5 = 4 * 10^6
An options order consists of more data than an average tweet, and we can certainly process them at a higher rate than twitter would need to go in practice. Many financial exchanges operate on a single thread. 1-100 million transactions per second with jitter measured in tens of microseconds. I don't see why other software products & services can't leverage similar concepts.
I'm sure there are other excuses available, but this is not a good one. Non-repudiation is a critical system requirement of any financial exchange. The figures stated include persistence to durable media, and the whole point of running everything on a single thread is to ensure serialized processing of all activities (i.e. consistency).
I would argue the need for replicating tweets is less urgent than ensuring 7-8 figure financial transactions don't go unaccounted for. We could probably make some compromises for the twitter use case to make this even faster.
But that wouldn't be nowhere near as cool as building "cloud infra", right?
I would have expected one of the big Internet websites to use better technology.
Then it's a message bus built on top of TCP. Anyone with basic understanding of networking can see that if you have a producer that wants to send the same data to multiple consumers efficiently, you should use multicast.
Kafka also lacks proper mechanisms to throttle the speed of producers when consumers are too slow, which is the first thing you should ever be concerned about whenever you introduce a queue.
If you want something somewhat decent within Javaland you could try Aeron.
I don't think you know the basics of Kafka's API or how it works internally.
You completely missed the point. It's the producer that needs to be paused, and the reason for that is that memory is not infinite. You cannot just keep buffering until the consumer catches up, because it may never catch up.
It's not "fine".
There is a Signals and Threads podcast episode that goes a bit into the history of it.
If anything in HFT environments, people use L2 switches for latency reasons. Those operate at the Ethernet level, so they don't really care about IP at all.
Anyway I don't see why that's specific to electronic trading or even just a low-latency concern. Sending the same traffic to hundreds of people with unicast means using hundreds of times the bandwidth, which is a huge problem.
But there are a lot of professional software devs with zero real networking experience. Sure, they may understand the text book definition of TCP or they may even have seen pictures of fibre with labels like 'this is how far light travels in a nanosecond'. But would have no idea how to calculate the serialization latency of a 10G link, let alone know the duty cycle required to saturate one.
But none of that matters in the cloud (which is where twitter is stacking their jenga tower in the original blog post). And even if both Google and AWS have custom silicon (or at least FPGAs) doing hardware offload for their internal SDN encapsulation protocols at the server level and that their custom switches all support it, it doesn't matter. They hide all of that from you the customer and rarely even acknowledge it's existence.
Tell that to the Netty folks.
I fear that technologists (including myself) are fascinated by the exotic solutions required by extreme centralization, and are more than happy to solve those rather than question the need for them in the first place.
I'm further suggesting there there is a relationship between scale and ethics, via moral hazard, that is worth exploring. Too bad I'm getting downvoted instead of engaged with.