One of the easier parts of this involves addressing, which in the UK is notoriously easy, reliable, and easy to process - especially the best-in-class Ordnance Survey stuff like AddressBase Premium, right?
A quick trawl of Github will shed some light on it - especially how much of a pain it is to get ABP into a usable state - and this is data that's core and integral to the service, the "are you a real user, a typo, a fraudster, a data supply issue, or getting things wrong in good or bad faith?" kind of business logic.
And it's doubly hard, because the government requires people to update their license when they change address - which often enough involves a new-build property, where the address (let alone UPRN - sometimes even the USRN!) is completely new to you.
Thinking bigger: imagine sitting at your desk during the first couple of weeks on the job, database validation checks running merrily in the background while you're staring at a screen. There's a mild frown forming on your face. You'd been scrolling over a list of rejected records in front of you, largely looking good - _how did they miss THAT fraud _ you'd briefly chuckled to yourself - but _this_ one...
It's a valid business entity, trading from the valid address, and you've hand-checked both _and_ got a junior who lives nearby to send you a photo of it, and, well, the wit running the business has decided to trade under the name _FUCKOFFEE_, and... that's... just going to have to be someone else's problem, you shrug.
(to be clear: the hard part of the DVLA project is _not_ implementing the coding, database, and systems design work)
You've sort of identified how to do it: break it up into problems.
Addresses are hard? Use https://postcodes.io or make your own - that's a project in its own right.
Separately out trading name from registered names needs to be an API from Companies House, or an internal service that API-ifies Companies House data.
Fraud detection? That needs to sit somewhere - let's break out all the fraud detection into a separate system that can talk to the other systems, and have it running continuously over the data. It'll need people to update fraud queries and also to make sure the other systems' data stays integrated with it.
Finally you need something on top that orchestrates the services and exposes them via a gov.uk website, and copes with things like "I don't have my address yet; can I use What3Words instead?" and another one with a UI and lots of RBAC and approvals for DVLA users to do lookups and internal admin.
Heh, you’ve fallen into the exact trap I was trying to expose, which is why I chose addresses as an illustration point :)
The first step with anything address-y is to try and nail down exactly what an address is in the project context. Quick example - property shells, a building at 1-2 Street Name that contains a bunch of flats, but doesn’t itself have residents or its own postal delivery point. They’re mega useful for an address autocomplete (sadly, the vast majority of geocoders are trash for the uk’s addresses), are they sth people should be able to use (without a flat number etc.) for their driving license? Probably not. Commercial venues? Maybe, what about pubs? Ok, so dual-use maybe, but man this stuff gets painful in a hurry.
Next up - historic addresses and how’re you going to link ‘em all together. It’s nasty, edge-case-strewn work - and for the most part, unavoidably so. It’s why people get their backs up when someone dismisses it out of turn, cos if they have worked with it in the past, they’d qualify anything they wrote with: * presuming a well-formed address source + pipeline.
Edit: for what it’s worth, companies house only lists corporate entities and partnerships as defined in whatever act of parliament. Self employed etc can call themselves whatever - and do! - and the only record of it can be as vague as a nondescript line from the VOA.
I like this trap. Why would you need historic addresses for this service? In my mind the main reason for the DVLA knowing your address at all is so they know where to post a fine and a new driving licence/car documentation to. Why do they need historic addresses in their core system?
>Why would you need historic addresses for this service?
The police (and other authorities like councils) who issue penalties, need to know who was the registered keeper of a vehicle on the date of an alleged offence.
That's where the DVLA's Keeper At Date Of Event (KADOE) system comes in.
It's currently being transitioned to a modern API:
Now you've got two addresses to handle - the vehicle's keeper and the licence holder.
Further, the DVLA isn't sending correspondence relating to criminal matters, that's coming from the police who use the Police National Computer, into which the driver and vehicle files are fed along with data from the motor insurers bureau.
I’m happy you’re taking it in the spirit intended :) it’s a trap I frankly despise but that’s cos I’m old and bitter.
The problem being addressed - if you’ll forgive the pun - is that you’re not storing someone’s current address; what you have is their _most recently known to us_ address, which obvs over time can become a problem, least of all if you’re wasting time and money sending undeliverable post. (I have a vague memory of Royal Mail fining bulk delivery users for not pre-screening, not sure if that was a particularly dull dream or not tho).
The thing it’s important to keep in mind is - there is no single nor centrally-held repository of addresses within the UK. I don’t mean about oh mr so-and-so lives at 11 acacia avenue. I mean for just the addresses themselves.
Throw in the mad mixture of Scotland having a separate national statistics agency that’s independent of the ONS, plus Northern Ireland having the same -plus- a separate OS in the form of OSNI, the whole landscape’s set up for pain and failure.
I use MapLibre Native to render static PNGs and JPEGs from vector MBTiles and mbstyle json files.
The nuts and bolts of it boil down to writing a short program in C++ that accepts a few arguments and makes the relevant library calls, and is then called as an external program from my sorbet code. I guess you could PInvoke it but I’ve never got my head around that part of .Net :)
It’s cross-platformish _enough_ for me, as I dev on macos and host on Linux. Can’t think of any reason it wouldn’t compile on windows though.
The flexibility is rather good, and I created a DSL without realising that I was doing so by supplying one program variation with a list of map types, dimensions, starting coordinates + zoom levels etc, which it can iterate over to produce a bunch of static map images.
Highly recommend it, the initial learning curve was steep - for me, at least - but is all one-time learning stuff that you don’t to relearn later on.
It's basically a method to help you find a downward slope, right?
In Britain, until surprisingly recent times, water engineers had been known to use dowsing rods to help find water pipes. As far as I can tell, it's a... method... that helped them to stop overthinking the problem, and subconsciously look at where the the pipe's likely to be, given the many small clues indicating previous engineering work.
You can get performance benefits from using natural keys, as many databases store rows in the order of a table's primary key (sometimes called the clustered index, though it may or may not have a unique constraint requirement depending on the DBMS and a few other bits).
In the author's example, if the first column in the natural index was the city name (or city ID!), and locations are often pulled from the database by city, you'll see a read time performance benefit because each cityName's restaurants will be stored together.
This is why UUID-based systems can suffer worse read + write performance; their rows will be stored in the order of its UUIDs (that is, randomly spread around), making read, insert, and update performance lower.
What to do? I favour a mixed approach: have a unique integer ID column used internally, expose a unique UUID to the public where necessary and - with a BIG DO NOT OPTIMIZE PREMATURELY warning - really think about how the data is going to be queried, updated, and inserted. Ask - does it makes sense to create a clustered index based on the table data? Is there enough data to make it worthwhile? Where can the index fail if changes need to be made? Under some circumstances, it might even make sense to use a natural key with the integer column included right at the end!
The only hard rule I have is using UUIDs for clustered indexes. Unless the tables are teeny-tiny, the system is most likely suffering without anyone being aware of it.
Nope, but it does have a Visibility Map due to its MVCC implementation. Go ahead and do an index-only lookup on a table using a UUIDv4. Then repeat with something k-sortable. See how many buffer hits each one has.
I assure you, those add up. Nothing like explaining to a dev that they have hit the theoretical memory transfer limit of the hardware, and it cannot possibly go any faster unless they refactor their schema (or a faster DDR version comes out).
>> many databases store rows in the order of a table's primary key (sometimes called the clustered index [...])
>Note for readers: Postgres doesn't do that
And I am climbing out of the rabbit hole that is Postgres' CLUSTER keyword :) Really funky, looks like it's a one-time operation that isn't persisted?
Looks like Postgres stores its index entries in ascending order by default (NULLs at the end), and if so, the point's worth keeping in mind on that front too.
I really need to do a deep dive on Postgres internals one of these days, it's an amazing - and different! - system its developers have created.
> And I am climbing out of the rabbit hole that is Postgres' CLUSTER keyword :) Really funky, looks like it's a one-time operation that isn't persisted?
Correct. If you're in a situation that needs it, you'll need to run it at some interval. I have it running once a week on one of our databases.
Also, postgres uses this to solve a problem that clustered indexes on other databases don't solve - a clustered index isn't necessarily on an index you care about for querying. It still has to go back to the table data to get fields not in that index, which could have a random-access performance penalty if you're getting a lot of rows. Postgres's CLUSTER changes the order of the table data to match an index, which allows a sequential read on the disk, avoiding that performance penalty.
This is one of the situations where people complain the postgres query planner is doing the wrong thing and they want hints to tell it to use the index, but the query planner is actually protecting them from a huge random-disk-access performance penalty they don't realize they'd hit if they did that. The right thing to do is either use CLUSTER or, if they're on an SSD or the data can all fit in memory, change the random-access-penalty config value so the query planner no longer takes this into account.
> This is why UUID-based systems can suffer worse read + write performance; their rows will be stored in the order of its UUIDs (that is, randomly spread around), making read, insert, and update performance lower.
This isn't always the case anymore. UUID standards have been developed in that mind, and they are not completely random anymore. UUIDs can give a hint, for example, about the time when it was created, which gives them some order.
Any suggestions for geolocating datacenter IPs, even very roughly? I'm analysing traceroute data, and while I have known start and end locations, it's the bit in the middle I'm interested in.
I can infer certain details from airport codes in node hostnames, for example.
It would also be possible - I guess - to infer locations based on average RTT times, presuming a given node's not having a bad day.
Anyone have any other ideas?
Edit: A couple of troublesome example IPs are 193.142.125.129, 129.250.6.113, and 129.250.3.250. They come up in a UK traceroute - and I believe they're in London - but geolocate all over the world.
Those IPs are owned by Google and NTT, who both run large international networks and can redeploy their IPs around the world when they feel like it. So lookup based geolocation is going to be iffy, as you've seen.
Traceroute to those IPs certainly looks like the networking goes to London.
The google IP doesn't respond to ping, but the NTT/Verio ones do. I'd bet if you ping from London based hosting, you'll get single digit ms ping responses, which sets an upper bound on the distance from London. Ping from other hosting in the country and across the channel, and you can confirm the lowest ping you can get is from London hosting, and there you go. It could also be that its connectivity is through London, but it's elsewhere --- you can't really tell.
Check from other vantage points, just to make sure it's not anycast; if you ping 8.8.8.8 from most networks around the world, you'll get something nearby; but these IPs give traceroutes to london from the Seattle area, so probably not anycast (at least at the moment, things can change).
If you don't have hosting around the world, search for public looking glasses at well connected network that you can use for pings like this from time to time.
"TULIP's purpose is to geolocate a specified target host (identified by IP name or address) using ping RTT delay measurements to the target from reference landmark hosts whose positions are well known (see map or table)."
> A couple of troublesome example IPs are 193.142.125.129, 129.250.6.113, and 129.250.3.250. They come up in a UK traceroute - and I believe they're in London - but geolocate all over the world.
If I'm running a popular app/web service, I would have my own AS number and I will have purchased a few blocks of IP addresses under this AS and then I would advertize these addresses from multiple owned/rented datacenters around the world.
These BGP advertisements would be to my different upstream Internet service providers (ISPs) in different locations.
For a given advertisement from a particular location, if you see a regional ISP as upstream, you can make an educated guess that this particular datacenter is in that region. If these are Tier 1 ISPs who provide direct connectivity around the world, then even that guess is not possible.
If you have ability to do traceroute from multiple probes sprinkled across the globe with known locations, then you could triangulate by looking at the fixed IPs of the intermediate router interfaces.
Even this is is defeated if I were to use a CDN like Cloudflare to advertise my IP blocks to their 200+ PoPs and ride their private networks across the globe to my datacenters.
> If you have ability to do traceroute from multiple probes sprinkled across the globe with known locations
Everyone who's aware of RIPE Atlas has that ability.
I have almost a billion RIPE Atlas credits. A single traceroute costs 60. I have enough credits to run several traceroutes on the entire IPv4 internet. (the smallest possible BGP announcement is /24, so max of 2^24 traceroutes, but in reality it's even less).
A battery-powered sky camera I put together using a raspberry pi pico that connects over wifi to send photos every hour. Ended up going crazy and making a custom PCB for it. Need to do more work on the battery life + camera exposure levels, but overall I'm pretty happy with it, especially how reliable it is given my minimal C skills.
The first group of views... pins down the disappearance of MID as the need to eliminate excessive expressions and redundant synonyms under MID's competition with WIÐ in the collocational phrases.
The second group of views ... establishes a push chain and a drag chain movement between the three prepositions: MID, WIÐ and AGAINST.
The third group... observes the cognitive advantage of spatial sense in the survival of prepositions and explains WIÐ's victory over MID...
And concludes its introduction with:
All three camps of opinions were confined to theoretical hypotheses lacking a quantitative approach, nor was due attention paid to the grand sociolinguistic backdrop of the English feudal society. Therefore, this paper aims to build on the previous research by introducing linguistic data based on quantitative methods, as well as connecting the change to the grander socioeconomic background of the medieval society.
While I enjoy protobufs, it's one of those things I feel I have to re-learn every time I write a new implementation. Never quite get to grips with them in the same way as, say, regex.
One of the easier parts of this involves addressing, which in the UK is notoriously easy, reliable, and easy to process - especially the best-in-class Ordnance Survey stuff like AddressBase Premium, right?
A quick trawl of Github will shed some light on it - especially how much of a pain it is to get ABP into a usable state - and this is data that's core and integral to the service, the "are you a real user, a typo, a fraudster, a data supply issue, or getting things wrong in good or bad faith?" kind of business logic.
And it's doubly hard, because the government requires people to update their license when they change address - which often enough involves a new-build property, where the address (let alone UPRN - sometimes even the USRN!) is completely new to you.
Thinking bigger: imagine sitting at your desk during the first couple of weeks on the job, database validation checks running merrily in the background while you're staring at a screen. There's a mild frown forming on your face. You'd been scrolling over a list of rejected records in front of you, largely looking good - _how did they miss THAT fraud _ you'd briefly chuckled to yourself - but _this_ one...
It's a valid business entity, trading from the valid address, and you've hand-checked both _and_ got a junior who lives nearby to send you a photo of it, and, well, the wit running the business has decided to trade under the name _FUCKOFFEE_, and... that's... just going to have to be someone else's problem, you shrug.
(to be clear: the hard part of the DVLA project is _not_ implementing the coding, database, and systems design work)
reply