Hacker News new | past | comments | ask | show | jobs | submit login

If I'm understanding correctly, it sounds like the plan is to use heuristics and machine learning to guess which URLs look "tricky".

I'm highly skeptical of an approach that involves training users to rely on a black-box ML system. That just makes them ever more dependent on technology they can't possibly understand and puts more power in Google's hands. By being the sole arbiter of what is "tricky," Google gets to blacklist the entire Internet.

It would be better to help users understand the URL. I don't mean expecting users to parse the syntax on sight; I mean finding ways to display or represent it so that the important information is easier to see and fraud is easier to spot.




Along the lines of your last thought (helping users understand the URI, displaying it in an idiot-proof manner), this wouldn't be at all hard if they simply had 4 separate areas for protocol, hostname if any, domain & public suffix combined, and path. e.g.:

    https://www.example.net/foo.html
becomes:

    [https] [www] [example.net] [foo.html]
Then they can colour the public suffix e.g. black and the rest of it light-grey, much like they do already, BUT it's also clear which box you always need to look at to determine the site's identity.

It could go even further and obscure the contents of the first, second, and fourth boxes, until you mouseover or focus it (but all of the boxes should appear light red in background for http, and light green for EV, even if you can't see the text in them), and the last one should be far from the one before it, to avoid e.g.:

    [https] [www.example.net] [example.org] [foo.html]
    [https] [www] [example.org] [www.example.net/foo.html]
(It would be easy to accidentally think you were somewhere at example.net with both of the above, even though you're really somewhere at example.org)

Clicking on any box (or the regular Ctrl+L) could turn it back into one box (for easy URI copying) and defocusing it will revert it again. Power users could set a knob to simply always display the 1 bar they've been looking at for the last 25+ years.

Maybe there could even be a conditional 5th area for the query parameters (GET variables) which isn't even shown by default (without input area focus), who knows.

    [https] [news] [ycombinator.com] [reply] [id=19032043&goto=item%3Fid%3D19031237%2319032043]
Just my wild 4am ideas... probably lots of things wrong with it I can't imagine right now.


I'd personally invert the order of the 2nd and 3rd areas. Yes, it'll look ugly, but it's way easier for users to parse for phishing:

https://example.com.phishing.com -> [https] [phishing.com] [example.com] [foo.html]


You can go Big endian all the way,

[https] [com] [phishing] [com] [example] [foo.html]


phising is not the only issue with urls


while we're at it we could make the query parameters into a textfield which could expand into a table, for easier editing of values


Ditch the protocol and show a lock or not. My parents don’t know what “https” means.

Or ditch the protocol and not render http at all by default.


It is worse then a blacklist.

Blacklist is easy to understand, as long as we trust Google (lots of us don't) everything would be fine.

With ML, not even Google have a full picture of what's going on.


At this point ML to govern things like autoplay or address bar is just whitewashing for biased data. You feed biased training data into an ML algorithm and now it's unbiased! Select your training data set and other parameters such that you get a result you want and you're good to go. The history of machine learning for tech is a history of obvious biases creeping into training data - whether it's a medical algorithm "learning" that background markings are an indicator of disease instead of classifying the tissue it's meant to, or recidivism risk algorithms going off race & gender to the point that they produce actively bad data, or face detection algorithms thinking that asian people are squinting.

I don't even think that youtube necessarily should get an autoplay prompt on first use, but it's pretty convenient that ML-based approaches like this are used instead of much simpler approaches.

Lots of research is going into creating adversarial data given known ML algorithms, as well. If this address bar ML is running on the client (it'd have to, right?) then it's not hard to do a training run against it to come up with custom tailored URLs/sites to get the ML to classify your attack as good.


While I agree that it is whitewashing biased data, maybe they get good results with new and unseen URLs that try to look like some relevant page using the same tricks as the scam URLs in the corpus.


Really good point.

ML equals diffusion of responsibility.


No, it doesn't. Google is still responsible. The team at G maintaining it is still responsible.

An ML solution is a completeness vs correctness trade off. ML can make the blacklist virtually infinite long, whereas a human team would likely burn out (and make more/different mistakes).


And what do you think will happen when they mess up pas.com/signup? Do you think they'll fix that for you? I hope the team at firefox is licking their chops to implement a better and distinguishing alternative.


Until the day a Google domain is accidentally blacklisted. Then suddenly there will be an internal whitelist where they get to decide what goes in.


Does this really sound that alien to people? I seem to recall the internet being controlled and blacklisted from a large portion of the population by AOL. Recall when people used to think that AOL keyword search was the entirety of the internet? This doesn't seem that much different from the old AOL tactics in my opinion.


> Recall when people used to think that AOL keyword search was the entirety of the internet?

I'm not the OP but I personally don't remember any of that because I'm not an American (like a major part of populace on the Internet) and I've never used AOL. And maybe AOL failed in America exactly because it did the things you mention they were doing, i.e. "controlling and blacklisting" a large part of the Internet.


Not alien but just a highly undesirable outcome.

In the 90s while mass AOL CD mailings were going out there was fear that "AOLization" of the internet would happen.

The same incentives for AOL curated and walled garden are present today for Google, Facebook etc.


AOL wasnt the internet. It was a private network that eventually allowed you access to the internet as its popularity increased. Once dialup died they made a broadband client.

If Google is trying to make their own private internet on top of the public internet I'm sure a few antitrust regulators will start asking about their hold on search and ad markets.


Agree. And what fun we'll have when Google's ML systems screws up our authentic site's classification. I'm sure they'll jump right up with an apology.


>I'm highly skeptical of an approach that involves training users to rely on a black-box ML system.

google did it with youtube, if they do it to chrome i don't know if they can handle the developer frustration that will ensue(i'll put a nice red fullscreen browser incorrect banner on my website if users visit from chrome)


It's not ML, according to an update to the article at the very end:

> Correction January 29, 10:30pm: This story originally stated that TrickURI uses machine learning to parse URL samples and test warnings for suspicious URLs. It has been updated to reflect that the tool instead assess whether software displays URLs accurately and consistently.


My take was that the plan is to come up with a plan. Not the first time I hear of that regarding URIs.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: