Hacker News new | past | comments | ask | show | jobs | submit login
10 Second Teleportation (upollo.ai)
117 points by caydenm on Jan 18, 2024 | hide | past | favorite | 66 comments



I found a few leads googling around Palo Alto Networks docs website:

- "Advanced URL Filtering" seems to have a feature where web content is either can be evaluated "inline" or "web payload data is also submitted to Advanced URL Filtering in the cloud" [1].

- If a URL is considered 2 spooky to load on the user's endpoint, it can instead be loaded via "Remote Browser Isolation" in a remote-desktop-like session, on demand, for that single page only [2].

I think either (or both) could explain the signals you're detecting.

[1]: https://docs.paloaltonetworks.com/advanced-url-filtering/adm....

[2]: https://docs.paloaltonetworks.com/advanced-url-filtering/adm...


This looks exactly like it!! Nice find!


Ex-PANW here. It's almost certainly the firewall's URL Filtering feature (aka PAN-DB).

When someone makes an HTTP request, the firewall takes the host and path from the request and looks them up first in a local cache on the data plane, then in the cloud. (As you can imagine, bypassing the entire feature is therefore trivial for malware. You just open a connection to an arbitrary IP address and put, say, google.com in the host header. As far as the firewall can tell, you are in fact talking to google.com.)

When the URL isn't already known to the cloud, or hasn't been visited more recently than its TTL, it goes into a queue to be refreshed by the crawler, which will make its way there shortly thereafter to classify the page.

Palo Alto has other URL scanners, but none that would reliably visit the page after the user. URLs carved out of SMTP traffic, for example, would mostly be visited before the real user, not after.


Would that explain getting past an auth wall though, i.e. loading the HTML page as if the user were logged in but without auth headers and cookies?


You may be slightly misreading the write-up. Note the following two bits:

> What we found were user agents purporting to be from a range of devices including mobile devices, all only ever loading a single page without any existing state like cookies.

> The behavior itself is also strange, how did it load these pages which were often behind an authwall without ever logging in or having auth cookies?

I don't think they mean to say that pages behind authentication were successfully loaded without authenticating. If cookies are required to load the page, you aren't loading it without them. So I read this as "The sessions weren't authenticated, so where on earth did they even find these URLs?"

The answer is that there's a real, authenticated user behind a firewall, and every unknown URL this user visits is getting queued up for the crawler to classify later, query string and all. So the crawler's behavior looks like the user's, but offset by a few seconds and without any state. Presumably the auth wall is doing its job and rejecting these requests.


OP here, I was trying to say that these pages were behind an authwall and loading with userids from a specific user but without any of their cookies to support that auth.

This led us to believe this page was MitM rather than crawled directly (as they would not be able to impersonate the user)


That's how I read it also. If the ids you're referring to were in the URL, it's almost certainly URL Filtering. The URLs are fed to the crawler via MITM, so you were basically right.


> I don't think they mean to say that pages behind authentication were successfully loaded without authenticating.

Hm, are you sure? From the article:

> Would render and execute all scripts on that page as if it was that user

> [...] scans pages by grabbing the page contents, sending it to a render queue and then processing it [...]

I know a system that fits the bill for the observed behavior: https://news.ycombinator.com/item?id=39051083

But apparently PAN can do it too: https://news.ycombinator.com/item?id=39051077


> Hm, are you sure? From the article:

> > Would render and execute all scripts on that page as if it was that user

If there is a valid user ID (or other user/session identifier) in the request URL or body, but not valid auth cookies, the system may respond with a page that references the same scripts as the user would get but with no data. In that case the scripts would run (perhaps requesting further resources, directly or by placing things that reference them into the DOM, which is how they know the scripts ran) as they would for the user but just render a “no data” message where the information would be.


Might as well be a browser extension.

I remember setting up a Confluence server which was only used by me, but had public access (still password protected).

When checking the logs, I noticed an external IP trying to access pages which I had accessed previously, but they got redirected to the log-in page. The paths were very specific, some which I had bookmarked, so it was clear that there was an extension logging my browsing and some server or person then tried to access my pages.


Could it be a MitM "enterprise browser" like Talon or Island, and/or related browser extensions?

https://www.paloaltonetworks.com/company/press/2023/palo-alt...

> Dec. 28, 2023 Palo Alto Networks .. announced that it has completed the acquisition of Talon Cyber Security, a pioneer of enterprise browser technology ... Talon's Enterprise Browser will provide additional layers of protection against phishing attacks, web-based attacks and malicious browser extensions. Talon also offers extensive controls to help ensure that sensitive data does not escape the confines of the browser.

https://www.island.io/product

  Set hyper-granular policies ... boundaries across all users, devices, apps, networks, locations, & assets 

  Log any and all browser behavior, review screenshots of critical actions, & trace incidents down to the click

  Critical security tools embedded into the browser: like native browser isolation, automatic phishing protection, & web filtering


> Palo Alto Networks ... after reading product page after product page, we couldn’t work out exactly what product it was

Well that definitely tracks.


I remember I worked somewhere where they had something like this. Most people had windows machines, but I had a mac that I had installed.

My machine wanted me to accept a client certificate from palo alto networks.

I did not and kept refusing.

I think they had some sort of intrusive mitm proxy that filtered everything everyone was doing/browsing.


The usual way is to require a custom CA for all clients, sounds like an ineffective setup if you can just ignore it. I.e. it should be a intermediate certificate for the proxy you need to acknowledge.


I believe it was a browser mitm dialog telling me about the untrusted connection and asking if I wanted to accept the certificate.

I suspect most of the other non-dev machines in the building had the ca installed by IT.


It could be a chat preview generator. Users DM links to some internal project pages in an chat tool and the tool fetches the page in the background in an attempt to render a preview.


That was on my list of candidates as well! Those usually have a specific user agent making it clear what they are, they appear from a companies netblock (eg. Facebook, Microsoft) and cannot access authed pages (unless the key is in the url).

In this case these appeared to be all MitM'ed pages from a security device since the key wasn't in the url and it contained userids for a specific user.


In that case, the preview system would do (eg) GET https://example.com/private/page, but get a 401 Unauthorized response back, and have none of the page content or execute any of the scripts inlucded in that /private/page:

> * That somehow had the page content from a user

> * Would render and execute all scripts on that page as if it was that user


Same thing happened with my work computer in the office network with a MITM HTTPS firewall. The IP address jumps between the coasts randomly confusing the Windows weather widget. Images failing to load on a lot of websites because the IP address change triggers something in their CDN. Everything is working fine when I'm WFHing so it has to be the office network.

Oh and this can also happen when a mobile user is jumping off their home wifi network to a internationally roaming data card. Why they would do that? Because data is cheaper this way, or they are actually tourists. So please do not block users just because they are doing this teleportation dance.


My mail provider locked my account after I used the satellite internet on an intercontinental flight (my IP location must have bounced all over). Got a serious scare later at my hotel since pretty much all of my itinerary plans and details were kept there.

Thankfully that could be resolved, but it wasn't a great way to start a vacation.


Here's my wild guess:

Some other code running in the browser window (probably a browser extension, but possibly another script tag in the page, inserted by an intermediate firewall/proxy) is doing this. It could be corporate spyware (i.e. forced on users by the IT department), or an extension that only tends to be used by large institutions (because it relates to some expensive enterprise product). Alternatively, it could be a much more popular browser extension, but it only executes this capture when it determines that the user is within a target list of large institutions.

I'm making the same guess as the author about the execution process: that the code is shipping a huge amount of page content to a cloud server, e.g. the full DOM, and then rendering that DOM in this older Chrome version. It's not fetching the same page from the origin server, which is how it's able to do this without auth cookies.

As part of rendering, the page's script tags all get executed again, which is why Upollo is seeing this. (Note that I don't know if this re-execution of script tags is deliberate. There's a good chance that it's an unintended side-effect of loading the DOM into Chrome, but it doesn't seem to break anything so nobody's bothered to disable it.)

It's only sampling a small percentage of executions, which is why it's not continually happening for every interaction by these users.

It's waiting ten seconds so that the page's network interactions are likely to have finished by then. Waiting longer would increase the odds of the user navigating to another page before the code has had a chance to run.

The article doesn't say if there are particular kinds of pages being grabbed, but looking for commonality between them would help.

The main thing that stumps me – assuming I've understood it correctly – is why the second render is happening across such a diverse set of cloud networks.


Browser extension is what we originally thought for exactly the same reasons you did. We started to see some requests show up from iOS devices which didn't support extensions so that made us think MitM corporate proxies.

The diversity of cloud networks looks to be due to these being deployed by individual institutions (eg. universities, corporations etc.) rather than only run from Palo Alto Network's data centers.

We also saw slightly different configurations with different browser versions, but with the same pattern of behaviour.


iOS has supported Safari extensions since iOS 15 (late 2021). There are far fewer extensions for Safari than Chrome or Firefox; they've been steadily adding more as Safari gets closer to the same Web Extension standard used by other browsers, but most developers still shun iOS support since the extension has to be wrapped in an iOS app rather than being loaded from the web.

https://support.apple.com/guide/iphone/get-extensions-iphab0...


"Palo Alto Networks" is something that shows up clearer than anything else in my lighttpd logs, as they include the "we're palo alto networks doing research, contact us here(email) for us not to scan" in http request headers. They appear to do full ipv4 range scan many times a day IIRC.

Funnily enough I got motivated to try to make my crawler show up the same way in my own server logs by just raw scan breadth, IE by hitting so many servers I'd see my own crawler in the logs without any kind of targeting. As a kind of "planetary level experiment" source of curiosity.

Had to tweak masscan settings till my crappy router could keep up with the routing load. Ended up with something like 500 addresses / sec, which pales in comparison to the best hardware used for this which when combined with masscan, scans the ipv4 space in 6 minutes. Managed to scan 1% of the IPV4 space while I slept before I started to get seriously throttled and got a quite angry email from my ISP. Just told them "Oh thanks for noticing, I now fixed the offending device" (pressed Control+C) and never ran the scan again lol.

Ran the scan with masscan with no blacklist. Don't recommend, at least not doing it more than once unless you get a good blacklist to follow


> masscan

> This is an Internet-scale port scanner. It can scan the entire Internet in under 5 minutes, transmitting 10 million packets per second, from a single machine.

Absolutely insane


Aren't there systems where a server does the browsing and/or page rendering but it's controlled by terminals using other protocols?

Just speculatively, if someone was managing the setup of a room full of NSA analysts browsing for OSINT, how would they cover their tracks? What would that traffic look like?


It would look much like any other institution full of people doing general web browsing. A university full of foreign students googling stuff in thier home languages. A hospital full of patients googling about random stuff. An airport full of international passengers surfing twitter feeds for war news.


What they choose to investigate is itself revealing. CDNs and large hosting providers for example would be in a position to make inferences by observing and correlating traffic from that origin. I would be trying to obfuscate it using a VPN distributed over a range of countries and IP addresses. That could appear strange to a host, depending on how they implement it.


Except that most open source material is now encrypted. They could see lots of traffic towards Twitter/Facebook/YouTube/google and lots of overseas news sources but would have little insight into actual content.


The NSA doesn't want any of the hundreds of thousands of Twitter/Facebook/YouTube/Google workers to have that insight either. And when the sites they're visiting aren't encrypted they can't browse to it because of OPSEC? That would all be at risk of side channel analysis. They aren't going to leave it chance. Count on them covering their tracks.


Sounds like it could easily be the Cisco umbrella junk a few gov/universities have had that I’ve seen. They install MITM CAs[0] on managed hardware so they can definitely see page content.

[0] https://docs.umbrella.com/deployment-umbrella/docs/install-c...

Edited to add link to docs.


What would the benefit of something like this even be? Is it possible it's some sort of tool that archives a user's internet usage?


It appears this is to find threats that might have no otherwise triggered or work out is particular sites are dangerous without monitoring a users machine.

It is scary that for people in a corporate environment this could be rendering banking, messaging or any other pages contents.


In a corporate environment, i.e. one with managed laptops/workstations, it's best to assume that your employer can access the content of every page you visit.

Some employers might not actually do that, but that decision is usually neither static nor will a change in it have to be reported to you under most policies.


I spoke w. a Palo Alto vendor rep a few months ago. We were talking about the features of the firewall appliance one of my clients was using.

They have a feature that effectively "tests" what the user is about to load in a virtual environment, and sees if that content behaves abnormally. I forgot what they called it. It sounds like this could be it.


Maybe remnants of Genesis Market [0]?

[0] https://en.wikipedia.org/wiki/Genesis_Market


A lot of larger orgs (Universities etc) use Palo Alto Networks Global Protect VPN as their VPN for accessing the orgs intranets.

Maybe related somehow to that?


With the rise of “store all the things you see / interact with for LLM recall” solutions, it could just be that. Some kind of async archival


> obviously some kind of security system

I don't know where the "security" bit comes from, but this is, to me, obviously web scrapping


This could be a browser AI tool that scans pages (copilot does this in edge… is useful)


Could it be a "read it later" type of article reader/storage service? I know of at least one that fits the bill in that it uploads locally-viewed HTML to a server which then renders that page in a headless Chrome instance for archival:

I've recently been wondering how Omnivore, unlike e.g. Pocket, is able to store paywalled content (for which I have a subscription) on iOS when saving it via the Omnivore app target in the share sheet, but not when directly pasting the target URL in the webapp or iOS app.

Turns out that sharing to an iOS app actually enables [1] the app to run JavaScript in the Safari web context of the displayed page, including cookies and everything!

If I'm skimming the client and server source code correctly, it does just that: It seems to serialize and upload the HTML of the page [2] and then invokes Puppeteer on the server [3]. Puppeteer is a scriptable/headless Chrome – that would fit the bill of "an outdated Chrome running in a data center"!

Omnivore can also be self-hosted since both client and server are open-source; that would explain you seeing multiple data center IPs.

[1] https://developer.apple.com/library/archive/documentation/Ge...

[2] https://github.com/omnivore-app/omnivore/blob/main/apple/Sou...

[3] https://github.com/omnivore-app/omnivore/blob/57aca545388904...


I wonder if this could be iCloud Private Relay? It appears that it's effectively a VPN with some redirection layers that change often, though I don't know the exact details.


From the article:

> But wait, these are different devices, they have none of the same cookies. If this were a VPN it would be the same device.


Prisma Cloud / Global Protect?


Who thinks it is appropriate to use one second looping gif on article?

Could be interesting, but I cant read this shit with flashing images.


Unrelated to the article directly, it's kinda neat that the site's text selection highlight color is randomized on every mousedown.


  /* Text Higlight Color /**/
  :root {
    --highlight-color: null;
  }

  ::selection {
    background: var(--highlight-color);
    color:#FFFFFF;
  }
  ::-moz-selection { /* Code for Firefox */
    color: #FFFFFF;
    background: var(--highlight-color);
  }
  </style>

  <!-- Text Highlight -->
  <script>
    const colors = ["#F76808", "#30A46C", "#0091FF", "#6E56CF", "#E5484D"];
    window.addEventListener("mousedown", (e) => {
      const color = colors.shift();
      document.documentElement.style.setProperty("--highlight-color", color);
      colors.push(color);
    });
  </script>


Josh on our team is so happy people discovered and liked his easter egg!


On the other hand, as someone with ADHD who is easily distracted, I couldn't even finish reading the article. Those GIFs are super annoying and don't play once, no they play again and again.

Sad, because it sounded interesting, but no way I could focus enough to actually comprehend it.


However, setting the highlight color to pure white in the CSS isn't very nice for people who load the page with JavaScript disabled.


I'm ashamed to admit that I spent more time highlighting random snippets of text than I did reading the article.


I went back just to see and it chose WHITE as the first option, making the text invisible lol


Just curious, what browser? This seems to be a bug.


Chrome, nothing special on it


did cursor-fidgeting-while-reading lead to this discovery?


I used to do this, I had to retire several mice because the underneath of the left mouse button was worn away from hitting the micro-switch so much..


I'm missing something

> strange devices show up for some of our customers' users

> how did it load these pages which were often behind an authwall without ever logging in or having auth cookies?

Either

- The customer has screwed up user auth big time and some X knows that.... lets go with no

- OP's data is wrong or they are reading it wrong

- They are explaining it badly.


What's happening is that some MiTM Palo Alto networks system is intercepting the HTML contents of the page, waiting a bit, and then rendering that HTML content again in old Chrome on a separate machine. It's like if you go to a authenticated page that only you can see, like https://news.ycombinator.com/flagged?id=aaron695, did "View Source", copy-and-paste that source into a HTML file, and then you send me the HTML file and I open the HTML file on my computer.


Are you sure it's has the page contents, or if it's just got the URLs that were called?

Either way it feels like malware on a client machine, but doesn't necessarily mean that the page contents are being read by the malware.

I guess if you had some javascript which only loaded if the chrome version was not the latest you could confirm -- the attempt to load the URL would not occur on GoodChrome, but it would on the "security" device. Therefore if the page contents was being shipped to BadDevice completely it would be loaded, but if it was just re-loading the URLs called by GoodChrome the URL wouldn't be called.


Exactly! Our library is embedding in these pages and similar to Segment or other analytics tools will get told information about user events from that state. Sometimes that state is stored in the page that is sent over the wire (eg. userid) and as such we get a request saying a particular user is on the other side of the world.


Ten Second*


Thank you. I'm not the brightest, and could not figure out the headline. What was the first teleportation?


Yet another example of how HN's automated title modification screws things up


I thought I messed it up! I had no idea this was a thing.


You can always edit it after the fact.


Explicit feedback to the submitter that a submission title (or URL) is being modified (e.g., URL canonicalisation, denumeralisation, de-howification) might help ameliorate this issue.

It's bad enough trying to proof my own comments what with being navigated away from that comment (I've a few recent typos caught well outside edit windows which nag on me as I type). Changing submitter's content without notice is ... less than optimal.

I'll note that certain edit features (e.g., year edits) do involve a confirmation, which has in fact proved useful.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: