Hacker News new | past | comments | ask | show | jobs | submit | zkid18's comments login

I wonder who you are referring to when you say "we"

I think the poster refers to NATO countries. Considering you are a Russian citizen, it's safe to say you're not included in this "we". Although frankly I think you would also benefit in the end.

Great job! It seems you have around 200k companies to list. How do you handle scraping at that scale – all websites are different. What if the schema and markup change? interested to hear what the DevOps aspect looks like.


Thank you so much. In some cases I was able to standardize where the title and location are located on the page (Greenhouse, Lever, etc.). But this mostly uses a valid dataset of job descriptions that matches lists of phrases for plain text within a page (with markup removed). Also the scraper remembers companies and career pages that have job listings. It will prioritize those companies that have listings and visit more often than those that don't. Currently there are 4 worker services that visit about 10k company websites per day (each).


I didn't build this, but here my guess: Most companies use a handful of ATS's (applicant tracking systems), like Greenhouse, Lever and Workday. Almost all of the jobs posted on these platforms are public and their top-level pages are indexable.

If I built something like this, I would start by searching for pages that have HTML fragments indicative to those systems a few times per week (since job listings don't change much).

While this won't do anything to reveal "real" ghost jobs (job reqs that are hidden or generic enough to be used for interesting referrals), it's probably a minor edge over LinkedIn Jobs (the home of stale jobs). Many of these companies cross post to those platforms anyway.


Feel free to correct me on that, but here's my understanding. The comprehensive support products cover four main sub-products:

1. FAQ/Knowledge bases with search functionality.

2. Conversational mediums and agent notifications (e.g., live chat widget, messenger support).

3. Ticket management systems and agent management, which is the core of Zendesk/Intercom. This is the most difficult to operationalize as it requires process architecture, SLA management, etc.

4. Orchestration and workflow management, which can be done inside #3, though some products are available as well.

Most new post-LLM startups target #2 but face platform risks as they rely on companies covering #1, #3, and #4 (e.g., Zendesk, Intercom, Gorgias).

I feel InKeep doing some combination of #2 but emphasising that you can support client whenever they are (ie Github, Discord, Slack) instead of asking them to submit tickets in the website widget.

Another issue for AI support startup is the verticalization/horizontal trap. Most LLMs require solid tuning per client, especially for enterprises like us. Startups often avoid this initially, opting for a more horizontal general path (e.g., AI support for Shopify merchants). This is where enterprise players are more beneficial. Thus companies like ServiceNow, Zoom, and Oracle offer products for support and implementation services.

Neat business imo.


are you implying that a custom implementation service for enterprises is a good business?


that's the reality of the post-LLM Customer Support business.


any differences from nango or supaglue?


Yes, we aim to focus on companies with their LLMs by providing embeddings and chunkings out of the box on top of all the data we sync across different software.


sounds like a neat use-case!


What's wrong with AI agents accessing website content? We seem to have been happy with Google doing that for ages in exchange for displaying the website in search results.


The website owner chooses. They can say "nope" in robots.txt. Not everyone respects this, but Google does. Google can choose not to show that site as a result, if they want to.

This adds a third option besides yes and no, which is "here's my price". Also, because cloudflare is involved, bots that just ignore a "nope" might find their lives a bit harder.


Robots.txt is for crawlers. It's explicitly not meant to say one-off requests from user agents can't access the site, because that would break the open web.


Yep, there's really two parts to this.

* Some company's crawler they're planning to use for AI training data.

* User agents that make web requests on behalf of a person.

Blocking the second one because the user's preferred browser is ChatGPT isn't really in keeping with the hacker spirit. The client shouldn't matter, I would hope that the web is made to be consumed by more than just Chrome.


The thing people have been doing for ages is a trade: I let you scrape me and in return you send me relevant traffic. The new choice isn't about a trade, so it's different.


And AI agents scrape your content in exchange for what exactly?


Sorry, I distinguish here an AI agent that basically automate the visual lookup and scraping to feed into LLMs by big tech. I don't see any problem with the first one tbh.


Yeah, there's a lot of confusion between AI training and AI agent access, and it's dangerous.

Training embeds the data into the model and has copyright implications that aren't yet fully resolved. But an AI agent using a website to do something for a user is not substantially different than any other application doing the same. Why does it matter to you, the company, if I use a local LLaMA to process your website vs an algorithm I wrote by hand? And if there is no difference, are we really comfortable saying that website owners get a say in what kinds of algorithms a user can run to preprocess their content?


> But an AI agent using a website to do something for a user is not substantially different than any other application doing the same.

If the website is ad-supported then it is substantially different - one produces ad impressions and the other doesn't. Adblocking isn't unique to AI agents of course but I can see why site owners wouldn't want to normalize a new means of accessing their content which will inherently never give them any revenue in return.


I don't believe that companies have the right to say that my user agent must run their ads. They can politely request that it does and I can tell my agent whether to show them or not.


True, but by the same measure your user agent can politely request a webpage and the server has the right to say 403 Forbidden. Nobody is required to play by the other parties rules here.


Exactly. The trouble is that companies want the benefits of being on the open web without the trade-offs. They're more than welcome to turn me down entirely, but they don't do that because that would have undesirable knock-on effects. So instead they try to make it sound like I have a moral obligation to render their ads.


For traditional search indexing the interests of the aggregator and the content creator were aligned. AIs on the other hand are adversarial to the interest of content creators, a sufficiently advanced AI can replace the creator of the content it was trained on.


We're talking in this subthread about an AI agent accessing content, not training a model on content.

Training has copyright implications that are working their way through courts. AI agent access cannot be banned without fundamentally breaking the User Agent model of the web.


Ok, fine, let's restrict it to AI agents only, without training. It's still an adversarial relationship with the content creator. When you take an AI agent an ask it "find me the best italian restaurant in city xyz" it scans all the restaurant review sites and gives you back a recommendation. The content creator bears all the burden of creating and hosting the content and reaps non of the reward as the AI agent has now inserted itself as a middleman.

The above is also a much clearer / more obvious case of copyright infringement than AI training.

> AI agent access cannot be banned without fundamentally breaking the User Agent model of the web.

This is a non-sequitur but yes you are right, everything in the future will be behind a login screen and search engines will die.


> The content creator bears all the burden of creating and hosting the content and reaps non of the reward as the AI agent has now inserted itself as a middleman.

As a user agent my god what's happened to our industry. Locking the web to known client which are sufficiently not the user's agent betrays everything the web is for.

Do you really hate AI so much that you'll give up everything you believe in to see it hurt?


Like I said in another comment, I'm pointing out what is going to actually happen based on incentives, not what I want to happen. I'd much rather the open web continue to exist and I think AI will be a beneficial thing for humanity.

edit: to be clear, it's already happening. Blogs are moving to substack, twitter blocks crawling, reddit is going the same way in blocking all crawlers except google.


To be optimistic, as long as anonymous access is a thing, or creating free accounts is a thing, such crawler blocks can probably be bypassed. I hope so, at least.


> reaps non of the reward

Just to be clear what we're talking about: the reward in question is advertising dollars earned by manipulating people's attention for profit, right?

I frankly don't think that people have the right to that as a business model and would be more than happy to see AI agents kill off that kind of "free" content.


It reminds me of legends: "YouTube listens to you and then recommends videos based on your offline chats"


He is a French citizen after all [0] https://www.forbes.ru/milliardery/446937-pavel-durov-polucil... (ru)


WhatsApp is still quite popular and it’s not blocked.


Many would oppose the idea, but if any service (e.g. eBay, LinkedIn, Facebook) were to dump the snapshot to S3 every month, that could be a solution. You can't prevent scraping anyway.


We publish a live stream of minutely updated OpenStreetMap data in ready do digest form on https://planet.openstreetmap.org/ and S3. Scraping of our data still happens.

Our S3 bucket is thankfully supported by the AWS Open Data Sponsorship Program.


Would the snapshot contain the same info ( beyound any doubt ) that an actual user would see if they opened LinkedIn/Facebook/Service from Canada on an IPhone at a saturday morning (for example)? If not, the snapshot is useles for some usecases and we are back to scraping.


Data from S3 isn't free though, still costs money and has a limit based on the tier you purchase.


Yeah, you can get dumps of Wikipedia and stackoverflow/stackexchange that way.

(Not sure if created by the admins or a 3rd party, but done once for many is better than overlapping individual efforts).


I wonder how new payment solutions compete with the growth of Visa in emerging markets and what impact they have on the company's strategy overall. Here in Brazil, I usually use PIX for payments, although card payments are still possible. I imagine a similar situation exists in Southeast Asia or even India.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: