Hacker Newsnew | past | comments | ask | show | jobs | submit | goncharom's commentslogin

A multi-purpose scrapper to turn any webpage into structured data: https://news.ycombinator.com/item?id=45870231

It uses LLMs to generate python code to scrap a webpage to fit any Pydantic model provided:

  from hikugen import HikuExtractor
  from pydantic import BaseModel
  from typing import List
  
  class Article(BaseModel):
      title: str
      author: str
      published_date: str
      content: str
  
  class ArticlePage(BaseModel):
      articles: List[Article]
  
  extractor = HikuExtractor(api_key="your-openrouter-api-key")
  
  result = extractor.extract(
      url="https://example.com/articles",
      schema=ArticlePage
  )
  
  for a in result.articles:
      print(a.title, a.author)


Relevant read (not my own): https://simone.org/advertising/



Every time I see comments like these I think about this research from anthropic: https://www.anthropic.com/research/mapping-mind-language-mod...

LLMs activate similar neurons for similar concepts not only across languages, but also across input types. I’d like to know if you’d consider that as a good representation of “understanding” and if not, how would you define it?


Anthropic is pretty notorious for peddling hype. This is a marketing article - it has not undergone peer-review and should not be mistaken for scientific research.


it has a proper paper attached right at the beginning of the article


It’s not peer-reviewed, and was never accepted by a scientific journal. It’s a marketing paper masquerading as science.


If i could understand what the brain scans actually meant, I would consider it a good representation. I don't think we know yet what they mean. I saw some headline the other day about a person with "low brain activity" and said person was in complete denial about it, I would be too.


As I said then, and probably echoing what other commenters are saying - what do you mean by understanding when you say computers understand nothing? do humans understand anything? if so, how?


Does a computer understand how hot or cold it is outside? Does it understand that your offspring might be cranky because they’re hungry? Or not hungry, just tired? Can it divine the difference?

Does a computer know if your boss is mad at you or if they had a fight with their spouse last night, or whatever other reason they may be grumpy?

Can a computer establish relationships… with anything?

How about when a computer goes through puberty? Or menopause? Or a car accident? How do those things affect them?

Don’t bother responding, I think you get the point.


Yes this is literally just appending site: reddit.com to the query and redirecting to google. This page is a single HTML: https://github.com/goncharom/greggit

This was meant as a silly project! I caught myself adding site:reddit.com to a lot of google searches so I figured I’d just make a shortcut.


The regeneration loop was probably the most interesting part to work on: you need very strict constraints on what “good” content looks like and what the specific issue is when codegen fails. I found Pydantic annotations to be specifically useful for this.


I've been working web scraping using LLMs, I just shared one of the libraries I created to get structured data from arbitrary pages: https://news.ycombinator.com/item?id=45870231

Instead of sending the page's HTML to an LLM, Hikugen asks it to generate python code to fetch the data and enforces the generated data conforms to a Pydantic schema defined by the user. I'm using this to power yomu (https://github.com/goncharom/yomu), a personal email newsletter built from arbitrary websites.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: