It uses LLMs to generate python code to scrap a webpage to fit any Pydantic model provided:
from hikugen import HikuExtractor
from pydantic import BaseModel
from typing import List
class Article(BaseModel):
title: str
author: str
published_date: str
content: str
class ArticlePage(BaseModel):
articles: List[Article]
extractor = HikuExtractor(api_key="your-openrouter-api-key")
result = extractor.extract(
url="https://example.com/articles",
schema=ArticlePage
)
for a in result.articles:
print(a.title, a.author)
LLMs activate similar neurons for similar concepts not only across languages, but also across input types. I’d like to know if you’d consider that as a good representation of “understanding” and if not, how would you define it?
Anthropic is pretty notorious for peddling hype. This is a marketing article - it has not undergone peer-review and should not be mistaken for scientific research.
If i could understand what the brain scans actually meant, I would consider it a good representation. I don't think we know yet what they mean. I saw some headline the other day about a person with "low brain activity" and said person was in complete denial about it, I would be too.
As I said then, and probably echoing what other commenters are saying - what do you mean by understanding when you say computers understand nothing? do humans understand anything? if so, how?
Does a computer understand how hot or cold it is outside? Does it understand that your offspring might be cranky because they’re hungry? Or not hungry, just tired? Can it divine the difference?
Does a computer know if your boss is mad at you or if they had a fight with their spouse last night, or whatever other reason they may be grumpy?
Can a computer establish relationships… with anything?
How about when a computer goes through puberty? Or menopause? Or a car accident? How do those things affect them?
Don’t bother responding, I think you get the point.
Yes this is literally just appending site: reddit.com to the query and redirecting to google. This page is a single HTML: https://github.com/goncharom/greggit
This was meant as a silly project! I caught myself adding site:reddit.com to a lot of google searches so I figured I’d just make a shortcut.
The regeneration loop was probably the most interesting part to work on: you need very strict constraints on what “good” content looks like and what the specific issue is when codegen fails. I found Pydantic annotations to be specifically useful for this.
Instead of sending the page's HTML to an LLM, Hikugen asks it to generate python code to fetch the data and enforces the generated data conforms to a Pydantic schema defined by the user. I'm using this to power yomu (https://github.com/goncharom/yomu), a personal email newsletter built from arbitrary websites.
It uses LLMs to generate python code to scrap a webpage to fit any Pydantic model provided: