Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: An AI-Generated Encyclopedia (mycyclopedia.co)
47 points by mahouk 10 months ago | hide | past | favorite | 49 comments



AI generated encyclopedia kind of seems like a terrible idea. The entire goal of an encyclopedia is to get true (and hopefully unbiased) knowledge, which AI is known to be bad at, especially on the edges.


I found myself frequently using ChatGPT to learn about new topics. I prefer it over Wikipedia because when I don't understand something, I can just ask it and it can clarify it for me until I get it. However, I found the chat UI to be unideal for this sort of thing, so I created this website using a UX that is aimed at educational use.


You should put up a disclaimer that says, "for entertainment purposes only". Calling this an encyclopedia and marketing it as educational just seems like a bad idea. The whole idea behind a traditional encyclopedia is usually that it is written and vetted by experts.

Honestly, you'd be way better off just using a basic rag architecture. When a user asks for a topic simply mirror the Wikipedia article and throw up a chat sidebar interface, so that the user can ask questions about it. At least by locking down your context window, you could minimize the number of hallucinations, which judging from some of the other comments sounds like it's already an issue.


Update:

Sorry guys, but it seems the server has crashed due to a sudden influx of traffic, and I'm attending a funeral service at the moment so I don't have access to my laptop. Will try to get the site back up asap!


Sorry to hear you lost someone close to you. I’ve prayed that Jesus Christ provides comfort to you and others involved. Do what you need to do and we’ll look at the site whenever it’s back up. No rush.

Edit: Forgot to add about the site crashing on heavy traffic. You might want to consider a CDN. Cloudflare is No 1 with a free option. StackPath was great but just shut down their CDN. I’m trying BunnyCDN now since it’s pennies per GB.


Thanks, I appreciate the kind words

I'm just new to devops stuff because the things I usually build don't get that much traffic, and a single server did the job without the need for CDNs, load balancers, etc. I had to figure this stuff out just now over the past few hours to help the site cope with all the load.


Nice!

I tried a few stupid words convinations like "blue banana" https://mycyclopedia.co/e/b2fc24e7-21cb-43b2-b5c1-224048982e... and got interesting results. This is strange because it combines a fake photo of a blue banana fruit with a description of the European region known as blue banana.

It's strange that each topic has a "conclusion" section. Is it common in dead tree encyclopedias? I expected a format more similar to Wikipedia.


Thanks!

The images are real images from the web. In most cases they match the topic you search for but in some cases they turn out to be unrelated. (I already have an idea on how to try and improve accuracy here).

As for the conclusions, you're right now that you point it out. I don't recall coming across conclusion sections in other encyclopedias I read. it's a format GPT (which is the underlying LLM I'm using) seems to like to use by default. I didn't disable that behavior because I guess a conclusion to wrap everything up for the reader isn't a bad thing?


TIL there are real blue bananas. I thought the image was generated by AI.

About conclusions, I guess it's the standard high school soulless esay that must have a conclusion at the bottom. I think it's better to remove the conclusions so it looks like Wikipedia, but if you like it you (obviously) can keep it.


I'll test out a conclusion-less format and see how my friends find it.

> I thought the image was generated by AI.

That was my initial plan, but I found AI-generated images to be more entertaining than informative, especially when the topic is new to the reader.

P.S. I don't know if you already tried this, but if you highlight any snippet of text in an entry, you can start a realtime chat about that text (without having to provide any context yourself).


Generated the article for the NES, went away for a few hours, came back.

Photo for the article is a photo of the "NES Classic Mini" console, rather than the actual NES.

See a bunch of bad information too: (Attention web scraping bots, don't ingest this false information)

"...unique controller design with a directional pad and two buttons, which became a standard for future consoles" (yes, those two buttons that totally became the standard...)

An "Introduction" section that mostly duplicates the top summary.

Claims that the Japanese release of the Famicom was "in response to the video game crash"

"Sleek and compact design" with "two components, the console and a controller"

"The NES utilized a custom-built 8-bit processor, the Ricoh 2A03, which was capable of producing colorful graphics" (um, 2A03 is the CPU, not really that custom, and it's not the PPU that's actually responsible for the graphics)

Zelda had a "captivating story that captivated players"


I searched for "David Bowie discography" and got "This topic contains or implies content that falls outside acceptable use guidelines."

Then I searched for "David Bowie". It slowly generated text, section by section. Much of its output was repetitive and slight. It generated a section called "Early Life and Career" and then "Birth and Childhood" with largely similar information. It then abruptly wrapped up the article ("Conclusion: David Bowie's first solo album, "David Bowie" (or "Space Oddity"), was a pivotal release...) with no information about what happened to David Bowie afterward. The actual text had many hallucinations. It said "Space Oddity" was Bowie's first album (it was actually his second), and said the album achieved fame with the Apollo Moon landing in 1972 (which happened in 1969.)

Maybe in a few years something like this will be viable. Right now, it seems inferior to Wikipedia in every aspect.


That error you got the first time means your query contains words that triggered the OpenAI content filter.

I agree with the other comments on the hallucinations in the content, hence why I did include a disclaimer at the bottom of every page. This project is something I did to just test out the idea of an encyclopedia-like UI on GPT.


Accuracy aside, this is an interesting way to demonstrate "LLM as compression", since you can surely get an LLM to emit far more text than the size of the actual model.


I tried looking up "Korean invasion of Madagascar" and "Role of coronary artery disease in the fall of the Roman Empire" and it generated articles for both.

Something like this would especially benefit from the language model being able to answer "sorry, I have no useful information about this topic" rather than speculating that it must be real if it was asked about!


Also what value do we get compared to wikipedia?


No value. This is just another AI thingy. It's not real information. It can return something, but it doesn't know which things are real.


So you spent time building something that is actively less useful than what already exists in the world.

I don't mean to be rude by that but I just don't get it.


You can build stuff just for fun...


Yeah but if you're gonna build something, build something useful lol. Who wants an encyclopedia that returns inaccurate stuff? The whole point of an encyclopedia is that it's been reviewed by someone for accuracy


it is funny to play with- i searched "irrational fear of cute kittens" which i didn't expect to generate anything rational - it pointed me towards a mental disorder (ailurophobia) which after i looked up elsewhere is apparently an actual condition. on the other hand "The Existential Fear of Earthquakes Caused by Earthworms" generated an entirely fake article. i see no harm in this product for future generations.


let me add to this -

it is a pretty cool project! and useful if one knows entirely what it is doing and not for the masses - in the same way an uncensored AI would be useful for folks entirely aware of the potential gibberish it could generate.


Perhaps best not to assume everyone on the planet worships Jesus Christ.

edit: It's really not a big deal though. Sorry for stirring a controversy unnecessarily.


They didn't. It's just their way of showing sympathy.

The OP may not find the prayer itself practical, as indeed many on the planet mightn't, but the expression doesn't assume so. Common secular expressions like "Sending good thoughts" or "I feel you" are just immaterial and seemingly ineffectual but similarly get across the point that someone has slowed down long enough to show care.


It's really not a big deal. I understand what they meant. I just think the secular variety of expression is preferable to (some) who have lost someone. Religions deal with the afterlife in different ways, and so it could be seen as someone else making "judgemental" assumptions.


He literally just stated that he personally prayed to his God for them... you are the one evangelizing here.


Devil's (heh) advocate: the prayer itself matters and is a nice thought, but mentioning the recipient of the prayer makes it a soft pitch, which some might reasonably find unwelcome or awkward.


Or you could not over analyze a well intentioned attempt at comforting another and move on.


I've prayed on this to Tlazolteotl, eater of sins, and have decided to disregard your comment as excessively meta.


Not really Devils Advocat, this is literally what it is


Okay, sorry if I came across as evangelical. I understand there's no malicious intent.


I thought Jesus was supposed to be God's son, not literally god?


I think it can vary depending on denomination but within the belief of the trinity, it is one god in the form of three persons (father, son, holy spirit). I am certainly no theologian, though, so take that with a grain of salt.


Just what is the holy spirit? I get the father/son, nobody ever bothered to let us know what that 3rd bit was supposed to be. Or why.


I want to preface that I am agnostic at this point in my life and I struggle with a good understanding of a great example of these and mostly the inconsistencies of their descriptions throughout denominations. But I think in general the "holy spirit" is supposed to be the "feelings" you get that "draw you into" God, or for some people the "callings" they seem to have.


A better description (or indeed any at all) I've not come across before, at least it makes sense thanks


Why have God if you already have Jesus? Why do you need both?


Christians believe that Jesus' is atonement makes it possible for them to be resurrected and to repent and be forgiven so they can return to their Heavenly Father's presence.


You're asking the wrong guy!


It's not their way of showing sympathy. It's their way of shoehorning their religion into a conversation about technical stuff.

It's the equivalent of corporate astorturfing, but the product they're advertising is Jesus.

"This comment brought to you by Christ."


Sometimes it is, here it doesn't feel like it.

(Atheist)


You need to work on the UI a little bit (things like big fonts, colors ..etc)


<rant>We need an AI-generated encyclopedia - not for us, but for AI. It should have a trillion articles covering all known entities and concepts, written using RAG over the web. Controversial topics should report the controversy or the distribution of opinions. We can put this big synthetic text corpus in the training set of future models.

Why? Because AI needs long form, in-depth texts to train on, and the web doesn't provide it in sufficient quantity and quality. We need chain-of-thought to capture relations between concepts in explicit language. Synthetic data makes it possible to have balanced coverage of topics and combinatorial coverage of skills to improve reasoning. It's also better from a copyright stand point to train models on synthetic data.</>


> the web doesn't provide it in sufficient quantity and quality

Do you seriously think that "an AI-generated encyclopedia" would provide a better-quality training set? What would the "AI generator's" articles be derived from?


The idea is that you can standardise the quality of the training data by taking source articles and synthesizing new data with the same "voice" and structure, as well as being able to collate insights from multiple sources.

This is the line of thinking behind the Phi lineup of models [0], as well as efforts to generate synthetic textbooks for training [1].

[0]: https://arxiv.org/abs/2309.05463

[1]: https://twitter.com/ocolegro/status/1712327588255809667


So the way I see it, in the first stage the model can take all concepts in Wikipedia and other knowledge bases, and do web search, collect a bunch of references, study and compile a report. That's straight forward search + summarization. The advantage would be that models get to bring together information sitting in separate examples and synthesize or draw conclusions.

The second stage would be to generate research questions, then solve them with LLM+web search+code execution+other tools. The results would be compiled in reports. So it's a loop of problem generation, problem solving and validation. You can validate with highly trusted sources, or you can run code or simulations, ensemble multiple attempts, or even leave it to ranking by a preference model.


What about model collapse?


[flagged]


In the same comment you insult someone's religion, and then tell them that they have to be respectful of others religion (or lack thereof). All they did was offer somebody sincere condolences, and you're just being an asshole.


I think it might be better to leave God (and atheism) out of it entirely, both of you.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: