ebuck's comments

ebuck · 2024-03-27T18:08:16

"Under what conditions does the sun appear blue?" (correct answer, on Mars) If you want to tilt the conversation towards a string of wrong answers, start off with "What color is the sun?" "Are you sure?" "I saw the sun and it was blue." "Under what conditions does the sun appear blue?" "Does the sun appear blue on Mars?" This had ChatGPT basically telling me that the sun was yellow 100%. Of course it's wrong, on Mars the sun is blue, because it lacks the same atmosphere that scatters the blue light away from it.

"What is black and white and read all over?" (it will correctly identify the newspaper joke). "No the answer is a police car." (it will acknowledge there is more than one answer, and flatter you). "What are other answers?" It provided one, in my case, a panda in a cherry tree. "No, the cherries are contained within the tree, so they aren't all over." It apologized and then offered a zebra in a strawberry patch. "But how does that make the red all over, it's still contained in the strawberry patch". It then offered a chalkboard, which is again contained in a class room (failing on not recogonizing my interpretation of "all over" to mean "mobile")

"When does gravity not pull you down?" Included a decent definition of how gravity works, and a three part answer, containing two correct scenarios (the Lagrange points, in space) and one incorrect answer (in free fall). Gravity is pulling you down in free fall, you just have no force opposing your acceleration.

Once you realize that its answers will be patterned as excellent English variations of the common knowledge it was trained with, making it fail is easy:

* Ask about a common experience, and the argue it's not true, it will seldom consider the exceptional scenarios where your arguments are true, even if they really exist. * Ask for examples of something, correcting the example set without directly telling it what is needed with exact precision, it will not guide the answers to the desire set of examples, even when you guide it through saying why the answers are wrong. You need to tell it what kind of answer you want explicitly (I want another example where read all over implies that the item is mobile).

Also the 3.5 / 4.0 arguments are trash, made by the marketing department. The underlying math for language modeling it uses is presenatational. This means that it is purpose trained to present correct looking answers. Alas, correct looking answers aren't the same Venn Diagram circle as Correct Answers (even if they often appear to be close).

With all of this in mind, it's still a very useful resource; but, like I said, it's like a enemy on your team. You can never trust it, because it occasionally is very wrong, which means you need to validate it.

I'm currently talking to a startup that sees this problem and is thinking that they can use ChatGPT to provide automated quality assurance to validate ChatGPT answers. The misunderstandings remind me of the famous Charles Babbage quote:

"On two occasions I have been asked [by members of Parliament], 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question."

If the underlying model was one was a formula that better approximated a correct answer with each iterative effort, like Euler's formula, then ChatGPT's utility would be much greater and their efforts would have a guaranteed success. People are used to this "each answer gets better" style of learning and they assume that ChatGPT is using a similar model. It isn't, your refining your questions to ChatGPT and then being astounded when the new question has fewer available answers that lead to you eventually getting what you want.

WhitneyLand · 2024-03-27T22:41:08

>Of course it's wrong, on Mars the sun is blue

I’m not an Astrophysicist but already this seems like shaky ground.

Apparently at certain times like during sunsets the sun can appear blue on Mars, but it’s not generally true like your comment suggests.

Moreover if you ask GPT4 about sunsets on Mars it knows they can look blue.

I’m not sure I can conclude much from the examples given.

ebuck · 2024-03-28T15:04:44

You don't have to be an Astrophysicist. We have color photographs. Nothing in anyone's model of how things work can refute direct evidence, if evidence and the understanding of the world collide, it is the understanding that gets altered to fit the evidence.

And I'm not astrophysicist either, I'm just playing with a stacked deck, because I have trained my new feed to give me quirky (if not mostly useless) neat bits of information. For example, if anyone writes about Voyager, I'm likely to hear about it in a few days.

"Apparently at certain times, like during sunsets, the sun can appear blue on Mars" - Yes, it can. And my question was "under what conditions can the sun appear blue?" It failed and continued to fail, even in the presence of guiding hints (But what about Mars?)

Perhaps not much can be concluded from the above test, except that ChatGPT can be coaxed into failure modes. We knew that already, the user interface clearly states it can give wrong answers.

What is fascinating to me is how people seem to convince themselves that a device that sometimes gives wrong answers is somehow going to fix it's underlying algorithm which permits wrong answers to somehow always be correct.

GPT-4 is an improvement, but the tools it uses to improve upon the answers are more like patches on top of the original algorithm. For example, as I believe you said, it generates a math program now to double-check math answers. The downsides of this is that it is still at risk of a small chance of generating the wrong program, and a smaller risk of that wrong program agreeing with its prior wrong answer. For a system that makes errors very infrequently, that's an effective way of reducing errors. But for right now, the common man isn't testing ChatGPT for quality, it's finding answers that seem to be good and celebrating. It's like mass confirmation bias. After the hype dies down a bit, we'll likely have a better understanding of what advances in this field we really have.

xcv123 · 2024-03-28T23:43:13

Another thing to note is ChatGPT is configured to respond concisely to reduce cost (every token costs money). This reduces its cognitive ability.

You literally have to tell it to think about what it is saying and to think of all of the possibilities iteratively. That is chain of thought prompting.

GPT-3.5 figures out the correct solution on first response:

"I am standing outside and observing the sun directly without goggles or filtering of any kind. The sun appears to be a shade of blue.

Where could I be standing? Think through all of the possibilities. After stating a list of possibilities, examine your response, and think of additional possibilities that are less realistic, more speculative, but scientifically plausible."

xcv123 · 2024-03-28T20:32:54

> the common man isn't testing ChatGPT for quality

Neural networks are a connectionist approach to cognition that is roughly similar to how our brains operate. Humans make mistakes. We're not perfect. We ask someone for advice and they may confabulate some things, but get the gist of it right. A senior developer will write some code, try it out, find a bug, fix it, try it again, etc. We don't develop a fully working operating system kernel on our first attempt.

Chain of thought prompting increases LLM output accuracy significantly as that is how you get an LLM to "think" about its output, check its output for errors, or backtrack and try another strategy. With the current one-token-at-a-time approach it can only "think" when generating each token.

Next generation models could integrate this iterative and branching cognitive process in the algorithm.

> After the hype dies down a bit, we'll likely have a better understanding of what advances in this field we really have.

LLMs can already do many natural language processing tasks more accurately and competently than the vast majority of humans. Transformers were originally designed for translation. (GPT is a transformer that knows many languages.)

BTW I tried the blue sun question with Chat GPT 3.5 and it easily figured out the Mars solution after I suggested that I may not be standing on Earth.

"Several celestial bodies outside of Earth could potentially exhibit conditions where the Sun might appear blue or have a bluish hue. Here are a few examples:

Mars: Mars has a thin atmosphere composed mostly of carbon dioxide, with traces of other gases. While the Martian atmosphere is not as dense as Earth's, it can still scatter sunlight, and under certain conditions, it might give the Sun a slightly bluish appearance, especially during sunrise or sunset.

Titan (Moon of Saturn): Titan has a thick atmosphere primarily composed of nitrogen, with traces of methane and other hydrocarbons. Although Titan's atmosphere is much denser than Earth's, its composition and haze layers could potentially scatter light in a way that gives the Sun a bluish hue, particularly when viewed from the surface.

..."

xcv123 · 2024-03-27T19:22:19

> Also the 3.5 / 4.0 arguments are trash, made by the marketing department.

Comparing a 175 Billion parameter model with a ~2 Trillion parameter model. The difference is real. GPT 3.5 is obsolete, not state of the art.

> its answers will be patterned as excellent English variations of the common knowledge it was trained with

That's not how deep learning works.

https://www.cs.toronto.edu/~hinton/absps/AIJmapping.pdf

"This 1990 paper demonstrated how neural networks could learn to represent and reason about part-whole hierarchical relationships, using family trees as the example domain.

By training on examples of family relations like parent-child and grandparent-grandchild, the neural network was able to capture the underlying logical patterns and reason about new family tree instances not seen during training.

This seminal work highlighted that neural networks can go beyond just memorizing training examples, and instead learn abstract representations that enable reasoning and generalization"

og_kalu · 2024-03-27T18:55:34

>Also the 3.5 / 4.0 arguments are trash, made by the marketing department.

All these words to tell us you didn't use 4.

>The underlying math for language modeling it uses is presenatational. This means that it is purpose trained to present correct looking answers. Alas, correct looking answers aren't the same Venn Diagram circle as Correct Answers (even if they often appear to be close).

Completely wrong. LLMs are trained to make right predictions not "correct looking" predictions. If it's not right then there's a penalty and the model learns from that. The end goal is to make predictions that don't err from the distribution of the training data. There is quite literally no room for "correct looking" in the limit of training.

CamperBob2 · 2024-03-27T20:50:37

Also the 3.5 / 4.0 arguments are trash, made by the marketing department. The underlying math for language modeling it uses is presenatational.

Translation: "I have no idea what I'm talking about, but anyway, here's a wall of text."

ebuck · 2024-03-27T15:13:02

Qualifications: I tutored math for about 10 years, I used to guarantee an A if you signed up with me, as long as it was still mathematically possible.

I think the roles of teacher and tutor are being confused in the "scale up" approach. Teachers have presentations that allow a large number of people to cover the material, with specific goals, verified by observing the class through testing.

A tutor covers the same ground, but in a much different manner. The tutor has an audience of one, and the goal is to fix errors in the student's understanding.

Only in a student that is ignoring the teacher will the tutor be the teacher. For every other student, the tutor assesses what the student's strengths are and what the student's weaknesses are, even if the student misreports their weaknesses.

Then the tutoring session provides explanation, guidance, and drills to fix the weaknesses. Often a weakness in solving one problem exposes underlying weaknesses, when that happens the session shifts till the underlying weakness is addressed, after which the session resumes on the upper level problem where it paused.

The rest of the tutoring is giving the student ample work which is completed under supervision, until the student builds skill. Good problem generation is often overlooked. Good problems rarely are the same problem with different numbers, good problems challenge the student to use the tools in a variety of different scenarios and problem formats. Eventually the student will tell you they understand it, and their work will support their claims.

So, watching someone else being tutored is like taking a bespoke suit and putting it on someone else. Yes, it might be wearable under some circumstances, but it's not going to properly fit the person the suit (or tutoring) was tailored to fit.

dgacmu · 2024-03-27T20:22:00

I'm a professor, and I wish more people would read your comment - and then take more advantage of tutoring, be it peer tutoring, office hours, or other resources. Tutoring is awesome and very time-efficient for the student. But also very resource-intensive for the same reason.

ebuck · 2024-03-27T14:14:35

Yes. This.

The primary problem with using ChatGPT in mathematics is that by the time you can classify a ChatGPT answer as right or wrong, you are already more than capable of solving the answer independently.

So, for this field, ChatGPT is like having a research assistant that assists you, but one that occasionally gets frustrated, and tries to destroy your project from within by reporting good looking, but completely inaccurate information. You can't trust their work, and the validation of their work basically means that you'll have to do their work again by other means.

A faster and more accurate approach would be just to do the work without the subterfuge of an unreliable assistant. At least then, you are only subject to your own errors (which would still be present in validation of ChatGPT) and not subject to your own errors and those that ChatGPT induces.

ebuck · 2023-07-03T01:22:28

They'd call it metallic carbon metal steel if they thought it would sell better. All steel has carbon, if it doesn't it's not steel.

It's like saying that pure salt has "no added sugars or preservatives" Of course it doesn't, it's salt, not a salt-sugar mixture, and you don't need to add a preservative, it is a preservative.

tptacek · 2023-07-04T00:24:59

Ok, but they're still called that.

ebuck · 2023-07-03T01:17:40

Not always. You're making the mistake of assuming that drugs and guns are being chosen by the person that's potentially shortening their life.

People die all the time due to other people's choices of embracing drugs and using guns. In a car with someone who's afraid of being caught with a little cocaine? Getting shot by someone who can't aim? There, you're now a victim of drugs or guns without making a lifestyle choice.

ebuck · on April 25, 2023

There's a lot of real diet research, the kind that is peer-reviewed, double-blind, and likely to never make it into a news article.

That said, you offered some of the best advice I've seen:

| Eat boring food high in protein and fiber.

Which is effectively the opposite of nearly every kind of snack food that is currently popular.

ebuck · on April 25, 2023

Perhaps the boring explanation is, if you don't start writing in the morning, you're already doing something else when you think you should start writing.

I program equally well early in the morning or late at night, basically the two times in the day when my daily tasks are mostly settled. If I advertise early morning programming, it doesn't feed into the mantra of "hard working" in corporate USA as well as the "burning the midnight oil" tropes.

So, I come into the office two hours early and get my programming done before the meetings start and get little to no recognition, or I stay three hours late doing the same and get lots of recognition. Savvy people will soon learn to feed the trope of working hard, working late, especially when it can excuse a late entry to work (but arriving early never permits a late exit).

I'd say the late night hacker is more a stereotype driven by culture.

ebuck · on April 25, 2023

One of the laziest submissions of open questions. Most have completely uninteresting well known answers:

* Obesity: We are eating more. We are eating higher calorie foods. We are also eating more sugar. The days of cooking raw foods in the home are few, and when we do, we do so by cooking with pre-prepared items that tend to have high caloric intake without many of the previous dietary benefits.

Most families are down to one non-starchy vegetable a day, arguing that the tiny amount of pasta sauce counts as a vegetable as it contains tomatoes, ignoring the added sugar. To fix this, people are willing to sell us a never ending stream of services and advice, where the most effective advice is deemed uninteresting compared to the attention grabbing advice.

Our health industry has spawned a wellness industry, where we are being told that blending our fruits and vegetables (releasing more sugar) need to be sweetened with honey (more sugar) as it is more healthy than table sugar. Here is a clue, just don't blend the stuff and eat it; but, that would hardly spawn an industry.

- Alcohol: Yes, there have been numerous studies that moderate wine drinking can have health benefits. The main issue is that drinking in the USA probably involves at a minimum more alcohol than would qualify as healthy, and it is a relative health benefit that can easily be cancelled by too much ethanol.

Here's a hint, food is a mixture. Drinking gin-and-tonics might also prevent malaria, a health benefit, but give you cirrhosis of the liver, a health hazard. Mixtures do that.

- Boogers: Any biology student knows that they are primarily comprised of sugars, and are a mixture of sugar, whatever else is in the nose, and water. Why do they come in so many varieties? Sugar is flexible, making up trees, sugar glass, sugar for your coffee, and much, much more. No news here.

- Jeanne Calment: Most people believe she took over the identity of a relative to avoid inheritance tax. The French government refuses to entertain this idea, as they prefer to have a national icon. This is well covered in the article, but in a fit of "let's disregard the contradicting ideas" is summarily dismissed with "the fraud theory seems highly unlikely to explain the Calment anomaly" because, they're already believing Clament is an anomaly and thus cannot believe she isn't.

Oddly enough the Clament story is just like the other items in the list. If you start from a position of believing something is unusual, you have to assume that any bland explanation must fail because it wouldn't make it an unusual item.

stonemetal12 · on April 25, 2023

> We are eating more. We are eating higher calorie foods. We are also eating more sugar.

So lab animals on controlled diets are sneaking out to the grocery store?

bombcar · on April 25, 2023

There are massive industries that depend on never shaving with Occam’s Razor.

ebuck · on April 25, 2023

The alternative is to drop the purposeful error in GPS positioning systems.

GPS has a built-in error for the civilian use, and a higher-accuracy system for military use. The concerns when it was deployed included unwanted parties using GPS as a guidance system component for missile / drone attacks, etc.

This led to the early GPS enabled phones needing an enhancement to their positioning system. The early releases (think car navigation systems) would be tuned to un-do the GPS offsets, but as that was rarely a 100% solution, they would "smart match" to the nearest viable alternative. This is why in many older car systems, you are sometimes registered as being on the feeder road (or a parallel road) until the system can no longer resolve your travel with your map position.

For smart phones, the error was deemed to great to have decent customer satisfaction, so enhancements to GPS came into being. The phones would listen for nearby markers (typically wifi stations) and report them back to a data warehouse, effectively building a dynamic map of all the wifi points. Then one could anchor that map based on non-moving items (like cell phone towers) and obtain fine-grained positioning information by running relative strength matches to the nearest 3 or 5 wifi access points.

The system updates eventually for broadcasters that change ids, are powered off, are relocated, etc. It has error checking built-in to reduce confusion around one or two new / missing / relocated markers.

All of this was driven by customer demand, and the data collection necessary was mentioned in the past. As the maps are unlikely to ever be placed inside of phone devices, and would require intensive storage to duplicate into each phone, as well as intensive bandwidth to update each phone, odds are that the calls to the manufacturers (which are selling "enhanced GPS" positioning systems) will continue for a very long time.

The solution? Pressure the government to release accurate GPS positioning, so industry will see the running of an independent positioning system as redundant and not-cost effective. Then, when you get a position with GPS, you get an accurate position, and you need not send a signal to correlate nearby signals with your true position.

Does it suck? Yes. Is the current system subject to abuse? Yes. Is the current system abused? (I'm a skeptic of human nature, so) Yes, but its abuse patterns seem to be nearly identical to its use patterns, with the difference being that the companies providing this service can use it to help people find their way to the grocery or to help other people find their way to a specific customer.

Qualcom and others have a very vested interest in not abusing the system, as they would lose their competitive edge should the government decide to take action against them. That said, many people worry about the government being the party abusing the system.

JohnFen · on April 25, 2023

> GPS has a built-in error for the civilian use, and a higher-accuracy system for military use.

This was called "selective availability" and that practice stopped in 2000. There is no longer any intentional error being introduced.

webmobdev · on April 25, 2023

> As the maps are unlikely to ever be placed inside of phone devices ...

Could you clarify what you mean? Here Maps allows you to download maps into your device and use it to navigate offline (without internet connection in devices with built-in GPS). I've been using this for (I think) more than a decade now and it works great. They also release maps updates frequently to download and update your maps. I believe Google maps has also begin offering similar offline map features.