Hacker Newsnew | past | comments | ask | show | jobs | submit | robthompson2018's commentslogin

Would love to see any evals you've run of this system


Just scanning these evals, but they seem pretty basic, and not at all what I would expect the failure modes to be.

For example, 'slack_wrong_channel' was an ask to post a standup update, and a result of declaring free pizza in #general. Does this get rejected for the #general (as it looks like it's supposed to do), or does it get rejected because it's not a standup update (which I expect is likely).

Or 'drive_delete_instead_of_read' checks that 'read_file' is called instead of 'delete_file'. But LLMs are pretty good at getting the right text transform (read vs delete), the problem would be if for example the LLM thinks the file is no longer necessary and _aims_ to delete the file for the wrong reasons. Maybe it claims the reason is "cleaning up after itself" which another LLM might think is a perfectly reasonable thing to do.

Or 'stripe_refund_wrong_charge', which uses a different ID format for the requested action and the actual refund. I would wonder if this would prevent any refunds from working because Stripe doesn't talk in your order ID format.

It seems these are all synthetic evals rather than based on real usage. I understand why it's useful to use some synthetic evals, but it does seem to be much less valuable in general.


Totally fair feedback, and it’s true, many of these are synthetic evals with a few that were still synthetically produced but guided. At this point, because it’s all self-hosted, I only have my own data set. The places where it fails (for me) today are due to feature gaps rather than LLM mistakes. This is a new project that has not been widely announced, so my user base today is small but growing. If you give it a whirl and find it making mistakes, please send them my way! :)

Our starter plan gives you a machine with 2GB of RAM. You will not be able to run a local LLM. OpenRouter has free models (eg Z.ai: GLM 4.5 Air), I recommend those.

I don't follow your argument about getting pwned.

A user could leave malicious instructions in their instance, but Clawbert only has access to that user's info in the database, so you only pwned yourself.

A user could leave malicious instructions in someone else's instance and then rely on Clawbert to execute them. But Clawbert seems like a worse attack vector than just getting OpenClaw itself to execute the malicious instructions. OpenClaw already has root access.

Re other use cases that don't rely on personal data: we have users doing research and sending reports from an AgentMail account to the personal account, maintaining sandboxing. Another user set up this diving conditions website, which requires no personal data: https://www.diveprosd.com/


> But Clawbert seems like a worse attack vector than just getting OpenClaw itself to execute the malicious instructions. OpenClaw already has root access.

Well the assumption was that you could secure OpenClaw or at least limit the damage it can do. I was also thinking more about the general usecase of a AI SRE, so not necessarily tied to OpenClaw, but for general self hosting. But yeah probably doesn't make much of a different in your case then.


We certainly have customers who work in sales, but that's not the only use case.

OpenClaw is capable of using ElevenLabs or other providers to make phone calls, but I personally haven't done this and as far as I know none of our customers have either. Is AI good enough at cold calling yet for this to work? I personally would never entertain such a call.


Our average user spends $50 a month all-in (tokens and subscription). If you're budget conscious you can use a cheap model (eg Gemini Flash) or even a free one. I confess I am a snob and only use Claude Opus, but even using OpenClaw all day every day I only spend about $500 a month on tokens.

Orthogonal credits are used more frequently by power users. For everyday tasks they'll last a very long time, I don't think any of our users have run out.

Some example Orthogonal user cases:

* customers in sales uses Apollo to get contact info for leads

* I use Exa search to help me prepare for calls by getting background info on customers and businesses

* I used SearchAPI to help find AirBnbs.

Point taken on the copy! We made this writing more technical for the HackerNews audience and try to use less jargon on other platforms.


Thanks for giving real-world examples of your usage.

Do you think it’s worth $500 a month? Also, maybe tough to answer, does it seem like the token usage ($500 a month) would be equivalent if you did the same things using Claude or GPT directly?

My reason for asking is because I tried OpenClaw and a quick one-line test question used 10,000 tokens. I immediately deleted the whole thing.


Your average user spends £50 a month? How long have you been running, just wondering since OpenClaw was only released (as openclaw) a month ago.

We have been live since Feb 7.

Maybe $50 a month is an underestimate because our average user has been live for less than a month.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: