Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: CamelQA (YC W24) – AI that tests mobile apps
141 points by vercantez 6 months ago | hide | past | favorite | 54 comments
Hey HN! We're camelQA (https://camelqa.com/). We’re building an AI agent that can automate mobile devices using computer vision. Our first use case is for mobile app QA. We convert natural language test cases into tests that run on real iOS and Android devices in our device farm.

Flaky UI tests suck. We want to create a solution where engineers don’t waste time maintaining fragile scripts.

camelQA uses a combination of accessibility element data along with an in-house custom vision-only RCNN object detection model paired with Google Siglip for UI element classification (see a sample output here - https://camelqa.com/blog/sole-ui-element-detector.png). This way we’re able to detect elements even if they do not have accessibility elements associated with them.

Under the hood the agent is using Appium to interface with the device. We use GPT-4V to reason at a high level and GPT-3.5 to execute the high-level actions. Check out a gif of our playground here (https://camelqa.com/blog/sole-signup.gif)

Since we’re vision based, we don’t need access to your source code and we work across all app types - SwiftUI and UIKit, React Native, Flutter.

We built a demo for HN where you can use our model to control Wikipedia on a simulated iPhone. Check that out here (https://demo.camelqa.com/). Try giving it a task like “Bookmark the wiki page for Ilya Sutskever“ or “Find San Francisco in the Places tab”. We only have 5 simulators running so there may be a wait. You get 5 minutes once you enter your first command.

If you want to see what our front end looks like, we made an account with some test runs. Use this login (Username: hackerNews Password: 1337hackerNews!) to view our sandboxed HN account (https://dash.camelqa.com/login).

Last year we left our corporate jobs to build in the AI space. It felt like we were endlessly testing our apps, even for minor updates, and we still shipped a bug that caused one of our apps to crash on subscribe (the app in question - https://apps.apple.com/us/app/tldr-ai-summarizer/id644930471...). That was the catalyst for camelQA.

We’re excited to hear what you all think!




As someone who worked in a mobile dev team, I can only applaud your effort! :)

I have a few questions:

As with all new AI-based RPA & Testing frameworks (there are quite many in YC), I'm curious about the costs and performance. Let's say I want to run a few smoke tests (5-10 end-to-end scenarios) on my app across multiple iOS and Android devices with different screen sizes and OS versions before going into production.

What would it cost, and how long would it take to complete the tests?

Do you already have customers running such real-world use cases with it?


Good questions. Execution cost is indeed higher than traditional testing automation scripts but much lower than the human cost of writing and maintaining the scripts. We're starting at $500/month and our plans go up from there depending on how many devices you want to test across. We do have customers running across multiple devices and OS versions today.


Appium project starter here. Congrats on the launch! If you ever want to talk shop, let me know!

I'm glad to see more vision-first, AI-powered testing tools in the world.


Wow thanks! We're only able to build this because of tools like Appium. Thanks for all of your contributions to this space.


Props for everything you do Jason and share openly, also your robots!


We'd love to! What's the best way for us to get in touch?


Contact info should be visible in my bio here. Otherwise, a DM on Twitter works, too. (@hugs)


Very cool! I don't have this pain point currently but I can absolutely see the utility. I like the in built demo tool (although it sadly means you have no need for DemoTime lol).

The demo.camelqa needs some styling. I would invest a few minutes here. Maybe a loading spinner too if you're expecting 15second latency.

Technically is this doing clever things with markup, or literally just feeding the image into a multimodal LLM and getting function calls in response?


Thanks for the feedback! We'll add some styling to the demo page. We're processing the image with an object detection model and classification model and also using some accessibility element data to get a better understanding of what is interactive on the screen.


Why don't you also use GPT4-V for that part?


GPT-4V is great for reasoning about what is on the screen. However, it struggles with precision. For example, it is not able to specify the coordinates to tap when it decides to tap an icon. That's where the object detection and accessibility elements help. We can precisely locate interactive elements.


Have you tried putting a pixel grid over the image with labelled guidelines every 100px?

Was one thing I never got around to testing with DemoTime but was always curious about.

Anyway sorry this is a nice product. Congratulations on the launch.

Always good to see substantial tech


Thanks! Yes, we experimented with that! I think because of the way that GPT sees images in patches it has a hard time with absolute positioning but that's just a guess.


I've done something similar and found the same thing. It also could not calibrate when I drew a dot on its last suggested coordinates.

"You said the play button was at 100, 200 and a green circle is drawn there. Is the circle located on the button or do you need to adjust it"

Something along those lines. And it also got the size of the image.

Nope its in the right ballpark but it could not make fine adjustments or anything closer to a button.


Having worked on mobile infra for many years now for a couple very large iOS teams, excited to learn more and kudos for putting yourselves out there. 1. Integration tests are notoriously slow, the demo seemed to take some time to do basic actions; is it even possible to run these at scale? 2. >Flaky UI tests suck; they can be flaky but it's often due to bad code and architecture. Any data to backup your tool makes the tests less flaky? I could see a scenario where there are 2 buttons with the same text, but under the hood we'd use different identifiers in-code to determine which button should be tapped in UI.

Overall I'm a bit skeptical because most UI tests are pretty easy to write today with very natural DSLs that are close to natural language, but definitely want to follow and hear more production use cases.


Great questions. 1. Yes, running tests in parallel helps. We also cache actions so subsequent runs are much faster (this is disabled in the demo). 2. I agree that testing can be much more reliable and pleasant in some codebases than others. I have not been blessed with these types of codebases in my career. Flakiness is from personal experience automating UI tests specifically and having them break when a new nondeterministic popup modal is added or another engineer breaks an identifier/locator strategy. That being said, if you like writing UI tests and your codebase supports easily maintaining them, there are some really cool DSLs like Maestro!


> We also cache actions so subsequent runs are much faster

Interesting, what do you cache? How do you know if 1 change needs to be rerun versus another?

>Flakiness is from personal experience automating UI tests specifically and having them break when a new nondeterministic popup modal is added or another engineer breaks an identifier/locator strategy

A modal popping up isn't a flake though, it's often when a button is on screen but the test runner can't seem to find it due to run-loop issues or emulator/simulator issues. If a modal pops up on the screen in a test, how does CamelQA resolve this and how would it know if it's an actual regression or not? If a modal pops up on a screen at the wrong time that _could_ be a real regression, versus a developer forgetting to configure some local state.


1. The AI agent writes an automation script (similar to Appium) that we can replay after the first successful run. If there are issues the AI agent gets pulled back into the loop.

2. You can define acceptance criteria in natural language with camel.


> most UI tests are pretty easy to write today with very natural DSLs that are close to natural language

Wouldn't it be a better/cheaper/faster solution to use LLMs to write UI/integration tests?


The issue with this approach is that for all but the most simple apps it is not possible to deduce the runtime element information needed to write traditional UI tests given just the source code. This can only be done reliably at runtime which is what we do. We run your app and iteratively build UI tests that can be reused later.


Why is the branding/mascot a camel?

I'm reminded of Waldo, a mobile testing automation product that was acquired in 2023.

Their mascot is another camelid (not sure if alpaca or llama). https://www.waldo.com/


We originally spelled it qaml which stood for quality assurance, machine learning. That wasn't obvious to anyone and everyone would pronounce it as "qwamel" which we hated so we decided on camelQA instead. Thanks for the question. I was waiting for someone to ask it!


I noticed the first letter in camelQA is very explicitly lower-case. I just assumed someone on the team had very strong opinions about camelCase.


Your demo is very concise and well crafted. Is your host naturally smooth or it was many takes? Good job


Our host is naturally smooth. We've tried a few different platforms to host this demo and landed on mac stadium. Glad you enjoyed your experience.


This is excellent. Definitely useful and well communicated on your site. Curious where you want to expand it, seems like it can be used to track not just your own apps but the apps of others and track information and new UX from competitors as well (I've seen apps like ChangeTower for the web). Is this the direction you're planning to take this? More initial thoughts here: https://www.youtube.com/watch?v=FrLNG2vtxsA


Goated guild, wait can we called you GG? GG your youtube videos are dope thank you for commenting. I'll leave that answer to our CEO - we have a lot of different paths we're considering today. Many people are interested in using camel's agent to operate an iphone or android using natural language. But after we master test execution for mobile and test case creation we're moving on to web.

Tracking competitive apps would be an interesting use case for automation. One of our partners at YC asked up to use camel to automatically refresh the waitlist for a tesla cyber truck lol.


Yeah, the two big issues with UI tests: flaky and slow.

Curious how using GPT and vision combats flakiness? I'd feel the entropy of GPT and anything less than 100% accuracy in the computer vision pieces would lead to more flakiness.

I also wonder about the speed and costs of running the tests. When E2E tests are traditionally slow and expensive already. The computer vision and GPT elements seem costlier and less fast.


We use GPT 4V to reason about the screen and decide what to do next. It does make mistakes. Here's a video of it thinking a page in the shop app is an ad (https://www.youtube.com/watch?v=MKyO-U7j4Hs).

The upside is that we do prompt hacking on our end to break out of loops and heal after it's made a mistake. Having said that, we're working on improving this!

On costs, it's cheaper than you think. The entire playground demo cost us less than $10. More expensive than running a script but we believe the cost of intelligence will go down in time.

On speed, yes it is slow. We minimize this by parallelizing tests across devices on our device farm. We can normally turn results around in 2.5-4 hours depending on the number of tests.

Thanks for the questions!


Demo looks very slick.

How far off being able to integrate into a CICD pipe is this? I'd love if this could trigger off a pr, then block merging since it wasn't sure how to execute some regular user flow (even if that were due to it not understanding how it could perform an action, since this maybe means my flow doesn't make sense).


Thanks! We are releasing an API soon along with some common CI/CD pipeline integrations. We're working hard on speeding tests up so that we don't slow down your pipeline.


This looks awesome, automated UI testing is so hard to get right but also very important. Great work so far!


Thank you! It is hard to get right. And what we've noticed from the user's we've been talking to, no one company has won in this space just yet.


> no one company has won in this space just yet

Detox from Wix for React Native Apps is pretty good.


I LOVE this. I pitched something similar (albeit far less intelligent) to my last employer only to get scoffed at, so it makes me really happy to see someone actually make and productize it. Wishing you success!


Scoffed at! Aw that sucks. Well when we blow up I hope you get your rightfully deserved "I told you so" moment. Thanks for the support :)


Great demo ! This is going to be a huge time / money saving for companies.


Seems similar to App Quality Copilot - https://www.mobile.dev/app-quality-copilot


They're building something based on appium and specifically through Maestro which is an excellent use of appium. But it isn't AI - we're excited to try it out!


Maestro is not based on Appium. It's built from the ground up, you can learn more about its internals here[0].

> it isn't AI

Hmm, why? I understand "AI" is an incredibly broad term, and there are maybe some fundamental differences between how App Quality Copilot and CamelQA work (I tried neither), but from looking at App Quality Copilot's website, it sure looks like "AI".

[0]: https://blog.mobile.dev/maestro-re-building-the-ios-driver-6...


I love this idea. Is there something similar for web-apps?

I wonder if you can easily add AI-based fuzzing or AI-based sample workflows to a testing pipeline.


Thanks! Web app support is coming very soon. That would be a really cool application of the API


I'd love to give an AI a list of workflows to try out every time I push an update to my site.

In addition to pass/fail, I can see it even leaving some comments about ease-of-use. There's a lot of value here!


This is a great idea. I assume that inference costs will be higher for the time being, but it does aim to solve a real problem. Kudos..!


Thanks! Yes inference costs are non-negligible right now but we think this will come down over time


Congrats, very exciting. Do you support configuring things like usernames and passwords? How do you handle MFA?


Yes and yes. Check out the sole GIF we've provided. It should show you an example of camelQA navigating user sign up MFA by navigating to the email app through a deep link, verifying the email, navigating back to the app and confirming sign up.


would be nice to just invite the AI tester to TestFlight instead of uploading the build :)


noted:) thanks for the feedback I think that would be an easier way to kick off camelQA as well. We'll have to learn how to do that in our remote physical device farm.


This is very cool. Is there a version for browsers?


Thanks. It's on our roadmap, but we're starting with mobile first.


Any support for non-mobile native apps? e.g. macOS?


Not yet but it's on the road map! We have a monthly newsletter if you'd like to stay up to date


Will do, and good luck with getting some traction! I am bullish on automation like this, and loved the demo.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: