
Automate any GUI using screenshots - ciudilo
http://sikuli.csail.mit.edu/
======
Corrado
A lot of folks on here are saying that this is cool but useless because there
are better ways to click a button on a screen. If you read through their paper
([http://groups.csail.mit.edu/uid/projects/sikuli/sikuli-
uist2...](http://groups.csail.mit.edu/uid/projects/sikuli/sikuli-
uist2009.pdf)) you'll find more practical examples of what can be done with
this type of system.

One such example is to track real time images of a webcam pointed at a baby
and using Sikuli to watch for a yellow dot placed on the baby's forehead.
Another is to track movements of something across the screen; in this case a
bus moving along Google maps.

I agree that there are better ways to do most of the things in their examples
and that they should probably re-work their videos a bit, but just because
this system doesn't solve your problems the way you want it to doesn't mean
its useless.

~~~
thorax
I don't think anyone is saying the technology isn't cool (and I see no one
using the word "useless" but you).

I believe most people are warning about the demonstrated technology for this
task:

> Sikuli is a visual technology to search and automate graphical user
> interfaces (GUI) using images (screenshots).

As I said below, for personal scripting of known applications I think this GUI
automation is great. But upon seeing it, we had all hoped to see a technology
which would get over our biggest hurdles in GUI development/testing. As it
doesn't sound like it would do that (well) for a number of reasons (i.e.
localization, themes, OS/app versioning, design changes during development,
coloring, etc), it loses a lot of its practical application appeal.

It's still really cool technology and is a cool way to think about this kind
of problem.

------
tbrownaw
Hm, makes me think of
[http://blog.objectmentor.com/articles/2008/06/22/observation...](http://blog.objectmentor.com/articles/2008/06/22/observations-
on-test-driving-user-interfaces) .

It sounds... scary. Like it will work well enough at first, and then explode
when someone changes their desktop theme (especially icon theme), or wants to
upgrade to a new version of whatever. Treating things as change-controlled
APIs when they aren't just seems dangerous. Still I guess there's at least
some amount of change control coming from platform conventions and human
interface guidelines, and this comes closer to operating at the correct level
of abstraction to benefit from that.

~~~
joe_the_user
Yes,

One problem with GUIs from the start has been that automating their behavior
is fragile.

Automating actual keystrokes and mouse movements is the most fragile and
kludgy of automations. Even drilling to the level of system messages is
problematic.

Aside from it's other good qualities, the web is really nice for reducing
possible user interactions to a very codified set of operations.

The article here points to ... _the wrong way of doing things_

~~~
undees
I wondered about that, too---how fragile will this be in the face of minor GUI
changes? (A fuzzy-match parameter would be a nice addition to the API.) One
encouraging sign is that the screenshots are stored as .pngs in the file
system. If an app update does nothing but change button graphics or text, the
script could presumably pull new graphics from the same place.

That just leaves the other 90% of GUI changes that involve moving a setting to
another screen, or changing the way an entire interaction works. ;-)

All the stuff you said about GUI automation being klunky is true. Still, this
thing could come in handy as a kind of "GUI batch file," or perhaps as a tool
to help produce screencasts.

~~~
mjs
The do say that it does fuzzy matching, so it can still work with minor
changes.

I don't think any program guarantees that the GUI is a stable API! I do wonder
about its ability to handle unexpected conditions, or for the scripter to know
that these conditions exist in the first case. For example, if the GUI changes
based on the day of week, the script will probably either get very confused,
or do the wrong thing. I imagine there's some way to do conditionals (if you
see this, then do thing 1, otherwise do thing 2) but it's hard for the
scripter to discover all the possibilities. Still, might be useful for some
things.

------
liuliu
Sometimes I see failures of HN to collaboratively discover this kind of
interesting topics. I've posted the paper much earlier than the media coverage
(more than 100 days ago): <http://news.ycombinator.com/item?id=810986> and
there is no vote up.

~~~
petercooper
There are a lot of interesting papers out there but, from what I've seen, HN
voters tend to appreciate actual products and services a lot more than theory
(and certainly don't appreciate PDFs). There's undoubtedly room for a more
compsci/theoretically HN equivalent - unless such a thing exists
(Reddit/compsci?)

Nonetheless, that's a cool paper and it's a shame it didn't make the front
page first time round :-)

~~~
ramchip
There used to be Academic Hacker News ( <http://hnacademic.com:40106/> ) but
it mostly died from lack of activity. It's a shame.

~~~
petercooper
Darn, that looks awesome. I suspect it died from the usual "launching a small
online community" problems of hitting critical mass, etc. Maybe this - or
something like it - could take off given the right amount of care by a group
of dedicated folks..

------
vdm
The people dissing this have obviously never dreamed of automating a 16-bit
Visual Basic 3 Windows app (that's Win16, not Win32) so it can be run from a
webapp front-end and gradually obsolesced.

Autohotkey works, but matching by screenshots with computer vision would cut
the amount of work required in half.

Bravo!

------
nex3
I can imagine this being useful for knowing to stop when things start going
wrong. One problem I've had with GUI-automators in the past is that they've
just kept automating after something unexpected happened and put them into an
invalid state. It seems like Sikuli could avoid this by literally knowing when
the screen looks wrong.

------
jrockway
Rube Goldberg would be proud.

(Sometimes you should take a step back and ask yourself, "is looking for
pictures on the screen _really_ the best way to do this"? The example they
show on the main page is a one-line "ifconfig" invocation, for example.)

~~~
m0th87
Their example is a bit juvenile. Even so, keep in mind that it is a one-line
ifconfig invocation _for those that understand the terminal and know about
ifconfig_ , which does not hold for the vast majority of users. This is about
making automation more usable. I'm not sure if it is a productive one because
my perspective is adulterated by technical knowledge. But at the very least,
it is a very interested approach which could prove fruitful for your average
user.

~~~
mquander
Sure, but there's also a precondition for using this: you have to understand
that it's actually even possible to automate repetitive tasks on a computer,
and have the motivation to do so. Hell, I know programmers who don't do that.

How many people fit that description, but will still find this easier than
ifconfig? My guess is few.

That said, I think this is a neat hack.

~~~
m0th87
I think its usefulness (or lack thereof) can't accurately be critiqued by
anecdotes from technical users, but rather needs to be quantified through
usability testing. It's easy to blow this idea off, but the "solve something
your users didn't know needed to be solved" advice I occasionally see on HN
keeps echoing in my head.

------
sacrilicious
Two wonderful things about this: 1\. as frankenstein-ed together as the tech
is, it works* 2\. this is arguably more natural than 'workflow' recording
functionality like automator, and I found the actual 'code' highly
readable(although inscrutably hard to debug or test or _run_ without the
IDE...)

All in all I love the way the idea works right now, although Java feels less
than elegant on the Mac.

*(er, although for me it's got a killer bug - using the hotkey to make a screenshot does not work, gives no option of cancelling... hardcore crasher in my book)

------
actf
This tool looks really interesting - and I love the idea that it can be
programmed using python. I've used a number of GUI automation tools in the
past like autohotkeys (which I can also highly recommend) - this one looks
like it would make it easier to do certain tasks that are difficult in
autohotkeys, for example: interacting with webpages or other applications that
don't have standard interfaces that can be examined with system api's.

The screenshot approach this tool takes is very unique. My only criticism is
that, judging by the video, the image processing approach seems slow compared
to an autohotkey's script.

What I'm really waiting for is a tool that can take this one step further and
do OCR on any on screen text. This would make it easy to interact with gui's
that present text that can't be read using system api's - imho that would be
the holy grail of gui automation.

------
mlapeter
If someone could take this concept a step further and let you create a self
contained process that users could download and run just by clicking (like
tasks in photoshop), I could see some uses:

\- Some tech support situations where you have to have a user do x amount of
steps on their computer that are the same for all users. Sort of like an
automated Geek Squad.

\- Sell a prepackaged GTD style organization system that creates all the
folders for you in the right places, downloads files (pre-made budget
spreadsheet for example) into them, etc. (trivial, but it's a pain point for
people)

\- Make a bunch of different productivity apps that mimic the steps a
professional programmer/ photographer/ marketer etc does when they first setup
a new computer (bookmarks, preference settings, etc.)

------
nodogbite
Clearly Sikuli has flaws, but for a research project, their presentation and
execution is impressive. Their efforts should be commended. Hopefully they'll
continue enhancing their scripting environment so that the scripts are robust
to significant variation in the GUI.

------
thorax
Very cool, but would have major limitations outside of the just making a
"personal script" or, at best, a script for a heavily locked-down
enterprise/academic setup.

Because it uses literal images, it seems like any change in OS theme, OS
version, app version, localization (e.g. text or control shape), or colors
(e.g. high contrast mode) would break the scripts.

It'd be neat to use for GUI automation during software development except for
the fact that the GUI changes, the button wordings are tweaked, etc.

In all of these cases, back-end or OS GUI automation is probably better, but
if you have an unchanging environment or want a quick on-the-fly test, the
screenshot approach is novel and probably a bit cooler.

------
onyrac
Agreed the demo object is silly, but they are problems that are hard to solve
without GUI automation. For example, this tool could be great for scrapping
flash-based websites, which are notoriously painful to automate. And the
integration with python means that you can easily mix and match with
conditional statements, calls to OCR libraries, etc...

------
amjith
This is a much nicer and an intuitive alternative for <http://autohotkey.com>
on windows. I've tried introducing autohotkey at work to automate some of the
mundane tasks, but the learning curve of autohotkey was difficult for most of
my co-workers. I'm going to introduce this at my workplace.

------
dpcan
If you skip to the last 30 seconds of the first SIX MINUTE video tutorial, you
can see the app in action. Otherwise, you have to sit through a whole class on
how to use the app before you even know if you want to use it.

Little lesson in creating a good video demo....

Get to the point.

Then provide more videos for details.

(I guess you could say this should be expected from an MIT project website)

------
timf
cf. <http://news.ycombinator.com/item?id=1069608>

------
kenshi
It looks like a more advanced version of tools like Quick Test Pro.

There is big money in tools like that, but I can tell you, its a real PITA to
write test scripts using tools such as these. Given the option, you are better
off exposing your app's object model to a scripting language, and letting
testers script it like that.

Obviously that doesn't work for third-party or legacy apps. So it definitely
has a market. And their computer vision algorithms have to be better than the
godawful bitmap comparison tools that QTP used.

------
subbu
The best use case I can think of for this is writing automated test cases for
a browser-based app. Selenium does a pretty good job of that already.

The demo (automatically setting an IP) is a one-time job. How many times do we
have to do this task? So there is no need for me to automate those kind of
jobs. But having said that, this could still be useful in some use cases. One
example I could think of is testing desktop apps.

------
ststrat
This is incredibly useful. That's why Redstone Software has been selling it
for years, under the name Eggplant - see
<http://www.testplant.com/products/eggplant_functional_tester> . It takes a
lot of work in QA to figure out why this is useful (back me up on t his one,
experienced QA engineers) and the right way to do it so I'll give you the
Cliff Notes: This sort of bitmap recognition lets you automate that "last
mile" QA groups can never seem to automate. autohotkeys, selenium, and other
things all help automate lots of aspects of the interface with tons of caveats
and gotchas. This is a much more useful, if less pleasingly elegant, solution.
When you are automating testing it's relatively easy to automate back-
endstuff, write unit tests, write scripts wrapping cli interfaces and so on,
but every automation team that deals with GUIs eventually stubs their toe on
automating the user interface. BY having the computer automate the GUI task in
the same way a human user executes it ( "I want to click the Apple Menu -
Where is the Apple icon I know is on top of the Apple Menu? - Ah! There it is!
I'll click it" ) you make it easier, or even possible, for the people writing
the qa automation to automate the GUI in a reasonable amount of time. There
are some pitfalls. What if someone changes the theme on the automation rig?
Well, you're an engineering team, not a preschool - DON'T change the theme!
What if somebody changes an icon in the app you're testing? Fortunately you
have access to the bitmap (it's saved with the rst of the build files, yeah?)
and of course the change notes for the build tell you hte iocon has been
updated. Well, of course it isn't in the change notes, but when a test that
was working fails you can easily run to the point where it says "Can't find
the foo button." This is a hint to look for the foo button and think about why
it can't be found. Finally, all good scripting languages have an escape hatch
to call otehr programs that can do things better than they can and return a
result. Need to check an old COM object through its native interface? Write a
small Windows app that your script calls to get that state. It takes a lot of
experience and frustration with trying to fully automate tests on a GUI to
understand why this is useful. and the cry of "Bitmaps break because things
cahnge" - well, no they don't. Not on a computer . Not if you know what you're
doing and have control of the source. (Please disable all auto-update systems
on your test rig or you will be surprised at some point.)

------
vdm
How often do you have to read and re-type an error message to Google, because
the text cannot be copy and pasted? This technology could OCR the screen text
and Google it for you automatically.

The demo video is proof of concept; make sure you read the paper.

<http://sikuli.csail.mit.edu/documentation.shtml>

------
RK
I've had to use some non-scriptable, proprietary software that this might
actually be useful for in doing repetitive tasks. This is especially true at
some places where I have done some engineering consulting (non-software). It
would probably fall in the category of ugly hack, but would also save some
headache for me.

------
sebastian
Does anyone know if a sikuli script can be run from the command line without
having to use the sikuli IDE?

~~~
gnubardt
they provide a command line tool for running sikuli scripts without the IDE

~~~
sebastian
Perfect!

Tnx

------
ideamonk
Sikuli comes at the right time for me, going to use it to automate browser
testing and generate reports :)

------
amichail
Here's a more ambitious vision of this idea published in 2000:

[http://www.cs.washington.edu/homes/lsz/papers/slpz-
cacm00.pd...](http://www.cs.washington.edu/homes/lsz/papers/slpz-cacm00.pdf)

~~~
liuliu
But the MIT version using more state-of-art methods (MSER/SIFT hybrid local
descriptor, ruby-like syntax etc.).

~~~
amichail
This also comes from MIT, but from the media lab.

I think it's a more ambitious vision because it talks about programming by
example systems, which automatically generalize from examples.

But Sikuli is very cool even if it's generalization is limited to ignoring
minor differences in images.

------
pwim
How well this would work for game playing bots? If this can abstract away the
detection and clicking of regions, it would make building one much more
approachable.

------
dirtbox
I wonder if I can teach it how to leet at Team Fortress.

