Hey guys, I made this site and just gave a talk about it at SHDH. Someone must have submitted it. Thanks for all your feedback, I really appreciate it!
nmap is too aggressive. It's a prelude to actual hacking attempts and labeled by IDS systems as such. Don't use it for this or you may end up in legal trouble.
Agreed. UnderTheSite makes a specific point of only considering information that would be returned when a user's browser hits a website. It doesn't scan or probe ports / urls.
Awww, I wish I didn't have to leave SHDH before the lightning talks started!
I wrote a Ruby program to do something similar to what you're doing: https://github.com/jpf/domain-profiler - If you ever start profiling sites using information from places other than what the server returns, perhaps what I've done can help inspire you?
I'd like to suggest avoiding manual additions of technology for as long as possible. Focus on adding more ways to match specific technologies. After all, a site could always advertise more technologies in server headers or meta generator tags.
A mechanism for sites themselves to advertise their stack seems like a great idea (though I'd prefer it not occur via fixed URLs like humans.txt or robots.txt, but via headers or meta tags). I'd just suggest not allowing arbitrary additions to a site's stack without any way to verify them.
Minor bug: For my site, it says YUI. Having written it from scratch, I'm pretty sure there is no YUI anywhere. It appears to match jqueryui.js as yui.js.
Does probe try and fetch signal urls? (Like admin pages.). UnderTheSite.com makes a point of only looking at data that would normally be fetched by your browser. Probing a site for urls could be considered offensive.
I'd add SSL/https. I'd also add "Strict-Transport-Security" and "Content-Security-Policy", both of which can be seen by looking at HTTP response headers.
Your analyzer mistakenly detected my website using Microsoft IIS and ASP.NET. Which is weird since it also detected Google App Engine (which is correct). ASP doesn't run on GAE.
Just a thought, do you think website owners should mention technologies they use in HTTP header? For example your analyzer can't detect that I'm using Java and Spring framework.
I'm not sure that there is a django matcher yet. Django is hard to reliably detect. Do you know of a good signature (like a header or form token) that it always uses?
What triggers the error message "The pattern that you entered appears to be too general - can you make it more specific for this technology?"? I tried to add a technology for the use of rel="nofollow", using an XPath expression //a/@rel[contains(., "nofollow")] , and got that error message. What do I need to do to make it more specific?
my 2c. Nice design (graphics). I would make the first page seem less busy. BuiltWith is going to be a tough competitor. You need to match them (precision/recall) and add stuff that they don't have (trending techs? Add info: who is the host provider? where in the world is it hosted? Response times? Bad link stats?...?)
It says that my company's site [1] is running ASP.NET on Microsoft IIS, which it's not. To be fair, it also mentions Ruby on Rails, Apache, and Phusion Passenger, which are all correct. Aside from these minor glitches, this is a pretty cool project.
Great idea. How about saving the results you generate and allowing people to search for sites based on the technologies they use. For example, I might want to see all sites using jQuery which are using something other than Apache.
In terms of features, I'd love to see more emphasis on the aggregate/comparison data. For example, most popular server side framework, most popular JS libraries, most popular hosting platforms and so on.