
httpdiff – diff responses to two HTTP/HTTPS requests - jgrahamc
https://github.com/jgrahamc/httpdiff
======
tlrobinson
The Unix way:

    
    
        diff <(curl -vs https://news.ycombinator.com/ 2>&1) <(curl -vs https://news.ycombinator.com/ 2>&1)
    

As a shell function:

    
    
        httpdiff () {
            diff <(curl -vs "$1" 2>&1) <(curl -vs "$2" 2>&1)
        }
    
        httpdiff https://news.ycombinator.com/ https://news.ycombinator.com/

~~~
jgrahamc
Which works well for that one example. But not others.

    
    
        diff <(curl -Lvs https://www.google.com/ 2>&1) <(curl -Lvs http://www.google.com/ 2>&1)
    

On my machine that produces 33k of output. I like one liners and using the
shell, but this tool was built not for fun because I was debugging things that
were painful.

~~~
tlrobinson
That's nothing a couple "1> /dev/null"'s can't solve (or redirect to a
temporary file then "diff -q"). But yeah, yours is a bit nicer.

------
brokentone
This is very nice. I'm working on something similar as part of my build
process so that I can more directly notice changes as they are made in markup,
and catch changes that I didn't expect.

The issue I've hit, that you may want to consider, is a suppression list of
sorts. The ability to silence diffs on things that look like dates for example
would be rather valuable.

------
johns
Awesome tool! Our Traffic Inspector also has this feature for requests
captured through our API debugging proxy:
[https://www.runscope.com/docs/comparisons](https://www.runscope.com/docs/comparisons)

------
nijiko
I made a gist with a collections of one-liners here:

[https://gist.github.com/Nijikokun/d6606c036d89d3b1574c](https://gist.github.com/Nijikokun/d6606c036d89d3b1574c)

------
some_furry
I wrote a similar tool while debugging a web scraper for my current dayjob
employer, but it was buried inside of our application rather than standalone.

(Also, it was PHP, which a lot of people hate.)

------
billyhoffman
Nice tool jgrahamc. You should consider expanding how you detect/show diffing
of the response bodies, since that has a lot of applications: detecting
content changes, detecting ads/malicious code, detect crawl duplicates,
security audits. etc.

Years ago I found the Levenshtein distance is super helpful to determine _how_
different the responses are, and used it as part of a black box web security
scanner. You can do this just on the Raw HTML, but that's noisy and shows a
number of differences. It's better to use an HTML-aware string distance
function, that diffs just page content. I used that a channel for detecting
blind SQL injection (in combination with some other things).

I also found that you can go a level higher, and use Levenshtein on just the
HTML tag structure of different responses. By looking at page structure, and
applying different weights based on the HTML tags that were added/removed you
can group similar pages, which usually maps to the different functional
areas/templates of a site. As in, you can say "these 5 pages are all product
details pages", "these 10 pages are all blog posts", etc. Super helpful from a
security scanner, since this could inform crawling/auditing choices and speed
up audits. It also allowed us to say "you have a XSS vulnerability in your
Blog comments form" instead of just saying "you have XSS vulnerabilities in
these 100 pages".

Anyway, there is a lot of value in detected how different/similar various
responses are. See some of Google's published work about detecting near
duplicates for web crawling...

------
pokoleo
Why not just write this as a bash one-liner?

    
    
        function httpdiff {
            diff <(curl -L $1) <(curl -L $2)
        }

~~~
TazeTSchnitzel
Because `diff` only understands naïve textual differences. A tool that
understands the HTTP format can give you more meaningful info.

------
beersigns
Getting x509 issues. Output is below:

$ httpdiff [https://www.google.com](https://www.google.com)
[http://www.google.com](http://www.google.com)

Doing GET: [https://www.google.com](https://www.google.com)
[http://www.google.com](http://www.google.com)

Error doing GET [https://www.google.com](https://www.google.com): Get
[https://www.google.com](https://www.google.com): x509: certificate signed by
unknown authority

I'm sitting behind a pretty heavy proxy though; could be that. That or OS
X(10.9.5) certificate store issue maybe?

------
chair6
Taking it down a level or two, occasionally you want to do something similar
with DNS -
[https://gist.github.com/chair6/1748a6676120a0aacea2](https://gist.github.com/chair6/1748a6676120a0aacea2).

------
Humjob
Very cool!

I'm a bit of a newb though - how do I install this?

~~~
jgrahamc

        1. You need Go
    
        2a. Type 'make'. It will build the binary and place it in bin/
        2b. Use the go tool chain directly to build. Does the same thing as 2a.

~~~
NateDad
why the makefile and src directory? Why not put the source in the root
directory, so that go get github.com/jgrahamc/httpdiff would just work? The
makefile doesn't even do anything useful, and just makes it so that poor
windows folks can't build your code for no good reason.

~~~
lclarkmichalek
Yes, you have to do the archaic `go get
github.com/jgrahamc/httpdiff/src/httpdiff`. Truly a travesty. While the
makefile allows you to avoid the whole GOPATH nonsense, which gets a plus from
me.

------
dmritard96
I wrote a tool for doing this with arbitrarily nested json docs a while back:
[https://github.com/ChannelIQ/jsoncompare](https://github.com/ChannelIQ/jsoncompare)

------
Newky
I recently did this manually by piping responses to a temp file called before
and after, and then using vimdiff as my differ which proved quite effective.

This tool looks great, but would not have worked with my particular use case,
which was doing some migration of user data, and diffing the user accounts to
make sure that they had changed in the expected way.

It might be an idea to add the ability to query the same host twice, but have
a user input trigger when to test each host.

------
anonfunction
I'm working on a project[1] that is based around this functionality. The big
"diff" is that I called it response-diff in our original spec, and we're
planning to write it in Node.js.

[1]
[https://github.com/Mashape/changehook](https://github.com/Mashape/changehook)

------
m0dest
Should rename to "diff two HTTP(S) responses" instead of requests.

~~~
jgrahamc
Better?

------
artursapek
Perfect use case for Go.

~~~
jnpatel
Can you elaborate?

~~~
anonfunction
Not OP but I'll give it a shot. Golang has a great standard HTTP package that
makes this project a breeze to implement. Combine that with cross-compilation
and statically linked binaries and you get a tool that can be ran virtually
anywhere a developer would want to without needing to set up a new
environment.

~~~
artursapek
You took the words right out of my mouth!

------
lukasm
is there any reason to use md5? (apart from backward compatibility etc.)

~~~
rudolf0
It doesn't really make a difference in a case like this. Though certainly not
an ideal algorithm, the chance of a collision for something like this is low
enough to never be a concern.

~~~
akjsdfh
counterpoint: is there any reason not to use a stronger hash function?

Why leave something dangerous lying around when /probably/ nothing is going to
go wrong... until someone picks it up and decides to do something with it that
was unexpected, when better alternatives abound?

