Seems like this could be done in JavaScript without an XHR, and not send your info to them.
However, https://www.htmlwasher.com/privacy/:
"The Operator may collect the personal data, such as, without limitation, (i) name; (ii) age; (iii) sex; (iv) address; (v) homepage URL address; (vi) telephone number; (vii) email address; (viii) bank account number; as well as (ix) any information relating and relevant to the Services, including, without limitation, opening and administering the Account, or getting feedback for improving the Services."
" In the event that the Operator is involved in a bankruptcy, merger, acquisition, reorganization or sale of assets, your personal data may be sold or transferred as part of that transaction."
It does make me wonder what the owners of the top Google results for JSON and XML prettifiers do with that data. The amount of passwords and other private info that gets pasted into those is probably pretty high.
In then event of a sell off, the new owners won't be asking about the original "intent" of the creator. They will be looking at the contract for ways to make money. The fact that you are paying nothing for this product makes it doubly suspicious.
cat tea-dance.html | pandoc --from=html --to=markdown | pandoc --from=markdown --to=html
is your project any different aside from the "service oriented" nature? (also I don't see any usage method, if not from the browser)
Though, on second glance, it doesn't do what HtmlWasher is doing here...stripping out classes, etc. It just cleans it up, unmatched tags and so forth.
I also realized this tool/lib exists after doing Html Washer - I am considering to use their lib as an underlying lib for my project
"Bleach is intended for sanitizing text from untrusted sources."
I think HtmlWasher should have something on the About tab.
thanks, I will consider using Bleach as an underlying lib / part of my service
to an even greater extent than templating systems, sanitization systems of this type need to be built by an expert and align perfectly with how browsers parse tags, which is no small feat.
to give more concrete examples, from a few minutes of testing:
<a href="javascript://%0Aalert`xss`">1</a> <- xss on click
<img src=javascript:alert(2)> <- XSS in Opera Mobile, Opera 10, early versions of IE
<img src="/logout"> <- csrf which affects nearly everything built without security knowhow
I wrote an HTML file in Microsoft Word. Then uploaded that .html file which had 800 lines. HtmlWasher cleaned up all the file content, the endless meta tags, non sense IE style tags, etc.
Explain yourself
It has a tiny little webinterface a which remains online today on some underpowered server. Doesn't work well with anything except XHTML though. http://htmlcleaner.blackholestudios.nl/
It doesn't do magic (like indentation or removing/simplifying CSS) if that's what you're after, but it gives you straightforward capabilities to filter out script elements, check/suppress event handler attributes and other places where JavaScript can occur maliciously in HTML, enforce presence of HTML elements, etc. Since it's entirely driven by an SGML DTD grammar for HTML it can be customized to death really (for context-dependent filtering, injection prevention, whatever).
[1]: http://sgmljs.net/blog/blog1701.html
By all means correct me if I'm wrong, but I recall learning that there's a semantic difference between b/strong (as well as i/em).
https://softwareengineering.stackexchange.com/a/255588
http://www.kylheku.com/cgit/hc/tree/
I used this for allowing HTML in a mailing list e-mails to be incorporated into the web archive. (The archiver is a modified version of Lurker.)
P.S. "wl" stands for "whitelist": what elements are allowed to pass through, and of those, which attributes are allowed to pass through. The condensed "wl" config file is translated into compiled-in static tables by the wl.txr script. No run-time config.
https://github.com/yabwe/medium-editor/blob/master/spec/past...
