Sites heavy on JS often have minified, obfuscated (at least, the names are replaced) JS files. However, many sites have huge amounts of whitespace sticking around in their HTML. Why don't people "minify" their HTML?
Because technically its not very sound. Examples of where "minifying" (removing whitespace) could go wrong if done across any served HTML document...
1) pre tags - being that pre tags take into account formatting, removing whitespace is going to alter how the content is rendered
2) any empty tags - its becoming a thing of the past but there are many instances where browsers will render a tag with a single space inside differently than a tag with nothing inside. In other words, space within a tag may be intentional by the developer.
3) spaces inside attributes may matter - you could have an attribute on an html tag that say is data-whatever="1 2\n 3" and potentially reducing those spaces could be bad - depends upon what the developer intended
Additionally there are some other things to consider...
1) GZIP if used will make the impact of scrubbing out whitespace almost nonexistent
2) Most HTML served is dynamic, meaning that the HTML compression will need to be run on every HTML response - this could have some performance negatives. (If your just compressing static HTML once it should be fine.)
None of your first three points are really a problem. No, you won't be able to do a simple regex based solution, but the rules for where whitespace matters in HTML are rigorously standardized. Obey them, and your minifier will work just fine.
>1) GZIP if used will make the impact of scrubbing out whitespace almost nonexistent
Maybe if you were just removing whitespace (although you still will see a difference). Removing comments and omitting optional closing tags will take you further. Minified JS compresses smaller than un-minified JS, so it's reasonable to think the same would be true of HTML.
>Most HTML served is dynamic, meaning that the HTML compression will need to be run on every HTML response - this could have some performance negatives. (If your just compressing static HTML once it should be fine.)
For templated HTML, the minification should be done on the template itself, not on the final output. You really do have to weigh the pros and cons of GZIPping dynamically generated HTML, so pre-minified HTML templates could be a pretty big win.
Minifiers don't just remove whitespace. Eliminating comments, for example, can save a lot (although it seems that most HTML out there is poorly commented in the first place). A smart minifier can also drop optional closing tags like </li> for a bit more gain (although I don't know of a minifier that does).
Who uses a text editor that inserts whitespace into HTML? I am aware that some editors like to show indentation levels in HTML code, for various reasons, but for me HTML code is always more readable if it has no extra whitespace at all. Whitespace is not meaningful in HTML, and I always set my text editor (I have used various brands of text editor over the years) so that my HTML output has no extra whitespace. Why use an editor that adds whitespace in the first place?
AFTER EDIT: TazeTSchnitzel asks a fair question. I put each new major element on a new line. In general, I try to make paragraphs look like paragraphs, headings look like headings, and so on, with just newline whitespace but without leading whitespace before elements (which has annoyed me for the last few weeks in a website updating project I was working on). Thanks for asking the clarifying question.
FURTHER EDIT: Yes, thanks for the statement that indentation shows nested structure (which is what I guessed is the usual rationale for extra whitespace in HTML code). Despite that obviously sensible practice (which, after all, leads to the MEANINGFUL white space in Python code), I have seen plenty of examples of HTML pages that have unmatched tags even though they have so much whitespace that the "view source" view of the page is mostly off to the right of my screen. Agreeing that being able to view source code structure is important, may I suggest that as one reason to like Notepad++ as one of the many editor choices available to persons who write code? In the recent project I worked on updating, the original programmers had left many unmatched tags and inconsistent structures in the code, and I was able to strip out all the extraneous code AND fix the structure by using Notepad++ to find (for example) the beginning and ending div elements surrounding big, complicated blocks of code. Notepad++ shows code structure with structure lines overlaid on the raw source code view.
Obviously (to me anyway) you're right, and I'm surprised there are people who don't indent their HTML like that. But a thought just occurred to me -- why don't editors do this automatically and without creating actual whitespace (tabs/spaces)? If you look at XML in your web browser it will automatically be formatted with indents based on tag hierarchy .
I see your point, but I think there are a few solid reasons it would be a bad idea.
The basic text file (at least in unix) is king, and the medium by which we transport all sorts of code around. Adding a layer of asbtraction to an editor like that, where it shows you one thing but saves the file differently, breaks this premise and means your new editor now doesn't play well with others.
- Unless you source control is in on it, you are going to suddenly see a different file than you were working with before when resolving conflicts.
- Your whole team is now forced to use your editor to get the same view of the code as you are.
- grepping, line counts, most third party text mapulation/wrangling services become moot unless savvy to the context of your editor
With the majority of websites, the content is changing the majority of the time which means the HTML has to be compressed after each change & sure they’re are scripts to do this but in comparison, CSS/JS files are rarely changed – they’re generally changed when a new feature/design is implemented.
Another reason is because, websites are becoming more dynamic & it’s not very cacheable – CSS/JS are extremely cacheable. Since on every request you have the extra task of running minimization on the complete HTML of a webpage (especially if your website is dynamic and you’re using a script to do this) and this is time you could have used to transfer data.
Moreover, there are other low hanging fruit that most websites need to tackle first – minimizing HTTP request, removing unnecessary images, minifying images, minifying & combining CSS/JS.
Minifying dynamically generated files on the fly is harder than static ones (Its obviously possible but you have to take it into consideration) & there are some cases where it actually might decrease performance as, every page request requires minifying.
Personally, I think the benefits of minifying your HTML won't pay off until you're receiving a significant amount of traffic anyway. Which is why, for the large majority of sites they'll see better value through minifying/combining their CSS/JS and tackling the other low hanging fruit first.
I understood the question which is why, I said there are still performance issues if you try to do this & its only beneficial to websites that receive a large volume of traffic as they are the ones who will see the benefits of doing this; there are other low hanging fruit that can improve performance before even looking minifying HTML.
Yeah, I like to minify the HTML as well ... most template engines have an "omit whitespace" option or flag which lets you do this.
Every now and then whitespace turns out to be significant though ... so you need to be slightly intelligent about how you handle it.
Not sure. We simply used a templating engine. Either a) we didn't take the time to read the documentation properly to find out if there is an efficient minifier, or b) there isn't an efficient templating engine that minifies.
I'm betting big on using preprocessor languages, such as HAML, or Jade.
Making the HTML minimal but it can't be done automatically. It takes some HTML knowledge and most important a few thousand lines of code written to make it optimized if not minified. What I want to say is that trough time you learn how to write code in a way that it as optimized as it can be. At least I did. For example I never write table structure hierachically but I always leave the <table> on top and then group together the <tr><td></td></tr> and then each table cell in a new row. This way I still locate the start of a new row so it makes the table structure human readable and a bit optimized. There are a lot of these kind of tricks I use that I learned over time.