minify-html/README.md

437 lines
14 KiB
Markdown
Raw Normal View History

2018-06-30 06:37:24 -04:00
# hyperbuild
A fast HTML minifier written in C, heavily influenced by [kangax's html-minifier](https://github.com/kangax/html-minifier).
Available in different flavours:
- Standalone 64-bit Linux executable (this)
- [Node.js](https://github.com/wilsonzlin/hyperbuild-nodejs)
- [Express](https://github.com/wilsonzlin/hyperbuild-express)
- [Webpack](https://github.com/wilsonzlin/hyperbuild-webpack)
- [Apache](https://github.com/wilsonzlin/hyperbuild-apache)
- [Nginx](https://github.com/wilsonzlin/hyperbuild-nginx)
2018-06-30 06:37:24 -04:00
## Features
### Streaming minification
hyperbuild minifies as it parses, directly streaming processed HTML to the output without having to build a DOM/AST or iterate/traverse around in multiple passes, allowing for super-fast compilation times and near-constant memory usage.
2018-09-29 09:00:27 -04:00
### Super fast
2018-06-30 06:37:24 -04:00
hyperbuild is written in C, and uses technologies like Emscripten and Cython to preserve performance in higher-level languages.
2018-06-30 06:37:24 -04:00
2018-09-29 09:00:27 -04:00
### Smart whitespace handling
2018-08-07 21:48:36 -04:00
2018-09-29 09:00:27 -04:00
hyperbuild has advanced whitespace minification with smart defaults that leaves whitespace untouched in `pre` and `code`, trims and collapses them in content tags, and removes them in layout tags allowing the use of `inline-block` without ugly syntax or CSS hacks.
2018-08-07 21:48:36 -04:00
2018-06-30 06:37:24 -04:00
## Parsing
Current limitations:
2018-08-21 11:07:34 -04:00
- UTF-8 in, UTF-8 out, no BOM.
2018-06-30 06:37:24 -04:00
- Not aware of exotic Unicode whitespace characters.
- Tested and designed for Linux only.
2018-07-02 05:21:00 -04:00
- Follows HTML5 only.
2018-06-30 06:37:24 -04:00
2018-07-02 05:21:00 -04:00
### Errors
2018-06-30 06:37:24 -04:00
2018-07-06 01:24:28 -04:00
Errors marked with a `⌫` can be suppressed using the [`--suppress`](#--suppress) option.
2018-08-03 07:02:10 -04:00
Use the error name without the `HBE_PARSE_` prefix.
#### `HBE_PARSE_MALFORMED_ENTITY` ⌫
It's an error if the sequence of characters following an ampersand (`&`) does not form a valid entity.
2018-08-04 00:54:52 -04:00
Entities must be of one of the following forms:
2018-08-03 07:02:10 -04:00
- `&name;`, where *name* is a reference to a valid HTML entity
- `&nnnn;`, where *nnnn* is a Unicode code point in base 10
- `&#xhhhh;`, where *hhhh* is a Unicode code point in base 16
2018-08-07 19:09:06 -04:00
A malformed entity is an ampersand not followed by a sequence of characters that matches one of the above forms. This includes when the semicolon is missing.
2018-08-03 07:02:10 -04:00
2018-08-04 00:54:52 -04:00
Note that this is different from `HBE_PARSE_INVALID_ENTITY`, which is when a well-formed entity references a non-existent entity name or Unicode code point.
2018-08-03 07:02:10 -04:00
2018-08-07 19:09:06 -04:00
While an ampersand by itself (i.e. followed by whitespace or as the last character) is a malformed entity, it is covered by `HBE_PARSE_BARE_AMPERSAND`.
#### `HBE_PARSE_BARE_AMPERSAND` ⌫
It's an error to have an ampersand followed by whitespace or as the last character.
This is intentionally a different error to `HBE_PARSE_MALFORMED_ENTITY` due to the ubiquity of bare ampersands.
2018-08-03 07:02:10 -04:00
An ampersand by itself is not *necessarily* an invalid entity. However, HTML parsers and browsers may have different interpretations of bare ampersands, so it's a good idea to always use the encoded form (`&`).
2018-08-07 19:09:06 -04:00
When this error is suppressed, bare ampersands are outputted untouched.
2018-06-30 06:37:24 -04:00
2018-07-05 05:17:31 -04:00
#### `HBE_PARSE_INVALID_ENTITY` ⌫
2018-06-30 06:37:24 -04:00
2018-07-02 05:21:00 -04:00
It's an error if an invalid HTML entity is detected.
2018-08-03 07:02:10 -04:00
If suppressed, invalid entities are outputted untouched.
2018-07-04 07:23:17 -04:00
See [entityrefs.c](src/main/c/rule/entity/entityrefs.c) for the list of entity references considered valid by hyperbuild.
2018-08-03 07:02:10 -04:00
2018-07-02 05:21:00 -04:00
Valid entities that reference a Unicode code point must be between 0x0 and 0x10FFFF (inclusive).
2018-06-30 06:37:24 -04:00
2018-07-05 05:17:31 -04:00
#### `HBE_PARSE_NONSTANDARD_TAG` ⌫
2018-06-30 06:37:24 -04:00
2018-07-02 05:21:00 -04:00
It's an error if an unknown (non-standard) tag is reached.
2018-07-04 07:23:17 -04:00
See [tags.c](src/main/c/rule/tag/tags.c) for the list of tags considered valid by hyperbuild.
2018-06-30 06:37:24 -04:00
2018-07-05 05:17:31 -04:00
#### `HBE_PARSE_UCASE_TAG` ⌫
2018-06-30 06:37:24 -04:00
2018-07-02 05:21:00 -04:00
It's an error if an opening or closing tag's name has any uppercase characters.
2018-06-30 06:37:24 -04:00
2018-07-05 05:17:31 -04:00
#### `HBE_PARSE_UCASE_ATTR` ⌫
2018-06-30 06:37:24 -04:00
2018-07-02 05:21:00 -04:00
It's an error if an attribute's name has any uppercase characters.
2018-06-30 06:37:24 -04:00
2018-07-05 05:17:31 -04:00
#### `HBE_PARSE_UNQUOTED_ATTR` ⌫
2018-06-30 06:37:24 -04:00
2018-08-04 21:04:01 -04:00
It's an error if an attribute's value is not quoted with `"` (U+0022) or `'` (U+0027).
This means that `` ` `` is not a valid quote mark regardless of whether this error is suppressed or not. Backticks are valid attribute value quotes in Internet Explorer.
2018-06-30 06:37:24 -04:00
2018-07-05 05:17:31 -04:00
#### `HBE_PARSE_ILLEGAL_CHILD`
2018-06-30 06:37:24 -04:00
2018-07-02 05:21:00 -04:00
It's an error if a tag is declared where it can't be a child of.
2018-08-03 07:02:10 -04:00
This is a very simple check, and does not cover the comprehensive HTML rules, which involve backtracking, tree traversal, and lots of conditionals.
2018-06-30 06:37:24 -04:00
2018-07-04 07:23:17 -04:00
This rule is enforced in four parts:
[whitelistparents.c](src/main/c/rule/relation/whitelistparents.c),
[blacklistparents.c](src/main/c/rule/relation/blacklistparents.c),
[whitelistchildren.c](src/main/c/rule/relation/whitelistchildren.c), and
[blacklistchildren.c](src/main/c/rule/relation/blacklistchildren.c).
2018-06-30 06:37:24 -04:00
2018-07-05 05:17:31 -04:00
#### `HBE_PARSE_UNCLOSED_TAG`
2018-06-30 06:37:24 -04:00
2018-07-02 05:21:00 -04:00
It's an error if a non-void tag is not closed.
2018-07-04 07:23:17 -04:00
See [voidtags.c](src/main/c/rule/tag/voidtags.c) for the list of tags considered void by hyperbuild.
2018-06-30 06:37:24 -04:00
2018-07-02 05:21:00 -04:00
This includes tags that close automatically because of siblings (e.g. `<li><li>`), as it greatly simplifies the complexity of the minifier due to guarantees about the structure.
2018-06-30 06:37:24 -04:00
#### `HBE_PARSE_SELF_CLOSING_TAG` ⌫
2018-08-04 07:41:44 -04:00
It's an error if a tag is self-closed. Valid in XML, not in HTML.
#### `HBE_PARSE_NO_SPACE_BEFORE_ATTR`
2018-08-07 22:31:28 -04:00
It's an error if there is no whitespace before an attribute.
Most likely, the cause of this error is either invalid syntax or something like:
```html
<div class="a"name="1"></div>
```
(Note the lack of space between the end of the `class` attribute and the beginning of the `name` attribute.)
2018-07-05 05:17:31 -04:00
#### `HBE_PARSE_UNEXPECTED_END` and `HBE_PARSE_EXPECTED_NOT_FOUND`
General syntax errors.
2018-07-05 03:40:47 -04:00
#### Additional errors
2018-07-02 05:21:00 -04:00
2018-08-04 00:54:52 -04:00
There are additional implicit errors that are considered as general syntax errors due to the way the parser works:
2018-07-02 05:21:00 -04:00
2018-07-05 03:40:47 -04:00
- Closing void tags; see [voidtags.c](src/main/c/rule/tag/voidtags.c) for the list of tags considered void by hyperbuild.
- Placing whitespace between `=` and attribute names/values.
- Placing whitespace before the tag name in an opening tag.
- Placing whitespace around the tag name in a closing tag.
- Not closing a tag before the end of the file/input.
2018-07-02 05:21:00 -04:00
2018-08-07 21:48:36 -04:00
#### Notes
- Closing `</script>` tags end single-line and multi-line JavaScript comments in `script` tags.
For this to be detected by hyperbuild, the closing tag must not contain any whitespace (e.g. `</script >`).
2018-07-02 05:21:00 -04:00
### Options
#### `--in`
2018-09-29 09:00:27 -04:00
Path to a file to process. If omitted, hyperbuild will read from `stdin`.
2018-07-02 05:21:00 -04:00
#### `--out`
Path to a file to write to; it will be created if it doesn't exist already. If omitted, the output will be streamed to `stdout`.
#### `--keep`
2018-06-30 06:37:24 -04:00
2018-07-05 23:18:58 -04:00
Don't automatically delete the output file if an error occurred. If the output is `stdout`, or the output is a file but `--buffer` is provided, this option does nothing.
2018-07-02 05:21:00 -04:00
#### `--buffer`
2018-07-05 23:18:58 -04:00
Buffer all output until the process is complete and successful. This won't truncate or write anything to the output until the build process is done, but will use a non-constant amount of memory.
This applies even when the output is `stdout`.
2018-07-02 05:21:00 -04:00
2018-07-05 23:18:58 -04:00
#### `--suppress`
2018-06-30 06:37:24 -04:00
2018-07-02 05:21:00 -04:00
Suppress errors specified by this option. hyperbuild will quitely ignore and continue processing when otherwise one of the provided errors would occur.
Suppressible errors are marked with a `⌫` in the [Errors](#errors) section. Omit the `HBE_PARSE_` prefix. Separate the error names with commas.
2018-06-30 06:37:24 -04:00
## Minification
2018-07-04 07:23:17 -04:00
### Theory
#### Whitespace
##### Beginning and end
2018-08-03 08:47:59 -04:00
```html
<p>
··The·quick·brown·fox↵
</p>
```
2018-07-04 07:23:17 -04:00
##### Between text and tags
2018-08-03 08:47:59 -04:00
```html
<p>The·quick·brown·fox·<strong>jumps</strong>·over·the·lazy·dog.</p>
```
2018-07-04 07:23:17 -04:00
##### Contiguous
2018-08-03 08:47:59 -04:00
```html
<select>
··<option>Jan:·········1</option>
··<option>Feb:········10</option>
··<option>Mar:·······100</option>
··<option>Apr:······1000</option>
··<option>May:·····10000</option>
··<option>Jun:····100000</option>
</select>
2018-08-03 08:47:59 -04:00
```
2018-07-04 07:23:17 -04:00
##### Whole text
2018-08-03 08:47:59 -04:00
```html
<p>
···↵
</p>
```
2018-07-04 07:23:17 -04:00
#### Content
##### Specific tags
2018-07-05 07:59:01 -04:00
Tags not in one of the categories below are **specific tags**.
2018-07-04 07:23:17 -04:00
##### Formatting tags
```html
<strong> moat </strong>
```
##### Content tags
```html
<p>Some <strong>content</strong></p>
```
##### Content-first tags
```html
<li>Anthony</li>
```
```html
<li>
<div>
</div>
</li>
```
2018-07-05 07:59:01 -04:00
##### Layout tags
2018-07-04 07:23:17 -04:00
2018-08-03 08:47:59 -04:00
```html
<div>
<div></div>
</div>
```
2018-07-04 07:23:17 -04:00
##### Overview
|Type|Content|
|---|---|
|Formatting tags|Text nodes|
|Content tags|Formatting tags, text nodes|
2018-08-03 08:47:59 -04:00
|Layout tags|Layout tags, content tags|
|Content-first tags|Content of content tags or layout tags (but not both)|
2018-07-04 07:23:17 -04:00
2018-06-30 06:37:24 -04:00
### Options
2018-08-07 00:25:15 -04:00
Note that only existing whitespace will be up for removal via minification. Entities that represent whitespace will not be decoded and then removed.
2018-08-10 05:57:49 -04:00
For options that have a list of tags as their value, the tags should be separated by a comma.
2018-07-06 06:59:01 -04:00
2018-08-10 05:57:49 -04:00
An `*` (asterisk, U+002A) can be used to represent the complete set of possible tags. Providing no value represents the empty set.
Both values essentially fully enables or disables the option.
2018-07-06 06:59:01 -04:00
2018-07-04 07:23:17 -04:00
For brevity, hyperbuild has built-in sets of tags that can be used in place of declaring all their members; they begin with a `$` sign:
2018-06-30 06:37:24 -04:00
2018-07-05 18:00:58 -04:00
|Name|Tags|Source|
2018-06-30 06:37:24 -04:00
|---|---|---|
2018-07-05 18:00:58 -04:00
|`$content`|`address`, `audio`, `button`, `canvas`, `caption`, `figcaption`, `h1`, `h2`, `h3`, `h4`, `h5`, `h6`, `legend`, `meter`, `object`, `option`, `p`, `summary`, `textarea`, `video`|[contenttags.c](src/main/c/rule/tag/contenttags.c)|
|`$contentfirst`|`dd`, `details`, `dt`, `iframe`, `label`, `li`, `noscript`, `output`, `progress`, `slot`, `td`, `template`, `th`|[contentfirsttags.c](src/main/c/rule/tag/contentfirsttags.c)|
|`$formatting`|`a`, `abbr`, `b`, `bdi`, `bdo`, `cite`, `data`, `del`, `dfn`, `em`, `i`, `ins`, `kbd`, `mark`, `q`, `rp`, `rt`, `rtc`, `ruby`, `s`, `samp`, `small`, `span`, `strong`, `sub`, `sup`, `time`, `u`, `var`, `wbr`|[formattingtags.c](src/main/c/rule/tag/formattingtags.c)|
2018-07-05 18:00:58 -04:00
|`$layout`|`blockquote`, `body`, `colgroup`, `datalist`, `dialog`, `div`, `dl`, `fieldset`, `figure`, `footer`, `form`, `head`, `header`, `hgroup`, `html`, `main`, `map`, `menu`, `nav`, `ol`, `optgroup`, `picture`, `section`, `select`, `table`, `tbody`, `tfoot`, `thead`, `tr`, `ul`|[layouttags.c](src/main/c/rule/tag/layouttags.c)|
2018-08-04 00:01:04 -04:00
|`$specific`|All [SVG tags](src/main/c/rule/tag/svgtags.c), `area`, `base`, `br`, `code`, `col`, `embed`, `hr`, `img`, `input`, `param`, `pre`, `script`, `source`, `track`|[specifictags.c](src/main/c/rule/tag/specifictags.c)|
2018-07-05 18:00:58 -04:00
|`$heading`|`hgroup`, `h1`, `h2`, `h3`, `h4`, `h5`, `h6`|[headingtags.c](src/main/c/rule/tag/headingtags.c)|
|`$media`|`audio`, `video`|[mediatags.c](src/main/c/rule/tag/mediatags.c)|
|`$sectioning`|`article`, `aside`, `nav`, `section`|[sectioningtags.c](src/main/c/rule/tag/sectioningtags.c)|
|`$void`|`area`, `base`, `br`, `col`, `embed`, `hr`, `img`, `input`, `keygen`, `link`, `meta`, `param`, `source`, `track`, `wbr`|[voidtags.c](src/main/c/rule/tag/voidtags.c)|
|`$wss`|`pre`, `code`|[wsstags.c](src/main/c/rule/tag/wsstags.c)|
2018-06-30 06:37:24 -04:00
2018-08-10 05:57:49 -04:00
As an example, for `--MXcollapseWhitespace`, here are some possible values:
|Arguments|Description|
|---|---|
|`--MXcollapseWhitespace $wss`|Collapse whitespace in all tags except `$wss` ones|
|`--MXcollapseWhitespace $content,$wss`|Collapse whitespace in all tags except `$content` and `$wss` ones|
|`--MXcollapseWhitespace $content,$wss,dd`|Collapse whitespace in all tags except `$content` and `$wss` ones, as well as the `dd` tag|
|`--MXcollapseWhitespace sup,dd`|Collapse whitespace in all tags except `sup` and `dd`|
|`--MXcollapseWhitespace`|Collapse whitespace in all tags|
|`--MXcollapseWhitespace *`|Don't collapse whitespace in any tag|
2018-07-05 23:18:58 -04:00
#### `--MXcollapseWhitespace $wss`
2018-06-30 06:37:24 -04:00
Reduce a sequence of whitespace characters in text nodes to a single space (U+0020), unless they are a child of the tags specified by this option.
<table><thead><tr><th>Before<th>After<tbody><tr><td>
```html
<p>
··The·quick·brown·fox↵
··jumps·over·the·lazy↵
··dog.↵
</p>
```
<td>
```html
2018-07-04 07:23:17 -04:00
<p>·The·quick·brown·fox·jumps·over·the·lazy·dog.·</p>
2018-06-30 06:37:24 -04:00
```
</table>
2018-07-05 23:18:58 -04:00
#### `--MXdestroyWholeWhitespace $wss,$content,$formatting`
2018-06-30 06:37:24 -04:00
Remove any text nodes that only consist of whitespace characters, unless they are a child of the tags specified by this option.
Especially useful when using `display: inline-block` so that whitespace between elements (e.g. indentation) does not alter layout and styling.
<table><thead><tr><th>Before<th>After<tbody><tr><td>
```html
<div>
··<h1></h1>
··<ul></ul>
··A·quick·<strong>brown</strong>·<em>fox</em>.↵
</div>
```
<td>
```html
<div><h1></h1><ul></ul>
··A·quick·<strong>brown</strong><em>fox</em>.↵
</div>
```
</table>
2018-07-05 23:18:58 -04:00
#### `--MXtrimWhitespace $wss,$formatting`
2018-06-30 06:37:24 -04:00
Remove any whitespace from the start and end of a tag, if the first and/or last node is a text node, unless the tag is one of the tags specified by this option.
Useful when combined with whitespace collapsing.
Other whitespace between text nodes and tags are not removed, as it is not recommended to mix non-formatting tags with raw text.
Basically, a tag should only either contain text and [formatting tags](#formatting-tags), or only non-formatting tags.
2018-06-30 06:37:24 -04:00
<table><thead><tr><th>Before<th>After<tbody><tr><td>
```html
<p>
··Hey,·I·<em>just</em>·found↵
··out·about·this·<strong>cool</strong>·website!↵
2018-07-03 08:18:10 -04:00
··<div></div>
2018-06-30 06:37:24 -04:00
</p>
```
<td>
```html
<p>Hey,·I·<em>just</em>·found↵
2018-07-03 08:18:10 -04:00
··out·about·this·<strong>cool</strong>·website!↵
··<div></div></p>
2018-06-30 06:37:24 -04:00
```
</table>
2018-07-05 23:18:58 -04:00
#### `--MXtrimClassAttribute`
2018-06-30 06:37:24 -04:00
2018-07-05 23:18:58 -04:00
Don't trim and collapse whitespace in `class` attribute values.
2018-06-30 06:37:24 -04:00
<table><thead><tr><th>Before<th>After<tbody><tr><td>
```html
<div class="
hi
lo
a b c
d e
f g
"></div>
```
<td>
```html
<div class="hi lo a b c d e f g"></div>
```
</table>
2018-07-06 01:00:45 -04:00
#### `--MXdecEnt`
2018-06-30 06:37:24 -04:00
2018-07-05 23:18:58 -04:00
Don't decode any valid entities into their UTF-8 values.
2018-06-30 06:37:24 -04:00
2018-07-06 01:00:45 -04:00
#### `--MXcondComments`
2018-06-30 06:37:24 -04:00
Don't minify the contents of conditional comments, including downlevel-revealed conditional comments.
2018-06-30 06:37:24 -04:00
2018-07-06 01:00:45 -04:00
#### `--MXattrQuotes`
2018-06-30 06:37:24 -04:00
2018-07-05 23:18:58 -04:00
Don't remove quotes around attribute values when possible.
2018-06-30 06:37:24 -04:00
2018-07-06 01:00:45 -04:00
#### `--MXcomments`
2018-06-30 06:37:24 -04:00
Don't remove any comments. Conditional comments are never removed regardless of this setting.
2018-06-30 06:37:24 -04:00
2018-07-06 01:00:45 -04:00
#### `--MXoptTags`
2018-06-30 06:37:24 -04:00
2018-07-05 23:18:58 -04:00
Don't remove optional starting or ending tags.
2018-06-30 06:37:24 -04:00
2018-07-06 01:00:45 -04:00
#### `--MXtagWS`
2018-06-30 06:37:24 -04:00
2018-07-05 23:18:58 -04:00
Don't remove spaces between attributes when possible.
2018-06-30 06:37:24 -04:00
### Non-options
2018-07-06 01:24:43 -04:00
#### Explicitly important
2018-06-30 06:37:24 -04:00
2018-07-06 01:24:43 -04:00
The following removal of attributes and tags as minification strategies are not available in hyperbuild, as they should not have been declared in the first place.
2018-06-30 06:37:24 -04:00
2018-07-06 01:24:43 -04:00
If they exist, it is assumed there is a special reason for being so.
2018-06-30 06:37:24 -04:00
2018-08-07 00:25:15 -04:00
- Remove empty attributes (including ones that would be empty after minification e.g. `class=" "`)
2018-07-06 01:24:43 -04:00
- Remove empty elements
- Remove redundant attributes
- Remove `type` attribute on `<script>` tags
- Remove `type` attribute on `<style>` and `<link>` tags