Scrap dump - notes on HTML
Mar. 10th, 2012 11:53 amOld notes from when I was trying to design my own next version of HTML:
1. Different tag types increase the complexity of writing a parser.
The language has more unique branches than would be most elegant. We have <html> tags and <!--comment--> tags and <!CDATA[[ ]]> tags and <!DOCTYPE> and more. They all have different parsing and closing rules.
Possible resolutions, and why they are bad solutions:
- Yeah, so?
- The complexity makes it harder for developers to write parsers.
- Make all the special stuff use one of the <! > formats
- Breaks backwards compatibility.
- Make everything use <html> tag format.
- Breaks backwards compatitility.
- All the special tag types can be seen as meta instructions, but this triggers the next problem.
2. Contents of future meta tags will be displayed by old browsers
Imagine the creation of a new tag <foobar/> which is an instruction to the browser to do something, but is not meant to be data. Old browsers will treat the tag as data and display its contents. This will annoy users and authors.
Consider the current practice of placing <script> in the body, and imagine a browser that does not know what to do with <script>.
Possible resolutions, and why they are bad solutions:
- Use <! > for all metadata tags.
- Wrong context. These are currently instructions to the lexer, not the renderer.
- Breaks backwards compatibility.
- Establish a tag type hierarchy and an inheritance language, and expect
authors to define the next HTML version's meta tags as descendant from the
meta tag.
- Every HTML7 page would be expected to include these redundant definitions for backwards compatibility, making for a waste of bandwidth.
- Force all metadata tags to go in the <head> section.
- Ideally, metadata should be applicable to any scope and any tag should be able to be its own scope for metadata instructions.
3. Collision between anonymous self-closing tags and anonymous closing tags
If anonymous (nameless) tags are allowed in a future version of the language, the structure </> can be read in one of two ways:
- A self-closing tag that has no name
- The closing tag for an earlier opening tag, with the name omitted.
Consider this example:
< attr=value>
< /> <!-- Is this a closing tag or a blank tag inside the anonymous tag? -->
< /> <!-- Does this close the first tag or produce a second tag after it? -->
Possible resolutions, and why they are bad solutions:
- Forbid anonymous opening tags (require names for all tags)
- I can't think of a bad reason for this other than that it would scuttle one of my ideas for changing the language.
- Forbid anonymous closing tags
- Anonymous tags now cannot be closed.
- Forbid empty anonymous tags
- Adds some complexity to writers.
- Potential to cause unexpected behavior if a tag dynamically becomes anonymous and empty.
- Inconsistent with rest of language.
- Authors will use </> regardless.
- Inconsistent with rest of language.
- Breaks backwards compatibility.
- Does not address the problem
HTML5 eliminates this problem by outlawing self-closing tags.
4. CDATA ]] versus Javascript
CDATA blocks end at ]] which is a common character pair in Javascript and any other scripting language that uses [] for array dereferences.
Possible resolutions, and why they are bad solutions:
- Point and laugh at all the Javascript programmers.
- Does not fix the problem.
- Use a different end-of-scope token.
- Breaks backwards compatibility.
- Arbitrary contents will trigger the same problem.
Some of the changes that I would make to HTML:
- Make comments nest.
- Make line breaks a character &br; rather than a tag <br>.
- Separate lexing and parsing to allow parsing to be done in parallel. In English, the lexer should not need to know the name of the tag it is inside to recognize when the tag ends. In practice, this would mean using an XML-like strict syntax rather than HTML5's developer-friendly error-tolerant attitude. Browser developers could always choose to use a more lenient syntax and web servers could automatically tidy html code before sending it. In general, the language syntax should be completely separate from the concept of what the tags are meant to represent.
- Any tag should be able to have a src attribute. This will end the web-breaking Javascript circus act that "HTML5" sites do to cache data locally and reduce server processing time. To optimize traffic flows, the HTTP spec can be modified to send timestamps for attached files in a HEAD response and to send multiple files in response to GET.
- Allow tag inheritance through an "aka" attribute:
<p attr="value" aka="fred" />
<fred>This is a paragraph with predefined attributes</fred> - CSS should be considered a language for mass-assigning attributes to HTML tags. The styles should be considered official attribute sets. Some of the deprecated style tags should be brought back as tags that are guaranteed to have certain style attributes set to standards.
- Canvas should be an optional plugin with its own standard scripting interface. Browsers should not be required by the HTML spec to support it.