AXDC – Sam Ruby, IBM

3 minute read

Sam is reprising a talk he did several years ago.

Most people learned HTML via “View Source”. What’s the downside to this?

Must be willing to seek and see the truth. Look at the messages and actually understand them (The Matrix).

Focus on identity, is x equal to y?

Unicode, letter “a” in two different fonts. Yes, they are both U+0061.

“a” vs. “A”, not the same.

Look at encodings, 0×41 all the way down.

Attractive nuisance — hazardous object which can be expected to attract children to investigate or play.

As applied to Unicode and RSS. People don’t worry about encodings, they do the simplest thing, take the default, strcat strings from various encodings, leave it up to the browser to handle the resulting mess (unfortunately, it often did).

Two more A’s, U+0041, U+0391 (Greek Alpha).

Four things look like a lowercase i (U+0069, U+2139, U+2148, U+2170).

Imagine the confusion that would result if these were allowed in domain names and URLs!

Two glyphs could look the same yet have different encodings.

Now get into accented characters. Multiple ways to do “e with accent.” Comparing them will produce different results.

Default encodings. HTML is iso-8859-1. XML is UTF-8. Would be nice if XML required the encoding declaration. Userland RSS feeds still don’t have them. Microsoft has Windows 1252.

Planet RDF will tke HTML, run it through iso-8859-1 to utf-8 comversion, producing good utf-8 triples, even if the input isn’t iso-8859-1. This chokes on things like the Euro character.

Footer of msdn2.microsoft.com has this issue, very hard to solve.

Win-1252 differents from 8859-1 in 27 places, Euro symbols and smart quotes. Kills RSS feeds. Most web clients are on Windows, most web servers don’t indicate the encoding. Most browsers have given up and consider 8859-1 and 1252 to be the same.

Google for links to feedvalidator.org, but have bad 1252 characters. Code points are used to generate bad XML entities. Don’t confuse encoding with code point. This is a pervasive problem.

UTF-8 has a nice property, strings can be recognized as such with high probability.

Now consider URI’s. Lots and lots of cases. URI spec recently revised; normalization method not yet nailed down. Add Unicode to make it even more interesting. Don’t believe it when they say that URI’s give you queries for free. What encoding is used for the data in the query string?

CLR’s System.Uri.Equals says that all of his examples are the same; using these as XML namespaces require them each to be distinct!

Presumptions about layering are very dangerous. Case in point, HTTP has tons of problems.

GET /index.html HTTP/1.0

This omits the supposedly optional Host: header, which is really required. Leaky DNS abstraction.

Content-type, XML header, meta tag in HEAD, and form post all have a charset or encoding attribute.

Ruby’s Postulate:

Accuracy of metadata inversely proportional to square of distance between data and the metadata. HTTP Content-Type is a hint, XML declarations ignored by browsers, meta charset typically seen as authoritative, charset attributes on elements aren’t supported.

More layering issues…

Accept-encoding: gzip

Mozilla unzips, starts to parse, http-equiv causes it to restart, issues page reload with bad byte range. This is an example of bad layering, does the byte range apply before or after gzip?

Character references, lots of ways to say the same thing, all the same if you respect all of the rules for XML processing. Broken in lots of RSS feeds, obvious to him and a few others, not to everyone else.

Ampersands in URLs are supposed to be encoded within HTML.

Escaping in XML is broken for character references. There is no way to look at a string of XML and determine if it is escaped or not. Trips up the pros every day.

This is all stated explicitly in Atom; this will allow for better tool support and better diagnostics.

Double-escaped — to put “A&B” in an RSS feed, must say “A&B”.

RSS & Atom — can title contain markup, and many other issues. He’ll talk about all of this tomorrow.

WS-*, take the blue pill. Hide it all in tools.

Summary, comparing characters and URIs is surprisingly difficult; it can lead to security holes. Layering is the problem, not the solution.

You won’t find reality in any specification.

Spec authors are responsible for the confusion that they create.

Updated: