Web weirdness

Wed Jun 20 14:11:06 BST 2007

David Cantrell writes:

> My understanding is that you didn't need to encode ampersands in URLs

You probably don't _need_ to, in terms of browsers try to be very
accommodating, but if you send a URL with a raw ampersand through the
W3C HTML validator it complains with:

  Entity references start with an ampersand (&) and end with a semicolon
  (;). If you want to use a literal ampersand in your document you must
  encode it as "&amp;" (even inside URLs!).

and points you here:

  http://www.htmlhelp.com/tools/validator/problems.html#amp

> unless they would otherwise look like the beginning of an entity - so
> the string '&quot;' would have to be represented as '%XXquot;' or
> somesuch.

There's 2 ways of writing that, with different meanings:

* &amp;quot%3B is the way of writing in HTML a URL fragment which will
  display in a browser's URL bar like &quot%3B, where (presuming this is
  in the query part) the ampersand signifies that this is a new
  parameter.

* %26quot%3B would appear exactly like that in a displayed URL bar; as
  part of a URL query it is all literal text, with all of the
  characters, including the ampersand, being continuations of the value
  for the preceding parameter

> &image isn't a named entity though.  Anything that thinks it is is
> broken.

I believe there are circumstances in SGML (which HTML claims to be) in
which the trailing semicolon is optional.  I don't think this is one of
them, but ...

> Browsers, for example, treat &image=blah correctly,

Arguably the 'correct' thing to do is to report a syntax error and
refuse to parse the document.  Otherwise software has to guess at what
the error and fix are; that different software makes different guesses
isn't surprising.

Smylers