Re: The Reference Concrete Syntax is not Current Practice (Was Re: Standards, Work Groups, and RealiThe Reference Concrete Syntax is not Current Practice (Was Re: Standards, Work Groups, and Reali

Arjun Ray (aray@pipeline.com)
Wed, 27 Sep 95 00:39:07 EDT
On Tue, 26 Sep 1995, Marc Salomon wrote:

> Arjun Ray <aray@pipeline.com> writes:
>
> |I have trawled untold megabytes of the mail archives and have yet to find
> |a good discussion of the *really* insidious problem. I'm nearly convinced
> |that the SGML gestalt actively hinders its appreciation, by conflating
> |parsing with validation as a practical prescription, so that *categories*
> |of error fail to be distinguished.
>
> I remember in '93 reading a paper (from cern?) asserting that browser user
> agents should be forgiving in parsing HTML.

And this legitimate Quality of Implementation issue has been perniciously
confounded with bugs paraded as features. There's no such thing as error
*recovery* without error *detection* first. The failure to distinguish
categories of variance handling, as if everything were just one single
monolithic "parsing HTML liberally" problem, has swept the Concrete Syntax
disaster under the rug of DTD violations.

There are at least three levels on which a practical "parser" -- qua
said single monolithic entity -- must contend with unexpected input:

1. Semantic, e.g. unrecognized GIs or attributes. The parser has simply
failed to identify an otherwise lexically correct piece of markup. HTML
already has an application convention for this, and how the Working Group
deals with the constant bombardment of tag-featurism isn't at issue here.

2. Structural, e.g. elements out of context, or overlaps. The parser has
identified a lexically correct piece of markup; it just doesn't belong at
the point it's encountered. Clearly, the behavior here is implementation
defined, to the extent that there are no explicit DTD enforcement rules in
the standard. Again, not the issue here.

3. Lexical, e.g. the notorious missing quote and arbitrary COM balancing
in comment declarations. Here, the lexically correct piece of markup is
*itself* in doubt. But how is the tokenization *failure* registered? This
is where the Concrete Syntax comes in -- or so we would hope.

You are grieviously misled if you believe that the lexical behavior of the
Mosaic family of browsers to be an instance of error recovery -- that there
is some supplementary "smart" heuristic at work on top of a "basic"
Concrete Syntax implementation. For that to be true, it must be true that
all legal input is tokenized correctly. And that is manifestly *not* the
case. In fact, there is no error *detection* at all. The *tokenization*
algorithm itself implements *some other* syntax.

(a) It's text until STAGO, ETAGO, or MDO.
(b) After STAGO/ETAGO/MDO, immediately scan forward for the first
occurence of '>'. Designate that as TAGC/MDC, and copy everything
in between (that was passed over in this scan) to a buffer.
(c) Tokenize buffer for GI, attributes and values.
(d) Goto (a)

If you're wondering why <missing quotes="work> and <!-- broken comments work>
and legal input fails, the algorithm above should make it clear. It
happens to be the entirety of the tokenization in HTMLparse.c. Calling
this "forgiving" is simply wrong: you have to get a spec *right* before
you can worry about what to do when something goes wrong.

And yet, the lexical syntax embodied in HTMLparse.c is the syntax that
other software is converging to, in the name of bug-for-bug compatibility.
(Try <URL:http://www.ctg.com/> -- look at Line 20 of the HTML source to
find a problem that has been known for a while. Should it be fixed? Why?)

I find inexplicable -- and distressing -- the reluctance of the Working
Group to confront this issue, especially the significance of HTMLparse.c.

So, I'll try once more: Can the Working Group offer an explanation why
HTMLparse.c should be considered a conforming implementation of the
Refrence Concrete Syntax as it applies to HTML?

Arjun Ray
(I speak for myself only.)