Re: The Reference Concrete Syntax is not Current Practice (Was Re: Standards, Work Groups, and RealiThe Reference Concrete Syntax is not Current Practice (Was Re: Standards, Work Groups, and Reali

Arjun Ray (aray@pipeline.com)
Mon, 25 Sep 95 23:37:41 EDT

On Mon, 25 Sep 1995, Daniel W. Connolly wrote:

> In message <Pine.3.89.9509242308.A25843-0100000@alpha>, Arjun Ray writes:
> >
> The HTML 2.0 spec defers to the SGML spec on lexical issues. But
> does a certain amount of handwaving about "parsed token sequences,"
> a term which is not really used/defined in the SGML spec.

SGML takes no prisoners: parsing and validation are identical concepts.
Scream-and-die when something doesn't parse^H^H^H^H^Hvalidate is the SGML
way. The fragility of the Concrete Syntax reflects this, as does the
unwillingness to distinguish lexical tokenization from content model
enforcement as essentially *different* meanings of "parsing".

> I'm working on a collection of materials to get the lexical stuff
> in HTML cleared up once and for all. Things like a conformance
> test suite, reference code, and some sort of paper to explain this
> stuff.

This is very good news. But it may be too late.

> I expect the jist of the paper to be:
>
> HTML 2.0 lexical syntax is the same as for "Basic SGML Documents"
> (SGML std section 15.1.1), with the following exceptions
> as application conventions:
>
> * SHORTREF applies only to attributes, not to tags. More
> precisely:
> * no Emtpy Start-tags: <>, Unclosed Start tags: <abc<def>
> nor NET-enabling Start tags: <em/foo/ (SGML 7.4.1)
> * no Empty End-tags: </>, Unclosed End-tags: </foo<bar>,
> nor Null End-tags: <em/foo/ (SGML 7.5.1)

Er... I believe that's SHORTTAG YES in the FEATURES section. And even
though it is criminally overloaded, I think only NET is the really
problematic feature. Empty tags may foul up in the event of unrecognized
GIs, but that could be an application convention too: unrecognized tags
are treated as content-model=EMPTY.

> * no internal document type declaration subset.
> (SGML 11.1 production 110,112)

We'll need some sort of macro mechanism, unless the idea is to force
everyone to use an editing tool with such facilities.

> * no marked sections (SGML 10.4)

A pity, because <![ IGNORE [ ... ]]> is a much more robust way (in the
sense of the author's intent being recoverable by an error-correcting
lexer) of "commenting" out chunks of text. Changing the default
status-keyword from INCLUDE to IGNORE might simplify things also.

> And I'm working on a matching lex specification, since the grammar in
> the SGML specification isn't sufficient to build a parser from it --
> it's ambiguous, and it mixes all sorts of lexical levels together.
>
> And I hope to build the validation suite collaboratively. For a preview,
> take a look at:

Cool stuff:-)

> >In all the agonizing over content model violations, sight has been lost of
> >the far more fundamental fact that current practice has been divorced
> >from the Reference Concrete Syntax.
>
> I'd agree it's time to bring this issue to the forefront again.

Thank you. Considering Amanda Walker's testimonial (and I can tell a
distressingly similar story), simply wishing the Concrete Syntax into
existence isn't going to solve any problems, or deal squarely with the
growing mass of material already out there on the Web that wouldn't know
the CS from a hole in the ground.

> > HTML *as it is being used in practice*
> >poses a *tokenization problem* for any SGML-compliant implementation.
> >It's ad hoc parsing all the way, and to expect implementors today to
> >consider SGML compliance *at the lexical level* is to ask competitive
> >suicide of them, insofar as HTML is taken to *mean* current practice.
>
> Well... I'm not asking the vendors to take HTML to mean current
> practice to that extent.

I think Amanda has already pointed out the *realities* of that :-(

> >Does the Working Group have an estimate of the percentage of existing
> >documents that can "conform" *only* to the parsing heuristic embodied in
> >
> ><URL:
> >ftp://ftp.ncsa.uiuc.edu/Mosaic/Unix/source/Mosaic-src/libhtmlw/HTMLparse.c>?
>
> Well... the folks at OpenText could probably answer that question to
> about 6 significant digits. My guestimate is 47.94567% But a more
> important question is: which will get us to reliable interoperability
> quicker: getting the document base to be SGML-conforming, or getting
> all the deployed software to agree on a specification derived from
> HTMLparse.c?

How much of the deployed *popular* software is *not* derived from
HTMLparse.c? There's a goal conflict here. Somebody might be able to
devise a BNF for this sophomoric heuristic, but it won't be interoperable
with SGML systems. OTOH, for interoperability, the magnitude of the
fly-swatting problem is something the Working Group seems reluctant to
discuss (vide Larry Masinter's comment.)

> I suggest we approach this education problem -- which is what is is --
> from the top down: educate the implementors of browsers and authoring
> tools, and the authors of HTML "how-to" documents. Let the consumer
> community learn from them. That's why I stared the HTML 2.0
> specification effort in the first place.

I'm sorry to disagree, but what you have here is actually a *re*-education
problem. The explosive growth of the Web mainly comprises people who come
from a different software universe/paradigm: closed proprietary programs,
misleading manuals, a thriving genre of "tips, tricks and traps" books,
and so on. The common theme is that the *program* is the *final* arbiter
of what's OK. People learn from and adapt to the implementation. There's
no such thing as "the program is doing this wrong."

The real idiocy with HTMLparse.c isn't that it was concocted (it serves as
an object lesson in why it's *vital* to gets the specs right) but that
it continues to be deployed. It's been around for over two and a half
years. It has become a vested interest, and the GUM comfortably cosy with
their <missing "quotes> and <!-- broken comments> aren't likely to be
swayed from the "truth" they imagine they've already grokked. HTMLparse.c,
through the Mosaic family of browsers, has sabotaged the Concrete Syntax
and with it any real hope for a SGML-interoperable HTML. That is simply a
fact.

Let the people responsible for HTMLparse.c try to make a standard out of
it. I see no reason why this Working Group should harbor it, unless the
Group is as much under the thrall of a mere name as everybody else seems
to be. That would be a shame.

> >And from that standard's perspective, how much of current practice is the
> >Working group prepared to declare explicitly as non-conforming?
>
> Well... with the publication of HTML 2.0, we've declared a whole lot
> of stuff to be non-conforming. At the lexical level, I don't expect to
> budge much. (At the content-model level, we're in need of a clean
> extension mechanism.)
>
> >But putting a significant fraction of the existing document base beyond
> >the pale isn't half as relevant as the fact that there are implementations
> >on which this fraction "works", and for people whose concerns are limited
> >to that, what a standard actually stipulates counts for much less than
> >the ability of the implementors to simply *claim* conformance.
>
> Let's look at this carefully: it starts with _an_ implementation on
> which this fraction works. A single implementation does not make for a
> healthy market. Other companies have stepped in. Ask them if they're
> happy that they've had to reverse engineer the behaviour of other
> browsers. Ask them how much it cost. That cost is passed on. We all
> pay.

I doubt that they're happy. Ask Amanda. More importantly, look at what
they're doing, or have already done. The cost has already been passed on.
Ask Amanda again.

Regards,

Arjun Ray
(I speak for myself only.)