(Computational) Linguistics and the Web: Hot research questions
I've spent the last ten years trying to feed technologies and
insights from Linguistics and Computational Linguistics into the
infrastructure of the Web. In this talk I'll give brief but intense
introductions to four areas of research interest from (C)L and
related disciplines which have the potential for making a real
impact on the way the Web works:
- A novel declarative approach to fixup of broken XML/(X)HTML
-
'HTML in the wild' isn't grammatical, and the majority opinion
is that only code and/or English can be used to standardise the
fixup process. There is precedent for error-correcting parsers,
I'll describe a variant that might work for HTML.
- Counter-augmented Finite-State Automata for parsing XML
-
Parsing XML content models which allow numeric occurrence ranges
(i.e. between 2 and 10 occurrences of (<x> followed by an
optional <y>)) has historically involved worst-case exponential
space. A new formalism, FSAs with counters, improves the
situation considerably.
- Functional XML -- Self-describing documents meet the lambda
calculus
-
XML increasingly is valuable as a vehicle for information which
gets manipulated, aggregated, transformed, etc. as a major part
of its utility. Traditional approaches to specifying this have
been external, i.e. scripting languages for XML data. The
alternative presented here is XML documents which define their
own processing.
- Identity, URIs and the (Semantic) Web
-
Why is there any reason to suppose that the Semantic Web will
succeed where thirty years of AI-based work on Knowledge
Representation have, well, failed? The only real difference is
the use of URIs for naming properties, classes and individuals.
The current dominant ideology with respect to how names work in
ordinary language offers some insight on this question.
Henry S. Thompson