A pipeline proposal

Richard Tobin, 2006

Introduction

This describes an unimplemented XML pipeline language. It is intended to be a minimal language on which a more realistic one could be based, either by extension or by means of a front-end that compiled another language to it. Some features may appear grossly inefficient (such as using an entire document to specify a condition), but the more realistic system would either optimise these or simply provide constructs that are implemented as if using the inefficient mechanisms. Other features may appear inadequate; I note where extensions are clearly needed.

I have not yet invented an XML syntax for the language; it is described abstractly in terms of components, pipes and so on.

Components

A component has a type, zero or more named inputs, zero or more named outputs, and zero or more named parameters.

I am not defining names. They might well be URIs or (for names local to a pipeline) NCNames.

The component type is a name identifying the function of the component, for example XInclude or XSLT 1.0. A number of standard coponent types will be provided, but they are not enumerated here.

Inputs and outputs are named in order that they can be distinguished. Each input is connected either to an input of the containing pipeline or to an output of another component in the pipeline. Each output is connected either to an output of the pipeline or to the input of another component in the pipeline. Inputs and outputs are collectively referred to as ports. The connections between them are known as pipes. Each port must be connected to exactly one other port. The directed graph of pipes must be acyclic.

Pipes transmit XML documents. These documents may be conceived of as XML infosets, and may have extensions (as in the case of the PSVI), but an implementation is not required to support anything but vanilla XML documents.

Each input accepts, and each output produces, either a single document or a sequence of documents. This is referred to as the cardinality of the port. The ends of a pipe must be connected to ports of the same cardinality. A sequence may contain zero documents.

As described, the cardinality is either one or zero-or-more. It might be useful to be able to constrain it further (e.g. exactly two, not more than four) in which case the ends of a pipe would have to be connected to ports with compatible cardinality.

There is a standard component that accepts a single document and produces a sequence containing that one document.

There might also be a component that accepts a sequence and produces the first document as a single output, or that requires the the sequence only contain one document.

An extension might allow the user to connect ports of different cardinalities and automatically insert the required component to do the conversion.

There is a standard duplicator component that accepts a document and produces a copy of it on each of two outputs.

More copies can of course be obtained by using multiple duplicators. An extension might allow outputs to be connected to more than one input and insert the necessary duplicators.

There are standard sink components that accept a single document or a sequence of documents and produce no output.

There is a standard component that takes a URI as a parameter and produces the XML document retrieved from that URI.

Component parameters have string values. The value can be specified either as a literal or as the value of one of the containing pipeline's parameters.

So parameter values can be set at compile time or passed in to the pipeline at run-time (for example by command-line arguments). An extension might provide for parameter values to be computed by other components.

The type of a component, the names and cardinalities of its ports, and the names of its parameters form the signature of the component.

Pipelines

A pipeline is a kind of component, and therefore has a name, inputs, outputs, and parameters. The name provides a mechanism for subroutine-like reuse of pipelines.

It is convenient to refer to "sub-pipelines" within another pipeline, but these have no status different from other components. Sub-pipelines are likely to appear as the branches of conditionals, so an implementation will probably not require the user to explicitly provide a name for them.

An implementation will have to provide a way to run a pipeline and provide it with inputs, outputs, and parameters. There is no reason why it should not also allow non-pipeline components to be run in this way.

A pipeline contains zero or more components. Each port of a pipeline must be connected to a port of a contained component.

Control structures

Except as described in conditionals, all components in a pipeline are run (in particular, they do not get run only if input arrives or output is requested). The conditional is the only explicit control structure.

Conditionals

A conditional is a kind of component. It consists of two components, called the "if branch" and the "else branch". These two components must have the same signature. The signature of the conditional is same as that of the branches, augmented by an input named "test-document" and a parameter named "test-xpath". The test-xpath is evaluated with the root node of the test-document as the current node. If its value converted to a boolean is true, the if branch is used, otherwise the else branch is used. The chosen branch is run with the ports and parameters of the conditional, except for the test port.

An alternative approach would allow the two branches to have different ports. But this would mean that components downstream of the unselected branch would receive no input, and those upstream of it would have their output unread. Should such components be run at all? The obvious answer is no, but then each conditional would have effects pervading the whole pipeline, which seems confusing.

In some cases where the branches would require different inputs a natural solution is to move some upstream components into the branch. In other cases, for example where the branches use different outputs of the same upstream component, the signatures of the branches can be unified, with each branch discarding some of its inputs. In the case of output, unifying the signatures is less useful, since a common downstream component would have to somehow know which of its inputs would receive a document.

Viewports

Viewports can be implemented using two components, without any additional control structure. The viewport start component has a single document input and an XPath parameter. It has two outputs: on one output it passes through the original document, and on the other it produces a sequence of documents corresponding to the nodes matched by the XPath. This sequence forms the input to the "body" of the viewport. The viewport end component has two inputs, one connected to the passed-through document and the the other to the output of the body, and the same XPath parameter. It reads the passed-through document, and nodes matching the XPath are replaced by successive documents read from the body, and the reconstituted document is output. It is an error if the body does not produce the same number of documents as it receives.