[Ocaml-pxp-users] enable_namespace_info: good or bad?

Wed Oct 22 08:47:44 PDT 2003

Am Die, den 21.10.2003 schrieb Stefano Zacchiroli um 14:49:
> Hi all,
>   I'm working with PXP on XML Schema document.
> 
> Such documents usually begin with something like:
> 
>   <xsd:schema
>     xmlns:xsd="http://www.w3.org/2001/XMLSchema"
>     xmlns:target="http://www.example.org/foo"
>     targetNamespace="http://www.example.org/foo"
>     >
> 
> then in the rest of the document you can have reference to previous
> defined stuff in this way:
> 
>   <xsd:element ref="target:elt1" />
> 
> Here the target prefix matches the prefix used in the xmlns:target
> declaration ... and here there's the problem!

Just to understand the problem better: Would it be allowed to have a
second namespace prefix, e.g.

<xsd:schema
    xmlns:xsd="http://www.w3.org/2001/XMLSchema"
    xmlns:target="http://www.example.org/foo"
    xmlns:target2="http://www.example.org/foo"
    targetNamespace="http://www.example.org/foo"
    >

and then to refer to the target namespace by either prefix?

For core XML, the prefixes have only the role of a notational
abbreviation. You can change them without changing the semantics of the
document. Schemas seem to violate this principle, as the prefixes (and
not the namespace URIs) occur within the Schema definition, so any
processor runs into trouble that modifies the XML prefixes without
adjusting their occurrences within the definition in the same way.

My question is whether we have at least the (weaker) principle that
several prefixes may be bound to the same namespace URI.

> I need to use PXP with XML Namespace support for various reasons. Using
> PXP with namespace implies that xmlns: attributes are removed from the
> representation of the schema element. Then prefix normalization is
> applied and the only way to know the prefix used by the user is to use
> enable_namespace_info parser configuration option.

Which works only for elements, not attributes. namespace_info was a bad
idea from the very beginnning.

> I'm a bit scared about using that option because the comments tell that
> it requires a lot of memory and that is "very very very very very
> experimental" ... Furthermore I just need to have namespace_info on the
> root element, not on all elements of the document.
> 
> Solutions that come into my mind are:
> 
> - ask the namespace manager which prefix got bound to
>   http://www.example.org/foo ==>>
>     not safe because namespace normalization could have choosen a prefix
>     different from the one used by the user ...

This happens only in two cases: (1) Before calling the parser there is
already some mapping from the namespace URI to the prefix, (2) the root
element defines several prefixes for the same namespace URI (as shown in
my question above).

You can explicitly check for (1), but case (2) may be a problem.

> - perform first a non namespace aware parsing phase and look at the
>   xmlns: attributes ==>>
>     terribly slow
> 
> Other ideas?

For the moment, I would recommend to use namespace_info, and to evaluate
what the [declaration] method returns (which reflects the xmlns
declarations). To save memory, you can override the set_namespace_info
method for all objects except xsd:schema such that it does nothing.

The problem of the namespace_info object is that it stores the list of
declarations for every element separately (actually, there is some
memory sharing on the representation level, but the [declaration] method
processes the representation before it returns the declaration list).
Furthermore, it does not store the original prefixes of the attributes.

To fix these problems, I have the following proposal:

First, namespace_info is given up. The information it contains describes
the structure of a namespace, and the right instance to keep the
structure is the namespace manager. (I.e. by storing the structure at a
single place we can optimize the representation.)

Currently, the namespace manager is a bijective mapping from
normprefixes to namespace URIs (if we ignore the alias URIs). This can
be extended by what I would call the namespace scope, i.e. the list of
prefixes that refer to the namespace. There would be a scope_id to
quickly identify scopes. For example:

<a1:x xmlns:a1="A" xmlns:b1="B">
  <b1:y xmlns:a2="A" xmlns:b2="B">
    <a2:z/>
  </b1:y>
</a1:x>

In the current implementation, the namespace manager would only contain:
{ (normprefix "a1" <-> nsuri "A"), 
  (normprefix "b1" <-> nsuri "B") }

In the extended version:
{ (normprefix "a1" <-> nsuri "A", scope 1, scope 2), 
  (normprefix "b1" <-> nsuri "B", scope 3, scope 4) }

where scope 1 = { ("a1" -> "A") },
      scope 2 = { ("a1" -> "A", "a2" -> "A") }
      scope 3 = { ("b1" -> "B") },
      scope 4 = { ("b1" -> "B", "b2" -> "B") }

The point is that the elements can now simply list the scopes that are
currently active. (The real point is that you can even modify the XML
tree, and the scopes are kept intact; if one would simply store the
original xmlns attributes, the namespace structure would be destroyed
every time the XML tree is modified across namespace boundaries.) In our
example:

a1:x has scopes 1,3
b1:y has scopes 2,4
a2:z has scopes 2,4

The memory requirements would be moderate. Keeping the scoping structure
allows us to query the prefixes that are declared for a namespace URI
for every element of the XML tree. Of course, these may be several
prefixes, not just one. To fix this problem, I propose to add the notion
of a display prefix to the document model. After parsing, the display
prefix is set to the prefix that has been found in the XML text. It is
allowed to set the display prefix to any other prefix that is member of
the same scope. This operation would be possible for both elements and
attributes.

What do you think about this proposal?

Gerd
-- 
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd at gerd-stolpmann.de          http://www.gerd-stolpmann.de
------------------------------------------------------------