Adding Structure to an XQuery Comparison

In my first post on Comparing XQuery with DeltaXML Core, we used a flat sequence of span elements, where each element represented an XQuery language token, with a class attribute showing the token type. Whilst this proved simple and effective, in this post I take things a stage further to allow a more accurate/robust comparison of the XQuery by using an additional XSLT template to the input XSLT filter to add some structure to this set of tokens.

The original XQuery source (input ‘A’) used for this exercise is shown below, each recognized token-type is represented by a different foreground color, note that there are also many ‘whitespace’ tokens interspersed with the visible tokens.

original XQuery source (input ‘A’)

In the code above I’ve used rectangles to highlight the structure that we are going to add so that they enclose each set of tokens to be wrapped. Specifically, the language structures that we’re looking to wrap are:

  • function declarations
  • element constructors with separate start and end tags
  • self-closed element constructor tags

The new ‘wrap-spans’ XSLT template is added to the existing XSLT filter, xquery2xml.xsl and then called from the entry-point template, as shown below:

<xsl:template match="/">
    <xsl:variable name="text-file-uri" select="f:path-to-uri(normalize-space(text-file))"/>
    <xsl:message>xquery2xml transform on: <xsl:value-of select="$text-file-uri"/></xsl:message>
    <xsl:variable name="file-content" as="xs:string" select="unparsed-text($text-file-uri)"/>
    <xsl:variable name="tokens" as="element()*" select="xqf:show-xquery($file-content)"/>

    <pre>
        <xsl:call-template name="wrap-spans">
            <xsl:with-param name="spans" select="$tokens"/>
            <xsl:with-param name="index" select="1"/>
        </xsl:call-template>
    </pre>
</xsl:template>

The required output XML for this template is the same as the input XML but with added wrapper spans as shown below (there are a large number of tokens, so to save space, I’ve only shown the tokens for the first function and haven’t shown the input XML). Each wrapper span has a class attribute with value ‘wrapper, but there’s also a data-id attribute with a value that starts with ‘ey’ for an element wrapper or ‘fy’ for a function declaration wrapper. The number appended to this value is the position of the start token, I’ve added this and ‘pos’ attributes for other tokens just for diagnostics purposes (these are removed later).

<?xml version="1.0" encoding="UTF-8"?>
<pre xmlns="http://www.w3.org/1999/xhtml">
    <span pos="1" class="open"/>
    <span class="wrapper" data-open-id="fy2">
        <span class="prolog">declare function</span>
        <span pos="3" class="whitespace"> </span>
        <span pos="4" class="function">local:summary-full</span>
        <span pos="5" class="parenthesis">(</span>
        <span pos="6" class="variable">$emps</span>
        <span pos="7" class="whitespace"> </span>
        <span pos="8" class="op">as</span>
        <span pos="9" class="whitespace"> </span>
        <span pos="10" class="node-type">element</span>
        <span pos="11" class="parenthesis">(</span>
        <span pos="12" class="qname">employee</span>
        <span pos="13" class="parenthesis">)</span>
        <span pos="14" class="quantifier">*</span>
        <span pos="15" class="parenthesis">)</span>
        <span pos="16" class="whitespace">     </span>
        <span pos="17" class="op">as</span>
        <span pos="18" class="whitespace"> </span>
        <span pos="19" class="node-type">element</span>
        <span pos="20" class="parenthesis">(</span>
        <span pos="21" class="qname">dept</span>
        <span pos="22" class="parenthesis">)</span>
        <span pos="23" class="quantifier">*</span>
        <span pos="24" class="whitespace">     </span>
        <span pos="25" class="op">{</span>
        <span pos="26" class="whitespace">     </span>
        <span pos="27" class="higher">for</span>
        <span pos="28" class="whitespace"> </span>
        <span pos="29" class="variable">$d</span>
        <span pos="30" class="whitespace"> </span>
        <span pos="31" class="op">in</span>
        <span pos="32" class="whitespace"> </span>
        <span pos="33" class="function">fn:distinct-values</span>
        <span pos="34" class="parenthesis">(</span>
        <span pos="35" class="variable">$emps</span>
        <span pos="36" class="step">/</span>
        <span pos="37" class="qname">deptno</span>
        <span pos="38" class="parenthesis">)</span>
        <span pos="39" class="whitespace">     </span>
        <span pos="40" class="higher">let</span>
        <span pos="41" class="whitespace"> </span>
        <span pos="42" class="variable">$e</span>
        <span pos="43" class="whitespace"> </span>
        <span pos="44" class="op">:=</span>
        <span pos="45" class="whitespace"> </span>
        <span pos="46" class="variable">$emps</span>
        <span pos="47" class="filter">[</span>
        <span pos="48" class="qname">deptno</span>
        <span pos="49" class="whitespace"> </span>
        <span pos="50" class="op">=</span>
        <span pos="51" class="whitespace"> </span>
        <span pos="52" class="variable">$d</span>
        <span pos="53" class="filter">]</span>
        <span pos="54" class="whitespace">     </span>
        <span pos="55" class="op">return</span>
        <span pos="56" class="whitespace">     </span>
        <span class="wrapper" data-open-id="ey57">
            <span class="es"><</span>
            <span pos="58" class="en">dept</span>
            <span pos="59" class="z">></span>
            <span pos="60" class="txt">       </span>
            <span class="wrapper" data-open-id="ey61">
                <span class="es"><</span>
                <span pos="62" class="en">full</span>
                <span data-close-id="sz63" class="z">/></span>
            </span>
            <span pos="64" class="txt">       </span>
            <span class="wrapper" data-open-id="ey65">
                <span class="es"><</span>
                <span pos="66" class="en">deptno</span>
                <span pos="67" class="z">></span>
                <span pos="68" class="op">{</span>
                <span pos="69" class="variable">$d</span>
                <span pos="70" class="op">}</span>
                <span pos="71" class="sc"></</span>
                <span pos="72" class="cl">deptno</span>
                <span data-close-id="ez73" class="z">></span>
            </span>
            <span pos="74" class="txt">       </span>
            <span class="wrapper" data-open-id="ey75">
                <span class="es"><</span>
                <span pos="76" class="en">headcount</span>
                <span pos="77" class="z">></span>
                <span pos="78" class="txt"> </span>
                <span pos="79" class="op">{</span>
                <span pos="80" class="function">fn:count</span>
                <span pos="81" class="parenthesis">(</span>
                <span pos="82" class="variable">$e</span>
                <span pos="83" class="parenthesis">)</span>
                <span pos="84" class="op">}</span>
                <span pos="85" class="txt"> </span>
                <span pos="86" class="sc"></</span>
                <span pos="87" class="cl">headcount</span>
                <span data-close-id="ez88" class="z">></span>
            </span>
            <span pos="89" class="txt">       </span>
            <span class="wrapper" data-open-id="ey90">
                <span class="es"><</span>
                <span pos="91" class="en">payroll</span>
                <span pos="92" class="z">></span>
                <span pos="93" class="txt"> </span>
                <span pos="94" class="op">{</span>
                <span pos="95" class="function">fn:sum</span>
                <span pos="96" class="parenthesis">(</span>
                <span pos="97" class="variable">$e</span>
                <span pos="98" class="step">/</span>
                <span pos="99" class="qname">salary</span>
                <span pos="100" class="parenthesis">)</span>
                <span pos="101" class="op">}</span>
                <span pos="102" class="txt"> </span>
                <span pos="103" class="sc"></</span>
                <span pos="104" class="cl">payroll</span>
                <span data-close-id="ez105" class="z">></span>
            </span>
            <span pos="106" class="txt">       </span>
            <span pos="107" class="sc"></</span>
            <span pos="108" class="cl">dept</span>
            <span data-close-id="ez109" class="z">></span>
        </span>
        <span pos="110" class="open"/>
        <span pos="111" class="whitespace">     </span>
        <span pos="112" class="op">}</span>
        <span data-close-id="fz113" class="op">;</span>
    </span>
    <span pos="114" class="whitespace">        </span>
</pre>

In the XML above we can see that the start of each function declaration is marked by a ‘prolog’ token with the value ‘declare function’ (in practice, the whitespace separator may vary), the start of an element constructor is maked by an ‘es’ or ‘esx’ token (note that the token types used by the XMLSpectrum tokenizer do not correspond directly to the language specification for XQuery). The ‘wrap-tokens’ template uses this information to set a ‘wrap-open’ variable that detects the start of a wrapped sequence of tokens, the XSLT excerpt for this is shown below:

<xsl:variable name="is-fn-declaration" as="xs:boolean"
              select="$span/@class eq 'prolog' and $prolog-tokens[1] eq 'declare' and $prolog-tokens[2] eq 'function'"/>

<xsl:variable name="wrap-open" as="xs:string"
              select="if ($span/@class = ('es','esx'))
                       then 'ey' (: ey - element start :)
                       else if ($is-fn-declaration)
                       then 'fy' (: fy - function declaration start :)
                       else ''"/>

A ‘wrap-close’ variable is used to detect the close of a wrapped token sequence in a similar manner to ‘wrap-open’, the full code for the template is shown below:

<xsl:template name="wrap-spans">
    <xsl:param name="spans" as="node()*"/>
    <xsl:param name="index" as="xs:integer"/>

    <xsl:variable name="span" as="node()?" select="$spans[$index]"/>
    <xsl:variable name="prev-span" as="node()?" select="$spans[$index - 1]"/>
    <xsl:variable name="next-span" as="node()?" select="$spans[$index + 1]"/>

    <xsl:variable name="prolog-tokens" as="xs:string*" select="tokenize($span, '\s+')"/>

    <xsl:variable name="is-fn-declaration" as="xs:boolean"
                  select="$span/@class eq 'prolog' and $prolog-tokens[1] eq 'declare' and $prolog-tokens[2] eq 'function'"/>

    <xsl:variable name="wrap-open" as="xs:string"
                  select="if ($span/@class = ('es','esx'))
                           then 'ey' (: ey - element start :)
                           else if ($is-fn-declaration)
                           then 'fy' (: fy - function declaration start :)
                           else ''"/>

    <xsl:variable name="wrap-close"
                  select="if ($span/@class eq 'op' and $span eq ';' 
                               and $prev-span/@class eq 'op' and $prev-span/@class eq 'op'
                               and $prev-span eq '}')
                           then 'fz' (: fz - closed function declaration :)
                           else if (ends-with($span, '/>') and $prev-span/@class = ('en', 'atn'))
                           then 'sz' (: sz - self-closed element :)
                           else if ($span eq '>' and $prev-span/@class eq 'cl')
                           then 'ez' (: ez - closed element :)
                           else ''"/>

    <xsl:choose>
        <xsl:when test="empty($span)"/>
        <xsl:when test="$wrap-open ne ''">
            <xsl:variable name="span-children" as="node()*">
                <xsl:call-template name="wrap-spans">
                    <xsl:with-param name="spans" select="$spans"/>
                    <xsl:with-param name="index" select="$index + 1"/>
                </xsl:call-template>
            </xsl:variable>
            <span class="wrapper" data-open-id="{$wrap-open}{$index}">
                <xsl:sequence select="$span"/>
                <xsl:sequence select="$span-children"/>
            </span>
            <xsl:if test="exists($span-children[last()]/@data-close-id)">
                <xsl:variable name="prev-index" select="xs:integer(substring($span-children[last()]/@data-close-id, 3))"/>
                <xsl:call-template name="wrap-spans">
                    <xsl:with-param name="spans" select="$spans"/>
                    <xsl:with-param name="index" select="$prev-index + 1"/>
                </xsl:call-template>
            </xsl:if>
        </xsl:when>
        <xsl:when test="$wrap-close ne ''">
            <span data-close-id="{$wrap-close}{$index}">
                <xsl:copy-of select="$span/@*|$span/node()"/>
            </span>
        </xsl:when>
        <xsl:otherwise>
            <xsl:apply-templates select="$span" mode="wrapping">
                <xsl:with-param name="pos" select="$index"/>
            </xsl:apply-templates>
            <xsl:call-template name="wrap-spans">
                <xsl:with-param name="spans" select="$spans"/>
                <xsl:with-param name="index" select="$index + 1"/>
            </xsl:call-template>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>

The most important part to this template is its recursive nature, with an xsl:choose instruction used to control the recursive template call depending on whether a ‘wrap-open’ is detected and a new wrapper element must be added, a ‘wrap-close’ is detected and the currently wrapped sequence must be closed, or no open/close markers are found and therefore the current sequence should continue.

Note that on a ‘wrap-open’ event, we need to continue processing tokens immediately after the ‘wrap-close’ event which will occur within the recursive call. The ‘wrap-close’ event therefore adds a ‘data-close-id’ attribute to the closing token, this is then read within the calling ‘wrap-open’ event to get the next token index.

Orderless Elements

Now we’ve added some structure to our XQuery code tokens we can make code comparisons more resilient. In XQuery, the order of functions declared in the prolog is not significant, as declared functions now have a dedicated ‘wrapper’ element, we can exploit DeltaXML Core’s ‘orderless’ comparison feature (described in Comparing Orderless Elements). Orderless comparisons are performed on elements when deltaxml:ordered attributes with a value of ‘false’ are found, also, a deltaxml:key attribute can be added to elements to ensure that alignment is acheived most efficiently and reliably. We would normally use the QName for the function (represented using {uri}local-name ) and also add an arity value as the key for each function declaration, but to simplify things slightly I will use the prefixed function name only. To add these attributes we just need one further template, but because this functionality is separate to the initial token wrapping I’ll write a separate XSLT filter for this, key-xquery.xsl:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns="http://www.w3.org/1999/xhtml"
                xmlns:deltaxml="https://www.deltaxml.com/ns/well-formed-delta-v1"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="2.0"
                xpath-default-namespace="http://www.w3.org/1999/xhtml">

    <xsl:output method="xml" indent="no"/>

    <xsl:template match="/pre">
        <pre xmlns:deltaxml="https://www.deltaxml.com/ns/well-formed-delta-v1" deltaxml:ordered="false">
            <xsl:apply-templates select="*" mode="top"/>
        </pre>
    </xsl:template>

    <xsl:template match="@* | node()" mode="lower">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()" mode="lower"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="span" mode="top">
        <span deltaxml:key="p-{position()}">
            <xsl:apply-templates select="@class | node()" mode="lower"/>
        </span>
    </xsl:template>

    <xsl:template match="span" mode="lower">
        <span>
            <xsl:apply-templates select="@class | node()" mode="lower"/>
        </span>
    </xsl:template>

    <xsl:template match="span[@class eq 'wrapper'][starts-with(@data-open-id, 'fy')]" mode="top">
        <span deltaxml:key="{span[@class eq 'function'][1]}">
            <xsl:apply-templates select="@class | node()" mode="lower"/>
        </span>
    </xsl:template>

</xsl:stylesheet>

This simple template adds the required ordered=”false” attribute to the ‘pre’ attribute and then adds keys to each top-level element. All that’s left is to add this ‘key-xquery’ filter to our Pipeline Configuration file (DXP) and run the comparison, here’s the new DXP file with the added filter:

<!DOCTYPE comparatorPipeline SYSTEM "../dxp/dxp.dtd"> <!-- $Id
<!-- comparatorPipeline description="compare xquery" id="xquery" -->
<comparatorPipeline description="compare xquery" id="xquery">
    <inputFilters>
        <filter>
            <file path="xquery2xml.xsl" relBase="dxp"/>
        </filter>
        <filter>
            <file path="key-xquery.xsl" relBase="dxp"/>
        </filter>
    </inputFilters>
    <outputFilters>
        <filter>
            <file path="xquery-tokens2html.xsl" relBase="dxp"/>
        </filter>
    </outputFilters>
    <outputProperties>
        <property name="indent" literalValue="no"/>
    </outputProperties>
    <comparatorFeatures>
        <feature name="http://deltaxml.com/api/feature/isFullDelta" literalValue="true"/>
        <feature name="http://deltaxml.com/api/feature/enhancedMatch1" literalValue="true"/>
    </comparatorFeatures>
</comparatorPipeline>

Its now time to compare this XQuery with a modified version, these are the changes I made:

  1. Swapped the order of the local:summary-full and local:summary-short function declarations
  2. In local:summary-full(): Added ‘company’ to $emps/deptno to give $emps/company/deptno
  3. In local:summary-full(): Added a ‘serial’ attribute to the ‘dept’ element constructor
  4. In local:summary-full(): Removed the ‘full’ child element from the ‘dept’ element constructor
  5. In local:summary-short(): Changed ‘$emps[deptno = $d]’ to be ‘$emps[deptno ne $d]’

And here is the result of using DeltaXML Core to compare this modified version with the original – using orderless comparison with keys:

result of using DeltaXML Core - orderless image

As we can see above, the changes have been marked clearly and there is no issue with the different order of the function declarations (the result shows the order of the original by default). It’s now time to look at the output from DeltaXML Core from the same input, but this time when keyed comparison is not used:

result of using DeltaXML Core without keyed comparison image

The above result shows that, when using ordered comparison, correct changes are not picked up and that incorrect changes are reported, simply because the correct functions have not been aligned and therefore the wrong functions have been compared with each other.

Conclusion

By creating a filter to add wrapper elements with unique keys to XQuery function declarations, we were able to exploit DeltaXML Core’s orderless comparison and ignore changes in the order of function declarations within the XQuery prolog – the example comparison showed that this allowed an accurate result where other comparators would fail. It would be relatively straightforward to extend the ‘wrap-spans’ template and ‘key-xquery.xsl’ filter to handle other orderless language constructs such as variable declarations.

Future Enhancements

One feature that I think could be useful would be a ‘ghost’ image showing a faint rendering of the place where the code block occurred in the other version (see below), this could probably be done best with an extra input/output filter to use a ‘placeholder’ element as a reference – hopefully I will get the chance to try this out in a further blog post.

‘ghost’ image showing a faint rendering of the place where the code block occurred

Keep Reading

Managing Risk in Legal Documentation

/
Proactively addressing compliance, accuracy, and security risks in legal documentation is essential to protect from costly errors.

Ensuring Accuracy in Legal Documentation

/
Efficient document comparison and merging can drastically improve accuracy, collaboration, and compliance for legal teams.

Introducing HTML Compare

/
HTML Compare is your go-to for tracking, comparing, and managing HTML content changes with ease, offering clear visual highlights and customisable settings.

Introducing Subtree Processing Mode for Greater Flexibility

/
A new feature that lets you control how content is compared by processing sections as either text or data.

Beyond Step-Through XSLT Debugging

Print-debugging in XSLT provides a broader view of code behaviour by capturing variable values at multiple points.

Solving Common Challenges with Inaccurate Document Management

Discover practical strategies to overcome common challenges in regulated industries.

How to avoid non-compliance when updating technical documents in regulated industries

Navigate the challenges of updating technical documents in regulated industries.

Built-in XML Comparison vs Document Management Systems (DMS)

Compare using specialised XML comparison software versus a DMS in regulated industries.

How Move Detection Improves Document Management

Learn how move detection technology improves document management by accurately tracking relocated content.