Comparing XQuery with DeltaXML Core

Typically, the DeltaXML Core product is used to compare XML content, but have you ever considered using the Core product to compare non-XML documents? In this blog post, I experiment with using Core to compare (non-XML) XQuery code.

The objective here is to use a DXP Pipeline Configuration file to define a pipeline for the Core comparator. Two simple XSLT filters are to be used, one for input, the other for output. The input filter converts XQuery to XML with one element for each token, it exploits an imported tokenizer function for XQuery – from the open source XMLSpectrum project (for which I’m currently the sole contributor).

A high-level view of the XQuery pipeline I developed is shown below:

A view of the configured DXP pipeline

A view of the configured pipeline

Running the XQuery comparison

For this experiment I’m using the Java version of Core and invoking the comparison from an Ant build file. Within this, the run target invokes the DeltaXML command.jar with 5 command-line arguments:

  1. compare The Core method to invoke
  2. xquery The id attribute of the Pipeline Configuration file
  3. input-file1.xml Input XML file 1 – holds the URI of the 1st XQuery file to compare
  4. input-file2.xml Input XML file 2 – holds the URI of the 2nd XQuery file to compare
  5. result.html The destination file
<project name="compare-xquery" default="run" basedir=".">

    <target name="run">
        <java jar="../../command.jar" fork="yes" failonerror="yes">
            <arg value="compare"/>
            <arg value="xquery"/>
            <arg value="input-file1.xml"/>
            <arg value="input-file2.xml"/>
            <arg value="result.html"/>
        </java>
    </target>

    <target name="clean">
        <delete>
            <fileset dir="." includes="result.html"/>
        </delete>
    </target>

</project>

The Ant build file: build.xml

The Comparison Result

Before looking at how the comparison pipeline is defined, let’s first have a look at the result of an XQuery comparison performed on 2 small test files. Each file defines the same XQuery function, but three minor changes were made to the second file. The HTML output from the comparison of these files is rendered below:

declare function display:print-modules($local as xs:boolean) as element()+ {
    (
        <div class="homehomeyyy">
            {
                if (fn:exists(fn:collection($display:XQDOC_COLLECTION)/xq:xqdocxq:xqdocxq:/modulexq:module[@type="library"])) then
                    (
                        <h4>Library ModulesLibrary Module</h4>,
                        <br/>,
                        <br/>,
                        for $x in fn:collection($display:XQDOC_COLLECTION)/xq:xqdoc[xq:module/@type="library"]
                        order by $x/xq:module/xq:uri
                        return
                            (
                                display:build-link("get-module",
                                    $local,
                                    (fn:string($x/xq:module/xq:uri)),
                                    display:decode-uri(fn:string($x/xq:module/xq:uri))
                                ),
                                <br/>
                            )
                    )
                else
                    ()
            }
        </div>
    )
};

Syntax highlighted result with differences

The output, as shown above, is HTML that renders a syntax-highlighted version of the result of comparing the two XQuery files, with the background color indicating changes – deletions are in red and additions in green. A couple of things can be observed from this: 1) the granularity for the changes is at the ‘token’ level, and 2) the tokens are syntax-highlighted as they would have been in the two input XQuery files.

CSS used to style the HTML is also generated by the output filter. The HTML produced has class attributes that allow the CSS to be used to render the background and foreground colors as required. To illustrate this, here’s a small part of the rendered HTML:

/xq:xqdocxq:xqdocxq:/modulexq:module

Extracted part of the HTML output

And here is the HTML code used to render the above:

<pre>
    <span class="step">/</span>
    <span class="partA qname">xq:xqdoc</span>
    <span class="partB qname">xq:xqdocxq:</span>
    <span class="partA step">/</span>
    <span class="partB qname">module</span>
    <span class="partA qname">xq:module</span>
</pre>

HTML code with class attributes used for CSS styling

Note: The change in the XQuery for this extract was just the deletion of the ‘step’ operator, this change rendered the XQuery invalid because we’re left with an invalid QName ‘xq:docxq:module’; unsurprisingly, XMLSpectrum doesn’t do too well tokenizing XQuery that won’t compile – hence the unexpected output where the invalid QName is split into two.

Every span element in the HTML source represents an XQuery token, each span element has a class attribute that holds upto 2 space-separated values:

  1. Token Type – always present, the type of XQuery token, for example step is used to denote an XQuery step operator.
  2. Part Identifier – posible values: partA or partB, indicates the A or B origin of the token, only present when no match is found for the token in the other file.

Now we’ve previewed the output, its time to look at the pipeline configuration and filters used to help produce this:

The Pipeline Configuration

The Pipeline Configuration file, referenced in the Ant file using its ‘xquery’ id attribute, is used to declare the input and output filters for the comparison, in this case there is just one input filter and one output filter:

<!DOCTYPE comparatorPipeline SYSTEM "../dxp/dxp.dtd"> <!-- $Id
<!-- nbsp; -->
<comparatorPipeline description="compare xquery" id="xquery">
    <inputFilters>
        <filter>
            <file path="xquery2xml.xsl" relBase="dxp"/>
        </filter>
    </inputFilters>
    <outputFilters>
        <filter>
            <file path="xquery-tokens2html.xsl" relBase="dxp"/>
        </filter>
    </outputFilters>
    <outputProperties>
        <property name="indent" literalValue="no"/>
    </outputProperties>
    <comparatorFeatures>
        <feature name="https://deltaxml.com/api/feature/isFullDelta" literalValue="true"/>
        <feature name="https://deltaxml.com/api/feature/enhancedMatch1" literalValue="true"/>
    </comparatorFeatures>
</comparatorPipeline>

The Core Pipeline Configuration file: compare-xquery.xml

The Input Filter

The Input Filter first fetches the XQuery file content as a string by invoking the unparsed-text XPath function, for this, it uses the URI contained within the input XML file. The result string is then passed as an argument to the xqf:show-xquery function (imported from XMLSpectrum), this returns a sequence of span elements which are then wrapped in a pre element. This pre element is then output as the principle result from the filter.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:xqf="urn:xq.internal-function"
    xmlns:f="internal"
    version="2.0"
    exclude-result-prefixes="f xs xqf">

    <xsl:import href="xmlspectrum-xsl/xq-spectrum.xsl"/>

    <!-- Input XML is a single 'text-file' element containing the file URI: eg.
        <text-file>
            xqdoc-display1.xqy
        </text-file>
    -->
    <xsl:template match="/">
        <xsl:variable name="text-file-uri" select="f:path-to-uri(normalize-space(text-file))"/>
        <xsl:message>xquery2xml transform on: <xsl:value-of select="$text-file-uri"/></xsl:message>
        <xsl:variable name="file-content" as="xs:string" select="unparsed-text($text-file-uri)"/>
        <xsl:variable name="tokens" as="element()*" select="xqf:show-xquery($file-content)"/>
        <pre xmlns="http://www.w3.org/1999/xhtml">
            <xsl:sequence select="$tokens"/>
        </pre>
    </xsl:template>

    <xsl:function name="f:path-to-uri">
        <xsl:param name="path"/>
        <xsl:choose>
            <xsl:when test="matches($path, '^[A-Za-z]:.*')">
                <xsl:value-of select="concat('file:/', $path)"/>
            </xsl:when>
            <xsl:when test="starts-with($path, '/')">
                <xsl:value-of select="concat('file://', $path)"/>
            </xsl:when>
            <xsl:otherwise>
                <xsl:value-of select="$path"/>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:function>

</xsl:stylesheet>

The Input Filter: xquery2xml.xsl

The Output Filter

The output from Core conforms to the deltaV2 format. From the point of view of our XSLT filter, the most significant part of this is the deltaxml:deltaV2 attribute which indicates the origin of the contents contained within the associated element. The filter must also take into account additional elements in the deltaxml namespaces used to represent changes within a single element and changes to attributes. The XSLT for handling the deltaV2 format is relatively straightforwards, requiring just a handful of short templates.

The only goal remaining for the XSLT is to enclose the pre element within HTML content to make a valid HTML document, and also generated the CSS file (linked to from the HTML) by making a call to f:get-css which uses the color-theme parameter to generate the appropriate CSS for rendering.

<xsl:stylesheet
    version="2.0"
    xmlns:xsl="https://www.w3.org/1999/XSL/Transform"
    xmlns:dxa="https://www.deltaxml.com/ns/non-namespaced-attribute"
    xmlns:xhtml="https://www.w3.org/1999/xhtml"
    xmlns:xs="https://www.w3.org/2001/XMLSchema"
    xmlns:deltaxml="https://www.deltaxml.com/ns/well-formed-delta-v1"
    xmlns:f="internal"
    xmlns:dxx="https://www.deltaxml.com/ns/xml-namespaced-attribute"
    exclude-result-prefixes="xs deltaxml dxa dxx f">

    <xsl:import href="xmlspectrum-xsl/highlight-file.xsl"/>

    <xsl:output method="html"/>

    <xsl:param name="title" select="'HTML Result'"/>
    <xsl:param name="color-theme" select="'pg-light'"/>
    <xsl:variable name="css-name" select="'theme.css'"/>

    <xsl:template match="/">
        <html>
            <head>
                <title>
                    <xsl:value-of select="$title"/>
                </title>
                <!-- for dark background style:
                <style type="text/css">
                    span.partA {background-color:#501010} span.partB {background-color:#105010}
                </style>
                -->
                <style type="text/css">
                    span.partA {background-color:#ffdada} span.partB {background-color:#daffda}
                </style>
                <link rel="stylesheet" type="text/css" href="{$css-name}"/>
                <xsl:if test="$font-name eq 'scp' and $css-inline eq 'yes'">
                    <style>
                        @import url(https://fonts.googleapis.com/css?family=Source+Code+Pro);
                    </style>
                </xsl:if>
            </head>
            <body>
                <div>
                    <pre class="spectrum">
                        <xsl:apply-templates select="xhtml:pre/*"/>
                    </pre>
                </div>
            </body>
        </html>
        <xsl:result-document href="{$css-name}" method="text" indent="no">
            <xsl:sequence select="f:get-css()"/>
        </xsl:result-document>
    </xsl:template>

    <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="xhtml:span[not(contains(@deltaxml:deltaV2, '!='))]">
        <span>
            <xsl:apply-templates select="@* | node()"/>
        </span>
    </xsl:template>

    <xsl:template match="xhtml:span[@deltaxml:deltaV2 = ('A','B')]">
        <xsl:variable name="part" select="if (@deltaxml:deltaV2 eq 'A') then 'partA' else 'partB'"/>
        <span class="{$part, @class}">
            <xsl:value-of select="."/>
        </span>
    </xsl:template>

    <xsl:template match="xhtml:span[contains(@deltaxml:deltaV2, '!=')]">
        <xsl:apply-templates select="@* | node()" mode="not-equal"/>
    </xsl:template>

    <xsl:template match="@deltaxml:deltaV2"/>
    <xsl:template match="@deltaxml:deltaV2" mode="not-equal"/>
    <xsl:template match="@class" mode="not-equal"/>
    <xsl:template match="deltaxml:attributes" mode="#default not-equal"/>
    <xsl:template match="deltaxml:textGroup" mode="not-equal">
        <xsl:apply-templates mode="group"/>
    </xsl:template>

    <xsl:template match="deltaxml:text" mode="group">
        <xsl:variable name="part" select="if (@deltaxml:deltaV2 eq 'A') then 'partA' else 'partB'"/>
        <xsl:variable name="class" select="f:get-class(.)"/>
        <span class="{$part, $class}">
            <xsl:value-of select="."/>
        </span>
    </xsl:template>

    <xsl:template match="text()" mode="not-equal">
        <span class="{../@class}">
            <xsl:value-of select="."/>
        </span>
    </xsl:template>

    <xsl:function name="f:get-class">
        <xsl:param name="text-element" as="element(deltaxml:text)"/>
        <xsl:variable name="class" select="$text-element/../../@class"/>
        <xsl:variable name="part" select="$text-element/@deltaxml:deltaV2"/>
        <xsl:value-of select="if (exists($class)) then
                                $class
                                else $text-element/../preceding-sibling::deltaxml:attributes/
                                dxa:class/deltaxml:attributeValue[@deltaxml:deltaV2 eq $part]"/>
    </xsl:function>

</xsl:stylesheet>

The Output Filter: xquery-tokens2html.xsl

Conclusion

I’ve shown here that Core can be used to perform a ‘token by token’ comparison of XQuery and return the result as syntax-highlighted XQuery rendered using HTML and CSS by using 2 very simple XSLT filters and some simple pipeline configuration. The tokenisation of XQuery was handled separately by an XSLT function that was imported by the input filter.

This experiment does show that converting a non-XML language, in this case XQuery, into XML to achieve a more semantic/intelligent display of changes is not too difficult. These initial results show that it is worth the effort. Because the resultant differences are represented in XML it would also be possible to generate reports on changes rather than just display a red-lined document. Such reports could be useful for documentation or audit.

Future Enhancements

As it stands we’ve produced an XQuery code comparator that provides much better granularity than a text-based, line-by-line comparison tool. However, we could enhance functionality considerably with some fairly simple updates to the input and output filters. I hope to look at these enhancements in future blog posts, but here are some that I’ve identified:

  • Ignore difference in whitespace tokens used only for XQuery formatting
  • Ignore changes to the location of variable and function definitions – provided they remain in scope
  • Provide more granular matching (using the ‘word-by-word’ feature) within certain token types, such as literal-text tokens

Keep Reading

Managing Risk in Legal Documentation

/
Proactively addressing compliance, accuracy, and security risks in legal documentation is essential to protect from costly errors.

Ensuring Accuracy in Legal Documentation

/
Efficient document comparison and merging can drastically improve accuracy, collaboration, and compliance for legal teams.

Introducing HTML Compare

/
HTML Compare is your go-to for tracking, comparing, and managing HTML content changes with ease, offering clear visual highlights and customisable settings.

Introducing Subtree Processing Mode for Greater Flexibility

/
A new feature that lets you control how content is compared by processing sections as either text or data.

Beyond Step-Through XSLT Debugging

/
Print-debugging in XSLT provides a broader view of code behaviour by capturing variable values at multiple points.

Solving Common Challenges with Inaccurate Document Management

Discover practical strategies to overcome common challenges in regulated industries.

How to avoid non-compliance when updating technical documents in regulated industries

Navigate the challenges of updating technical documents in regulated industries.

Built-in XML Comparison vs Document Management Systems (DMS)

Compare using specialised XML comparison software versus a DMS in regulated industries.

How Move Detection Improves Document Management

Learn how move detection technology improves document management by accurately tracking relocated content.