JSON array diff handling – heuristic matching

We have made a major improvement to the way that data in JSON arrays are compared. The problem of how to align array items in the best way is one we have been working on for some time, and we have developed a new heuristic algorithm to do this. The new algorithm will be used when you select the Array Alignment type of orderless, which is not the default. It replaces the old version used for orderless.

A look into JSON arrays

JSON arrays are often used to hold data that has been dumped out of an application or database, so the order of items in the array is not important and needs to be ignored. So how then do we align the items in two arrays, one from document A and one from document B, that we are comparing?

An item in the A array that is equal to an item found in the B array should be aligned with that item. In the previous version of JSON Compare, this is how alignment was done, and it stopped there so that any two items that were almost equal would not be aligned.

In this new version, we take a heuristic approach to the alignment so that as well as aligning items that are exactly equal, we also align items that are similar. The heuristic algorithm seeks to find items that are most similar to one another and align them first, then moving on to items that contain less common data. In this way, we determine the ‘best match’ between the items in the two arrays, regardless of ordering.

The heuristic prefers to align items that contain common ‘unique’ data, and the best type of unique data occurs when an item in the A array has the same identifier or key as an item in the B array. However, if the data comes from two data sources which each use their own keys (so they do not match up at all!), the algorithm will still align the items based on their other data.

The output preserves the original ordering as far as possible, using the order of the data in the B input as its guide.

The intention is to get the best possible matching without the need to provide further information to the comparison/diff process. That said, there are ways in which the alignment can be tuned and we are exploring the best way to control this.

Please do let us have any comments or suggestions about this new heuristic JSON array comparison, we are determined to provide the best possible result for unordered JSON array comparison.

Example

In this example, we have some data from a spreadsheet which has been loaded into a database. The database has allocated each item an identifier, shown in the JSON as “_id”. The spreadsheet data has been updated to make a few changes, delete one item and add another item. We now want to compare the revised spreadsheet with the database to see what changes have been made.

To use the new algorithm make sure you use the Array Alignment type of Orderless. For example, in the Web Client;

Thanks to JSON Generator for their website that generates JSON data.

First, we have the revised spreadsheet where the items do not have identifiers:

[
  {
    "age": 23,
    "eyeColor": "blue",
    "name": "Rivera Myers",
    "favoriteFruit": "apple"
  },
  {
    "age": 22,
    "eyeColor": "green",
    "name": "Clarice Roman",
    "favoriteFruit": "banana"
  },
  {
    "age": 22,
    "eyeColor": "green",
    "name": "Shannon Wallace",
    "favoriteFruit": "banana"
  },
  {
    "age": 22,
    "eyeColor": "blue",
    "name": "Lott Moran",
    "favoriteFruit": "banana"
  },
  {
    "age": 22,
    "eyeColor": "green",
    "name": "Kaye Byrd",
    "favoriteFruit": "strawberry"
  },
  {
    "age": 23,
    "eyeColor": "brown",
    "name": "Justice Talley",
    "favoriteFruit": "banana"
  },
  {
    "age": 21,
    "eyeColor": "green",
    "name": "John Added",
    "favoriteFruit": "apple"
  },
  {
    "age": 22,
    "eyeColor": "brown",
    "name": "Annie Sanders",
    "favoriteFruit": "strawberry"
  },
  {
    "age": 22,
    "eyeColor": "green",
    "name": "Conner Galloway",
    "favoriteFruit": "strawberry"
  },
  {
    "age": 21,
    "eyeColor": "blue",
    "name": "Joseph Neal",
    "favoriteFruit": "strawberry"
  }
]

The database items with identifiers and in a different order is shown next:

[
  {
    "_id": "5d49a97fdd9fea5ec1888bc3",
    "age": 21,
    "eyeColor": "blue",
    "name": "Rivera Myers",
    "favoriteFruit": "apple"
  },
  {
    "_id": "5d49a97fabca02eb56a3a8be",
    "age": 22,
    "eyeColor": "brown",
    "name": "Annie Sanders",
    "favoriteFruit": "strawberry"
  },
  {
    "_id": "5d49a97f60bd09add005eaf7",
    "age": 20,
    "eyeColor": "blue",
    "name": "Key Patrick",
    "favoriteFruit": "banana"
  },
  {
    "_id": "5d49a97fd21a56116e9e1aee",
    "age": 22,
    "eyeColor": "green",
    "name": "Kaye Byrd",
    "favoriteFruit": "strawberry"
  },
  {
    "_id": "5d49a97fe7968c2c220a1776",
    "age": 22,
    "eyeColor": "green",
    "name": "Clarice Roman",
    "favoriteFruit": "banana"
  },
  {
    "_id": "5d49a97f7841450901913a78",
    "age": 22,
    "eyeColor": "green",
    "name": "Shannon Wallace",
    "favoriteFruit": "strawberry"
  },
  {
    "_id": "5d49a97ff41896986ca6a46d",
    "age": 22,
    "eyeColor": "green",
    "name": "Conner Galloway",
    "favoriteFruit": "strawberry"
  },
  {
    "_id": "5d49a97f129f4099b11e99c0",
    "age": 22,
    "eyeColor": "blue",
    "name": "Lott Moran",
    "favoriteFruit": "banana"
  },
  {
    "_id": "5d49a97f03231b741528f24e",
    "age": 22,
    "eyeColor": "brown",
    "name": "Justice Talley",
    "favoriteFruit": "banana"
  },
  {
    "_id": "5d49a97f46f0a109d25231d9",
    "age": 21,
    "eyeColor": "blue",
    "name": "Joseph Neal",
    "favoriteFruit": "strawberry"
  }
]

All the items are different because of the addition of the “_id” and there are also some other changes. If we compare these using the heuristic alignment by selecting , then we can see that the changes are identified (these are highlighted). There is some meta data at the start of the delta file, the data itself is shown as an array after the “dx_deltaJSON_delta” name.

{
    "dx_deltaJSON": {
        "dx_data_sets": "A!=B",
        "dx_deltaJSON_type": "diff",
        "dx_deltaJSON_metadata": {
            "operation": {
                "type": "compare",
                "input-format": "multi_part_strings",
                "output-format": "JSON"
            },
            "parameters": {
                "output": "full-context",
                "arrayalignment": "orderless",
                "wordbyword": false
            }
        },
        "dx_deltaJSON_delta": [
            {
                "_id": {
                    "dx_delta": {"B": "5d49a97fdd9fea5ec1888bc3"}
                },
                "age": {
                    "dx_delta": {
                        "A": 23,
                        "B": 21
                    }
                },
                "eyeColor": "blue",
                "name": "Rivera Myers",
                "favoriteFruit": "apple"
            },
            {
                "dx_delta": {
                    "A": {
                        "age": 21,
                        "eyeColor": "green",
                        "name": "John Added",
                        "favoriteFruit": "apple"
                    }
                }
            },
            {
                "_id": {
                    "dx_delta": {"B": "5d49a97fabca02eb56a3a8be"}
                },
                "age": 22,
                "eyeColor": "brown",
                "name": "Annie Sanders",
                "favoriteFruit": "strawberry"
            },
            {
                "dx_delta": {
                    "B": {
                        "_id": "5d49a97f60bd09add005eaf7",
                        "age": 20,
                        "eyeColor": "blue",
                        "name": "Key Patrick",
                        "favoriteFruit": "banana"
                    }
                }
            },
            {
                "_id": {
                    "dx_delta": {"B": "5d49a97fd21a56116e9e1aee"}
                },
                "age": 22,
                "eyeColor": "green",
                "name": "Kaye Byrd",
                "favoriteFruit": "strawberry"
            },
            {
                "_id": {
                    "dx_delta": {"B": "5d49a97fe7968c2c220a1776"}
                },
                "age": 22,
                "eyeColor": "green",
                "name": "Clarice Roman",
                "favoriteFruit": "banana"
            },
            {
                "_id": {
                    "dx_delta": {"B": "5d49a97f7841450901913a78"}
                },
                "age": 22,
                "eyeColor": "green",
                "name": "Shannon Wallace",
                "favoriteFruit": {
                    "dx_delta": {
                        "A": "banana",
                        "B": "strawberry"
                    }
                }
            },
            {
                "_id": {
                    "dx_delta": {"B": "5d49a97ff41896986ca6a46d"}
                },
                "age": 22,
                "eyeColor": "green",
                "name": "Conner Galloway",
                "favoriteFruit": "strawberry"
            },
            {
                "_id": {
                    "dx_delta": {"B": "5d49a97f129f4099b11e99c0"}
                },
                "age": 22,
                "eyeColor": "blue",
                "name": "Lott Moran",
                "favoriteFruit": "banana"
            },
            {
                "_id": {
                    "dx_delta": {"B": "5d49a97f03231b741528f24e"}
                },
                "age": {
                    "dx_delta": {
                        "A": 23,
                        "B": 22
                    }
                },
                "eyeColor": "brown",
                "name": "Justice Talley",
                "favoriteFruit": "banana"
            },
            {
                "_id": {
                    "dx_delta": {"B": "5d49a97f46f0a109d25231d9"}
                },
                "age": 21,
                "eyeColor": "blue",
                "name": "Joseph Neal",
                "favoriteFruit": "strawberry"
            }
        ]
    }
}

Keep Reading

Introducing Subtree Processing Mode for Greater Flexibility

A new feature that lets you control how content is compared by processing sections as either text or data.

Beyond Step-Through XSLT Debugging

Print-debugging in XSLT provides a broader view of code behaviour by capturing variable values at multiple points.

Streamlining Data Syndication in PIM Systems through JSON Comparison

Utilise JSON comparison to reduce errors, labour costs, and system downtime.

Configuring XML Compare for Efficient XML Comparison

Define pipelines and fine-tune the comparison process with various configuration options for output format, parser features, and more.

A Beginner’s Guide to Comparing XML Files

With XML Compare, you receive more than just a basic comparison tool. Get started with the most intelligent XML Comparison software.

Everything Great About DeltaJSON

Accessible through an intuitive online GUI or REST API, DeltaJSON is the complete package for managing changing JSON data. Learn everything about makes DeltaJSON great.

Simplifying Your JSON Management Experience with DeltaJSON

DeltaJSON simplifies JSON data management with the introduction of an NPM package.

Navigating XML Change in Aviation

Discover how the aviation industry can effectively manage XML changes to ensure compliance and safety while enhancing operational excellence.

Making Tax Digital: Embracing XML Technology for HMRC Compliance

The Making Tax Digital (MTD) initiative by HMRC aims to digitise the UK tax system, but what does that mean for UK businesses?