Automating dataset updates

Data.govt.nz adopted an international data harvesting open standard (data.json) to automate agency dataset updates and additions. There are tools available to help generate the correct format and the open standard data schema is detailed with examples.

What is a data.json?

A data.json file is a machine readable stocktake of data used to exchange metadata between data catalogues (like data.govt.nz). It allows for a distributed model of keeping a centralised catalogue up to date from many contributors (this is also called a "Federated" model).

The data.json file conforms to open standards used widely around the world in most government open data portals. It originated from the Project Open Data initiative in the United States and is a derivative of the W3C DCAT (Data CATalog Vocabulary) interoperability standard.

How do we use a data.json to keep data.govt.nz up to date?

If agencies maintain one or more data.json files containing the total metadata of datasets they wish to list on data.govt.nz then these can be set up by the data.govt.nz support team (info@data.govt.nz) to be harvested on a regular basis. The harvest frequencies available are:

Monthly
Weekly
Daily

The data.json file should exist at a publicly accessible URL (usually on an agency website) and this URL should not change. However the agency can update the file including new, removed or updated datasets which will update data.govt.nz the next time it is harvested.

Some open data repositories can be harvested onto data.govt.nz without having to upload a data.json file. Get in touch with the data.govt.nz support team for more information.

How to generate a data.json harvest file

Option 1: CSV file

A simple method for creating a data.json harvest file is to use the provided CSV template and populate with your dataset metadata.

Populating the CSV stocktake

The name of the columns in the CSV are mostly the same as in the data.json schema (see below schema for the names, descriptions and examples of the metadata to supply).

To make filling out your stocktake easier some of the CSV columns can be entered in plain English, rather than the more technical standards (however, if you'd rather hold your stocktake in the ISO standards, the conversion tool will respect this).

See the example.csv to get an idea of how to prepare this file for conversion.

CSV column dot notation for nested metadata

The conversion tool uses a dot (.) notation to store nested values as they appear in the data.json file. For example:

"publisher": {"name": ""} in the json file would be stored in the CSV file column with the heading publisher.name.

Entering many related dataset files into the CSV

The CSV format can handle datasets which have one or more data files or related URLs. These are entered on the same row in the CSV file as the dataset metadata.

For each individual file or API endpoint you will add a series of columns using the dot notation mentioned above. You will also add a number reference starting at 0 to ensure each column has a unique name (this is important!).

For example, you are required to provide the downloadURL, title and optionally, the format of each dataset file.

If you had two files relating to your dataset, these would be expressed using the following CSV headings within one row:

distribution.0.downloadURL
distribution.0.title
distribution.0.format
distribution.1.downloadURL
distribution.1.title
distribution.1.format

If your URL relates to an API endpoint you can replace downloadURL with accessURL (refer to the metadata schema at the end of this guidance for other properties and values you can make use of).

Option 2: Implement the data.json standard in your own system

As long as the open standard is followed, how and where you populate the final data.json is up to you. There are a number of often used geospatial open data portal systems that have already adopted this standard, or you may write your own implementation for your public facing data system or content management system.

What does a data.json look like?

The data.json file is a Javascript Object Notation format and structured as an array "[ ... ]" of dataset objects "{ ... }".

e.g.

    [
     {"title": "Chief Executive Expenses",
      "license": "https://creativecommons.org/licenses/by/4.0/",
      "publisher": {"name": "Publisher Agency", "mbox": "test@test.com"},
      "distribution": [
        {
          "downloadURL": "https://agency.govt.nz/link/to/ce_expenses_2016.csv",
          "title": "Chief Executive Expenses for 2016",
          "format": "csv"
        },
        {
          "downloadURL": "https://agency.govt.nz/link/to/ce_expenses_2015.csv",
          "title": "Chief Executive Expenses for 2015",
          "format": "csv"
        },
        ]
     },
     {"title": "Roadworks locations",
       ...
     }
    ]

Where to put your data.json file

Ideally this should reside at https://YOURORGANISATION.govt.nz/data.json, however, as long as the URL is public and conforms to the schema standard, it can be harvested into data.govt.nz.

Once you have this file in place, contact the data.govt.nz support team and let them know the location and how often you will likely update this file with new and updated datasets.

This file needs to be accessible to our harvesters - you may need to allowlist access to this file in any WAFs (web application firewalls) that you may have.

Production harvester IP: 52.64.170.227
Test harvester IP: 52.62.153.137

Character encoding

The data.json file should have ASCII or UTF-8 character encoding (as per the JSON standard). The harvester now detects this and gives an error if it is not acceptable.

Data.json schema (for data.govt.nz)

Dataset

data.json field	Required?	Example value	Comments
title	Yes	"New Zealand Public Sector Websites"	A good descriptive title of your dataset.
description	Yes	"List of websites owned and administered by the New Zealand Public Sector. The Department of Internal Affairs acknowledges this list has been compiled to the best of their knowledge, but it is not a complete list of all Public Sector websites. This list will be updated as the Department becomes aware of required updates."	A longer description about the dataset which may include methodology, caveats and other related information to help others use appropriately
identifier	Yes	For example `https://webtoolkit.govt.nz/guidance/domain-names/new-zealand-public-sector-websites/` or `f572a794d5aa323824ccbc72f138fc2233b54ad141a00eba`	A string that identifies the dataset now and in the future, ideally even if the dataset's title changes. If the dataset is already in a data catalogue, supply the URL of the dataset page, or unique catalogue identifier. If it is not catalogued already you can assign it a random hexademical string of 24 digits or more. Ideally the identifier should be globally unique - not just unique to the publisher - so a URI is highly recommended.
license	Yes	`https://creativecommons.org/licenses/by/4.0/`	Must be a license URI from those recommended in NZGOAL or empty string if not licensed.
keyword	Optional	`"keyword": {"websites", "open government", "url"}`	Keywords help to connect related datasets. Each keyword should only include numbers and letters (alphanumeric).
issued	Optional	`2011-08-26`	Date that the data was first published. Formats allowed are: 'YYYY-MM-DD', 'YYYY-MM', 'YYYY' or 'YYYY-MM-DDTHH:MM:SS.mmmmmm' (according to ISO8601)
modified	Optional	`2015-04-01`	Date that the data was most recently updated. Formats allowed are: 'YYYY-MM-DD', 'YYYY-MM', 'YYYY' or 'YYYY-MM-DDTHH:MM:SS.mmmmmm' (according to ISO8601)
publisher	Yes	`"publisher": {"@type": "org:Organization", "name": "Department of Internal Affairs"},`	Organization schema. See https://schema.org/Organization. `name` and `email` are common values to provide.
contactPoint	Yes	`"contactPoint": {"@type": "vcard:Contact","fn": "Jane Doe","hasEmail": "mailto:jane.doe@agency.gov", "hasPhone": "1234567890"}`	Contact for the specific dataset in vCard format including full name (`fn`), email(`hasEmail`) and optionally, phone (`hasPhone`) of the contact person.
distribution	Yes	See "Distribution" table below.	Location for accessing/downloading the data files or accessing APIs.
landingPage	No	`https://webtoolkit.govt.nz/guidance/domain-names/new-zealand-public-sector-websites/`	URL of a web page specifically about the dataset and includes links to data files and supplimentary information about the dataset.
references	No	`["https://webtoolkit.govt.nz/guidance/domain-names/new-zealand-public-sector-websites/"]` OR in data.json you can specify more than one reference like a distribution e.g. `[ {"url": "", "title": "", "format": ""}, {"url": "", "title": "", "format": ""}, ]`. Useful for providing links to data dictionaries and vocabularies.	URL of a web page, PDF or other documentation that gives more information about the dataset. Note: Use landingPage instead for a resource URL if that is more appropriate. Should be an array, to allow multiple references to be specified.
language	No	`["en"]`	Language of the data in ISO 639-1 format. Should be an array of values `["en", "mi", ...]`.
accrualPeriodicity	No	`R/P1Y` (=annual) `R/P1W` (=weekly)	The frequency at which dataset is published. Format: ISO 8601 Repeating Duration (or `irregular`) See the project Open Data Guidance on human readable names mapped to ISO data standard.
temporal	No	`2000-01-15/2000-01-20` `2010-01/2010-032010/2010`	The date period that the data applies to. Formatted as two ISO 8601 dates (or datetimes) separated by a slash. If the period in question is a whole year or whole month, just put the same value for start and finish - eg `2010/2010` or `2010-06/2010-06`.
spatial	No	`{\"type\":\"Polygon\",\"coordinates\":[[[2.072, 49.943],[2.072, 55.816], [-6.236, 55.816], [-6.236, 49.943], [2.072, 49.943]]]}`	The geographic location that the data applies to. Formatted as a GeoJSON point, bounding box or polygon.
theme	No	`"theme": "Fiscal, tax and economics",` or `"theme": ["Fiscal, tax and economics", "Health"],`	The main group(s) in data.govt.nz you would like to classify your dataset under to improve discoverability. Can be a single group or list. See https://catalogue.data.govt.nz/group

Distribution

These are for direct links to downloadable data files or access points for APIs, etc.

data.json field	Required?	Example value	Comments
downloadURL	Yes, or accessURL (see below)	`https://webtoolkit.govt.nz/files/ PublicSectorWebsites01April2015.csv`	The direct URL that downloads a file with the data
accessURL	No, unless no dowloadURL	`https://webtoolkit.govt.nz/guidance/domain-names/new-zealand-public-sector-websites/`	If there is not a `downloadURL`to a downloadable file then specify the `accessURL`. This is the URL of an API or other non-downloadable data location.
title	Yes	`Exposure to second hand smoke`	A descriptive title of the data resource. 50 - 70 characters preferred (CKAN concatenates long title strings).
description	No	'A study from 2015 on the effects of exposure to second hand smoke ...'	If you need to describe a particular data file further you can supply a description. 100 - 200 characters preferred.
format	No	`text/csv` or `csv`	Mime-types or file extensions.

Content last reviewed: 27 Mar 2019