Assessing the effect of geographic information on data linkage quality using Vocational Education and Training in Schools data, 2014

28 Oct 2014

This research paper compares the quality of linking two Vocational Education and Training in Schools datasets containing different levels of geography with 2011 Census of Population and Housing data.

Executive summary

As more governments and research sectors look towards using data integration to better answer policy questions, solving the problem of how to maximise linkage quality becomes paramount. This research paper compares two datasets to determine the impact that geographic information has on data linkage quality.

Previous ABS research examined National Centre for Vocational Education Research data with large-area geography, and investigated what effect this geography had on data linkage quality. The research recommended improvements to linkage methods and to the data itself to create a suitable linked dataset. Further work showed that improved linkage methods such as basing linkage on the geographic information available as well as weighting or calibrating the linked dataset to the input dataset led to the creation of a linked dataset that is of sufficient quality for statistical purposes. In turn, this improved method was used to link Vocational Education and Training in Schools data to 2011 ABS Census of Population and Housing (Census) data successfully in the publication Outcomes from Vocational Education and Training in Schools, experimental estimates (ABS cat. no. 4260.0).

In this current research, Queensland Curriculum and Assessment Authority data containing small-area geography was linked to the Census. The linkage quality of this data was compared to the results of the previous research to examine the effects of using small-area geography on data linkage quality.

The results showed that linking on small-area geography increased linkage rates compared to linking on large-area geography, even after improvements in linkage methods were considered. Linkage rates rose a total of 12.4 percentage points when compared to using large-area geography with improved linkage methods, and 22.9 percentage points when compared to using large-area geography without improved linkage methods. Further analysis showed that the differences in available geography increased the representativeness of the integrated dataset, particularly for variables based on geography.

Lastly, for the dataset with small-area geography, more links were made on variables with lower duplicate rates, and hence the links were likely to be more accurate when compared to linking on variables with higher duplicate rates.

While linked datasets of sufficient quality can be created using large-area geography, more detailed geography improves the quality of linked datasets in multiple areas. Improvements to the detail of geographical and other information on administrative data should be sought to deliver these enhancements. However, improved data linkage methods have also made it possible to integrate data even with less detailed geographical information, and greater use can be made of existing datasets to create a richer and more informative picture of Australia.

Publication Details
Published year only: 
Subject Areas
Geographic Coverage