Using census, social security and tax data from the Multi-Agency Data Integration Project (MADIP) to impute the complete Australian income distribution

Nicholas Biddle; Dinith Marasinghe

Working paper

SHARE

Using census, social security and tax data from the Multi-Agency Data Integration Project (MADIP) to impute the complete Australian income distribution

14 Apr 2021

Nicholas Biddle,

Dinith Marasinghe

Publisher

Tax and Transfer Policy Institute

Data processing Machine learning Income distribution Australia

Resources

Using census, social security and tax data from the Multi-Agency Data Integration Project (MADIP) to impute the complete Australian income distribution

Description

Abstract:

The distribution of income, income dynamics and how observable characteristics predict an individual’s position on the income distribution are all core aspects of economics and social science research, and of keen interest to policy makers. Researchers approach these topics using a combination of cross-sectional surveys, panel studies, and administrative datasets. In Australia, all three types of datasets have been used historically to help answer such questions, without any one individual dataset being without limitations in terms of sample size, sample representation, quality of income data, or longitudinal availability. A relatively new dataset – the Multi-Agency Data Integration Partnership (MADIP) Basic Longitudinal Extra (BLE) – has the potential to extend our knowledge of income in Australia by combining income-related data from a targeted survey (the 2011 or 2016 Censuses of Population and Housing), income tax records at the individual level, and information on access to social security. As individual datasets, there are limits of each. However, one way to overcome the limitations of the individual datasets on the MADIP BLE is to combine them to create a synthetic income measure for each individual. For 2011, this is a relatively straightforward exercise, as there are three sources of information for each individual. For the other years though, there are only two sources of information – PIT and SSRI. To overcome this limitation, we borrow information from the first wave of data (2011) to help estimate income for the remaining years (2012-16). After testing nine machine-learning approaches using a training and test dataset from the MADIP BLE 2011, we were able to generate a synthetic income measure that performed far better than either tax or census data alone in matching the HILDA income distribution, and was also able to capture income dynamics reasonably well, albeit with some understating of income dynamics. This new synthetic income data is available for further analysis for over 15 million individuals, compared to only around 17,000 for HILDA and even less for other sample surveys.

Publication Details

Copyright:

The authors 2021

License type:

Access Rights Type:

open

Series:

TTPI Working Paper 8/2021

Post date:

14 Apr 2021

Using census, social security and tax data from the Multi-Agency Data Integration Project (MADIP) to impute the complete Australian income distribution

Digital barriers to economic justice in the wake of COVID-19

Combining crowds and machines

Artificial Intelligence primer

The Commonwealth Bank jobs and skills of the future report

Data, technology, systems and transformation

Artificial intelligence and public policy

Ethical principles for AI in medicine: draft for consultation

How will big data impact healthcare?

Proposed regulatory framework for modifications to Artificial Intelligence/Machine Learning (AI/ML)-based Software as a Medical Device (SaMD)

Artificial Intelligence and emerging technologies in schools: research report