Income Analysis – US Census Data

A couple months back, I worked on analysis and predictive modeling of US salary given census data. Full Jupyter notebook here, below are some details and some of the more interesting findings.

In general, metadata is below and contains lots of null values (as you might suspect of census data).

screen-shot-2017-01-22-at-11-21-22-pm

Handling of null values are worth some attention for this dataset.According to the census website, rather than just a null or missing value, “not in universe” (the most common NaN value) actually means the question does not apply to the respondent. For example, asking about spouse’s income after having determined that the respondent is a 5-year old is not an applicable question, so the response is “not in universe.” Thus, contrary to what I first thought, “not in universe” is not the same thing as a missing/null value, it actually contains important information about the survey respondent, namely “this question does not apply.” The relevance is obvious if one considers a feature like “reason for unemployment” where an answer of “not in universe” effectively means “is employed.” In summary, we’ll keep them for the prediction phase.

However, I’m fairly skeptical about the integrity of all of our “not in universe”-type answers, because without further validating the census methodology (the website appears to have been taken down) it seems possible that things like “other,” “choose not to respond,” “no response,” etc could all have been mapped to a general “not in universe”-type answer.

Anyway, let’s look at age.

age

A curious dip in the 13-30 year old range, probably too dramatic of a dip to be purely accounted for by the idea that “young folks” have better things to do than fill out a census. Especially given that census methodology involves a householder providing information on dependents, which should account for the 13-18 year old group. Unfortunately, there didn’t seem to be an obvious answer.

income-dist

Distribution of income. Specifically, 93.8% of the population make below 50k.

           income  percentage
 - 50000.  187141   93.794199
 50000+.    12382    6.205801

download-39

Somewhat surprisingly, a significant but not overwhelming difference when looking at race (y-axis is count, not percentage).

download-38

Males overwhelmingly more likely to make +50k.

download-37

In terms of age, skews towards 40-50 year olds. Presumably late-career, pre-retirement individuals.

After some preprocessing using sklearn LabelEncoder and feature selection, logistic regression gives 94.5% accuracy with 10-fold CV. Not a great score given that the null accuracy (percentage of those with under 50k in salary) is around 93%. More interesting is a look at coefficients values:

Parameter coefficients:

Out[25]:

	feature	coefficient value
0	wage per hour	-0.000196
1	dividends from stocks	0.000244
2	age	0.033676
3	capital gains	0.000142
4	weeks worked in year	0.066085
5	education	0.052389
6	major occupation code	0.016101
7	race	0.073159
8	sex	1.431965
9	full or part time employment stat	0.009928
10	detailed household summary in household	0.110222
11	citizenship

So what does a high-income individual look like?

Based on coefficients in our model, univariate analysis, and distribution comparisons of under- and over-50K earner populations, the person most likely to make more than $50K a year is:

Male
Over (roughly) 25 years old
A householder

..with additional features like race, education, and citizenship playing a less significant role. Note that in this analysis I am characterizing “significance” as a combination of predictive power as well as how much of the general population a given property extends to. For example, it is absolutely true that holding a PhD is very highly correlated with a high income, but because the population of people holding PhDs is very small relative to the entire population, this factor is less significant. It is entirely possible that under another analysis with binarized features, or under an analysis that examined only proportional distribution of a given feature against the target classes, features like “holds a PhD” would be deemed very significant.

However there are obviously a few odd things going on here: age has a relatively small coefficient even though it does plainly have a large effect on income (age under 25 and over retirement age), and wage per hour actually has a negative coefficient! This leads us to a few lackluster properties of our model affecting these results:

Considerations

1) The coefficients are going to be confused due to the fact that our categorical variables are treated like continuous ones. For something like sex which is a binary category in our survey, that poses no problem (which could be part of the reason why it has the highest coefficient score). However age, education, race, etc. are going to be treated as continuous variables and so will provide much less useful information to the classifier than they would if split out into multiple columns and binarized.

2) The impact of “not in universe” responses on our data. For example, household status has the second highest coefficient value, possibly in large part because that feature contains no “not in universe” values. This feature is probably well-suited for a classification task not because it’s necessarily more correlated with high income than other features in the real world, but because it contains more complete information about respondents.

As another example, the “wage per hour” feature, should be very highly correlated with predicting whether or not someone makes more than $50K/year. In fact this feature, combined with “weeks worked per year” should be THE features that predict how much you make, but because the data was incomplete this feature was mostly “0” in our survey data and it is therefore of less consequence to our classifier. Frankly, I’m not sure why it came out as actually negative, but with more time diagnostics could perhaps tell us where the model has gone wrong.

The lesson is that the classifier balances features that do have underlying relationships with the target variable against the information-content of those features, and we should keep this in mind when attempting to use classifier results to interpret the real world.

Here’s a normalized confusion matrix for debugging classifier performance

Normalized confusion matrix
[[ 0.99718394  0.00281606]
 [ 0.84566306  0.15433694]]

Random forest seems like a better fit for this problem, and indeed scores 98.5% accuracy on a test set.