Multi-label Classification: A Guided Tour

Introduction

I recently undertook some work that looked at tagging academic papers with one or more labels based on a training set.

A preliminary look through the data revealed about 8000 examples, 2750 features, and…650 labels. For clarification, that’s 2750 sparse binary features (keyword indices for the articles), and 650 labels, not classes. Label cardinality (average number of labels per example) is about 2, with the majority of labels only occurring a few times in the dataset…doesn’t look good, does it? Nevertheless, more data wasn’t available and label reduction wasn’t on the table yet, so I spent a good amount of time in the corners of academia looking at multi-label work. 

Continue reading “Multi-label Classification: A Guided Tour”

Income Analysis – US Census Data

A couple months back, I worked on analysis and predictive modeling of US salary given census data. Full Jupyter notebook here, below are some details and some of the more interesting findings.

In general, metadata is below and contains lots of null values (as you might suspect of census data).

screen-shot-2017-01-22-at-11-21-22-pm Continue reading “Income Analysis – US Census Data”