Text Classification at Data Science Hackathon with DataKind

Last weekend I attended a DataKind data science hackathon. It was a lot of fun and a great way to meet people in the space and share some ideas. If it sounds the least bit interesting, I encourage you to join a DataKind event. Here’s what my team worked on, which should serve as a good indication what you might do over the course of the weekend. My code here: project folder, supervised classification – most interesting, and topic modeling.

Our team’s (5-6 people) goal was to create a text classification tool that could map a couple of sentences (at the least) to SDGs. What are SDGs? A few years ago, the UN created a taxonomy of 17 Social Development Goals (SDGs) in order to help standardize the classification of the many thousands of philanthropic organizations out there. Where there was once a large fragmented ecosystem of philanthropic organizations working independently, a standardized taxonomy lets workers more easily identify and connect to organizations that share common goals.

The obvious path is to train a classifier on a labeled dataset, but unfortunately a labeled dataset does not seem to exist. As a result our team improvised a few paths to data creation and classification, with the result looking something like this (made with draw.io, a site I stumbled upon yesterday).

990_flow

The interesting part is the bottom half of the chart dealing with data collection, where we had a few streams (L to R):

Creation of labeled set (~24,000 examples):
- Tax forms filed by US philanthropic organizations (those who qualify fill out form 990) provided by DataKind in an S3 bucket.
- Another file matched organization mission statements to NTEEs (an outdated and messy classification system that SDGs aim to replace).
- Join on the two files above, plus cleaning to extract company mission statements and finally mapping some of the NTEEs to SDGs via a manually created dictionary
Small (~300 examples) labeled dataset discovered at the 11th hour, provided by a DataKind member.
Text descriptions of SDGs (17 descriptions)
A web scraping script created to query an existing (but closed) text-to-SDG online tool.

Of course, the above diagram represents the final shape of the project, and hindsight is 20/20.

Early on, with no labeled data, we tried an unsupervised approach. One teammate tried LDA topic modeling while I did the same with K-means (tokenizing, no stemming or lemmatizing, tf-idf). K-means yielded better clusters (controlling the max/min document frequency on tf-idf is quite important for this task), but neither of us had much luck converting these clusters to predictive tools.

My teammate used bag of words and LDA on text descriptions of the 17 SDGs themselves, here’s the dot product of that matrix with its transpose:

dotmatrix

So not great since we should be seeing a strong diagonal. Of course this could be developed, but with off-the-shelf results like this I switched attention to creating a labeled dataset for supervised approaches.

After spending some time joining, cleaning, and standardizing the 4 or so datasets we had available, I was able to get around ~24,000 labeled examples of text blocks and their associated SDGs. Note the pretty large class imbalance (not representative of the actual data out there, just a product of the fact that my NTEE to SDG mapping dictionary only matched a few selectively/tidily):

SDG distribution

One text block can have multiple SDGs associated with it, so that makes this a multi-label problem, which as I’ve written about has plenty of quirks and directions for optimization. Here are candidate classifiers with binary relevance (one vs rest) on 7 CV folds and F1 micro scoring:

classifiers_sdg

Considering the huge SDG class imbalances in the dataset, this wasn’t bad at all! Even under an accuracy metric (rewards only exact and not partial matches) scores are similarly around .8.

I even made a quick function that takes in text and outputs SDGs based on your classifier of choice:

Screen Shot 2017-03-12 at 7.28.16 PM

Here’s a look at SVM performance, where you can probably guess which SDG classes had almost 0 training data.

Oddly, even on the classes with enough training data SVM was consistently not sensitive enough giving more false negatives on each class than false positives. Odd but good, since this can be solved with thresholds, whereas over/underprediction all over the place indicates noise.

There’s a confusion matrix for each class, but to get an idea of classifier performance here’s the CM for the most prevalent SDG in the training set:

Screen Shot 2017-03-12 at 7.47.06 PM

Lastly, lots of room for improvement in taking this project forward:

More data
More data for SDGs that are underrepresented in this set – there’s very high class imbalance
Tune models and diagnose for overfitting/underfitting, ROC scores
Supervised ensemble
Squeeze out some more performance with multilabel-specific algorithms and metrics (https://nickcdryan.wordpress.com/2017/01/23/multi-label-classification-a-guided-tour/)
Ensemble with unsupervised approach like LDA or k-means