Last weekend I attended a DataKind data science hackathon. It was a lot of fun and a great way to meet people in the space and share some ideas. If it sounds the least bit interesting, I encourage you to join a DataKind event. Here’s what my team worked on, which should serve as a good indication what you might do over the course of the weekend. My code here: project folder, supervised classification – most interesting, and topic modeling.
Our team’s (5-6 people) goal was to create a text classification tool that could map a couple of sentences (at the least) to SDGs. What are SDGs? A few years ago, the UN created a taxonomy of 17 Social Development Goals (SDGs) in order to help standardize the classification of the many thousands of philanthropic organizations out there. Where there was once a large fragmented ecosystem of philanthropic organizations working independently, a standardized taxonomy lets workers more easily identify and connect to organizations that share common goals.
The obvious path is to train a classifier on a labeled dataset, but unfortunately a labeled dataset does not seem to exist. As a result our team improvised a few paths to data creation and classification, with the result looking something like this (made with draw.io, a site I stumbled upon yesterday).
The interesting part is the bottom half of the chart dealing with data collection, where we had a few streams (L to R):
- Creation of labeled set (~24,000 examples):
- Tax forms filed by US philanthropic organizations (those who qualify fill out form 990) provided by DataKind in an S3 bucket.
- Another file matched organization mission statements to NTEEs (an outdated and messy classification system that SDGs aim to replace).
- Join on the two files above, plus cleaning to extract company mission statements and finally mapping some of the NTEEs to SDGs via a manually created dictionary
- Small (~300 examples) labeled dataset discovered at the 11th hour, provided by a DataKind member.
- Text descriptions of SDGs (17 descriptions)
- A web scraping script created to query an existing (but closed) text-to-SDG online tool.
Of course, the above diagram represents the final shape of the project, and hindsight is 20/20.
Early on, with no labeled data, we tried an unsupervised approach. One teammate tried LDA topic modeling while I did the same with K-means (tokenizing, no stemming or lemmatizing, tf-idf). K-means yielded better clusters (controlling the max/min document frequency on tf-idf is quite important for this task), but neither of us had much luck converting these clusters to predictive tools.
My teammate used bag of words and LDA on text descriptions of the 17 SDGs themselves, here’s the dot product of that matrix with its transpose:
So not great since we should be seeing a strong diagonal. Of course this could be developed, but with off-the-shelf results like this I switched attention to creating a labeled dataset for supervised approaches.
After spending some time joining, cleaning, and standardizing the 4 or so datasets we had available, I was able to get around ~24,000 labeled examples of text blocks and their associated SDGs. Note the pretty large class imbalance (not representative of the actual data out there, just a product of the fact that my NTEE to SDG mapping dictionary only matched a few selectively/tidily):
One text block can have multiple SDGs associated with it, so that makes this a multi-label problem, which as I’ve written about has plenty of quirks and directions for optimization. Here are candidate classifiers with binary relevance (one vs rest) on 7 CV folds and F1 micro scoring:
Considering the huge SDG class imbalances in the dataset, this wasn’t bad at all! Even under an accuracy metric (rewards only exact and not partial matches) scores are similarly around .8.
I even made a quick function that takes in text and outputs SDGs based on your classifier of choice:
Here’s a look at SVM performance, where you can probably guess which SDG classes had almost 0 training data.
Oddly, even on the classes with enough training data SVM was consistently not sensitive enough giving more false negatives on each class than false positives. Odd but good, since this can be solved with thresholds, whereas over/underprediction all over the place indicates noise.
There’s a confusion matrix for each class, but to get an idea of classifier performance here’s the CM for the most prevalent SDG in the training set:
Lastly, lots of room for improvement in taking this project forward:
- More data
- More data for SDGs that are underrepresented in this set – there’s very high class imbalance
- Tune models and diagnose for overfitting/underfitting, ROC scores
- Supervised ensemble
- Squeeze out some more performance with multilabel-specific algorithms and metrics (https://nickcdryan.wordpress.com/2017/01/23/multi-label-classification-a-guided-tour/)
- Ensemble with unsupervised approach like LDA or k-means