Here’s another post I co-authored with Chris McCormick on how to quickly and easily create a SOTA text classifier by fine-tuning BERT in PyTorch. It’s incredibly useful to take a look at this transfer learning approach if you’re interested in creating a high performance NLP model.
Please check out the post I co-authored with Chris McCormick on BERT Word Embeddings here. In it, we take an in-depth look at the word embeddings produced by BERT, show you how to create your own in a Google Colab notebook, and tips on how to implement and use these embeddings in your production pipeline. Check it out!
I recently undertook some work that looked at tagging academic papers with one or more labels based on a training set.
A preliminary look through the data revealed about 8000 examples, 2750 features, and…650 labels. For clarification, that’s 2750 sparse binary features (keyword indices for the articles), and 650 labels, not classes. Label cardinality (average number of labels per example) is about 2, with the majority of labels only occurring a few times in the dataset…doesn’t look good, does it? Nevertheless, more data wasn’t available and label reduction wasn’t on the table yet, so I spent a good amount of time in the corners of academia looking at multi-label work.
How does front page news track a single topic over a period of time? What’s the media’s attention span for a given story?
In general, many find it surprising how quickly major media outlets shift their attention from one story to another. This is partly a reflection of our own attention spans and appetites, and is partly due to the fact that media organizations are incentivized to be the first to break news; as a result readers are more likely to be bombarded with what’s novel instead of what’s important. Continue reading “Article Classification and News Headlines Over Time”
The purpose of this quick tutorial is to get you a very big, very useful neural network up and running in just a few hours. The goal is that anyone with a computer, some free time, and little-to-no knowledge of what neural networks are or how they work can easily begin playing with this technology as soon as possible. Technical explanations of what RNNs are abound on the internet, so this tutorial will skip explanation and focus solely on building. Continue reading “Building a Recurrent Neural Network to Generate Novel Text”
What is regularization? Regularization, as it is commonly used in machine learning, is an attempt to correct for model overfitting by introducing additional information to the cost function. In this post we will review the logic and implementation of regression and discuss a few of the most widespread forms: ridge, lasso, and elastic net. For simplicity, we’ll discuss regularization within the context of least squares linear regression, and I assume that you have some familiarity with linear regression. Onward! Continue reading “Introduction to Regularization”
In Principal Component Analysis (PCA), we would like to convert our high-dimensional dataset onto a lower-dimensional space while keeping as much information as possible. Typically, this is done to avoid curse of dimensionality effects or for the purposes of data visualization.
In broad strokes, PCA reduces the dimensionality of our dataset in a way that minimizes (certain aspects of) the amount of information we throw away by projecting our -dimensional feature set onto a lower-dimensional subspace. Continue reading “Short Introduction to PCA”