The Box-Cox Transformation

The Box-Cox transformation is a family of power transform functions that are used to stabilize variance and make a dataset look more like a normal distribution. Lots of useful tools require normal-like data in order to be effective, so by using the Box-Cox transformation on your wonky-looking dataset you can then utilize some of these tools.

Here’s the transformation in its basic form. For value x and parameter \lambda:

\displaystyle \frac{x^{\lambda}-1}{\lambda} \quad \text{if} \quad x\neq 0 

\displaystyle log(x) \quad \text{if} \quad x=0

Continue reading “The Box-Cox Transformation”

Text Classification at Data Science Hackathon with DataKind

Last weekend I attended a DataKind data science hackathon. It was a lot of fun and a great way to meet people in the space and share some ideas. If it sounds the least bit interesting, I encourage you to join a DataKind event. Here’s what my team worked on, which should serve as a good indication what you might do over the course of the weekend. My code here: project folder, supervised classification – most interesting, and topic modeling. Continue reading “Text Classification at Data Science Hackathon with DataKind”

A Few Nice Coding Challenges

A recent interview process required passing some coding challenges.

When I first started programming I spent a decent amount of time on Project Euler, but since then I rarely do these crack-the-interview coding challenges. I find project-based work more interesting, I work mostly with data, and – based on what I understand from experienced interviewers – facility with brain teasers and coding challenges correlates less with good programming than time spent programming correlates with good programming. Anyway, I spent a few afternoons working through coding challenges on Codility to get a feel for the types of questions that get asked of software engineering candidates. Continue reading “A Few Nice Coding Challenges”

Understanding Facebook Ads: Pros and Cons

I recently did some A/B testing work through the Facebook advertising platform, and gave a quick presentation on the pros and cons of the platform. Here’s a summary.


  • Microtargeting
  • Optimization
  • Inexpensive, low ceilings
  • Demonstrated to work at scale, sophisticated distribution


  • Click bots
  • Opaque

To clarify my perspective on the platform, some background on the work we did:

We ran some A/B tests through the platform targeting a specific population, evaluating different levels of resulting engagement for statistical significance. I assure you, nothing fancy. Continue reading “Understanding Facebook Ads: Pros and Cons”

Decorators and Metaprogramming in Python


Decorators are intuitive and extremely useful. To demonstrate, we’ll look at a simple example. Let’s say we’ve got some function that sums all numbers 0 to n:

def sum_0_to_n(n):
    count = 0
    while n > 0:
        count += n
        n -= 1
    return count

and we’d like to time the performance of this function. Of course we could just modify the function like so:

Continue reading “Decorators and Metaprogramming in Python”

Shallow Parsing for Entity Recognition with NLTK and Machine Learning

Getting Useful Information Out of Unstructured Text

Let’s say that you’re interested in performing a basic analysis of the US M&A market over the last five years. You don’t have access to a database of transactions and don’t have access to tombstones (public advertisements announcing the minimal details of a closed deal, e.g. ABC acquires XYZ for $500mm). What you do have is access to is a large corpus of financial news articles that contain within them – somewhere – the basic transactional details of M&A deals.

What you need to do is design a system that takes in this large database and outputs clean fields containing M&A transaction details. In other words, map an excerpt like this: Continue reading “Shallow Parsing for Entity Recognition with NLTK and Machine Learning”

Multi-label Classification: A Guided Tour


I recently undertook some work that looked at tagging academic papers with one or more labels based on a training set.

A preliminary look through the data revealed about 8000 examples, 2750 features, and…650 labels. For clarification, that’s 2750 sparse binary features (keyword indices for the articles), and 650 labels, not classes. Label cardinality (average number of labels per example) is about 2, with the majority of labels only occurring a few times in the dataset…doesn’t look good, does it? Nevertheless, more data wasn’t available and label reduction wasn’t on the table yet, so I spent a good amount of time in the corners of academia looking at multi-label work. 

Continue reading “Multi-label Classification: A Guided Tour”