Suppose we are given a dataset of outcomes from some distribution parameterized by . How do we estimate ?
For example, given a bent coin and a series of heads and tails outcomes from that coin, how can we estimate the probability of the coin landing heads? Continue reading “MLE, MAP, and Naive Bayes”
Last weekend I attended a DataKind data science hackathon. It was a lot of fun and a great way to meet people in the space and share some ideas. If it sounds the least bit interesting, I encourage you to join a DataKind event. Here’s what my team worked on, which should serve as a good indication what you might do over the course of the weekend. My code here: project folder, supervised classification – most interesting, and topic modeling. Continue reading “Text Classification at Data Science Hackathon with DataKind”
A recent interview process required passing some coding challenges.
When I first started programming I spent a decent amount of time on Project Euler, but since then I rarely do these crack-the-interview coding challenges. I find project-based work more interesting, I work mostly with data, and – based on what I understand from experienced interviewers – facility with brain teasers and coding challenges correlates less with good programming than time spent programming correlates with good programming. Anyway, I spent a few afternoons working through coding challenges on Codility to get a feel for the types of questions that get asked of software engineering candidates. Continue reading “A Few Nice Coding Challenges”
I recently did some A/B testing work through the Facebook advertising platform, and gave a quick presentation on the pros and cons of the platform. Here’s a summary.
- Inexpensive, low ceilings
- Demonstrated to work at scale, sophisticated distribution
To clarify my perspective on the platform, some background on the work we did:
We ran some A/B tests through the platform targeting a specific population, evaluating different levels of resulting engagement for statistical significance. I assure you, nothing fancy. Continue reading “Understanding Facebook Ads: Pros and Cons”
Decorators are intuitive and extremely useful. To demonstrate, we’ll look at a simple example. Let’s say we’ve got some function that sums all numbers 0 to n:
count = 0
while n > 0:
count += n
n -= 1
and we’d like to time the performance of this function. Of course we could just modify the function like so:
Continue reading “Decorators and Metaprogramming in Python”
Getting Useful Information Out of Unstructured Text
Let’s say that you’re interested in performing a basic analysis of the US M&A market over the last five years. You don’t have access to a database of transactions and don’t have access to tombstones (public advertisements announcing the minimal details of a closed deal, e.g. ABC acquires XYZ for $500mm). What you do have is access to is a large corpus of financial news articles that contain within them – somewhere – the basic transactional details of M&A deals.
What you need to do is design a system that takes in this large database and outputs clean fields containing M&A transaction details. In other words, map an excerpt like this: Continue reading “Shallow Parsing for Entity Recognition with NLTK and Machine Learning”
I recently undertook some work that looked at tagging academic papers with one or more labels based on a training set.
A preliminary look through the data revealed about 8000 examples, 2750 features, and…650 labels. For clarification, that’s 2750 sparse binary features (keyword indices for the articles), and 650 labels, not classes. Label cardinality (average number of labels per example) is about 2, with the majority of labels only occurring a few times in the dataset…doesn’t look good, does it? Nevertheless, more data wasn’t available and label reduction wasn’t on the table yet, so I spent a good amount of time in the corners of academia looking at multi-label work.
Continue reading “Multi-label Classification: A Guided Tour”
A useful snippet for visualizing decision trees with pydotplus. It took some digging to find the proper output and viz parameters among different documentation releases, so thought I’d share it here for quick reference.
Continue reading “Decision Tree Visualization with pydotplus”
A couple months back, I worked on analysis and predictive modeling of US salary given census data. Full Jupyter notebook here, below are some details and some of the more interesting findings.
In general, metadata is below and contains lots of null values (as you might suspect of census data).
Continue reading “Income Analysis – US Census Data”