Article Classification and News Headlines Over Time

download-25

How does front page news track a single topic over a period of time? What’s the media’s attention span for a given story?

In general, many find it surprising how quickly major media outlets shift their attention from one story to another. This is partly a reflection of our own attention spans and appetites, and is partly due to the fact that media organizations are incentivized to be the first to break news; as a result readers are more likely to be bombarded with what’s novel instead of what’s important.

Election coverage in 2016 seems like as good a place as any to track the bouncing attention of the front pages. During this time the public is hungry for updates and breaking news carries high value.

A brief overview of the process (Python script available here):

Given a search query term and a date range, e.g. “Election, October 1 – November 8 2016”)…
We’ll pull results from google news results for a given date range. For each day, we’ll scrape the headlines from the first page of results. To be clear, we’re scraping this page from each date in our date range.

Screen Shot 2017-01-21 at 10.37.09 PM.png

Parse and clean data, filtering out news sources and coupling each headline with a date
Remove stopwords, apply tf-idf and vectorize headline tokens
Apply k-means clustering to generate news topic clusters related to the given search query. Generating a useful set of clusters is more art than science and depends to some degree on your goal and knowledge of the subject matter, and how well you can translate this to algorithm parameters. For example, clusters for “Election” gave diffuse results, but after parameter tuning we can more or less generate clusters with discrete sets of news topics. Cluster 5 is about Clinton emails, Cluster 2 is about Russian hacking, etc:
```
[u'Cluster 1: donald, presidential, cnn, 2016, rigged, just, party',
 u'Cluster 2: elections, russia, officials, presidential, rigged, russian, say',
 u'Cluster 3: week, billion, let, cost, disrupt, process, cancel',
 u'Cluster 4: voting, american, presidential, days, early, stressed, conway',
 u'Cluster 5: clinton, hillary, october, fbi, email, likely, probe',
 u'Cluster 6: day, voter, vote, donald, early, voters, registration']
```
- cluster_count determines…the number of clusters. Too many or too few clusters will mean each cluster’s topics are too diffuse/similar, so finding the right number takes some practice
- max_df <-[0,1] sets the ceiling on term frequency proportion on the tf-idf vectorizer. A low value helps keep terms and clusters exclusive from one another and prevents words appearing across all headlines from showing up in the clusters
- num_terms doesn’t affect the algorithm, but is fairly important in creating useful clusters. Too many terms per cluster and they tend to bleed into one another, too few and they don’t capture enough information about the story they should represent
Create dictionary of cluster topic terms and match article headline tokens to cluster topic terms. Create ratios of matched cluster topic terms against total number of tokens, and finally group by date
Prepare for visualization with spline for smooth graphing in matplotlib

Let’s look at results for “Election,” “Clinton,” and “Trump” from October 1 to November 7.

download-30

download-26

download-20

The “Election” results are fairly even; not too many spikes. Relative to other topic clusters, Russia seemed to be a major topic throughout. Similarly for Clinton, the relative frequency of cluster topics stays mostly constant throughout the last 6 weeks of the election, save for a spike in the last week corresponding to Comey’s decision to reopen the investigation of Clinton’s email scandal.

By contrast, Trump data separated readily into a number of mutually exclusive topics due to the wide variety of stories. Due to the sequence of unrelated stories over time, Trump data also generated the most interesting visualization and best demonstrates the shifting focus of the front page.

Here’s a subset of Trump clusters:

download-25

The leaked tape of Billy Bush and Trump took over the news around October 8, though the specifics of this scandal seemed to lose their position on the front page for the rest of the election cycle, making way for Trump’s accusations of Bill Clinton’s sexual misconduct (as well as his own sexual misconduct), and following the headlines covering the smashing of Trump’s Hollywood Star, the leaked tape scandal all but disappeared from the news for the remainder of the election cycle.

Stepping up to the time-honored soapbox of speculation, this raises interesting points with regards to the media’s relaying of information to the public. To some degree, the public entrusts the media not only with breaking news, but also curating it, where curation involves weighting the relative importance of news stories. The public correlates news coverage with importance; when a story disappears from the news cycle, we tend to assume its importance or relevance has diminished. These graphs demonstrate that front page news often has more to do with novelty than what most, presumably including editorial staff, would regard as importance. These graphs highlight a disjoint: news outlets focus on novelty at the expense of relevance (admittedly for good reasons). Media organizations can draw out a story they deem important with further coverage, but know that ratings aren’t sustained unless they continue to free up space on the front page for breaking news.

An upside and upshot is that there is potential opportunity for the editorial arm of media organizations to create another product or form of communication with the public that stresses curation but does not depend on breaking news in the same way the front page does.