Niall Ferguson’s Mistake Makes the Case for Metadata

Harvard historian Niall Ferguson goofed on Bloomberg TV yesterday. Arguing that the 2009 stimulus had little effect, he said:

The point I made in the piece [his controversial cover story in Newsweek] was that the stimulus had a very short-term effect, which is very clear if you look, for example, at the federal employment numbers. There’s a huge spike in early 2010, and then it falls back down.  (This is slightly edited from the transcription by Invictus at The Big Picture.)

That spike did happen. But as every economic data jockey knows, it doesn’t reflect the stimulus; it’s temporary hiring of Census workers.

Ferguson ought to know that. He’s trying to position himself as an important economic commentator and that should require basic familiarity with key data.

But Ferguson is just the tip of the iceberg. For every prominent pundit, there are thousands of other people—students, business analysts, congressional staffers, and interested citizens—who use these data and sometimes make the same mistakes. I’m sure I do as well—it’s hard to know every relevant anomaly in the data. As I said in one of my first blog posts back in 2009:

Data rarely speak for themselves. There’s almost always some folklore, known to initiates, about how data should and should not be used. As the web transforms the availability and use of data, it’s essential that the folklore be democratized as much as the raw data themselves.

How would that democratization work? One approach would be to create metadata for key economic data series. Just as your camera attachs time, date, GPS coordinates, and who knows what else to each digital photograph you take, so could each economic data point be accompanied by a field identifying any special issues and providing a link for users who want more information.

When Niall Ferguson calls up a chart of federal employment statistics at his favorite data provider, such metadata would allow them to display something like this:

 

Clicking on or hovering over the “2” would then reveal text: “Federal employment boosted by temporary Census hiring; for more information see link.” And the stimulus mistake would be avoided.

I am, of course, skimming over a host of practical challenges. How do you decide which anomalies should be included in the metadata? When should charts show a single flag for metadata issues, even when the underlying data have it for each affected datapoint?

And, perhaps most important, who should do this? It would be great if the statistical agencies could do it, so the information could filter out through the entire data-using community. But their budgets are already tight. Failing that, perhaps the fine folks at FRED could do it; they’ve certainly revolutionized access to the raw data. Or even Google, which already does something similar to highlight news stories on its stock price charts, but would need to create the underlying database of metadata.

Here’s hoping that someone will do it. Democratizing data folklore would reduce needless confusion about economic facts so we can focus on real economic challenges. And it just might remind me what happened to federal employment in early 2009.

Online Education and Self-Driving Cars

Last week, I noted that former Stanford professor Sebastian Thrun enrolled 160,000 students in an online computer science class. That inspired him to set up a new company, Udacity, to pursue online education. A new article in Bloomberg BusinessWeek adds some additional color to the story.

Barrett Sheridan and Brendan Greeley answer a question many folks asked about the students: how many actually finished? Answer: 23,000 finished all the assignments.

Second, they note that professor Thrun is also at the forefront of another potentially transformative technology: self-driving cars:

Last fall, Stanford took the idea further and conducted two CS courses entirely online. These included not just instructional videos but also opportunities to ask questions of the professors, get homework graded, and take midterms—all for free and available to the public.

Sebastian Thrun, a computer science professor and a Google fellow overseeing the search company’s project to build driverless cars, co-taught one of the courses, on artificial intelligence. It wasn’t meant for everyone; students were expected to get up to speed with topics like probability theory and linear algebra. Thrun’s co-teacher, Peter Norvig, estimated that 1,000 people would sign up. “I’m known as a crazy optimist, so I said 10,000 students,” says Thrun. “We had 160,000 sign up, and then we got frightened and closed enrollment. It would have been 250,000 if we had kept it open.” Many dropped out, but 23,000 students finished all 11 weeks’ worth of assignments. Stanford is continuing the project with an expanded list of classes this year. Thrun, however, has given up his tenured position to focus on his work at Google and to build Udacity, a startup that, like Codecademy, will offer free computer science courses on the Web.

I wish Thrun success in both endeavors. Perhaps one day soon, commuters will settle in for an hour of online learning while their car drives them to work.

P.S. In case you missed it, Tom Vanderbilt has a fun article on self-driving cars in the latest Wired.

Zanran: Google for Data?

Zanran is a new search engine, now in beta testing, that focuses on charts and tables. As its website says:

Zanran helps you to find ‘semi-structured’ data on the web. This is the numerical data that people have presented as graphs and tables and charts. For example, the data could be a graph in a PDF report, or a table in an Excel spreadsheet, or a barchart shown as an image in an HTML page. This huge amount of information can be difficult to find using conventional search engines, which are focused primarily on finding text rather than graphs, tables and bar charts.

Put more simply: Zanran is Google for data.

This is a stellar idea. The web holds phenomenal amounts of data that are hard to find buried inside documents. And Zanran offers a fast way to find and scan through documents that may have relevant material. Particularly helpful is the ability to hover your cursor over each document to see the chart Zanran’s thinks you are interested in before you click through to the document.

Zanran is clearly in beta, however, and has some major challenges ahead. Perhaps most important are determining which results should rank high and identifying recent data. If you type “united states GDP” into Zanran, for example, the top results are rather idiosyncratic and there’s nothing on the first few pages that directs you to the latest data from the Bureau of Economic Analysis. Google, in contrast, has the BEA as its third result. And its first result is a graphical display of GDP data via Google’s Public Data project. Too bad, though, it goes up only to 2009. For some reason, both Google and Zanran think the CIA is the best place to get U.S. GDP data. It is a good source for international comparisons, but it falls out of date.

Here’s wishing Zanran good luck in strengthening its search results as its competes with Google, Wolfram Alpha, and others in the data search.

Curation versus Search

I love Twitter (you can find me at @dmarron). Indeed, I spend much more time perusing my Twitter feed than I do on Facebook. But it’s not because I care about Kanye West’s latest weirdness (I followed him for about eight hours) or what Katy Perry had for lunch. No, the reason I love Twitter is that I can follow people who curate the web for me. News organizations, journalists, fellow bloggers, and others provide an endless stream of links to interesting stories, facts, and research. For me, Twitter is a modern day clipping service that I can customize to my idiosyncratic tastes.

Several of my Facebook friends are also remarkable curators, as are many of the blogs that I follow (e.g., Marginal Revolution and Infectious Greed, to name just two).  So curation turns out to be perhaps the most important service I consume on the web. In the wilderness of information, skilled guides are essential.

Of course, I also use Google dozens of times each day. Curation is great, but sometimes what you need is a good search engine. But as Paul Kedrosky over at Infectious Greed notes, search sometimes doesn’t work. That’s one reason that Paul sees curation gaining on search, at least for now:

Instead, the re-rise of curation is partly about crowd curation — not one people, but lots of people, whether consciously (lists, etc.) or unconsciously (tweets, etc) — and partly about hand curation (JetSetter, etc.). We are going to increasingly see nichey services that sell curation as a primary feature, with the primary advantage of being mostly unsullied by content farms, SEO spam, and nonsensical Q&A sites intended to create low-rent versions of Borges’ Library of Babylon. The result will be a subset of curated sites that will re-seed a new generation of algorithmic search sites, and the cycle will continue, over and over.

Google More Popular Than Wikipedia … in 1900

Google unveiled a new toy yesterday. The Books Ngram Viewer lets users see how often words and phrases were used in books from 1500 to 2008. Other bloggers have already run some fun economics comparisons. Barry Ritholz, for example, has does inflation vs. deflation, Main Street vs. Wall Street, and Gold vs. Oil.

In the humorous glitch department, I tried out the names of two Internet services I use everyday, Google and Wikipedia. For some reason, the Ngram viewer defaults to the timeperiod 1800 to 2000 (rather than 2008), and this was the chart I got (click to see a larger version):

It’s amazing to see references to Wikipedia as far back as the 1820s. Impressive foresight. Google overtook Wikipedia in the late 1800s and, with the exception of a brief period in the 1970s, has led ever since.

An Unusual Battle Between Amazon and Publishers

Over at the New Yorker, Ken Auletta has a fascinating piece about the future of publishing as the book world goes digital. Highly recommended if you a Kindle lover, an iPad enthusiast, or a Google watcher (or, like me, all three).

The article also describes an unusual battle between book publishers and Amazon about the pricing of electronic books:

Amazon had been buying many e-books from publishers for about thirteen dollars and selling them for $9.99, taking a loss on each book in order to gain market share and encourage sales of its electronic reading device, the Kindle. By the end of last year, Amazon accounted for an estimated eighty per cent of all electronic-book sales, and $9.99 seemed to be established as the price of an e-book. Publishers were panicked. David Young, the chairman and C.E.O. of Hachette Book Group USA, said, “The big concern—and it’s a massive concern—is the $9.99 pricing point. If it’s allowed to take hold in the consumer’s mind that a book is worth ten bucks, to my mind it’s game over for this business.”

As an alternative, several publishers decided to push for

an “agency model” for e-books. Under such a model, the publisher would be considered the seller, and an online vender like Amazon would act as an “agent,” in exchange for a thirty-per-cent fee.

That way, the publishers would be able to set the retail price themselves, presumably at a higher level that the $9.99 favored by Amazon.

Ponder that for a moment. Under the original system, Amazon paid the publishers $13.00 for each e-book. Under the new system, publishers would receive 70% of the retail price of an e-book. To net $13.00 per book, the publishers would thus have to set a price of about $18.50 per e-book, well above the norm for electronic books. Indeed, so far above the norm that it generally doesn’t happen:

“I’m not sure the ‘agency model’ is best,” the head of one major publishing house told me. Publishers would collect less money this way, about nine dollars a book, rather than thirteen; the unattractive tradeoff was to cede some profit in order to set a minimum price.

The publisher could also have noted a second problem with this strategy: publishers will sell fewer e-books because of the increase in retail prices.

Through keen negotiating, the publishers have thus forced Amazon to (a) pay them less per book and (b) sell fewer of their books. Not something you see everyday.

All of which yields a great topic for a microeconomics or business strategy class: Can the long-term benefit (to publishers) of higher minimum prices justify the near-term costs of lower sales and lower margins?

Now Available in Dozens of Languages

Good news for international readers: Thanks to Google Translate, you can now read this blog in several dozen languages. Just click on the language you want in the box to the right.

(For those of you reading this via email, Google Reader, etc., here are some example links: German and Spanish.)

P.S. Kudos to the WordPress member who wrote the code for this.

Google’s Public Data: Much Improved

Google recently released some major improvements in its public data efforts. If you click on over to Public Data, you will find a much broader range of data sets including economic information from the OECD and World Bank, key economic statistics for the United States, and some education statistics for California. Google has also included more tools for visualizing these data, from standard line charts to the evolving bubble charts that have made Hans Rosling such a hit at TED.

As an example, I made a flash chart of state unemployment rates from 1990 to the present. Puerto Rico (which counts as a state for these purposes), Michigan, Nevada, and Rhode Island currently have the highest unemployment rates, so I thought it would be interesting to see how they stacked up against the other states over the past twenty years.

WordPress doesn’t allow me to embed Flash, but if you click on the image above and then click play, you will see the evolution of state unemployment rates over time. (Spoiler alert: All those colored bars move sharply upward toward the end of the “movie”.)

Long-time readers may recall my series of posts criticizing Google for directing its users to unemployment data that have not been seasonally adjusted. Happily, Google now allows the user to use either seasonally adjusted or non adjusted data. Two cheers for Google.

Why only two cheers rather than three? Because Google still directs unsuspecting users to unadjusted data–without the ability to switch to seasonally adjusted–if they do a Google search on “unemployment rate United States“. That’s a big deal, particularly for February 2010 when the official unemployment rate was 9.7%, but the unadjusted figure reported by Google was 10.4%.

Clearly, the two parts of Public Data need to integrate a bit more.

Google and Me, Part II

My existential crisis is over. As of last Thursday, Google is again including this blog in its search results. So, welcome to all the new readers who’ve come here after Googling information on the Eggo shortage and the debate about whether kids should get one H1N1 shot or two.

This is probably of interest only to other bloggers, but for the record: When I first started this blog, it took about six weeks for it to appear regularly in Google search results. After several months, the blog inexplicably (to me, at least) disappeared from Google’s results. As in *really* disappeared; as one friend pointed out, you couldn’t even find it if you searched for “Donald Marron blog”.  About eight weeks elapsed before it reappeared regularly in the first few pages of Google’s results.

My eight-week exile provided a nice natural experiment for evaluating Google’s importance. Not surprisingly, Google drives a good amount of traffic; readership is larger when Google knows about the blog. The more interesting impact, though, is a version of the Long Tail: with Google’s help, more posts find readers on any given day.

 

Insight on Google and Unemployment

In a series of posts (here, here, and here), I have expressed concern that Google directs its users to what I think is the “wrong” measure of unemployment. For example, if you search for “unemployment rate United States” today, it will tell you that the U.S. unemployment rate in August was 9.6%, when the actual figure is 9.7%.

This discrepancy arises because Google directs users to data that haven’t been adjusted for seasonal variations. Almost all discussions of the national economy, however, use data that have been seasonally-adjusted. Why? Because seasonally-adjusted data (usually) make it easier to figure out what’s actually happening in the economy. The unemployment rate always spikes up in January, for example, because retailers lay off their Christmas help. But that doesn’t mean that we should get concerned about the economy every January. Instead, we should ask how the January increase in the unemployment rate compares to a typical year. That’s what seasonal adjustment does.

My concern about Google’s approach is that many (if not most) data users know nothing about seasonal adjustment. They simply want to know what the unemployment rate is and how it has changed over time. Directing those users to the non-seasonally-adjusted data thus seems like a form of search malpractice.

I’ve wondered why Google has chosen this approach, and thus was thrilled when reader Jonathan Biggar provided the answer in a recent comment. Jonathan writes:

Continue reading “Insight on Google and Unemployment”