Thursday, August 27, 2009

Google News highlights 3 Ted Kennedys

I know, I know, I'm obsessed with how Google News algorithms sometimes do goofy things. This time the robot finds 3 Teddy Kennedys:

So let's think about how the Google News algorithm might have done better. If it somehow knew that "Ted Kennedy" and "Edward Kennedy" and "Edward M. Kennedy" were the same person, it could collapse those entries into one. A conventional approach would be for Google News to have a thesaurus that included those equivalencies. The classic example of a thesaurus (in the U.S. anyhow) is a table that shows that "Mark Twain" and "Samuel Clemens" are the same author.

But traditional thesauri are built by hand, and Google abhors services that require manual labor. It would require a pretty fancy algorithm to understand that the 3 Teddys are the same guy. But the folks that craft Google algorithms are pretty clever, and I bet it's do-able.

Now, what about Hyannis Port? Obviously all of those articles are about Teddy's death. But they may have a different angle: the effect of his death on the town and its citizens. And in fact the top several articles on the hit list are mainly about the town, not the man. Yet intermixed in the results are many articles that are primarily about the man.

This is a toughie. From past cases I'm pretty sure that the Google News highlighter relies heavily on capitalization. It teases out people and place names when it sees they are capitalized. So how might it know that Hyannis Port is not a person?

Elementary, my dear Watson. Consult the Google Maps database. Google as a collective knows that Hyannis Port is a town. The tricky part would be to figure out which articles are about the town, and which are just about Teddy, tangentially mentioning his family's outpost.

Ideally, the news highlighter would list 1 Teddy Kennedy along with 1 Hyannis Port -- and the latter would be about the town, not the man.

I predicted Google News in an article in 2001: The Effects of September 11 on the Leading Search Engine.
I'm predicting now that the Google News robot is only going to get smarter.