Friday, February 06, 2004

Google News and People Names

Google News usually does a remarkable job of pulling together headlines and stories from a panoply of news sources worldwide. Sometimes, though, since the news page is assembled robotically, goofy things happen.

Here's an example: the News home page always has a little box labeled "In the News" where they pull up stories about people who are making headlines. Not surprisingly, Janet Jackson and Justin Timberlake are newsmakers right now, along with Howard Dean and Martha Stewart.

But one "person" in the news is a bit of a surprise: County Sheriff. Hmm, is this some rap star I never heard of? Clicking on the link reveals that there happens to be a bunch of stories in the news about various county sheriffs around the country, and the Google robot has somehow lumped them all into one melange.

Again, it is remarkable how well Google News does its job. It's pretty clear that Google has entered into explicit partnerships with some major news providers, which presumably means they use technology to do the gathering of content "right." That means using RSS or some other XML scheme to feed marked-up content to the harvester.

But I bet in many cases Google News still "screen scrapes" to gather content. This means parsing HTML to try to grok what a news source's page means -- and that means getting it wrong from time to time.