Google's Abandoned Library of 700 Million Titles (UPDATED)

books_usenet

(Update: Google has begun fixing the Usenet archive in response to this article)

Imagine a world where Google sucks.

It might seem a stretch. The Google logo is practically an icon of functionality. Google’s search engine and other tools are the company’s strongest, if unstated, argument in favor of the Google Books Settlement, which would give the internet the largest and most comprehensive library in history, at the cost of granting Google a de facto monopoly. It’s hard to imagine any company better equipped to scan, catalog and index millions of books than Google.

But a few geeks with long memories remember the last time Google assembled a giant library that promised to rescue orphaned content for future generations. And the tattered remnants of that online archive are a cautionary tale in what happens when Google simply loses interest.

That library is Usenet, a vast internet- and dial-up-based message board system erected in 1980. Though moribund today, for decades Usenet was the paper of record for the online world, and its hundreds of millions of “newsgroup” postings chronicle everything from the birth of the web to the rise of Microsoft, as well as more trivial matters.

In February 2001, Google rescued that history when it acquired the New York-based Deja.com, and with it a Usenet archive going back to 1995. It turned the archive into Google Groups, in a move that was cheered by net geeks who had seen Deja’s reliability declining, and were certain that the supremely competent Google would save it.

“Taking on Deja has to be considered an overwhelming accomplishment,” wrote one Slashdot commenter. “There is simply no way for any other party to supersede this. Essentially, Google has the Usenet Monopoly.”

Later that year, Google deepened its archive with millions of posts that had been saved on aging magtape by a veteran Unix guru named Henry Spencer. The combined archives gave Google a library of 700 million articles from 35,000 newsgroups, spanning two decades.

Salon hailed the accomplishment in an article headlined “The geeks who saved Usenet.” “Google gets the credit for making these relics of the early net accessible to anyone on the web, bringing the early history of Usenet to all.”

Flash forward nearly eight years, and visiting Google Groups is like touring ancient ruins.

On the surface, it looks as clean and shiny as every other Google service, which makes its rotting interior all the more jarring — like visiting Disneyland and finding broken windows and graffiti on Main Street USA.

Searching within a newsgroup, even one with thousands of posts, produces no results at all. Confining a search to a range of dates also fails silently, bulldozing the most obvious path to exploring an archive.

Want to find Marc Andreessen’s historic March 14, 1993 announcement in alt.hypertext of the Mosaic web browser? “Your search – mosaic – did not match any documents.”

Flat searches of the entire archive still work, but they aren’t very useful: there are 1.42 million hits on “mosaic.” The rise of Microsoft, the first Usenet review of the IBM PC in 1981, early rumblings of a Y2K problem in 1985 — it’s all locked in Google Groups, virtually irretrievable if you don’t already have a direct link.

“The search results are extremely poor,” says network pioneer Brad Templeton. “Like nobody cares.”

Spencer, whose Usenet archive forms much of Google Groups, is troubled by the company’s curatorship. “Google does get a lot of credit for putting it together and making it available,” Spencer says. “But search capabilities are important for such a large collection of data. The archive’s value to the community is considerably reduced if it’s not conveniently searchable.”

A year after Slashdot called attention to the bugs, the problems with the archive not only haven’t been fixed, but they aren’t reflected in the Google Groups “known issues” page.

Asked if the bugs are documented anywhere, or if Google planned on repairing its library, a company spokesman was noncommittal. “We’re aware of some problems with the way search is working in Google Groups,” said Jason Freidenfelds, in an e-mail. “We’re always working to improve our products.”

Templeton, who helped Google compile an index of historically significant Usenet articles when it first launched its archive, thinks Google’s neglect is a simple matter of economics.

“I presume they find that the volume of searches is too low for them to put people on it, or the ad revenue results are too poor,” Templeton says. “The ads don’t seem to match the pages well.”

In the end, then, the rusting shell of Google Groups is a reminder that Google is an advertising company — not a modern-day Library of Alexandria.

Image: Dennis Crothers/ Wired.com