What should we do about
Internet "cruft"?
Toward knowledge-rich websites

Larry Sanger

Keynote delivered at the Consortium of Liberal Arts Colleges Annual Meeting, Reed College, June 13, 2007.

I’m really glad to be back here at Reed College, from which I graduated in 1991. I attended many lectures here in Vollum Auditorium. Thanks very much to the CLAC for inviting me.

Someone suggested that I talk about what to do about all the low-quality information on the Internet. So I am going to talk about something about which I am by no means an expert—how to improve Internet search.

In the beginning—I mean, the 1990s—there was Yahoo! and Excite and Infoseek, and confusion moved over the face of the Web. Finding a really good website meant either using increasingly obsolete directories, or taking your chances with Web search, and ignoring page after page of completely irrelevant cruft.

Then, around 1999 or 2000 or so, people started discovering Google, and suddenly, Internet searching became relevant. And we all saw that it was good. In fact, Google still is good. If you’re looking for simple factual information, and you know how to use it, you can find an answer using Google within seconds. I can, anyway, and probably most of you can, too.

But hardly anybody would say that search is a “solved problem.” Some people complain a lot about the vast quantities of garbage online—especially educators, librarians, and journalists.  These are the traditional gatekeepers of information. Many of them are dismayed that so much biased, misleading, and outright wrong information is available so easily. Many also are troubled by the amount of "junk information": they find too much that is trifling, unimportant, and insubstantial, the mental equivalent of junk food. 

These defenders of reliable and substantial information have a new champion, by the way, in the form of entrepreneur and journalist Andrew Keen; Keen new book, The Cult of the Amateur, argues that empowered amateurs are driving the creation of too much bad information, and that in various ways, the bad information is driving out the good. Well, unlike Keen, I am a bit of a champion of the amateur. I think it is, on balance, a positive thing that all of us, not just professionals, can reach a mass audience.

But I still have to agree that there is a problem about Internet “cruft”—the aforementioned unreliable and insignificant information. What I want to talk about with you this morning is what we should try to do, if anything, about Internet cruft.

I mean this is a very broad way. We don’t like all the information we see on the Internet. So, should we try to do something about, other than complain? I don’t mean we should censor anyone, of course. I’m a strong proponent of freedom of speech online.

Of course, you might think, how could we do anything about Internet cruft, anyway? You can’t stop it; that’s like trying to stop the tide. You might not be able to stop the existence of cruft, but there are certainly ways to avoid seeing it.

But I’m getting ahead of myself. The first step in deciding what to do about cruft is to ask whether there really is a problem in the first place—and if so, what precisely the problem is.

After all, a lot of people would say there’s no problem at all. I can hear you saying—and I’d agree with you—that the existence of so much wrong and lightweight information on the net is an inevitable and not particularly troubling byproduct of a wonderful new development, that everyone is now empowered to reach a worldwide audience. You might as well complain that people say all sorts of false things in the privacy of their own homes. That’s just a consequence of freedom of speech.

I think there is a problem, but the problem isn’t with the expanding scope for free speech that the Internet makes possible. That, I think, is a very positive thing. So what is the problem about “Internet cruft”?

We might say that free speech online has two functions: participatory and informational. First, it provides enlightenment and entertainment for those who engage in it; in other words, there is the participation aspect. Second, it also provides a source of information for people when they the Internet more strictly as an information tool. Blogs, wikis, Internet forums, and most other community Web projects have both aspects, participatory and informational. They allow people to participate, but they also output a product that other people, nonparticipants, sometimes consult as a resource.

So I want to say that the problem about Internet cruft is mainly a problem about using the Internet as a source of information. Cruft is so far from being a problem for the participation aspect that we can say that cruft is a necessary part of a healthy online participatory community.

So Internet cruft is not a problem merely because it exists, but because of the use to which we put it. Often, a lot of what you and I might call cruft isn’t meant to serve as a definitive presentation of some information. When somebody makes a YouTube video of the funny noises her cat makes, she isn’t acting as a serious documentary producer. She’s just some random person who wants to share a funny video. Usually, what we see online is just the humble perspective of one person, or of a group of amateurs having fun together—and there’s nothing wrong with that, in itself.

The problem arises when we sit down to use our search engine of choice and actually look for facts, to get a balanced and well-informed perspective on things. The problem is about finding good information, not about the mere existence of bad information.

Thing is, there’s tons of good information online. So maybe we can put the problem this way: why is good information so hard to find?

“Well,” you might say, “just a second—like you were saying before, I can find good information lickety-split using Google.” And this is true. It really is easy to find good information quickly using Google.

Maybe, you might say, the problem is about signal-to-noise ratio. For every hit, you might say, we’re apt to find a miss. I can demonstrate this even with a simple Google search, say a search for “lion”. Google ought to get this right, because there is loads of good information about lions online.

But what’s the first result? The Wikipedia article. The second result is from the African Wildlife Foundation, and I would consider that a “hit” (as opposed to a “miss”), in terms of credibility anyway. Result three is “Literature Online” (get it?—“LiOn”). Result four is a company called “Lion Technology Inc.” Then finally the next few results are from various educational websites, including Kids’ Planet, the Smithsonian, and the San Diego Zoo, which all seem pretty authoritative.

Unsurprisingly, it wasn’t all that hard to find credible articles from authoritative sources about lions. If what we wanted was a general introduction to lions, we found several. But there are two problems with the search results, which are problems you will find with most search results.

First, the most credible information is not necessarily placed front and center. The Wikipedia article is in first place. For all I know, this could be a fantastically reliable article. It is certainly the longest of the lion articles on the first page of results. If we knew that the information in the article were reliable, then I’d have no trouble with the Wikipedia article being in the first rank. But, pending a review by a biologist, we don’t know that the information in the article is reliable.

Second, another problem with the search results is that there are a lot of pages that aren’t about lions, but about other things that happen to include the word “lion” in their name. This reduces the signal-to-noise ratio.

One has to admit, however, that neither of these is a terribly huge problem. If, somehow, Google put the most reliable results first, and only had articles about lions when one searched for lions, then Google would be only marginally better than it is right now.

At this point, you might well be wondering if there really is any problem about Internet cruft at all. So let me tell you, finally, what I really think.

It’s this. The problem with Internet cruft isn’t that it exists, or that the existence of cruft makes it harder to find reliable information. The problem, rather, is simply that there isn’t as much fantastic information online as there could be. The problem with Internet cruft is that it keeps us thinking small. I want us to think big.

Let me explain. Right now, to rise to the top of the search rankings, all you have to do is be popular. And to be popular, all you have to do is be minimally credible, and have some interesting information. People will link to your website, which will cause it to rise in the rankings, and then, if your website seems to have quick answers to a quick Web search, people will click on it.

To be popular, therefore, all you have to have is quick, credible answers. There’s no pressure to be maximally useful in addition.

Let’s take the “Lion” search as an example again. Most of the informational articles there are rather short, and have a few pictures of lions. But imagine, if you will, the ideal website about lions. There is a big long meaty encyclopedia article. There’s a shorter and simpler article for children. There’s an annotated bibliography. There’s an annotated set of Web links. There’s plenty of immediately-accessible multimedia. There is a gallery with thousands of free lion pictures. There are recordings of lion roars. There are educational and other videos all collected together. There are maps. There are links to scientific studies. There is educational material.

That’s what I mean by a “maximally useful” website. I want the whole enchilada all in one place.

So I think the trouble isn’t with Internet cruft at all. It’s not that there’s too much bad information out there. The trouble is that the search engines do not reward people for creating maximally useful websites, websites that would push all the cruft off the first page of search results. Google, Yahoo, and MSN are all responsible for their own “soft bigotry of low expectations.”

What I would like to see, therefore, are search engines that rank most highly those websites that have the highest quantity and quality of information, of all types. They should specifically reward the websites that are most knowledge-rich. I choose this phrase, “knowledge-rich,” carefully. To contain “knowledge,” an information resource has to be authoritative and credible. To be “rich,” there has to be a huge amount of information and a large variety of information types. A knowledge-rich website excels in three distinct dimensions of information: quality, quantity, and variety.

So now my question is: how could search engines reward knowledge-rich websites—or, how could we help them do so?

Now, this is very far from trivial. The single biggest problem keeping us from implementing this idea is that, at present, there is no easy way to identify which websites are in fact knowledge-rich. I think we can say—based on our common experience with Google—that mere links, page click data, and whatever other proprietary Google algorithms we don’t know about, none of that actually predicts knowledge-richness. If it did, then the most knowledge-rich websites would appear at the top of our search results. But they don’t, not always. Sometimes, the most knowledge-rich websites are buried deep in the search results.

Imagine you’re a webmaster, and you knew that the way to get to the top of search results was to have information on some topic in the greatest quality, quantity, and variety. Then the whole art of “search engine optimization” would change; it would be simplified. Merely improve your information, and your ranking will rise. Suddenly, people would have an economic incentive to make information on the Web better.

“That’s a nice dream,” I can hear you say, “but the idea would work only if your search engine had a huge amount of data about the knowledge-richness of websites. How on Earth can you possibly evaluate all the websites that you have to in terms of knowledge-richness?” That’s a very difficult problem. The sheer vastness of the Internet—there are billions and billions of Web pages—surely makes it impossible for any human-built search engine to do the job.

If you are so naïve as to believe that a hand-collected, human-built search engine has the slightest chance of competing with Google, I would simply point you to the experience of Yahoo!—which has lots of money to put into its directory. But who uses its directory to find stuff? I don’t know when the last time was that I used it; probably years ago. Similar remarks go for the biggest open source search engine, DMOZ, also known the Open Directory Project. They’ve had huge numbers of participants adding Web links to its directory, but of course they haven’t got a prayer of keeping up with all the new websites that have come online. Consequently, DMOZ is of relatively little use, or at least, I virtually never use it.

Well, I don't propose that anyone catalog every website that's out there by hand, since that’s impossible. But what we can do is create a list of websites that are of unquestionably high quality. We can then use data about what those websites link to, and what websites those websites in turn link to, to seed an otherwise human-free search engine. Actually, I don't presume to have an interesting opinion about how a good search engine might make use of data about which websites really are of high quality. But a bit more about that later.

So how then should we create this list of websites that are “of unquestionably high quality”?

I’ve got a proposal. Before I explain it, I should give some credit to British columnist Simon Stuart who, in a short piece a few months ago, argued: “the web needs quality control.” So, he said, we should set up an “Internet-Wide Accuracy Commission,” or “iWac.” This would produce “a list of approved websites,” and even give approved websites a logo they could stick on their site, like the Good Housekeeping Seal of Approval.

Well, I don’t exactly want to propose all that, but I do want to propose something like it.

As you probably know, I have started a new wiki encyclopedia project called Citizendium. I like to describe it as Wikipedia with editors and real names. In the last seven months we have added some 2,000 articles, and we have about 240 expert editors and 1,700 authors. We’re growing very nicely, and we’re getting ready to expand even further.

Right now, Citizendium is just an encyclopedia project, but we have added some other kinds of data, such as images, lists, and bibliographies. In the coming months, I hope to kick off a lot of ancillary projects. I hope to announce actual projects, with fleshed-out policies and project leaders, devoted to bibliographies; information catalogs (in other words, almanac-type lists); galleries; debate guides; and possibly other things too.

Well, one of these ancillary projects will be a Web directory. Let me explain a little bit how this might work, and then say a bit about why you should care about yet another Web directory.

The idea is this. When you go to a Citizendium article, on the upper right of the article there will be a box with links to various pieces of supporting content. You’ll see such things as:

  • categories
  • catalogs
  • gallery
  • debate guide
  • annotated bibliography
  • links

On Citizendium, people are already adding links to credible websites at the bottom of articles, but we haven’t been doing this systematically or according to any clear rules. Well, we’re going to move what links we have now to their own separate links pages, and we’re going to start asking contributors to make more extensive sets of links, divided into sections, about every aspect of a topic. In short, we’re going to start something that is more of a serious, and seriously useful, Web directory.

For example, on the links page attached to our Biology article, we’ll have links to introductory articles; free textbooks; Biology image databases; general Biology encyclopedias; important and interesting essays about Biology in general; and, essentially, every type of information about Biology in general on which it’s possible to have credible websites.

So you’ll be able to quickly click from an encyclopedia article to various other genres of information on the same topic. If you’re looking at an article about Biology, then, you’ll be able to click on “links” and come to a whole page devoted to the best Web content about Biology in general.

Using a wiki to compile this information will, I think, offer several advantages. For one thing, it will be easier to collecting credible websites simply because “many hands makes light work.” Anyone can pitch in on any page. For another thing, wiki pages are flexible. So it is possible to arrange links into various categories, rearrange the categories, and rearrange the links within the categories.

Another thing that I will be encouraging our contributors to do is give a little description of the websites they link to—that will make the links page like an annotated Web bibliography. This will be useful for end-users, of course, but it will also make the link data more useful for search engines. More about that in a bit.

We’ll also establish a method of checking links on a regular basis—making sure they still exist. Link rot is one of the most serious problems about hand-edited Web directories. We can easily set up a template that says, “The last time these links have been completely checked was March 1, 2007.” We can then automatically compile lists of link pages that haven’t been checked in over, say, six months.

That’s the proposal in outline. What I haven’t answered yet, however, is why the world needs another Web directory when, as I said, the Web directories that exist, like Yahoo! and DMOZ, don’t really add much to what Google and other sophisticated search engines offer.

There are two truly exciting advantages of this proposal. First, we might build, perhaps for the first time ever, a free and enormous general collection of credible, expert-approved links. The Web is full of link lists compiled by amateurs. It could use one that is managed by actual experts.

Second, because Citizendium requires real names and identities, we have had virtually no vandalism. The likelihood of link spam is very low, because we have a policy against self-promotion. You can ask someone else to put up a link to your website, but you can’t put one up yourself.

Given these two advantages—that the directory will be managed by experts, and that it will have a low rate of link spam—I think we can expect the signal-to-noise ratio to be very high. Our directory should have a very low incidence of cruft.

But still, you might wonder, even so, what would the directory be good for? It would be silly to think that Citizendium’s Web directory might replace Google. Our directory will never be nearly as complete as Google’s.

Well, imagine what it will be like if we had one million encyclopedia articles—considering that Wikipedia has nearly two million, we think this is possible within some years—and ten links per topic, on average. That would be ten million links, compiled collaboratively under the guidance of experts.

Imagine, then, a search engine taking this free information and re-ranking its search results based on whether a website appeared in the Citizendium Web directory. This would, I think, solve the problem I described earlier. If search engines were to use the data we collect to re-rank its results, they would in effect be rewarding knowledge-rich websites.

If a website is actually credible and useful enough that the Citizendium Web directory links to it, then that simple data can be used in all sorts of interesting ways. Remember also that it isn’t just a URL that is associated with a particular topic. In addition, we’ll file different links under their proper data type—such as essays, images, or textbooks—and the links will (I hope) be annotated.

In all humility, I have to admit that I am myself a complete amateur when it comes to Web directories. It’s entirely possible—in fact, maybe it’s likely—that there’s something deeply wrong with this proposal. It’s entirely possible that, even if there were 10 million expert-approved links on a million different topics, that wouldn’t significantly improve how search engines operate. In all likelihood, we won’t solve the problem of Internet cruft with our directory project. But if we keep at it, I think at the very least we’ll create an actually useful directory, one that will be helpful to people doing research on the Citizendium website, if for no one else.

Thank you.