Open Source Search Technology

Pattern number within this pattern set: 
116
Douglas Schuler
Public Sphere Project (CPSR)
Problem: 

People rely on search engines to find the information they need on the web. The motivation, however, of the groups providing search engines is securing profits for their owners; other motives necessarily and inevitably take a back seat. The negative implications of relying solely on commercial search engines, though vast, are generally not recognized. If the enormous gatekeeping potential of commercial search engines is not balanced with open and accountable public approaches, the ability to find non-commercial information including that which doesn't appeal to broad audiences or is critical of governments and other powerful institutions could conceivably disappear. The privatization of the means to access information could also lead to a situation where advertisements and other "sponsored" information could crowd out non-commercial information.

Context: 

People in their daily lives need, search for — and find — a tremendous amount of information. Increasingly, they are looking for this information in cyberspace. While Internet technology has opened up an unbelievably vast amount of information and opportunities for communication for millions of people worldwide, the very fact that we are relying on technology which is out of our control is cause for concern — if not alarm. Although the application of this pattern is relevant to any system that people use to find information, our immediate attention is drawn to the Interne which is poised to become increasingly dominant in the years ahead.

Discussion: 

Access to information can be made easier; barriers to obtaining the information that people need can, at least in theory, be anticipated and circumvented. But, like the chain whose ultimate strength is determined by its weakest link, access to information can be thwarted at many levels. Although non-public (commercial and otherwise) providers of information and communication services can be "good citizens" who prioritize the needs of their users, the temptation to become less civil may prove irresistible if and when the "market" suggests that uncivic behavior would result in higher revenue. In circumstances such as those, they may decide to relax their current high standards accordingly. Big web portals are, for example, becoming increasingly cooperative with the Chinese government, presumably because of the huge market which potentially exists there.

One approach to addressing this problem, an open source / public domain classification system similar to that used in the public libraries in the U.S. and other places coupled with open source, community owned and operated search engines, is simultaneously defensive and forward looking. Defensive, because it could serve as a hedge against information deprivation and commodification. Forward looking, because this approach could help usher in an exciting new wave of experimentation in the era of access to information. As the development of the Internet itself has demonstrated, the "open source" nature can help motivate and spur usage in terms of the complementary tasks of classifying information and retrieving it easily.

Existing classification approaches like the Dewey Decimal System also have limitations (Anglo-centrism, for example) and approaches like Dewey are not strictly speaking in the public domain (although Dewey is readily licensable). Nevertheless the Dewey system might serve as at least a partial model. Schemes that are well-known, such as the Dewey Decimal system allow everybody to communicate more quickly and with less cost. It is the open protocol nature of the Internet that has allowed and promoted easy and inexpensive ways to not only get connected, but to develop new applications that relied on the underlying, no license fee, protocols.

Computing and the potentially ubiquitous availability of online environments provide intriguing possibilities that older approaches didn't need or anticipate. The Dewey Decimal system, for example, tacitly assumes a physical arrangement of books — the code assigned by the librarian or technicians using the system declares both the book's classification and the location it will occupy in the library. Although having a single value is not without advantages, an online environment opens the door for multiple tags for a single web page — or for finer-grained elements (a paragraph, for example, on a web page or the results of a database query) or, broader-grained collections of elements. A federated collection of link servers (Poltrock and Schuler, 1995) could assist in this.

As far as search engines are concerned, civil society can hardly be expected to compete with Google's deep pockets and its acres of server farms. Yet, it may be possible to distribute expertise, knowledge, and computational capacity in such a way that a competitive "People's Google" ("Poogle?) becomes conceivable. The idea of a single organization within civil society that can even remotely approach Google's phenomenal computing resources is of course absurd. But so in general is the idea of civil society "taming" the most powerful and entrenched forces and institutions. The problem here, though chiefly technological, is very similar to the one that civil society faces every day: How can a large number of people sharing similar (though not identical visions) work together voluntarily without central authority (or centralized support), undertake a project and succeed with large, complex undertakings. The "answer" though diffuse, incomplete and sub-optimal is for the "workload" — including identifying, discussing and analyzing problems to devising responses to the problems — to be divvied up — as "intelligently" as possible — so people, doing only "pieces" of the whole job can be successful in their collective enterprise.

This strategy is much easier to define and implement in the technological realm. One very successful example of this is the SETI@home project that employs the "idle" cycles of user's computers all over the world to analyze radio telescope data in a search for extraterrestrial intelligence. If, for example, one million computers working together on the people's search project, could devote some amount of processing power and storage to the project, the concept might suddenly become more feasible.

Although it would be possible for every participating computer to run the same software, breaking up the tasks and distributing them across a large number of computers (thus allowing us to "divide and conquer") is likely to provide the most suitable architecture for a People's Search Engine. For one thing this allows dynamic re-apportioning of tasks: Changing the type of specialization that a computer is doing to make the overall approach more effective. At the beginning of "Poogle's" life, for example, half of the computers might be devoted to finding (or "spidering") and indexing websites while the other half might work on identifying which web sites meet the users' search criteria and presenting a list of pertinent results to the user. After a week or so, it may become clear that the first task (identifying and indexing sites) may require less attention overall while the second task (handling user search requests) desperately needs more processing power. In this situation, some of the computers working on the first task could be re-assigned to the second task. Of course this situation might become reversed the following week and another adjustment would be necessary. In a similar way, the contents of indexes could be shifted from computer to computer to make more effective use of available disk space more efficiently while providing enough redundancy to ensure that the entire system works efficiently even though individual computers are being shut down or coming online all the time and without advance notice.

The People's Search Engine (PSE) would make all of its ordering / searching algorithms public. Google's page-ranking algorithm is fairly widely known, yet Google has adjusted it over the years to prevent it from being "gamed" in various ways by people who hope to increase the visibility of their web pages by "tricking" the algorithm to gain a higher page rank than the Google gods would bestow. Ideally the PSE would offer a variety of search approaches of arbitrary complexity to users. Thus people could use an existing, institutionalized classification scheme like the Dewey Decimal System or a personalized, socially-tagged "folksonomy" approach, a popularity approach a la Google, a social link approach like Amazon ("People who searched for X also searched for Y") or searches based on (and/or constrained by) "meta-information" about the pages, such as author, domain, publisher, or date last edited.


Card catalog photo from Duke University, http://www.lib.duke.edu/libtour/perkins/cardcat.htm
Solution: 

The development of "open source," public domain approaches to information access is essential for equity and progress among the people of the world. The possibility of credible competition will serve as a reminder to for-profit concerns that access to information is a sacred human right. It would also help to maintain and extend the patterns of innovation that open protocols have made possible. Among other things, researchers and members of civil society need to work on classification systems for Internet resources. It is imperative that civil society focuses attention on open source approaches to searching, archiving and other information access needs. For many reasons, this will help in the evolving process of opening up the world of information to people everywhere.

Pattern status: 
Released