The Opaque Library of Babel

As much as we live in a ‘post-9/11 world,’ many of its key features aren’t new, but rather the logical extrapolation, with perhaps different implicit or explicit motivations, of already existing tendencies. That’s particularly true with regards to the increased emphasis during the last years — from governments, business, and consumers alike — on surveillance, information aggregation, data fusion, pervasive intelligence, and other forms of the old goal of complete situational awareness.

The technologies might be new, but the impulse behind them isn’t. Years before 9/11, the US’ National Security Agency was already wiretapping communications. ‘Spiders’ (automated computer programs that systematically explore and download vast amounts of online documents) were already crawling the web as early as the mid-90′s. Thirty years before, ECHELON — nowadays the former secret intelligence project most mentioned in fiction movies and books — was intercepting and analyzing Soviet radio communications. And twenty years before that, cryptologists and analysts in Bletchey Park during WWII were collating information from decrypted German communications and distributing it as ULTRA, widely credited as a significant factor in the Allied victory.

It can be fairly said that for as long as we have had information technology of any sort, we have wanted to use it to be as close to omniscient as possible.

There have been progress and throwbacks in this unspoken planet-wide information race. ULTRA was, by all technical and practical measures, an incredible success, while the NSA as an organization has often failed to, as the now-proverbial phrase goes, ‘connect the dots.’ Google often finds useful information, but it has to be noted that it only gives answers for which a webpage already exists. Wholly new questions, the most interesting ones, are a different matter.

The main reason for this limitation lies in the fact that, when it comes to the topics we care about, we still communicate almost always to and for other humans, and never to computers. The world’s immense and quickly growing stock of digital information is encoded either in texts — webpages, papers, books, etc — images and videos, or in databases computers can read, but not make sense of.

The web might exist on computers, but by and large, it’s opaque to them. It’s somewhat like being in charge of a library of books written in a language you don’t understand. You can find books with certain words on them without knowing what the words mean, but how can you answer a question that’s not already explicitly answered in one of the books?

Google managed to redefine the expectations of web searching by leveraging a piece of information that computers are able to understand and process, the mathematical graph of links between pages. This happens to encode useful information about what pages in a set of possible results are more important than others, resulting in a search engine that was nothing short of revolutionary at the time.

But Google, for all the data it accumulates and indexes, is far from all-knowing (or, to be more precise, all-understanding). We just tend to think it is because we unconsciously define the web as “things we can find using Google,” and finding information as “using Google.” That’s a triumph of marketing, but a dead end on its own.

So the question becomes, how do we make computers understand more about everything we keep storing on them? This isn’t, either, a new question. For all the velocity of progress in IT, specially when compared with other fields, it’s not devoid of history and structure. Artificial Intelligence was always a dream and a goal for scientists and engineers, and among the subproblems involved was how to give computers the “common sense” necessary to communicate with humans and our texts. One of the projects that attempted to tackle this — and still does — was Cyc, short for encyclopedia, an attempt launched in 1984 to build a machine-readable (and machine-understandable) database of common-sense facts that could allow a computer to understand natural language.

Of course, back in the ’70s and ’80s almost everybody was generally more optimistic about machine intelligence than later events warranted, as well as too conservative, it turned out, in terms of the future number of computers, their power and connectivity, and the amount of data that would be available to them. Thus, the main engineering focus shifted over time towards giving structure to the mass of online information.

In contemporary terms, that’s the long-hoped-for Semantic Web: a sort of parallel web embodied in and behind the human-readable documents, codifying both information and the context necessary for a computer to understand it (concepts, relationships, etc). The idea was, and is, that website and database creators would continuously add to it, making everything online, to a greater or lesser degree, understandable by computers.

Results so far are mixed. The idea of the Semantic Web lies behind some of the defining characteristics of the modern Internet, like “format-free” versions of site contents (the popular RSS/Atom format), and remote interfaces allowing software to access other websites’ services (e.g., the way Facebook can retrieve contacts from GMail).

Social networks, and “social websites” in general, approach the task of making sense of online information from a very different angle. Instead of making information machine-readable so the computers can process it, they encourage humans to do the processing, tagging and digging websites, twittering links, editing Wikipedia articles, refining SearchWiki queries, etc. In the most direct application of this idea, Amazon’s Mechanical Turk service allows you to buy highly commoditized human information processing in bulk for specific tasks. Already existing examples are identifying products in photographs and annotating research results.

Unless (or perhaps rather until) radically new elements like full-fledged Artificial Intelligence or very enhanced human cognition capabilities enter in play, it might not happen that either of these approaches — semantic machine understanding and distributed human processing — will drive its competitor out. Both show very different, and arguably complementary, economic and capability profiles, so it’s natural to expect that the organizations that are best able to extract novel and useful meaning from the vast mass of information, one of the key skills to survive and thrive, will be those that master both modes of information analysis.

Laisser un commentaire

Votre adresse de messagerie ne sera pas publiée. Les champs obligatoires sont indiqués avec *