Lifeblue - Digital Agency in Dallas, TX - Pixels & Tags - Cruisin' on the Information Superhighway

There's a lot of information out there (duh)! So much, in fact, that it's kind of a problem. How do we figure out which information is what we're looking for? Better yet, how do we tell a computer to figure out which information we're looking for? Let's start by breaking the problem down: What do we consider information, what types of information exist, and how much do we need to get what we want?

A good definition of information is "the thing that distinguishes some stuff from other stuff." (Before coming to Lifeblue, I was rejected by three dictionary companies) Suppose you're in a room with two basketballs. One is in the corner, and the other is in the middle of the room. If I said, "I want to play with the basketball," you might be confused! But if I said, "I want to play with the basketball in the corner," we'd be on the same page. Here, the basketball's position is a key piece of information in figuring out what exactly I'm after.

Now let's consider a word like "water" and see how much information is in the word itself. We might guess we have five pieces of information - the letters 'w', 'a', 't', 'e', and 'r'. What about their arrangement? The 'w' is first, the 'a' second, 't' third, 'e' fourth, and 'r' fifth. So, maybe we really have 10 pieces of information (five specific letters, each with a specific position), but how much is necessary to describe the word "water"? A Scrabble helper tells me the only other word I could make with those letters is "tawer" (someone that taws?). Hmmm. Maybe if we aren't worried about batting a thousand, we can get by with just five pieces of information here: the letters.

Alright, pop quiz! Fill in each blank with a letter: "I'm thirsty and would like a glass of wa_ _ _." If you put (in order) 't', 'e', and 'r', congratulations! Out of the 17,576 possible ways to fill in the blanks, you guessed the exact one I intended. What happened? We just decided you need five pieces of information to describe the word "water", and here you are getting away with only two of the letters! Granted, you were given the position of the letters 'w' and 'a' and that there were three letters missing, but something much bigger is at work here: context.

Information theorists have determined that, with a little context, someone is choosing between about two or three letters for each blank, rather than the full alphabet. So while there were 17,576 ways to fill in the blanks for your pop quiz, a person would usually decide which makes the most sense out of about eight of those possibilities.

Now let's apply this to a slightly bigger scale: statements!

"My friend ate an apple." In that statement, there are 22 letters and spaces, composing five words. What information is needed about that sentence to understand its meaning? This is important for search engines that want to figure out what someone is searching for and then provide relevant results.

One generally effective approach for getting the user what she wants can be imagined as tossing all the words into a bag and asking "Where have I seen this collection of words before?" A lot of the time, not much information about the original message is lost in the process: "My friend ate an apple," "An apple ate my friend," and "friend an ate apple my" have the same bag of words, but if all we know is that a person's statement corresponded to that particular bag, we can reasonably guess that she (hopefully) intended only the first statement's meaning, even though there are 120 ways to arrange the words. Google Translate relies on a fancier, fine-tuned version of this approach to skip over nuances in grammar and syntax when hopping between languages, providing a great improvement over the quality of automated translations from the late 90s.

Let's go back to our statement: "My friend ate an apple." Not worrying about certain words ("the", "an") lets us associate our word-bag with the one for "My friend ate the apple." If we say extra words are allowed, toss in: "My friend ate a red apple," and even "My friend ate a banana and an apple." We can even group singular and plural forms together: "My friend ate apples." Is it all coming together now?

Unfortunately, there's a pitfall here: our word-bag is now also associated with "My friend ate a banana and hates apples." Clearly our friend is very different from the one described here! A famous example of this pitfall is when IBM's Watson answered a question very strangely on Jeopardy! To overcome this, perhaps we should worry about words and word pairs! "Apple" is great, but let's skip "hates apples." These sorts of tricks and doodads are the things that push good search engines above the rest, and there are plenty more where they came from!

Well, there you have it! You've got yourself a fine search engine, if you ask me. All that's left is programming it.