Icon ASCII : A Love Letter


Icon My Neural Network isn't working! What should I do?


Icon Phase-Functioned Neural Networks for Character Control


Icon 17 Line Markov Chain


Icon 14 Character Random Number Generator


Icon Simple Two Joint IK


Icon Generating Icons with Pixel Sorting


Icon Neural Network Ambient Occlusion


Icon Three Short Stories about the East Coast Main Line


Icon The New Alphabet


Icon "The Color Munifni Exists"


Icon A Deep Learning Framework For Character Motion Synthesis and Editing


Icon The Halting Problem and The Moral Arbitrator


Icon The Witness


Icon Four Seasons Crisp Omelette


Icon At the Bottom of the Elevator


Icon Tracing Functions in Python


Icon Still Things and Moving Things


Icon water.cpp


Icon Making Poetry in Piet


Icon Learning Motion Manifolds with Convolutional Autoencoders


Icon Learning an Inverse Rig Mapping for Character Animation


Icon Infinity Doesn't Exist


Icon Polyconf


Icon Raleigh


Icon The Skagerrak


Icon Printing a Stack Trace with MinGW


Icon The Border Pines


Icon You could have invented Parser Combinators


Icon Ready for the Fight


Icon Earthbound


Icon Turing Drawings


Icon Lost Child Announcement


Icon Shelter


Icon Data Science, how hard can it be?


Icon Denki Furo


Icon In Defence of the Unitype


Icon Maya Velocity Node


Icon Sandy Denny


Icon What type of Machine is the C Preprocessor?


Icon Which AI is more human?


Icon Gone Home


Icon Thoughts on Japan


Icon Can Computers Think?


Icon Counting Sheep & Infinity


Icon How Nature Builds Computers


Icon Painkillers


Icon Correct Box Sphere Intersection


Icon Avoiding Shader Conditionals


Icon Writing Portable OpenGL


Icon The Only Cable Car in Ireland


Icon Is the C Preprocessor Turing Complete?


Icon The aesthetics of code


Icon Issues with SDL on iOS and Android


Icon How I learned to stop worrying and love statistics


Icon PyMark


Icon AutoC Tools


Icon Scripting xNormal with Python


Icon Six Myths About Ray Tracing


Icon The Web Giants Will Fall


Icon PyAutoC


Icon The Pirate Song


Icon Dear Esther


Icon Unsharp Anti Aliasing


Icon The First Boy


Icon Parallel programming isn't hard, optimisation is.


Icon Skyrim


Icon Recognizing a language is solving a problem


Icon Could an animal learn to program?




Icon Pure Depth SSAO


Icon Synchronized in Python


Icon 3d Printing


Icon Real Time Graphics is Virtual Reality


Icon Painting Style Renderer


Icon A very hard problem


Icon Indie Development vs Modding


Icon Corange


Icon 3ds Max PLY Exporter


Icon A Case for the Technical Artist


Icon Enums


Icon Scorpions have won evolution


Icon Dirt and Ashes


Icon Lazy Python


Icon Subdivision Modelling


Icon The Owl


Icon Mouse Traps


Icon Updated Art Reel


Icon Tech Reel


Icon Graphics Aren't the Enemy


Icon On Being A Games Artist


Icon The Bluebird


Icon Everything2


Icon Duck Engine


Icon Boarding Preview


Icon Sailing Preview


Icon Exodus Village Flyover


Icon Art Reel




Icon One Cat Just Leads To Another

How I learned to stop worrying and love statistics

Created on Aug. 9, 2012, 12:25 p.m.

I've always hated statistics. Nothing smells more like accountancy, rimmed glasses, bookkeeping, and horrible little news reports than statistics. A career in statistics was the image of compiling endless financial reports to a stony board of directors in an attempt to squeeze out those few more dollars from the public. It was the lowest. It was the selling of an mathematical mind to the machine and the end of all beauty and expanse. There was no doubt in my mind that statistics was simply evil.

So it was a mysterious change when it happened, and it all began with a search engines module at University. This was easily one of the best courses I took in my time at University and from the beginning of the course what became most clear was that making an effective search engine had nothing to do with understanding the English language, with extracting semantic meaning from queries or documents, with logic, reason, or human experience. It was all to do with raw, unadulterated statistics.

And suddenly I saw the glint of gold. I saw a promise in statistics. Hiding beneath dusty logarithm lookup tables and hypothesis testing was the promise of an Oracle Machine. Something that could be queried and provide answers in milliseconds. This was knowledge like had never been seen before and yet it was nothing to do with knowledge, logic, semantics, or meaning. It was just numbers, just data and statistics and a query box. Ultimately the question in my mind was "how can this be?", and secondly, against my better judgement, "how can I get it?".

An internet search engine relies on the systematic de-construction and processing of text. The text is crippled; stripped of meaning until it is completely void and will fit into nice neat data structures for processing. Only then would the data shine through. And once the numbers were ready, the statistical algorithms could roll along and process the data. Finally the questions we all had could be answered in the blink of an eye. Building a search engine is not re-inventing the wheel, it is rediscovering the holy grail.

The first thing to go is syntax. The hierarchy of language, which structures and subjugates words into a towering tree, is unimportant under statistics. All web pages, documents, and queries are reformed and stored as jumbled lists of words. Context is not truly lost. Those words which often are together are still in association via their combined presence in a list. Everything is just a little more anarchic. The words have been freed of their sentences. There is no longer a primary verb, or a root pronoun. Under the statistical system all words are equal, and as you would expect, some are more equal than others.

The important words are those which do not occur often. "The" is largely a useless citizen; syntactic glue. No room is left in our system for such common words and where possible they are removed. The "aardvarks" and "armamentaria" are king, because you can be sure if they exist in a query then they must be key. So how are these statuses assigned? Not by some governing hand. We look toward the Laws of Text, Zipf's law and Heaps' law. These laws tell you, in beautiful fairness and balance, the relative importance of words in a language. Even the numbers and numerals can be governed using Benford's law. Nothing is left to chance, all is mathematical.

But all this begs the question. Do we really need words in the first place? Is this bureaucracy? Can something smaller suffice - say, a symbol, a letter? In languages such as Japanese, with no spaces to separate words, we can simply assume that each overlapping pair of symbols, as well as each individual symbol, is a word in its own right. As we accumulate more and more web pages and documents, the pairs which are actually words will continue to appear, while those which are not words will not. It soon becomes clear what is, and what isn't a word. This system, of taking N-grams, sets of symbols is effective. Even more effective than just splitting via words. Even in European languages. The reason is we can match policeman with policemen, even with no idea of the semantic relationship. Words are not required. A good search engine can use just symbols.

We note that in using N-Grams, the more documents you have the better. This is another devilish aspect of statistics. In statistics, more is more. You can never have too much data. The reason is simple. Signal adds up and noise cancels out. More precisely, when you have more data, the probability of something becoming statistically significant via chance is lessened, while the probability of something becoming statistically significant via actuality, is increased. In a logical formula, the smaller the formula the better. But in our search engine, the more websites scanned the better - even if what they contain is largely junk.

So now our documents are simply jumbled lists of words and their relative importance. We have spiders crawling the web and accumulating more data for us, and we have an index slowly ticking over and processing the document data. All that remains now is to design the statistical models via which we rate our documents for a given query. Because of our destruction of the text we can build effective data structures and feed them into a huge database. The final step is just to turn it on.

A human-free system. A system of knowledge automated by the cold clicking hands of a computer.

The secret in statistics is rather simple. The power it provides is a concise mantra. In logic, deduction and mathematical proof one can divulge true answer to precise questions. Statistics, on the other hand, can provide compelling answers to all questions. Statistics focuses on the question rather than the answer. As Douglas Adams revealed, if you wish to know the answer to the meaning of life the universe and everything, you must first know the question.

The difference can be shown with a trick. When queried with the question...

"How many legs does the average person have?"

  • A logical system will answer ~1.99
  • A statistical system will answer 2

A good logical system will know the answer to a question.

A good statistical system will know what question you are asking.

This is both the beauty and the danger of statistics. Much like a search engine, the goal of a statistical system is to tell you exactly what you want to hear. A statistical system does not answer a question with the precision and truth of a logical system, but it should capture the absolute intuition of what you are asking from it. It will know when "average" is translated to "mean" and when one really intends for the "mode". When a CEO asks his statistician "is the company doing good?", a good statistician will formalise and calculate the exact notion the CEO holds of "company doing good", and present it to the CEO.

The danger comes when the intuitive notion of "company doing good" differs from person to person. Perhaps the CEO is unconcerned with the variable counting toxic waste dumped, while a citizen rates this variable highly on their intuition of that evaluation. The power of statistics comes from its subjectiveness and lack of true meaning, but it is also its heel.

What really is the "mean" or the "standard deviation" other than the formalisation of some human intuition? In exact terms the mean is not "the average" because as we discussed above, that is a subjective and relative notion. The mean is only itself - that is the sum of all data points divided by the count of data points. The same is true for search engines. If I search for "The Best Page In The Universe" Google does not return the best page in the universe. It returns the tf.idf weighted sum of my query terms against its index including user ranked weights, individual behaviour weights and pagerank.

Statistics is not boring. Far from it. At its heart it is the beautiful and twisted cousin of logic and reason. It is deceptively powerful. It gives you the chance to throw your pennies in the well and get an answer back. Most of all, statistics is agnostic, subjective and human. Unlike the godlike sentience of logic and reason, statistics is the devil inside. For that reason I love it.

github twitter rss