DADAMETER - Christophe Bruno (2002

THE DADAMETER
Global index of the decay of the aura of language


Home	Concept	Dadamap	R.R.Engine	Methodology	Documentation	Credits - En / Fr

Mass extraction of textual data from Google

Analysis in terms of homophony and semantics

Visualization
and interpretation

Methodology

Search engines hold in their dadabase the sum of everything that is being written on the Web, and, in the next future, with the ever increasing development of surveillance technologies and global pervasiveness, they might just be able to store everything that is being said by human beings. The question is then what to do with this huge amount of information. The exploitation of this information provided one of the most impressive post-911 IPO, namely Google's, thanks to its famous Adwords system, in 2004, after what they became the most famous brand in the world. It is done thanks to extremely sophisticated mathematical tools. Recentely there have been huge breakthroughs in graph theory, like the theory of scale-free networks by the mathematician Albert-László Barabási around the year 2000. Since then, scientists have been using these revivified theories in the framework of language and quantitative linguistics, renewing a field that had not made much progress since the advent of information and communication theory by Claude Shannon in the 1950's.

Maybe you will find these theories criticisable since they tend to turn language into a dead body which is being studied postmortem... an attempt to carry out the autopsy of an ever-dying language. You would be right. However, in the framework of late capitalism, they do play an ever increasing strategic role and it seemed natural to give a closer look - if you like to know what fate has in store for us. The aim of these mathematical and computational tools is simply to predict our thoughts and behaviours for commercial or political purposes, a new kind of trend prediction that plays a crucial role in the optimization of the Adwords system.

It is also worth to be noted that Shannon's hobbies were quite interesting conceptually. One of the most humorous device he built was a box called the "Ultimate Machine", based on an idea by Marvin Minsky. Otherwise featureless, the box possessed a single switch on its side. When the switch was flipped, the lid of the box opened and a mechanical hand reached out, flipped off the switch, then retracted back inside the box. In a way the Dadameter is about the same idea, but at the level of language at the age of globalisation... as if the cause of language lied in its own death.

Step 1: Extracting massive amount of textual information from Google

The Dadameter deals with language at large scale and it needs some of these information Google has in her belly. More precisely, what we are looking for are the number of pages that contain two different words, for any two words. These numbers allow to compute the correlations between two different words, ie. the probalility to go from one word to the other if we consider language as a global hypergraph of words and pages. These numbers are actually given by Google as shown in the picture.

We start by building a list of words. For the present case we started with a list of about 7000 current english words (and also some non-english as well as a few surnames). We have then to extract these figures for about 2.5 millions couples of words (7000 x 7000 / 2). Since we were mainly interested in words that look like a bit each other, this quantity is reduced to about 800 000 couples.

These numbers are given for free by search engines, however if we need a lot of them, we have to build a program that scans this huge number of different pages of a given search engine. If we do that, they detect us easily and tend to consider our program as a virus, blocking the IP (as I said these data allow search engines to make predictions about our behaviour and desires and have a strategic interest if we consider them in their globality, although they are worthless taken one-by-one). Therefore we needed to slow down the scan to make believe the program is actually a human being searching on the web (and in fact, if you want to know everything, we did not even use Google but another search engine, because Google was really too smart in detecting us. But results are independant of this choice to a large extent. I keep mentionning Google anyway since it's in the current language now).

Conceptually this inverse-turing-test situation is quite interesting, but in practice it considerably delayed our schedule! It took us about two months to download all the information we needed, with 5 different computers connected to the Web with different IP adresses. Search engines were the first to scan the sum of all the pages of mankind and now, mankind cannot have access to it's own data! In principle these data are not property of the search engine, since they still belong to their authors (each one of us), but the refined substance that is extracted from this raw material is most certainly their property, according to the usual laws of capitalistic societies (one may say this is a co-property but this argument would probably be ruled out in court).

Step 2: Analyzing this information using recent breakthroughs in the field of graph theory in order to understand the large scale structure of language in terms of homophony and semantics

The information extracted from Google as described above allow us to classify our couples of words according to semantic relatedness. As you will see there are mathematical formulae that allow to do that in the framework of graph theory. But first, remember that our starting point is Raymond Roussel's work and let's have a look at the other aspect of the question: homophony, which is quite easier and well known.

1 - in terms of homophony: Damerau-Levenshtein distance

The question of homophony within information theory has been investigated since the 1950's. Damerau-Levenshtein distance allows to measure the amount of difference between two sequences (written words) and therefore is a measure of their homophonic resemblance (not all the question of homophony is described by this quantity since it stays on the level of written letters and not real phonetics but it will be enough for us and matches quite well with the method of Roussel - although he also made a broader use of homophony). For instance the Damerau-Levenshtein distance between «billard» and «pillard» is low (it is 1 according to the definition below). The Damerau-Levenshtein distance between «kitten» and «sitting» is 3, since these three edits change one into the other, and there is no way to do it with fewer than three edits:

kitten --> sitten (substitution of 's' for 'k')
sitten --> sittin (substitution of 'i' for 'e')
sittin --> sitting (insert 'g' at the end)

Precisely, In information theory and computer science, Damerau–Levenshtein distance is a «distance» (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two characters.

A first part of our work was then to compute all the Damerau–Levenshtein distances between the 7000 x 7000 / 2 couples. That was the easy part.

2 - in terms of semantic relatedness: Google Similarity distance

Very recentely there have been important progress in the field of graph theory and quantitative linguistics. A recent paper from 2005 by Rudi L. Cilibrasi and Paul M.B. Vit´anyi, has defined a (pseudo-)distance called Google Similarity Distance (or Normalized Google Distance) that reflects the semantic relatedness of two words, using information extracted from search engines (cf Step 1). The distance between two words is smaller when both words are closely related, ie. when they tend to appear more often in the same webpages.

This Normalized Google Distance is actually an improvement of a simpler and more understandable formula defining the similarity between two words a and b:

s(a, b) = f(a,b) / sqrt (f(a) x f(b)) = Number of pages containing both a and b divided by the squareroot of (Number of pages containing a × Number of pages containing b).

Then d(a, b) = 1 — s(a,b) is the disimilarity between both words.

This distance depends on the figures that we have extracted from Google. After having scanned about 800 000 pages of search results, we were able to get all these quantities for our 7000 words and 800 000 couples of words. Results are impressive and seem to match quite well with semantic intuition, as it is argued in the paper by Cilibrasi and Vit´anyi.

3 - in terms of equivocation: clustering coefficient

An even more complex concept is the clustering coefficient on a graph. It is interesting for our study because on a semantic graph, it helps at defining the degree of polysemy of a word, or equivocation (as it is used by Roussel in his method). Intuitively if a word belong to two clusters it tends to have two different meanings. If a word belong to only one cluster, its meaning will most certainly be univocal.

Recent papers have studied clustering coefficient on a semantic network, by relating it to the graph curvature. In our case we had to consider the problem of an hypergraph and we used in fact a paper from 2007 dealing with weighted graphs. Technically we take the minimum of this coefficient for a given couple of words.

Here, results become very difficult to check and to interprete. This coefficient is function of the entire neighborhood of a word and calculating its full value was just impossible for us. We still lack a lot of information, and we had to make some approximations whose validity is not very well settled yet. However we expect that our study shed some lights on the subject.

At least qualitatively it was important to take this aspect into account since it is part of Roussel's method and since it allowed us to conceptualize our approach in terms of map in an unexpected and interesting way. If you look at the Dadamap section, the Utilitarianism zone and the Equivocation line, are directely related to this clustering coefficient (or curvature) approach.

Step 3: Visualizing these structures with maps, graphs or global indexes and interpreting the results

We are now in possession of huge lists of numbers and we need to interprete and visualize this information. This is a diffcult problem but not hopeless. There are two basic technics here: self-organizing maps and graphs visualization.

1) A self-organizing map map is a standard and very helpful technique used for visualizing low-dimensional views of high-dimensional data by preserving the topological properties of the high-dimensional initial space. It is generated with a neural network that gathers together pixels wich are of similar colours. The model was first described as an artificial neural network by the Finnish professor Teuvo Kohonen, and is sometimes called a Kohonen map. Here is an example of a generation of such a map with a very small set of couples. At the beginning, coloured pixels are randomly picked and they organize themselves. In the Dadamap section you will discover our interpretation of the different zones, which is the main result of our study.

Click on the map to relaunch

2) Finally our data have been used for building an interactive graph. It is described in the R.R.Engine section.

A project by Christophe Bruno (2002 - 2008) - Programming Valeriu Lacatusu