## Google Correlate : How does it work?

0Impressed by the amazing engine Google Correlate. I read the Google Correlate white paper generously published by the its team. In this post I will start by going over the technical details provided in the white paper using quotes from it and explaining what they mean. At the end of the post I will write up a short analysis of the query process which would hopefully can be used in other contexts.

**UPDATE:** Google has released a complete description of their algorithm around October 16th 2013 (few days after this post was published) is it well written and very easy to follow. You can find it here.

# Google Correlate : What does it do?

With Google Correlate you can type a search query and get a list of other queries from Google own analytics in a descending order of their Pearson Correlation. For example, matching “iPhone” correlated queries will bring up other terms (which one may expect)

The engine provides two more parameters for the query: A time shift in the data (the default is 0) and a time window on which the correlation is calculated (the default is the whole dataset). For example, If we shift the data by 2 weeks for “iPhone”, we get different results which are reasonable as well.

You could also use your own data, put in your own time series and find queries that match its pattern. The time series may be provided using a drawing as well. For example, I drew winter peaks for each year (first screen shot), which resulted in winter related activities and products (second screen shot):

# Google Correlate : How does it work?

In this section I will use quotes from the white paper in an attempt to understand how Google built its correlation engine.

In our Approximate Nearest Neighbor (ANN) system, we achieve a good balance of precision and speed by using a two-pass hash-based system. In the first pass, we compute an approximate distance from the target series to a hash of each series in our database. In the second pass, we compute the exact distance function on the top results returned from the first pass.

So the general idea is to compare the query series with the hashes of the real data in order to focus the search on a partial dataset. Once focused, the search will compute the actual distance (Euclidean, see below) from the top result and return the search terms found to the user.

## Indexing

Before one can query the data, hashes must be calculated for the entire dataset (i.e. the search queries) so it can be used it the first stage of the query. The indexing (or training) is done in 2 parts where one follows the other.

### Correlation and Euclidean Distance

Before diving into the method, first we recognize the relation between the pearson correlation and euclidean distance.

**Definition.** For vectors \(u,v \in \mathbb{R}^N \) the **Euclidean Distance** is \(d(x,y) = \sqrt{\sum\limits_{i=1}^N {\left ( u_i – v_i \right )}^2} \)

**Notation.** For vector \(x \in \mathbb{R}^N \), \(\mu_x = \frac{1}{N}\sum\limits_{i=1}^N x_i \) the mean of the vector and \(\underline{x}_i = x_i – \mu_x \) the centralized vector. The normalized vector \(\bar{x}_i = \frac{\underline{x}}{\left \Vert \underline{x} \right \Vert} \).

**Definition.** For vectors \(u,v \in \mathbb{R}^N \), the **Pearson Coefficient (PC)** is defined by \(r_{u,v} = \langle \bar{u}, \bar{v} \rangle \).

PC is ranges between -1 and 1. A value of zero means that there is no correlation. A value of 1 means that there is a positive correlation and -1 means a negative correlation. Note, that if the vectors in the dataset are normalized, the pearson coefficient can be represented as the dot product between them. Also, we can use euclidean distance to calculate the PC of two vectors:

$$\begin{aligned}

d^2(\bar{u},\bar{v}) &= \sum\limits_{i=1}^N {\left ( \bar{u}_i – \bar{v}_i \right )}^2 \\

&= \sum\limits_{i=1}^N \bar{u}_i^2 + \sum\limits_{i=1}^N \bar{v}_i^2 – 2\sum\limits_{i=1}^N \bar{v}_i \bar{u}_i\\

&= {\left \Vert \bar{u} \right \Vert}^2 + {\left \Vert \bar{v} \right \Vert}^2 -2\langle \bar{u}, \bar{v}\rangle \\

&= 2 - 2\langle \bar{u}, \bar{v}\rangle \\

&= 2 - 2r_{u,v}

\end{aligned} $$

so

$$ r_{u,v} = 1- \frac{1}{2}d^2(\bar{u}, \bar{v}) $$

Since normalized vectors are on the unit sphere the maximum distance between any two can be \(2\) and squared \(4\). In the formula above this is divided by 2 so indeed the RHS of the formula ranges between -1 and 1 as the PC. In order to find two vectors which has a PC closer to 1 we should minimize the euclidean distance between them. Which is the task of this engine.

As a preprocessing stage, the vectors in the dataset should be normalized to a mean of \(0\) and a variance of \(1\) as above. The advantage of calculating the PC in such a way is due the fact that research about Nearest Neighbours (and ANN) using a Euclidean distance is abundant.

Curse of Dimensionality

Traditional tree-based nearest neighbors search methods are

not appropriate for Google Correlate due to the high

dimensionality which results in sparsenes. Most of these

methods reduce to brute force linear search with such data.

This quote actually talks about the Curse of dimensionality. It is what happens when the data resides in an high dimensional spaces and some search, partition, or a different kind of processing has to be take place there. Some techniques which work with a handful of dimensions are simply not appropriate for hundreds or thousands of dimensions. This happens because usually the computational complexity of operations are exponential in the dimension (time, space or both). For example, in a k-d Tree, the general rule states that the number of points (search term in this case) must be orders of magnitude larger than the \(2^k\) where \(k \) is the dimensionality of the data. Otherwise it will collapse to a linear scan. This “curse” is a recognized problem for which, in the context of a nearest neighbour search, many methods are available in the form of an Approximated Nearest Neighbour (ANN) search which enables results which are not exact but very close. Its solutions however do tend assume that data can be fitted with a lower dimensional manifold on which more common algorithms may operate efficiently.

For Google Correlate, the amount of data is massive. For each search terms, the search volume is described per week. As there are years of data and tens of millions of terms, so the dimensionality is measured by hudrends and number of vectors are the number of search terms. A different approach is required.

### The Algorithm

For Google Correlate, we used a novel asymmetric hashing

technique which uses the concept of projected quantization…Each set of projections is further quantized using a clustering method such as K-means. K-means is appropriate when the distance between two series is given by Euclidean distance. Since Pearson correlation can be easily converted into Euclidean distance by normalizing each series to be a standard Gaussian (mean of zero, variance of one) followed by a simple scaling.

The process described here is pretty straight forward. The algorithm selects \(m\) subspaces of \(\mathbb{R}^N \) and each vector in the dataset is projected on them. This results with another vector of dimension of the subspace. So for each vector in the dataset we now have \(m\) vectors. For each subspace, yet another partitioning if performed with k-means. In k-means, the target is to find k vectors (here in each of the subspaces) which will be called centroids. Each centorid represents a cluster in the subspace. Formally, the goal is to minimize this function

$$ Err(S, Q) = \frac{1}{|S|}\sum\limits_{i=1}^{|S|} \min_{1 \leq j \leq k} {\left \Vert x_i – q_j \right \Vert}^2 $$

If \(S \subset \mathbb{R}^N\) is the data set and \(Q = {\lbrace q_i \rbrace}_{i=1}^k \) are the centroids of the \(k \) clusters. This is typically achieved with Lloyd’s Algorithm which uses a heuristic technique and may find a local minimum result. This is, in general, an NP-Hard problem.

## Querying

During the online search, given the target series, the most

correlated database series are retrieved by asymmetric

matching. The key concept in asymmetric matching is that

the target query is not quantized but kept as the original

series. It is compared against the quantized version of each

database series. For instance, in our example, each database

series is represented as an 8m bit code. While matching,

this code is expanded by replacing each of the 8 bits by the

corresponding K-means center obtained at training time, and

Euclidean distance is computed between the target series

and the expanded database series. The sum of the Euclidean

distances between the target series and the database series

in m subspaces represents the approximate distance between

the two.

So, from the first step we have a \(mk\) centroids for each cluster in each subspace. I have to do a leap here, as this is not specified but it seems that for each subspace the closest cluster is picked (by computing the distance from each of the \(k\) centroids in the subspace). The sum of these distances is used to focus on a subset of the dataset, all vectors which has the same clusters in each of the \(m \) projected subspaces. If there are enough of these the algorithm can choose the next best set of \(m \) clusters and so on.

To further improve the precision, we take the top one thousand

series from the database returned by our approximate search

system (the first pass) and reorder those by doing exact

correlation computation (the second pass).

This quote supports the suggestion above that one set of \(m \) clusters does not produce even 1000 search terms. Probably several cluster sets are eventually used. The actual search term volume vectors are retrieved corresponding to these sets and the exact distance is computed against the query vectors. They are ordered and send back to user (see screenshots above).

## Indexing and Query Computational Cost

**Indexing**

With its engine, Google had to have a tradeoff between indexing and search costs. The indexing in this case is rather expensive due to the k-means stage. This makes the search very fast because the list of the top 1000 results can be fetched very quickly. The method of comparing data in a hash space can be divided into two sub classes.

The first is data-dependent which means that the hashing process itself is based in some way on the data. Google Correlate is such an algorithm and it takes the indexed data into account during the k-means stage and perhaps in the projection stage as well. As it so happens, the Google philosophy of doing things is about online latency and not about the cost of offline processes. I suppose that if someone needs 10K or so machines for something important enough, this should not be a problem. However users cannot wait, and perhaps this engine is used in other places in Google, where latency counts (e.g. semantic expansion of query results) . If you are interested, I came across a very interesting paper during the writing of this post which does both clustering and projections simultaneously by fitting a lower dimensional manifold.

The second sub class is insensitive, or data-independent and includes methods such as Locality Sensitive Hashing. Hashing is usually very cheap but the the search pays the price so to speak in space or time or both.

**Search **

Search Phases

- Standardize query vector
- Compute distance from each centroid (first project on to corresponding subspace)
- Sort “cluster vectors” by distance summation. By “cluster vector” I mean a collection of clusters, one from each projected space.
- Fetch 1000 top vectors from storage
- Compute distances from top vectors
- Sort vectors
- Return results

There are \(mk \) centroids in total, I would suppose that this number is low enough to keep them in memory. This would mean that stages 1-3 can be done entirely in-memory, stages 2,3 can even be parallelized. For stage 4 the actual search term volume vectors are probably too big for to fit in memory so fetching them would require IO access. However that should be relatively cheap, especially if the data is stored sequentially on disk. Vectors from each “cluster vector” are grouped together. This would mean one seek and then a sequential read per “cluster vector”. Stages 5-6 can again be computed parallely to fetching of data from disk which would make them very fast.

# Summary

The method presented in the white paper, though lacking in details, is quite inspiring in its robustness and relative ease to implement. Two problems are left unanswered in the whitepaper: (1) How would querying a partial time interval would be implemented? (2) As time passes, the time-series should be incremented, it is not clear how that can be achieved as the vectors has already been clustered. I hope Google would share more details about this system in the future so we can all build better systems.

## The Discrete Fourier Transform Provides a Good Description of Stock Data

0In this post I going to describe a little experiment I conducted with stock data and the Discrete Fourier Transform (DFT). Quoting Bernard Baruch,

The main purpose of the stock market is to make fools of as many men as possible.

And so it has, and it has made many people rich and poor. I do not claim for a winning model for a predication of the stock market. I do claim that as in many other fields other than DIgital Signal Processing (DSP), the DFT has valuable applications as this has been demonstrated many others before.

Compression is much more than cramming more images and videos on an hard drive. Compression tells us how much information exists in the data. It is a very important distinction made by Shannon when he developed Information Theory, and one of the most beautiful concepts in modern Computer Science IMHO. Provided a model of the data we can encode the data in such a way that it requires less to describe (see Kolmogorov Complexity for a more in-depth discussion). The compression in this post will be lossy, which means that it will not describe the information perfectly but in a well enough. Well, well enough is a somewhat abstract term. A popular estimator is the Mean Squared Error (MSE).

**Definition.** for vectors \( x, \hat{x} \in \mathbb{R}^n \) the **MSE** is \( \frac{1}{n}\sum\limits_{i=1}^N {(x_i -\hat{x}_i)}^2 = \frac{1}{n} {\left \Vert x – \hat{x} \right \Vert}^2 \).

The DFT is a bijective transform from a complex vector to a complex vector of the same size. That means that it doesn’t loose any information. Once the transform was applied, the original vector can be reconstructed easily. The DFT is the finite case of the FT (Fourier Transform) which takes a function (a nice one) and represents it as a series of sine and cosine functions.

**Definition.** for vector \( v \in \mathbb{C}^n \) the **DFT** is \( \mathcal{F}(v)_k =\sum\limits_{n=1}^N v_n e^{-i\frac{2\pi}{N}(n-1)(k-1)}\) for \( 1 \leq k \leq N \). \( f_1, f_2, …, f_n \) are called the Fourier coefficients.

The inverse DFT takes \( \mathcal{F}(v) \) back to \( v \)

**Definition.** for vector \( f \in \mathbb{C}^n \) the **Invesre DFT** is \( v_n = \mathcal{F}^{-1}(f)_n =\frac{1}{n}\sum\limits_{k=1}^N f_k e^{i\frac{2\pi}{N}(n-1)(k-1)}\) for \( 1 \leq k \leq N \).

Note that the transform is linear. Another nice fact:

**Parseval Theorem.** for \( v \in \mathbb{C}^n \) and \( f =\mathcal{F}(v) \) the following holds \( {\left \Vert v \right \Vert}^2 = \frac{1}{n}{\left \Vert f \right \Vert}^2 \)

When \( v \in \mathbb{R}^n \) the DFT requires only \(\lceil n/2 \rceil\) complex numbers to encode the original vector (that makes sense because there are about the same amount of numbers). Specifically the first ones. So from here on, unless otherwise stated, \( v \in \mathbb{R}^n_+ \) (just positive real numbers). Stock prices are positive real numbers.

# The Compression

The basic idea is to take the largest coefficients from the transform and use them to encode the data. We would also need to decide how many coefficients to use. Luckily we can bound the MSE on this lossy compression.

let \( x\in \mathbb{R}^N_+ \) and \(\alpha \in (0, 1] \) so

let \(f = \mathcal{F}(x) \) and let \(s:[0..N-1] \rightarrow [0..N-1] \) be a permutation such that \( | f_{s(i)} | \geq | f_{s(i+1)} | \) for all \( 0 \leq i < N-1 \) . Let

$$i_0 = \underset{0\leq i_0\leq N}{\operatorname{argmin}} \sum\limits_{k=0}^{i_0}|f_{s(k)}|^2 = \beta\sum\limits_{k=0}^N |f_k|^2 \geq \alpha\sum\limits_{k=0}^N |f_k|^2$$

and let \(\hat{f}_{s(k)} = {1}_{\{k \leq i_0\}} f_k \) and let \(\hat{x} = \mathcal{F}^{-1}(\hat{f})\). So

$$ \begin{aligned}

MSE &= \frac{1}{N}\| x-\hat{x} \|^2 \\

&= \frac{1}{N^2} \| f-\hat{f} \|^2 \\

&= \frac{1}{N^2}\sum\limits_{k=0}^{N-1} {1}_{\{k > i_0\}} |f_{s(k)} |^2 \\

&= \frac{1-\beta}{N^2}\sum\limits_{k=0}^{N-1} | f_{s(k)}|^2 \\

&= \frac{1-\beta}{N^2}\| f \|^2 \\

&= \frac{1-\beta}{N} \| x \| ^2 \\

&\leq \frac{1-\alpha}{N} \| x \|^2 \end{aligned}$$

So the MSE of \(\hat{x} \) as an estimator to \(x \) is bounded by \(\frac{1-\alpha}{N}\|x \|^2 \). So if we set a maximum bound of \(MSE_{max}\) we get \(\alpha = 1-\frac{N}{\| x \|^2}MSE_{max} \). The parameter \(\alpha \) now fully specifies \(i_0 \) as a function of the maximum MSE acceptable. It is the smallest index in the partial sum of the permuted Fourier coefficients that is larger than \(N \|x \|^2 \left ( 1-\frac{N}{ \| x \|^2}MSE_{max} \right ) \). Note that this selection of coefficients is optimal in the sense that it minimizes the MSE.

# The Data

I have used stocks from NYSE, NASDAQ and AMEX stock exchanges. The list of tickers was obtained from http://www.nasdaq.com/screening/company-list.aspx and the actual stock data was retrieved from Yahoo! FInance which provide complete historical prices for these exchanges. A simple python downloader based on the ticker list can found found in this gist. In total 5792 stocks were downloaded. In the analysis I used only stock closing prices. One thing to note about the DFT is that it requires uniform distribution of the samples. There are 2 ways to look at that in the context of stock data. First, we can treat uniform distribution from the calendar point and say that 2 days weekend stop in trade is a gap and it should be filled before analysis. This procedure add around 31% more points to the data. I filled the gaps with the previous closing price. The second suggests that it doesn’t matter since we should only count trade days. Both approaches are fine and I have produced results for both (see below). Filling the

# Measures

Since the MSE bound is determined before the compression, I measure the compression and its characteristics using 3 histogram metrics.

**Number of coefficients used.**i.e. \(i_0 \) from above. That is without normalization to the length of the data (each stock has a different IPO date).**Compression rate.**similar to the one above only that it is normalized with the size of the series.**Coefficients Used.**The actual indices of the coefficients found in the compression process.

The last metric is of special interest. As coefficient \(k \) denotes the \(k/N \) frequency. The higher the frequency, the more oscillations it will encode in the time range processed. The lower the frequency (down to 0) the wider the trend it describes. Before I ran the analysis, I hypothesized that frequencies which carries most information will be closer to 0. Higher frequencies will seem more like noise. This turned out to be very true.

In order to for the results to be comparable among the different stocks, each vector was normalized. This also make the formula above somewhat simpler: \(N\left ( 1-MSE_{max}N \right ) \)

# Algorithm and Implementation

The compression algorithm is as follows in python

The complexity of the algorithm is \(\rm{O}(N\log(n)) \) due to the fft and the sorting of the coefficients.

Results

The following histogram is given in logarithmic scale. It provides the counts for each compression ratio (ratios was rounded to the smallest larger integer). Clearly the compression is very good. 5050 stocks using non uniform dates and 5043 stock using the uniform dates were compressed with 10% or lower ratio.

The scatter graph below, also showing negligible difference between the calendar uniform data and non uniform data, displays the number of coefficients used to achieve the desired MSE of 10^-6.

This graph is of most interest. It shows the how many time each coefficient was used in the compression, regardless of the series length. A definite tendency to use the first, low frequency, coeffcients.

# Summary

Stock data can be efficiently represented using a small portion of the original using the Discrete Fourier Transform. There are more efficient compressions (though not tested for now) such as arithmetic encoding, hoffman coding and others which may yield better compression results. Further more, there are probably many other data domains such as climate, health, economic indicators and meters, sociological, transportation, communications, server monitoring, biology, medicine where this type of compression may provide good results.

Teh advantage of this method over bit count only methods is that the data is stored in an analytical fashion. This may prove as most productive as there are many tools in existence for frequency domain analysis (see Coherence, Cross-Correlation, Phase Correlation as examples).

## Setting up a PostGIS db with OpenStreetMap data

0In this post I will provide a recipe for installing and loading Postgres 9.1/PostGIS 2.0 database with OpenStreepMap data. I am using Windows 7 machine and after some trials I opt to use a Linux virtual machine. Steps 1-3 are redundant if you have a Linux machine ready but you might need to make some adjustments.

**Step 1**

Install VMware Player from http://www.vmware.com/products/player/.

**Step 2**

Download the Ubuntu 12.10 64 bit or later from http://www.ubuntu.com/download/desktop. Preferably use the torrent version, it is faster, more fault tolerant and preferred by the distributors of Ubuntu.

**Step 3**

Create a new machine and use the ISO you downloaded in Step 2 to create it. I allocated 300GB, 4 cores and 8GB RAM. Later I found that to be a bit excessive for basic usage but required if you want to do more with the machine later. Update the OS with all of the latest updates.

**Step 4**

Install Postgres and PostGIS. Run the following in the terminal (from http://trac.osgeo.org/postgis/wiki/UsersWikiPostGIS20Ubuntu1210src)

$ sudo apt-get install build-essential postgresql-9.1 postgresql-server-dev-9.1 libgeos-c1 libxml2-dev libproj-dev libjson0-dev xsltproc docbook-xsl docbook-mathml

$ sudo apt-get install libgdal1-dev

$ wget http://download.osgeo.org/postgis/source/postgis-2.0.3.tar.gz

$ tar xfvz postgis-2.0.3.tar.gz

$ cd postgis-2.0.3

$ ./configure

$ make

$ sudo make install

$ sudo ldconfig

$ sudo make comments-install

$ sudo ln -sf /usr/share/postgresql-common/pg_wrapper /usr/local/bin/shp2pgsql

$ sudo ln -sf /usr/share/postgresql-common/pg_wrapper /usr/local/bin/pgsql2shp

$ sudo ln -sf /usr/share/postgresql-common/pg_wrapper /usr/local/bin/raster2pgsql

$ sudo apt-get install pgadmin3

$ sudo apt-get install postgresql-contrib

**Step 5**

Download the planet OSM file (http://planet.openstreetmap.org/) or any of your choice. I used the London map from http://download.bbbike.org/osm/ for testing, it takes less time.

**Step 6**

Tune the db (from http://workshops.opengeo.org/postgis-intro/tuning.html#section-21-tuning-postgresql-for-spatial)

your config file is /etc/postgressql/9.1/main/postgressql.conf

change the following:

shared_buffers = 4GB

work_mem = 64MB

maintenance_work_mem = 512MB

checkpoint_segments = 6

add to /etc/sysctl.conf the following:

kernel.shmmax = 4418322432

kernel.shmall = 1078692

and restart your machine

**Step 6**

Create a spatially enabled db. Create your db and add the extensions ‘postgis’, ‘postgis-topology’ and ‘hstore’ and test your installation with

select postgis_full_version();

select ST_Point(1, 2) AS MyFirstPoint;

select ST_SetSRID(ST_Point(-77.036548, 38.895108),4326); — using WGS 84

you might want to use a

**Step 7**

Import the data. We will be using osmosis

$ sudo apt-get install openjdk-6-jdk git

$ git clone https://github.com/openstreetmap/osmosis.git

$ cd osmosis

$ export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64/

$ ./gradlew build

$ export JAVACMD_OPTIONS=”-Xmx2G -server”

execute pgsnapshot_schema_0.6.sql from ~/osmosis/package/scripts in your db and you are now ready to import the data

$ cd ~/osmosis/package/bin

$ ./osmosis –read-xml file=”/home/bar/London.osm” –buffer –write-pgsql host=”127.0.0.1″ database=”GISDB” user=”username” password=”password”

That’s it! Now you can teach yourself how to use the DB in http://trac.osgeo.org/postgis/wiki/UsersWikiTutorials for example.

## Cyclic Tag System : 1 Line of Turing-Complete Code

2The following line of python code is able to simulate a Universal Turing Machine:

while len(word) > S : pIndex, word = (pIndex + 1) % len(C), word[1:] + C[pIndex] if (word[0] == "1") else word[1:] |

It is 114 characters long and can probably reduced further using some clever python tricks. It implements one of the most extraordinarily simple Turing-Complete models: a Cyclic Tag System or CTS. The CTS was first introduced by Mathew Cook in 2004 when he published his paper on the universality of Elementary Cellular Automata and specifically Rule 110. This rule, simple on its own right but out of scope for this post, was long suspected to be able to simulate a Universal Turing Machine (and thus Turing-Complete). It was Cook who proved this conjecture while working for Steven Wolfram at Wolfram Research.

For those willing to read the whole thing, I have included a bonus : how the Collatz Conjecture can be simulated using CTS.

# How does a Cyclic Tag System work?

If you think of a cyclic tag system as your CPU then you would have your code as an array of binary words, lets call it \( C = {\alpha}_0{\alpha}_1..{\alpha}_{p-1} \) (same in code above). By binary words I mean a string of 0′s and 1′s. This is your Assembly code, machine language if you like. The \( \alpha_i\)‘s are called appendants. Your data, the input parameter is a binary word, lets call it \( W = {w}_{0}{w}_{1}..{w}_{n} \) (it is the variable ‘word’ in the code above). \( C \) never changes during execution, but we do keep track of which part of the program we are executing with a pointer, lets call it \( k \) (pIndex in code above). The executer of our program cycles through the program in a never ending mannar. Execution halts when the word, \( W\), which is being manipulated during execution (bear with me) is shorter or equal than \( S\), some integer. Each execution step consists of the following actions:

- while |w| > S
- if \( w_0\) = 1
- then \( W = {w}_{1}{w}_{2}..{w}_{n} + {\alpha}_{k} \)
- else \( W = {w}_{1}{w}_{2}..{w}_{n} \)

- \( k = (k + 1) \mod p\)

- if \( w_0\) = 1

**Example 1:**

\( C = 0, 1\) , \( W = 001 \) , \( S = 1 \)

Step # k W 1 0 001 2 1 01 3 0 1

The execution halts when the length of the word is 1.

**Example 2: **

\( C = 10, 101 \), \( W = 10 \) , \( S = 1 \)

Step # k W 1 0 10 2 1 010 3 0 10 ...

Here, in every odd state the first appendant is used and in every even state the second is ignored and we have an endless loop which never halts.

# Turing Completeness

In order to prove that something is Turing-Complete one should show that it can emulate a Universal Turing Machine (UTM). If it can, it can do anything a UTM does and thus universal, this is the Church-Turing Thesis. Intuitively it will be as general purpose as your Desktop, iPhone, and favorite programming language. Now, you can directly show that something can emulate UTM, but it is sometimes more convenient to show that a model can emulate another Turing-Complete model. Computational capacity is a transitive relation. If A emulate a UTM and B emulates A so B emulates a UTM.

Cook showed that Rule 110 can emulate an UTM by going through 2 steps. He first showed that a construction by Emil Post called a 2-tag system can be emulated by a CTS and then he showed that Rule 110 can emulate a CTS; thus proving the claim. There is actually another proof of CTS universality published by Neary and Woods in 2006, 2 years after Cook’s first proof. They use a novel model, a Clockwise Turing Machine which is Turing-Complete instead of a 2-tag system. Cook’s original proof has an exponential time scale up. Using a Clockwise Turing Machine only a polynomial scale up was achieved which is quite remarkable. The proof by Neary and Woods is quite complex and the eager reader is referred to their paper. I will review Cook’s very intuitive and short proof.

## 2-Tag System

A 2-tag system has an alphabet \( A \) and a set of Production Rules \( P:A\rightarrow A^* \). The computation is very similar to the circular tag system. We have an initial word \( W={w}_{0}{w}_{1}{w}_{2}..{w_n} \). At each step we check if the word is 2 symbols or more in length. If it is we delete 2 letters from the left side of the word \( {w}_{0}{w}_{1} \) and append \( P({w}_{0}) \) to the right side of the word, otherwise we halt.

**Example 1:**

\( A = {a, b}\) , \( P(a) = {}’{bb}{}’\) , \( P(b) = ”\) with \( W = {}’{aaa}{}’\)

Step # W 0 aaa 1 abb 2 bbb 3 b

**Example 2:**

\( A = {a, b}\) , \( P(a) = {}’ab’\) , \( P(b) = ”\) with \( W = {}’ab{}’\)

Step # W 0 ab 1 ab ...

just like in the second example of CTS, the execution here never halts and the state of the system always cycles.

Python code of the first example:

P = {'a' : 'bb', 'b' : ''} word = 'aaa' S = 1 while len(word) > S: word = word[2:] + P[word[0]] |

the implemetation here is even simpler than the CTS’.

## Cyclic Tag System Can Emulate 2-Tag System

Suppose we have a 2-tag system with \( A = {a}_{1}{a}_{2}{a}_{3}..{a}_{k}\) we can encode each character in a binary manner in following way \( T({a}_{i}) \rightarrow 0..010..0 \) where the \( 1\) is in position \( i\). The program \( C = {\alpha}_0{\alpha}_1..{\alpha}_p\) will be constructed in the following manner: If \( P(a_i) = a_{i1}a_{i2}..a_{in_i}\) for each \( a_i\) we set \( \alpha_{i-1} =T(a_{i1})T(a_{i2})..T(a_{in_i}) \). We would also add \( k\) empty appendants to \( C\) and overall we have \( |C| = 2k\).

The proof: First, notice that each time the CTS starts reading a sequence of k characters from the word it will be at the beginning of \( C\) or half-way through it, at the beginning of the \( k\) additional empty appendants we added to \( C\).

If \( pIndex = 0\) thus points to the beginning of \( C\) it will encounter a single 1 while going through the word when at position \( i-1\) which corresponds to the \( i^{th}\) symbol of the 2-tag system and appends \( T(a_i)\) to the word. The CTS will go through 0′s until it reaches position \( pIndex= k-1\)

If \( pIndex= k-1\) it will delete \( k\) letters from the word and append empty strings to the word and in essence ignore the character represented by those \( k\) binary symbols in the word. It will then cycle to the beginning of \( C\) and we find ourselfs in the previous paragraph.

If we set \( S = k\) It is easy now to see that the constructed CTS emulates the 2-tag system perfectly.

The following python code is an implementation of the conversion above with the parameters from the 2-tag system and input in example 1.

#Conversion def TwoTagSymbolToBinaryWord(A, s): idx = A.index(s) return ('0' * idx) + '1' + '0' * (len(A) - idx - 1) def TwoTagWordToBinaryWord(A, w): t = '' for i in xrange(len(w)): t += TwoTagSymbolToBinaryWord(A, w[i]) return t def TwoTagToCTS(A, P): C = [] for i in xrange(len(A)): C.append('') C[i] += TwoTagWordToBinaryWord(A, P[A[i]]) for i in xrange(len(A)): C.append('') return C, len(A) #2-Tag System A = ['a', 'b'] P = {'a' : 'bb', 'b' : ''} #Input for 2-tag word = "aaa" #convert C, S = TwoTagToCTS(A, P) wordCTS = TwoTagWordToBinaryWord(A, word) print "S: " + str(S) print "C: " + str(C) print "Input: " + wordCTS print "step #\tpIndex\tword" #Run and print CTS step = 1 pIndex = 0 #CTS Step def DoCTSStep(word, C, pIndex): w0 = word[0] word = word[1:] if (w0 == '1'): word = word + C[pIndex] pIndex = (pIndex + 1) % len(C) return word, pIndex print str(step) + "\t" + str(pIndex) + "\t" + wordCTS while len(wordCTS) > S: wordCTS, pIndex = DoCTSStep(wordCTS, C, pIndex) step+=1 print str(step) + "\t" + str(pIndex) + "\t" + wordCTS |

The code outputs the following (with comments)

S: 2 C: ['0101', '', '', ''] Input: 101010 step # pIndex word 1 0 101010 =aaa 2 1 010100101 3 2 10100101 deletes the second symbol 4 3 0100101 5 0 100101 =abb 6 1 001010101 7 2 01010101 deletes the second symbol 8 3 1010101 9 0 010101 =bbb 10 1 10101 11 2 0101 deletes the second symbol 12 3 101 13 0 01 =b

As expected the code halts and simulates the 2-tag system run.

## 2-Tag System Can Emulate A Turing Machine

There are actually at least 3 different proofs for the universality of tag systems. The first was provided by Minsky in 1961 and the second by Cooke and Minsky 2 years later in 1963. The third proof was provided by Cook in his paper on the universality of Rule 110 from 2004.

The two later proofs are the simplest IMHO. The reader may follow both in the original papers, they are not long. They are too technical, however inspiring, for this scope.

# Bonus: The Collatz Conjecture with 2-Tag System

Using the correct configuration, Liesbeth De Mol, showed in her paper from 2008, how the Collatz conjecture can be reproduced in a tag system. A reminder: the conjecture asserts that the following sequence will always terminate with 1. If \( n\) is odd, the next element in the sequence is \( 3n+1\). If even the next element is \( n/2\). It has been long standing since 1937 when first proposed by Lothar Collatz.

We encode a natural number with \( n\) a’s. ‘b’, and ‘c’ will also be included in the alphabet. The production Rules are as follows: \( P(a) = {}’bc{}’\) , \( P(b) = {}’a{}’\) , \( P(c) = {}’aaa{}’\). It should terminate with ‘a’ or ’100′ in CTS jargon. Voila!

A = ['a', 'b', 'c'] P = {'a' : 'bc', 'b' : 'a', 'c' : 'aaa'} |

S: 3 C: ['010001', '100', '100100100', '', '', ''] Input: 100100100 step # pIndex word 1 0 100100100 2 1 00100100010001 3 2 0100100010001 4 3 100100010001 5 4 00100010001 6 5 0100010001 7 0 100010001 8 1 00010001010001 9 2 0010001010001 10 3 010001010001 11 4 10001010001 12 5 0001010001 13 0 001010001 14 1 01010001 15 2 1010001 16 3 010001100100100 17 4 10001100100100 18 5 0001100100100 19 0 001100100100 20 1 01100100100 21 2 1100100100 22 3 100100100100100100 23 4 00100100100100100 24 5 0100100100100100 25 0 100100100100100 26 1 00100100100100010001 27 2 0100100100100010001 28 3 100100100100010001 29 4 00100100100010001 30 5 0100100100010001 31 0 100100100010001 32 1 00100100010001010001 33 2 0100100010001010001 34 3 100100010001010001 35 4 00100010001010001 36 5 0100010001010001 37 0 100010001010001 38 1 00010001010001010001 39 2 0010001010001010001 40 3 010001010001010001 41 4 10001010001010001 42 5 0001010001010001 43 0 001010001010001 44 1 01010001010001 45 2 1010001010001 46 3 010001010001100100100 47 4 10001010001100100100 48 5 0001010001100100100 49 0 001010001100100100 50 1 01010001100100100 51 2 1010001100100100 52 3 010001100100100100100100 53 4 10001100100100100100100 54 5 0001100100100100100100 55 0 001100100100100100100 56 1 01100100100100100100 57 2 1100100100100100100 58 3 100100100100100100100100100 59 4 00100100100100100100100100 60 5 0100100100100100100100100 61 0 100100100100100100100100 62 1 00100100100100100100100010001 63 2 0100100100100100100100010001 64 3 100100100100100100100010001 65 4 00100100100100100100010001 66 5 0100100100100100100010001 67 0 100100100100100100010001 68 1 00100100100100100010001010001 69 2 0100100100100100010001010001 70 3 100100100100100010001010001 71 4 00100100100100010001010001 72 5 0100100100100010001010001 73 0 100100100100010001010001 74 1 00100100100010001010001010001 75 2 0100100100010001010001010001 76 3 100100100010001010001010001 77 4 00100100010001010001010001 78 5 0100100010001010001010001 79 0 100100010001010001010001 80 1 00100010001010001010001010001 81 2 0100010001010001010001010001 82 3 100010001010001010001010001 83 4 00010001010001010001010001 84 5 0010001010001010001010001 85 0 010001010001010001010001 86 1 10001010001010001010001 87 2 0001010001010001010001100 88 3 001010001010001010001100 89 4 01010001010001010001100 90 5 1010001010001010001100 91 0 010001010001010001100 92 1 10001010001010001100 93 2 0001010001010001100100 94 3 001010001010001100100 95 4 01010001010001100100 96 5 1010001010001100100 97 0 010001010001100100 98 1 10001010001100100 99 2 0001010001100100100 100 3 001010001100100100 101 4 01010001100100100 102 5 1010001100100100 103 0 010001100100100 104 1 10001100100100 105 2 0001100100100100 106 3 001100100100100 107 4 01100100100100 108 5 1100100100100 109 0 100100100100 110 1 00100100100010001 111 2 0100100100010001 112 3 100100100010001 113 4 00100100010001 114 5 0100100010001 115 0 100100010001 116 1 00100010001010001 117 2 0100010001010001 118 3 100010001010001 119 4 00010001010001 120 5 0010001010001 121 0 010001010001 122 1 10001010001 123 2 0001010001100 124 3 001010001100 125 4 01010001100 126 5 1010001100 127 0 010001100 128 1 10001100 129 2 0001100100 130 3 001100100 131 4 01100100 132 5 1100100 133 0 100100 134 1 00100010001 135 2 0100010001 136 3 100010001 137 4 00010001 138 5 0010001 139 0 010001 140 1 10001 141 2 0001100 142 3 001100 143 4 01100 144 5 1100 145 0 100

**References**

M. Minsky, “Recursive Unsolvability of Post’s Problem of ‘Tag’ and Other Topics of Turing Machines”, 74, no. 3, pp. 437-455; November 1961.

J. Cooke and M. Minksy, “Universality of TAG Systems with P=2″, AI Memos 52, MIT, MA; 1963.

M. Cook, “Universality in Elementary Cellular Automata”, Complex Systems, vol 15 pp. 1-40; 2004.

T. Neary, D. Woods, “P-completeness of cellular automaton Rule 110″, Proceesings of ICALP 2006, Lecture Notes in Computer Science, pp.132-143, vol 4051, Springer, Berlin; 2006.

L. De Mol, “Tag Systems and Collatz-Like Functions”, Theoretical Computer Science, vol. 390, issue 1, pp.92-101, January 2008.

## Memory-Efficient String Compression

2In this post I will show how to do a memory-efficient string compression. C# and .NET in general has a managed memory model which means that the memory cannot be accessed directly. In some cases, such as this one it might be too restrictive. Fortunately .NET enables, for many reasons, direct memory access like in non-managed runtimes such as C and C++. While compression is a good example for the technique presented, any byte oriented processing of strings may benefit from it.

I will use .NET’s GZipStream for the example, but any compression stream can be used (I have actually used the 7ZIP LZMA library with it). Before we get to the efficient implementation lets check the naive approach:

byte[] CompressStringNaive(string str, Encoding encoding) { //Get the string's bytes byte[] strBytes = encoding.GetBytes(str); //Wrap the string's bytes with a memory stream MemoryStream memoryStream = new MemoryStream(strBytes); //Init compression GZipStream compressionStream = new GZipStream(memoryStream, CompressionLevel.Fastest); //Copy to a memory stream MemoryStream compressedBytesStream = new MemoryStream(); compressionStream.CopyTo(compressedBytesStream); //Voila! the compressed bytes byte[] compressedBytes = compressedBytesStream.ToArray(); return compressedBytes; } |

We have two copies of the string bytes, once in the string and once in the byte array. This is a waste of memory. .NET encodes its strings with UTF16. The following will compress the string’s UTF16 bytes directly from memory:

unsafe byte[] CompressString(string str) { fixed (char* strPtr = str) { long length = Encoding.Unicode.GetByteCount(strPtr, str.Length); //Get stream of string bytes byte* bytePtr = (byte*)strPtr; var strStream = new UnmanagedMemoryStream(bytePtr, length); MemoryStream outStream = new MemoryStream((int)(length / 50)); //Compress GZipStream compressionStream = new GZipStream(strStream, CompressionLevel.Fastest); compressionStream.CopyTo(outStream); //Close str stream strStream.Close(); //To Array outStream.Seek(0, SeekOrigin.Begin); return outStream.ToArray(); } } |

Notice that the method is marked as unsafe, this tells .NET that the method is using unsafe (unmanaged operations) and it is necessary for the code to compile. The fixed keyword tell the garbage collector not to move the string around the memory because we are accessing it by address. UnmanagedMemoryStream enables us to view the string bytes as a stream very similar to the more common MemoryStream. Notice that we are also initializing the outStream so it will at least be initialized with 2% of the original size – this can be configurable and tuned. It prevents reallocations while compressing.

We lost control of the encoding and that’s a shame. We can wrap the UnmanagedMemoryStream with an encoding conversion stream (I should cover the implemantation of such a thing in a future post – stay tuned). In any case we have cut the memory requirements by half! This will have an impact on both time and space (think of LOH collections for long strings)

Happy Compression

**EDIT**

I got some constructive comments on HN (here) suggesting that I’ll explain the benefits of the second version of the code in more depth (thank you tkellogg for your constructive critisism). So I will! Well, two points which arose in the comments are (1) why are you using unsafe code? it is not a recommended practice (2) Why should we bother with unsafe code and pointers at all, the byte array is going to be collected very quickly either way. Both of these are excellent points. First, we must understand that compression is a thing we do on large strings and not on a few KBs. Saying that, you should recall that every object larger than 85K does not go into the first generation of the .NET’s memory heaps – it allocated in a separate heap called the Large Object Heap (LOH). Why you ask? because if an objects survives a first generation collection, it moves to the second and then to the third. These moves are very expensive and should be avoided for large objects, even by the garbage collector. The LOH collections are expensive in both time (only in .NET 4.5 you have a server GC with background mode) that is you application halts and in CPU (check the % time in GC counter with long strings and high throughput – it is nasty). So the case for compression is large objects and large objects are expensive to collect so we should try to avoid their allocation in the first place. The improved version of the code does just that and it actually uses less memory which is always nice. Secondly, unsafe code is not a good practice primarily because the generations shifts I mentioned earlier are prevented and defragmentation of the heaps (something the GC is doing as well) is also interrupted – this is the result of fixed keyword used. In our case however, these operations does not happen so often as they are expensive on the LOH and are actually less frequent because we are keep the LOH smaller – all in all a good practice. Also if you check out the implementation of Encoding.GetByteCount or Encoding.GetBytes with a decomplier (such as dotPeek which I blogged about before) you will see that the same use of pointers – so that’s MS code for you. I hope that helps to understand the rationalizations further.

## Queue Stream

0I have wrote a little stream I call *queue stream* which I want to share with you. It is a stream that can you can read and write from. Internally it will have a queue of buffers you wrote. When reading, the stream will cleverly collect the a buffer from the queue at your requested length.

Some optimizations:

- The buffers committed to the queue have a maximum size of 64KB so nothing get in the LOH.
- Buffers in the queue are all taken from a Buffer Manager courtesy of WCF developers. The difference will be noticed for long streams when the buffers are frequently reused.

Not implemented: If you want the stream to be **thread-safe** which may prove quite useful when writing on one thread and reading on another, use a ConcurrentQueue instead of Queue, it it lock-free and very fast.

The code can be found here: https://gist.github.com/3980608

## Working with Bitfields in C#

0Bitfields are a great way to maintain an arbitrary collection of bit flags using a single numeric value (usually an integer). For example, when using a System.Thread object in .NET you may want to use the ThreadState property. The property is of the type System.Threading.ThreadState which is an enum decorated with FlagsAttribute. The integer values of the enum are powers of 2 because we want to use the actual bits of the integer to mark different states (see previous post on .NET Integer Representation). Note that the in ThreadState states are not mutually exclusive so you can different states flagged at the same time.

[Flags] public enum ThreadState { Running = 0, StopRequested = 1, SuspendRequested = 2, Background = 4, Unstarted = 8, Stopped = 16, WaitSleepJoin = 32, Suspended = 64, AbortRequested = 128, Aborted = 256 } |

With Int32 we have 32 bits that we can used. An integer indicating that the state is both WaitSleepJoin and AbortRequested (it can happen) will look like that:

Working with bitfields is all about bitwise operations. In order to assert whether the thread is in these two states and exacltly and these two states we can use the following:

t.ThreadState == (ThreadState.AbortRequested | ThreadState.WaitSleepJoin) |

Checking if a bit is on:

(t.ThreadState & ThreadState.AbortRequested) == ThreadState.AbortRequested |

Turning on a bit:

t.ThreadState |= ThreadState.AbortRequested |

What happens if the FlagsAttribute is not used? Consider the following:

enum ColorsEnum { None = 0, Red = 1, Green = 2, Blue = 4, } [Flags] public enum ColorsFlaggedEnum { None = 0, Red = 1, Green = 2, Blue = 4, } |

We have two enums with the same values and same names for the strings but one marked with the flags attribute and the other is not.

Console.WriteLine("Without Flags:"); for (int i = 0; i <= 8; i++) { Console.WriteLine("{0,3}: {1}", i, ((ColorsEnum)i).ToString()); } Console.WriteLine("With Flags:"); for (int i = 0; i <= 8; i++) { Console.WriteLine("{0,3}: {1}", i, ((ColorsFlaggedEnum)i).ToString()); } |

Running the code above will produce the following output:

Without Flags:0: None 1: Red 2: Green 3: 3 4: Blue 5: 5 6: 6 7: 7 8: 8

With Flags:0: None 1: Red 2: Green 3: Red, Green 4: Blue 5: Red, Blue 6: Green, Blue 7: Red, Green, Blue 8: 8

Seems that the FlagsAttribute does have some effect. But is it really necessary?

ColorsEnum y = ColorsEnum.Blue | ColorsEnum.Green; Console.WriteLine(y.ToString()); |

Running this code will result in “6″ written on the console. Using dotPeek we have the following for Enum.ToString():

public override string ToString() { return Enum.InternalFormat((RuntimeType) this.GetType(), this.GetValue()); } private static string InternalFormat(RuntimeType eT, object value) { if (!eT.IsDefined(typeof (FlagsAttribute), false)) return Enum.GetName((Type) eT, value) ?? value.ToString(); else return Enum.InternalFlagsFormat(eT, value); } private static string InternalFormat(RuntimeType eT, object value) { if (!eT.IsDefined(typeof (FlagsAttribute), false)) return Enum.GetName((Type) eT, value) ?? value.ToString(); else return Enum.InternalFlagsFormat(eT, value); } private static string InternalFlagsFormat(RuntimeType eT, object value) { ulong num1 = Enum.ToUInt64(value); Enum.HashEntry hashEntry = Enum.GetHashEntry(eT); string[] strArray = hashEntry.names; ulong[] numArray = hashEntry.values; int index = numArray.Length - 1; StringBuilder stringBuilder = new StringBuilder(); bool flag = true; ulong num2 = num1; for (; index >= 0 && (index != 0 || (long) numArray[index] != 0L); --index) { if (((long) num1 & (long) numArray[index]) == (long) numArray[index]) { num1 -= numArray[index]; if (!flag) stringBuilder.Insert(0, ", "); stringBuilder.Insert(0, strArray[index]); flag = false; } } if ((long) num1 != 0L) return value.ToString(); if ((long) num2 != 0L) return ((object) stringBuilder).ToString(); if (numArray.Length > 0 && (long) numArray[0] == 0L) return strArray[0]; else return "0"; } |

This is the route that a Flags decorated enum goes through when ToString is invoked. So the ToString implementation is sugar only, no actual meat involved. In fact one can use simple integers. The enum keyword is only sugar.

# Types of Enums

An enum is not nesaccerally an integer, we can have any of the C# integral types except char. From the c# documentation:

The approved types for an enum are byte, sbyte, short, ushort, int, uint, long, or ulong.

http://msdn.microsoft.com/en-us/library/sbbt4032(v=vs.110).aspx

There is absolutely no need to use a signed integer, but it doesn’t hurt either if you work with bitwise operations. You should use only the number of bits required for your state. Use byte if you need 7 states or less. Use Int16 if you need 8-15 states and so on. The all-bits-off state is used for “None” state or something similar.

# Performance

The best thing about bitfields is the performance gain compared with other naïve methods (several booleans, dictionary etc.). The increase in performance is twofold. First, memory access. If you use Booleans to store information about an object or any other information you will most likely want to assert the value of these bits in conjunction. An often overlooked optimization is the CPU cache. Assembly:

if (flag1 && flag2 && flag3) 0000011d cmp dword ptr [ebp-0Ch],0 00000121 je 0000013A 00000123 cmp dword ptr [ebp-10h],0 00000127 je 0000013A 00000129 cmp dword ptr [ebp-14h],0 0000012d je 0000013A { Console.WriteLine("true"); 0000012f mov ecx,dword ptr ds:[03662088h] 00000135 call 049BD3E4 } Console.ReadLine(); 0000013a call 04F84F64 |

When the CPU does a cmp operation, it will try to get it from the CPU cache and if it isn’t there it will go to machine memory and fetch it. When the data is not in the cache we experience a cache-miss. Thumb rule is 10 cycles for using data in the cache and 100 cycles from memory. If you are lucky (or extremely careful) all your bits will be in the same cache line. A cache line is the block the data is fetched in from the memory into the CPU cache. In the above example you might experience 3 cache-misses. If you are using a byte for example as your bit field, the entire state will be loaded into the cache and will require only one memory access. Modern CPUs have 32-64 byte cache lines. Assembly:

if (myColor == (ColorsEnum.Blue | ColorsEnum.Green)) 000000f3 cmp dword ptr [ebp-0Ch],6 000000f7 jne 00000104 { Console.WriteLine("true"); 000000f9 mov ecx,dword ptr ds:[032F2088h] 000000ff call 0485D3E4 } Console.ReadLine(); 00000104 call 04E24F64 |

Not only we have 2 instructions instead of 6, we do only one memory access. These numbers will hold for any number of bits and any combination of states the enum is in. ‘Nough said, bitfields rock.

## .NET Integer Representation

1When using bitwise operations one must understand the way integral types are stored in the memory. The following code will generate strings for Int32 values where the MSB (most significant bit) is on the left a.k.a big endian :

static string BinaryStringOfInt32(Int32 number) { char[] s = new char[32]; for (int shiftIndex = 0; shiftIndex < 32; shiftIndex++) s[32 - shiftIndex - 1] = (number & (1 << shiftIndex)) == 0 ? '0': '1'; return new string(s); } |

For example:

0 | 00000000000000000000000000000000 |

1 | 00000000000000000000000000000001 |

-1 | 11111111111111111111111111111111 |

2 | 00000000000000000000000000000010 |

-2 | 11111111111111111111111111111111 |

2^6 (1 << 6) | 1111111111111111111111111111111 |

Int32.MinValue (-2147483648) | 10000000000000000000000000000000 |

Int32.MaxValue (2147483647) | 01111111111111111111111111111111 |

We can see that the Two’s Complement scheme is used. Ones’ Complement would have +0 and -0 values. Where +0 would have all the bits off and -0 would have all the bits on. With two complements we have the following relatio for signed integral types:

~n = -(n+1) |

Where ~ is the bitwise negation operator. The relation holds for MinValue and MaxValue. That’s why Int32.MinValue = 1 << 31 and Int32.MaxValue = ~ (1 <<31)

Arithmetic operations are completely encapsulated in the processor. Disassembly:

x = x - 1; 000000c7 dec dword ptr [ebp-44h] |

Unsigned integral types will not have the sign bit

static string BinaryStringOfUInt32(UInt32 number) { char[] s = new char[32]; for (int shiftIndex = 0; shiftIndex < 32; shiftIndex++) s[32 - shiftIndex - 1] = (number & (1 << shiftIndex)) == 0 ? '0' : '1'; return new string(s); } |

For example:

BinaryStringOfUInt32(UInt32.MaxValue) is “1111111111111111111111111111111″ and BinaryStringOfUInt32(UInt32.MinValue) is “00000000000000000000000000000000″

## Google N-Gram Downloader

0Google’s N-Gram Project is an amazing data source for many NLP projects. Google produced a very extensive database of n-grams from its Books project and made it available for the general public under Creative Commons. I felt that the resolution of the data is too high for my current need. In their download page they specify the file format:

File format:Each of the numbered files below is zippedtab-separated data. (Yes, we know the files have .csv extensions.) Each line has the following format:ngram TAB year TAB match_count TAB page_count TAB volume_count NEWLINEAs an example, here are the 30,000,000th and 30,000,001st lines from file 0 of the English 1-grams (googlebooks-eng-all-1gram-20090715-0.csv.zip):

circumvallate 1978 313 215 85 circumvallate 1979 183 147 77The first line tells us that in 1978, the word “circumvallate” (which means “surround with a rampart or other fortification”, in case you were wondering) occurred 313 times overall, on 215 distinct pages and in 85 distinct books from our sample.

The resulting data spans over 200 files each approximately 450MB compressed and approximately 3.1GB decompressed, for 3-grams of the complete collection of English books. You can do the math. Now, I don’t care for a year by year data and I don’t care for the number of books or pages each 3-gram was in. I only care to know the count of each 3-gram in the entire collection. Unfortunately it is not possible, for me, to download all the files and decompress them and then manipulate the data. So I have written a small python script which handles this problem. It will download file by file, decompress it and sum the data. Before downloading the next file, It will delete the old ones. The only thing is that the summarized values are held in memory, You may want to update a database or make in-place changes to the summed file if you don’t have enought memory for it. The resulting file has the format:

ngram TAB match_count NEWLINE

and each 3.1GB file adds another 45MB to the summed data. Totals in 9GB of data for 200 files.

I have used the 7-zip command line to extract the csv content but you may change it if you prefer to use something else.

The script (tested with python 2.6.6):

## Syntactic

0Syntactic is a visualization project by Omer Shapira and I. *Syntactic *helps to gather insights on the internal workings of a classification algorithm proposed by Alexander Clark. The algorithm is used in classifying lexical categories such as inflections of to be, place names and verbs with similar syntactic use. It is completely unsupervised and achieves very good results. The input for the algorithm is a very large sentence corpus from which 3-grams are produced. Just like k-means, the amount of clusters is predefined. In the first stage it takes the k most common words and assigns a cluster for each of them. After that it plays a what-if game: for each of the unsorted words (on the left in the visualization) it calculates the distance using a mutual entropy “metric” and then assigns some words to their appropriate clusters. The general idea is that two words belong to the same cluster if their left and right contexts are similar enough – hence the 3-grams. The algorithm is iterative and the number of iterations is not deterministic. It will run until all the words are sorted. The algorithm was implement by Omer in Roni Katzir’s seminar on linguistic learnability we both attended in Tel-Aviv University. Implementation in java and further information is avaliable here.

The visualization of the context distributions is the main issue. We used a canvas where each of the contexts (there are (k+1)^2) is represented by a square which it lit with respect to the metric. We have tested the results and found it unsatisfactory since the data was too sparse, linear luminosity was too dark. First we normalized the data so a minimum value is considered as 0 – that is most of the data. We divided all the values by the maximum value so we will have an array of values ranging from 0 to 1. This was the linear part. We calculated the median value \( m \) from the values which were not 0. Then we used the function \( f(x) = x^{-1/{log_{2}{m}}} \) which preserve the values 0 and 1 but is concave and equals 0.5 for the median value.

Omer designed the visualization and did a great job. I especially liked the help feature, it is extremely intuitive and I have never seen a similar solution. When you click on the question mark, this is what you get:

Working with Omer was a great fun, thank you Omer, live long and prosper.