Google N-Gram Downloader
0Google’s N-Gram Project is an amazing data source for many NLP projects. Google produced a very extensive database of n-grams from its Books project and made it available for the general public under Creative Commons. I felt that the resolution of the data is too high for my current need. In their download page they specify the file format:
File format: Each of the numbered files below is zipped tab-separated data. (Yes, we know the files have .csv extensions.) Each line has the following format:
ngram TAB year TAB match_count TAB page_count TAB volume_count NEWLINEAs an example, here are the 30,000,000th and 30,000,001st lines from file 0 of the English 1-grams (googlebooks-eng-all-1gram-20090715-0.csv.zip):
circumvallate 1978 313 215 85 circumvallate 1979 183 147 77The first line tells us that in 1978, the word “circumvallate” (which means “surround with a rampart or other fortification”, in case you were wondering) occurred 313 times overall, on 215 distinct pages and in 85 distinct books from our sample.
The resulting data spans over 200 files each approximately 450MB compressed and approximately 3.1GB decompressed, for 3-grams of the complete collection of English books. You can do the math. Now, I don’t care for a year by year data and I don’t care for the number of books or pages each 3-gram was in. I only care to know the count of each 3-gram in the entire collection. Unfortunately it is not possible, for me, to download all the files and decompress them and then manipulate the data. So I have written a small python script which handles this problem. It will download file by file, decompress it and sum the data. Before downloading the next file, It will delete the old ones. The only thing is that the summarized values are held in memory, You may want to update a database or make in-place changes to the summed file if you don’t have enought memory for it. The resulting file has the format:
ngram TAB match_count NEWLINE
and each 3.1GB file adds another 45MB to the summed data. Totals in 9GB of data for 200 files.
I have used the 7-zip command line to extract the csv content but you may change it if you prefer to use something else.
Download the script: googleNGramDownloader (tested with python 2.6.6).
Syntactic
0Syntactic is a visualization project by Omer Shapira and I. Syntactic helps to gather insights on the internal workings of a classification algorithm proposed by Alexander Clark. The algorithm is used in classifying lexical categories such as inflections of to be, place names and verbs with similar syntactic use. It is completely unsupervised and achieves very good results. The input for the algorithm is a very large sentence corpus from which 3-grams are produced. Just like k-means, the amount of clusters is predefined. In the first stage it takes the k most common words and assigns a cluster for each of them. After that it plays a what-if game: for each of the unsorted words (on the left in the visualization) it calculates the distance using a mutual entropy “metric” and then assigns some words to their appropriate clusters. The general idea is that two words belong to the same cluster if their left and right contexts are similar enough – hence the 3-grams. The algorithm is iterative and the number of iterations is not deterministic. It will run until all the words are sorted. The algorithm was implement by Omer in Roni Katzir’s seminar on linguistic learnability we both attended in Tel-Aviv University. Implementation in java and further information is avaliable here.
The visualization of the context distributions is the main issue. We used a canvas where each of the contexts (there are (k+1)^2) is represented by a square which it lit with respect to the metric. We have tested the results and found it unsatisfactory since the data was too sparse, linear luminosity was too dark. First we normalized the data so a minimum value is considered as 0 – that is most of the data. We divided all the values by the maximum value so we will have an array of values ranging from 0 to 1. This was the linear part. We calculated the median value from the values which were not 0. Then we used the function
which preserve the values 0 and 1 but is concave and equals 0.5 for the median value.
Omer designed the visualization and did a great job. I especially liked the help feature, it is extremely intuitive and I have never seen a similar solution. When you click on the question mark, this is what you get:
Working with Omer was a great fun, thank you Omer, live long and prosper.
dotPeek – JetBrains’ free alternative for Red Gate’s .NET Reflector
0dotPeek has many of the basic features of .NET Reflector. When you fire it up for the first time it might take a few minutes until it warms up. It will load with the main assemblies of .NET.
Code View
The System.IO.Stream class:

On the left is the list of classes and other .NET objects group by Assemblies and then by like in Reflector. On the right you will find the disassembled code. I really like the tabbed view, it is really nice when you drill down or compare different classes.
Download Symbols
With dotPeek you can download symbols from Microsoft’s symbol server:
Find Usages
dotPeek, like .NET Reflector has a “Find Usages” feature:
At this point .NET reflector is better, it groups the results based on the type of usage which is often the what we actually want from this view while dotPeek is more dependency oriented. That one I didn’t like. This view in .NET Reflector was one of the best features and I think dotPeek did it the wrong way.
Loading Assemblies
You can choose to load assemblies by browsing or loading it from GAC: 
You can also see the file structure view on the right. I think it is a great bonus. The file structure shows the same data as in the tree on the left but it is much easier to use and it’s also automatically refreshes when you go through tabs. Nice!
Summary
One other thing dotPeek is missing is a plugin framework. .NET Reflector had some great plugins and that it is a something I definitely want to see in the future.
All in all I am very pleased with the new dotPeek and JetBrains. This is a great example for a supply and demand situation, JetBrains will definitely earn some points for that.
Keep in mind that this is an early access version and much will change and improve on the final version.
Tech Hipsters Explained : A Socratic Approach
0A few days ago I have stumbled upon a series of excellent youtube videos. The series is about NoSQL databases, Ruby, Outsourcing and other trendy things going on in web development. Nothing more to say, enjoy, or don’t.
Episode 1 – NoSQL
Episode 2 – Ruby vs. PHP
Episode 3 – A narcissistic CTO and Outsourcing
JavaScript Brain Teaser
0While debugging a certain piece of code I found a small bug which I found annoying enough to share with you as this little brain teaser.
What does the following code do?
var result = 0; function f1() { for (i = 0; i < 10; i++) { result += f2(); } alert(result); } function f2() { var count = 0; for (i = 0; i < 10; i++) { count += i; } return count; } f1();
Trivially, f2 always returns 0 + 1 +2 +3 +4 +5 + 6 + 7 + 8 + 9 = 45. f2 actually runs only once since i is defined globally. The code will show a message box with 45. After f2 runs for the first time it sets i = 10 and the next loop condition in f1 (i < 10) will fail and the loop will end. The solution is to define i locally for each function. This little thing can show up in recursion as well.
Online Function Grapher – Expression Evaluation (Part 3)
11. Online Function Grapher – Planning (Part 1)
2. Online Function Grapher – Formula Parser (Part 2)
In the previous post in this series I showed how to build a lexer/parser for a function expression using ANTLR. The parser was implemented completely in JavaScript and the result was a tree representing the function’s formula as a workable data structure. Workable in the sense that we can evaluate the function’s value at a certain point.
A quick reminder about the features supported by our function grapher:
- Functions are of one variable, x.
- Composed of the following binary operations: Multiplication (*), Subtraction (-), Addition (+), Raising to power (^).
- Negation is allowed: -.
- We allow all rational numbers and two constants: pi and e (Euler constant).
- It is possible to use special functions like: cos, sin and log.
- Parenthesis are allowed to bound a certain expression in order to break precedence order.
The parsing made two things easier for us in this stage:
- Parenthesis are not a part of the tree but the expression is parsed with respect to them.
- Precedence order is inherently built into the structure so we don’t have to worry about that either.
As an object a tree node has the following structure (with sample values)
{ text: "5", type: 11, typeName: "INTEGER", children //tree node array }
We have a string and a type and the output of the evaluator is a number. We will write an evaluator for each node type. The types and their possible values are
| ID | NAME | |
|---|---|---|
| 4 | NEGATE | ‘-’ |
| 5 | PLUS | ‘+’ |
| 6 | MINUS | ‘-’ |
| 7 | MULT | ‘*’ |
| 8 | DIV | ‘/’ |
| 9 | POW | ‘^’ |
| 10 | INTEGER | ’5′, ’893298′, ’12′, ‘-14′,… |
| 11 | FLOAT | ’4.1244′, ’12.0004′,… |
| 12 | VARIABLE | ‘x’ |
| 13 | CONST | ‘e’, ‘pi’ |
| 14 | IDENT | ‘sin’, ‘cos’, ‘tan’, ‘cot’, ‘asin’, ‘acos’, ‘atan’, ‘acot’,
‘log’, ‘ln’, ‘sqrt’, ‘max’, ‘min’, ‘abs’, ‘ceil’, ‘floor’ |
The end result of this post should enable the following code:
N = 300 domainStart = -5; domainEnd = 5; dataSet = new Array(); //Create data set delta = Math.abs(domainEnd - domainStart) / (N - 1); for (n = 0; n < N; n++) { //Calculate evaluation point x0 = domainStart + n * delta; //Evaluate tree y0 = evaluateNode(topNode, x0); //Add to data set dataSet.push({ x: x0, y: y0 }); }
The code above evaluates the function at N points with equal steps over a domain (In the above example, between -5 and 5 in 300 points).
Now with the final target code in mind we can implement evaluateNode. We named it evaluateNode because this function is going to be recursive. A tree is a top node, topNode has the same structure as its children.
We need a different evaluation function for each node type. We will create a map of functions with the type ID (from the table above) as key. The function parameters are identical to evaluateNode’s. The function map:
funcEvalMap = new Array(); //Negate funcEvalMap[4] = function (node, x) { //Expecting exactly one child return -evaluateNode(node.children[0], x); }; //Plus funcEvalMap[5] = function (node, x) { //Expecting exactly two children return evaluateNode(node.children[0], x) + evaluateNode(node.children[1], x); }; //Minus funcEvalMap[6] = function (node, x) { //Expecting exactly two children return evaluateNode(node.children[0], x) - evaluateNode(node.children[1], x); }; ... //Integer funcEvalMap[11] = function (node, x) { //Expecting no children return parseInt(node.text); }; //Float funcEvalMap[12] = function (node, x) { //Expecting no children return parseFloat(node.text); }; //Variable funcEvalMap[13] = function (node, x) { return x; }; //Const funcEvalMap[14] = function (node, x) { switch(node.text) { case "pi": return Math.PI; case "e": return Math.E; } }; //Ident (Special Functions) funcEvalMap[15] = function (node, x) { switch(node.text) { case "sin": return Math.sin(evaluateNode(node.children[0], x)); case "cos": return Math.cos(evaluateNode(node.children[0], x)); ... case "max": //Calculate value for each parameter for (i in node.children) node.children[i].value = evaluateNode(node.children[i], x); //Create an array vals = new Array(); for (j in node.children) vals[j] = node.children[j].value; return Math.max.apply(Math, vals); ... } }; function evaluateNode(node, x) { return funcEvalMap[node.type] (node, x); }
While it seems that we are done, we are not. Consider the function , there are two possible implementations:
//First Math.Pow(Math.E, x); //Second Math.Exp(x);
Both supposedly returning same result. I am skeptic. Time to write a small tester. Same goes for cot. I have also written a small tester for that one here. We get different numeric result for identical analytic representations. However, the differences are so small that for the purspose of rendering we will use the most convenient one to code. We will stick with the Math.Pow(Math.E, x) and not use a special case for e as the base and. cot is not a part of the Math object so we can use the more accurate implementation of the variety in the tester, cos(x)/sin(x).
Another function not implemented in the Math object is acot - the inverse of cot. We will use the identity . log and ln are going to be the same, the natural logarithm (base e)
You can play and get the full source code and tester here.
That’s it. We are done and ready to continue to the next step, rendering.
Resources
W3 Schools Math Object Reference
Online Function Grapher – Formula Parser (Part 2)
5This is the second post in this series about building a function grapher. In the first post I described the anatomy and stages of building. This part will provide a solution for parsing the formula in which a function is described. For example:
or
These examples were rendered using a rendering engine. The first one is represented by: sin(5x)*7^x and the second is represented by: e^{cos(x^{sin(x^2)})}. These strings are sent to a LaTeX rendering engine and returned as a gif image. The function grapher job is not that different. It takes a string, converts it into a workable data structure, the expression is evaluated for some points and the set of values is rendered on a canvas element. So instead of rendering the math symbols we evaluate the expression.
Before diving into the implementation, lets define what formulas we are supporting:
- Functions are of one variable, x.
- Composed of the following binary operations: Multiplication (*), Subtraction (-), Addition (+), Raising to power (^).
- Negation is allowed: -.
- We allow all rational numbers and two constants: pi and e (Euler constant).
- It is possible to use special functions like: cos, sin and log.
- Parenthesis are allowed to bound a certain expression in order to break precedence order.
Parsing
As required in the first post, the parser should be implemented to use only client-side resources. I will use a LALR. The acronym translates to Look Ahead Left to Right which means that the input is parsed from left to right and looks ahead in order to understand what token it is parsing. I started by checking out JS/CC library. The library accepts a grammar in Bachus-Naur form styled syntax which is really intuitive and easy to learn. Manual and syntax specification are detailed in the library’s documentation. After playing around with it a bit I found its API is not mature enough and eventually decided to continue with ANTLR library which is much more mature. ANTLR did not support JavaScript output up until version 3.1 and currently in version 3.3 it does the job, although the JavaScript output is not fully supported in the UI. Like JS/CC it takes a ENBF grammar as an input and outputs a JavaScript lexer/parser code. The API is very good and based on the Java implementation. There are many languages for which ANTLR can generate lexer/parser code to, consult its website for current supported languages. ANTLR also ships with a Java based UI enviorment for writing and testing the grammars. This is not a post about ANTLR and if you wish to learn more about it you can consult its excellent documentation.
After some googling I came across the following http://www.codeproject.com/KB/recipes/sota_expression_evaluator.aspx. The article presents a grammar which is very close to what a function grapher needs. In addition to numeric math expressions the grammar in the article also supports equalities, inequalities, logical expression and some scripting. I have modified the grammar to suit the needs of a function grapher. Here it is:
grammar Formula;
options
{
language=JavaScript;
output=AST;
ASTLabelType=CommonTree;
}
tokens
{
NEGATE;
}
expression
: additiveExpression EOF!
;
additiveExpression
: multiplicativeExpression ( (PLUS|MINUS)^ multiplicativeExpression )*
;
PLUS : '+';
MINUS : '-';
multiplicativeExpression
: powerExpression ( (MULT|DIV)^ powerExpression )*
;
MULT : '*';
DIV : '/';
powerExpression
: unaryExpression ( POW^ unaryExpression )*
;
POW : '^';
unaryExpression
: primaryExpression
| MINUS primaryExpression -> ^(NEGATE primaryExpression)
;
primaryExpression
: '('! additiveExpression ')'!
| value
;
value
: INTEGER
| FLOAT
| VARIABLE
| CONST
| function
;
VARIABLE
: 'x'
;
CONST : 'pi'
| 'e'
;
INTEGER
: ('0'..'9')+
;
FLOAT
: ('0'..'9')* '.' ('0'..'9')+
;
function
: IDENT '(' ( additiveExpression (',' additiveExpression)* )? ')' -> ^(IDENT additiveExpression*)
;
IDENT
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;
WS
: (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=HIDDEN;}
;Lets take a step back and break the grammar into smaller pieces. First we have the grammar declaration:
grammar Formula;
This tells ANTLT that the grammar’s name is Formula (the file name must be Formula.g). Next we have the options section:
options
{
language=JavaScript;
output=AST;
ASTLabelType=CommonTree;
}Under options we define that the output language should be javascript and that the output of the parser is AST = Abstract Syntax Tree. It means that we want the output to be a tree. It will be easier to evaluate the function using a tree structure.
The grammar has two parts that are merged in this grammar file: lexicon and rules. The lexer uses the lexicon and the parser uses the rules. The lexer takes the input string and parse it to lexemes (terminals in a grammar language) e.g. INTEGER, FLOAT, MINUS. For example:
VARIABLE
: 'x'
;
CONST : 'pi'
| 'e'
;
INTEGER
: ('0'..'9')+
;All lexemes names are in upper case and rule definitions are in lower case. After the lexer creates a lexemes stream from the input it passes the stream to the parser which builds the syntax tree. This process will be more clear once we get to the code.
ANTLR must have a start rule. It is defined by a template which ends with EOF which is a built in lexeme used to denote the end of the input.
expression : additiveExpression EOF! ;
We also don’t want the EOF to be a node in our final tree so we exlude it by placing a ! symbol after it. We can also rewrite rules like we did for the unary expression:
unaryExpression
: primaryExpression
| MINUS primaryExpression -> ^(NEGATE primaryExpression)
;An unary expression has two options, it is either a primaryExpression or a negated primaryExpression. If it is the second we don’t want to have MINUS in the tree, this will be reserved for a binary operation. So we can rewrite the rule using the above syntax. Note that NEGATE has no lexeme definition, it is declared it the tokens { … } section at beginning of the file. The rest of the grammar is pretty straight forward.
After generating code for the grammar we get two files FormulaParser.js and FormulaLexer.js. We need them both. I found a small bug in the javascript code generator. When generating code with ANTLRWorks (the UI for ANTLR) for this grammar we have two lines which are not necessary and raises an error.
set7=input.LT(1); set7=this.input.LT(1);
The first line is reduntent and we get an error saying that “input is undefined” – delete the first line. There is another line just like that with 4 instead of 7 – delete that line also. Other than that I did not find any problem with the code.
User Code
Finally we get to the user code. In order to use our grammar the following should be included:
- antlr3-all-min.js – The runtime framework for ANTLR javascript output. You can download it from http://code.google.com/p/antlr-javascript/.
- FormulaLexer.js – The generated lexer of our grammar.
- FormulaParser.js – The generated parser of our grammar.
Once everything is setup we can write the following code:
var input = "5*x+7"; var cstream = new org.antlr.runtime.ANTLRStringStream(bla); var lexer = new FormulaLexer(cstream); var tstream = new org.antlr.runtime.CommonTokenStream(lexer); var parser = new FormulaParser(tstream); var foo = parser.expression();
First we initialize a stream from the input. Passing it on to the lexer and from there to the parser. Nothing special about it. In order to parse the input we call to parser.expression which is the start rule for our grammar. Each rule has a corresponding function, so we can use only a part of the grammar each time. Now foo is the parsed tree object. For the purposes of the function grapher, I am converting the librarie’s structure to a much simpler one:
function convertToObject(node, parser) { var current = { text: node.getToken().getText(), type: node.getToken().getType(), typeName: parser.getTokenNames()[node.getToken().getType()], children: new Array() }; //Add children if (node.getChildCount() > 0) { var children = node.getChildren(); for (i in children) { current.children.push(convertToObject(children[i], parser)); } } return current; }
Using that structure I have created a test page for the grammar using prettyPrint library which visualize javascript/JSON objects with an interactive interface. You can play here.
Resources
Online Function Grapher – Planning (Part 1)
0In this series I will provide a step-by-step guide for building a function grapher. As always, it is best to start from the requirements. We want our function grapher:
- To have a function input of some sort
- To draw and view the function graph
- To use only client-side resources.
The third requirement is obviously not mandatory, but I find that the coding will be much cooler and challenging than using server-side resources for the first two.
For function formula parsing input we will use an LALR parser. Plotting the graph is traditionally accomplished by using java, flash or server-side image generation. We will use the HTML5 canvas element.
In each subsequent post in this series I will provide detailed requirements for each part. We will review the following in this series:
- Formula parsing
- Expression evaluation
- Rendering the graph
- User Interface
- Putting it all together
Stay tuned
HTML Embedded Images
4Page load time has a huge impact on user experience. Search engines also begin to consider page load time as a factor when evaluating the quality of a page. There are many ways to improve load time when the two most prominent and effective methods are
- Reduce resources (html, js, images, css) size. This is accomplished by many different methods depending on the resource type and the specific context.
- Making fewer requests.
We are always told to separate code and html, put css in a different file and avoid using inline styles – and that is a good idea. Making fewer requests is achieved by combining resources of the same type. We put all our css in one file (or several but carefully), we consolidate our script files and minifying them.
In this post I will review a method that violates the first directive and satisfying the second in a non traditional way. When embedding images inside the html the actual image data is downloaded as a string with the html. If used correctly, this method can speed the page load time significantly. Embedding images to html looks like that:
<img src="data:image/gif;base64,[data]" alt="" />
The [data] should be replaced with base64 encoded gif file but you can use any other supported format.
Concrete Example and its html:
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAkAAAAKCAAAAABDbitiAAAACXZwQWcAAAAJAAAACgDIMgRLAAAAL0lEQVQI122MsQ0AMAyDULpE6v/30iHOVk8MYFSuKuqBkEDIBvnNFaeoFNS+jPIA0nwx6TCiCtAAAAAASUVORK5CYII=" alt="" />
When the browser renders an image element that has embedded data in its src attribute it will not issue a request to the server. Instead it will render the image embedded in the string thus preventing the additional request-response to server. This is a huge performance gain. However embedding images in that way has its downsides:
- Base64 encoding is about 30 percent longer than the original and the html carries that burden.
- The page does not begin to render until the html file is fully downloaded so embedding many or large images in the page may result in a much slower loading.
- SEO – If SEO is important this is method should be used very carefully. The code-text ratio is worse when embedding images and the whole page size grows.
- The browser cache does not cache the image locally, the image will be downloaded with the page every time it is rendered
- Serving static content is easier for the server. When embedding static content, it might be a good idea to cache the Base64 form in memory or alternatively embed the encoded image statically.
If you want to use embedded images but the html gets too dirty, a possible workaround is to put the encoded images in a separate JavaScript file.
The HTML
<img id="img1" src="#" alt="" />
The Javascript (using jQuery)
img1data = 'data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAkAAAAKCAAAAABDbitiAAAACXZwQWcAAAAJAAAACgDIMgRLAAAAL0lEQVQI122MsQ0AMAyDULpE6v/30iHOVk8MYFSuKuqBkEDIBvnNFaeoFNS+jPIA0nwx6TCiCtAAAAAASUVORK5CYII=';
$(document).ready(function()
{
$("#img1").attr("src", img1data);
});
After the page loads, the javascript will replace the src tag in the img element.
You should consider consolidate the encoded images into several files based on usage and function so you won’t have too many requests. Separating the html from the image and is not exacltly what we were aiming for, after all, we wanted make fewer requests. Consolidating it may remind you of CSS Sprites while making the response size larger (x1.3 larger). These arguments are correct but there are cases in which is does makes sense to do it.
Another workaround which makes more sense in some cases is to embed the image in the css. For example, you may have a span which is used as a place holder for a small icon:
.arrow { background: transparent url(data:image/gif;base64, [data]); }
the url and the src attribute have the same syntax.
Note that in IE the length of the content of the src attribute is limited to 32K.
when to use?
Scenario A – Generated Images
The classic example of a generated image is a CAPTCHA image. The image only used once and is completely random. There is no reason to make an additional request to the server since caching is never an issue and the images are used only once in the page. You may have other generated images for numbers, random or user generated content.
If you have more than say 10 generated images you might want to consider using a separate css or javascript file. This is where css sprites might become a bit messy. Generating sprite images and then rendering the page accordingly can be quite annoying to code. Maybe I will write something in the future about it. Using css or javascript to accomplish that with embedded images inside is much simpler.
Scenario B – Tiny Images
Just like the little arrow above which weighs 125 bytes, it might be worthwhile to embed the image in html or better yet in a separate css file. I haven’t checked it but a nice pattern might be using css sprites when embedding the image in the css. You would have one class with the background image and other classes only changing background-position property.
Scenario C – AJAX
Ajax is all about responsiveness. We use ajax mostly because of performance, we don’t want to load the whole page just because we want to change a small thing. The simple case is when we make an ajax call and the response is html. The html is placed in its proper place using javascript and only then the images the html reference begin to load. Excluding large images, the returned html graphics should be completely embedded if images are not in common use. If the images are common they should be included in a css. If JSON is returned, the encoded image can be embedded in the message.
Resources
http://en.wikipedia.org/wiki/Data_URI_scheme – Wiki article about the schema used to embed the data.
http://www.motobit.com/util/base64-decoder-encoder.asp – Convert files and text to Base64 online.




