Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: What kind of index is Google using here? (google.com)
116 points by pc on July 7, 2009 | hide | past | favorite | 45 comments


AFAIK, Google Code Search was written by Russ Cox, who also wrote this: http://swtch.com/~rsc/regexp/regexp1.html

My guess, back when I was doing code search at Apple, was that he was actually scanning all the docs with his fast regexp, which would be do-able with the number of machines at his disposal + the expected traffic.

If anyone knows any differently, I'm all ears. :)


If Russell Cox wrote it, it must be good. I programmed competitively with him at USACO and in college (we got to 8th best college team in the world), and he was a consistent ass-kicker. Hardcore and disciplined and fast.


Other (open source) projects by Russ people might find interesting:

* Plan 9 from User Space - http://plan9.us

* vx32/9vx - http://pdos.csail.mit.edu/~baford/vm/

* libtask - http://swtch.com/libtask/

(Plus quite a few bits of Plan 9 itself, obviously.)

Edit: forgot this really cool storage system: http://doc.cat-v.org/plan_9/misc/foundation/


That does searching of text, without generating an index. While this is fast, you are still have search time linear in the size of your searchable text.

If you used a suffix tree as an index, you'd get time complexity strongly sublinear in text length.

In summary, if the regex matching complexity dwarfs the corpus size, you want Thompson NFA. If your corpus size dwarfs the regex complexity, you want suffix tree indexes. It's not clear to me if the two approaches can be combined, but if they are, I'd like to know.


If anyone knows any differently, I doubt they'd be able to tell you. ;-)


I agree that Google probably just throws HW at the problem, but you could also build something custom if it really needed to scale. Just record what people look for and keep an index for the common chunks of logic.

So, if people often search for someones name or date you keep a list of what satisfies that so you can quickly eliminate large numbers of documents without doing a full text search. You could also reduce the document to a list of tokens and search that list before the file etc.


Reverse Aho-Corasick matching.

Basically, you store all of your searchable text in a suffix tree. You then do a depth-first search of the tree for matching strings.

http://www.cs.brandeis.edu/~mfc/cs120/papers/ptrees.pdf

http://en.wikipedia.org/wiki/Aho-Corasick_algorithm

http://en.wikipedia.org/wiki/Suffix_tree


How does that help here? My query doesn't contain any directly-searchable strings, unless you start abstracting things by treating all alpha chars as the same, etc.


A nice thing about regular expressions is that you always know where you are in the expression (no backtracking, at least in theory).

Say your query is \d+(th|nd|rd), and your searchable text is "The 3rd planet from the sun is 20000000 miles away." Your tree search will hit 2, then fail on 20, then 3, then 3r, then 3rd, and then terminate successfully. So we followed 5 pointers, and tested another 20 or so for nullity.

This was independent of the length of the searchable text (but not the number of found matches). It would have performed equally well on "The 3rd planet from the sun is 20000000 miles away. Lorem ipsum (insert 10mb of random Latin here)".


How much RAM do you think it'd take to store all the world's source code in a suffix tree, and how would you partition it across multiple machines?


Well, there's always suffix arrays, so two 64-bit pointers, (plus the original char) per char?

Assuming their claim of 116,000,000 matches for /./, and 5k mean document size, 70TB, uncompressed.

Papers like "Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching" [1] lead me to believe that the state of the art could be as low as 4 bits/char, in which case the answer would be a much more tractable 270GB.

As far as partitioning goes, now I'm getting really out of my league. At a first pass, I imagine that you could keep the root and neighbors on a primary server, and as big of subtrees as you can fit in RAM on a bunch of slaves. The first couple chars get processed in the master, and then however many subtrees are alive get delegated in parallel to the secondary servers. The secondary servers return the answers to the primary, which returns to the client.

I feel like I'm answering hypothetical questions at a job interview :p

[1] http://www.di.unipi.it/~grossi/PAPERS/sicomp05.pdf


The math doesn't work. 116,000,000 matches x 5k is 580 GB. It fits on one big disk, and with 5800 machines holding 100 Mb of plain text each you could search all of it naively in real time. (Grep searches 100 Mb in 50 mSec.)

5800 machines is practical for a real product. Hardware probably costs Google $40/month/CPU, so $230k / month, about equal to 10 engineers. So if a more clever solution needs a whole team to maintain it then it's probably better to just use grep.


10 engineers to program one data structure? I'd be greatly surprised if 1 PhD and 2 engineers couldn't easily do the additional work beyond distributed grep in less than six months. Maybe one engineer to maintain it? That saves ~$200k/mo--more if you go over 20 queries/second it takes to max out the 5800cpu limit.


If I understand correctly (a big if) and the underlying structure is something along the lines of a giant suffix tree then searching for any particular string takes time asymtomtically equivalent to the length of the query string. What if there were another corresponding "meta-data" suffix-tree perhaps this would allow the fast abstract search. what if the nodes contained meta-data such as length/type etc instead of merely containing the data itself.


Neat! I hadn't seen that method before. Time to get reading. :)


Is your question "How do you create an index that allows regexes?"

In that case, I'd like to know the answer to the question, as well.


perl has a "study" function that examines a string in anticipation of doing many pattern matches on it. I'm not sure what it does exactly, but the kind of structures it creates could be serialized and stored along with an input file (or perhaps replace it), and used as kind of regular expression optimized index. This isn't necessarily an "index" in the normal sense, but then, neither is the likes of a full-text index either. But really, any data structures that are optimized for a specific kind of processing are indexes.


study does not do anything really interesting. It looks up substrings in a table of substring -> average frequency. It then scans for the rarest character first, so that it can fail quickly for most input.

From the docs:

(The way "study" works is this: a linked list of every character in the string to be searched is made, so we know, for example, where all the 'k' characters are. From each search string, the rarest character is selected, based on some static frequency tables constructed from some C programs and English text. Only those places that contain this "rarest" character are examined.)

Basically, this is not what Google is doing for code search.


Study doesn't do anything really interesting, but neither is creating a keyword index for a keyword based search all that interesting either.

Considering that Google code search is not intended to search English (for which their regular search technology would most likely suffice for that) using regular expressions, then I agree that the naive study implementation you've quoted is, basically, not what Google is using for code search. However, there's no reason that the frequency tables need to be static nor need to be constructed solely from from C programs and English text. Frequency tables can be continuously updated based on the input corpus and the input corpus should be the code that is being searched. Part of search is being able to rule out significant portions of the search space in the name of getting positive results as fast as possible, so being able to fail quickly for most inputs (considering the example regular expression only returns three hits, it seems a wide majority don't match), a study-like implementation would appear to be a possibility.


This is off-topic, but I just realized Google Code Search is great for seeing where code you open source ends up. Came across a few things I had no idea were using my code. Kinda cool.


Tom, did you know there are extra filters you can use to search Github.com?

Try "tlrobinson package:github.com lang:javascript"


This presentation talks about using N-grams to make an regex searchable index

http://www.informatik.hu-berlin.de/forschung/gebiete/wbi/tea...


No complexity analysis. Random student results.

It's viable to build an index with n-grams, but I don't think you'd get any benefit over suffix-tree search. There might be a way to select n-grams cleverly some of the time, but you can also easily run into pathological cases.


IIRC, this was a rather high-compute-cost project inside of Google, far more so than other search projects. Obviously it was a favorite though of the eng decision makers, so was well cared for resource-wise.


I'd (probably) consider building and searching an index of ASTs (http://en.wikipedia.org/wiki/Abstract_syntax_tree).

Not only would it allow searching with regexp, but it could allow more sophisticated search queries like "Search all my Ruby code for blocks that rescue StandardError and are not followed by an ensure clause"


But you'd have to do that separately for every language.

Edit: the OP is such a great question. I've been saving up the joke of commenting "How is this Hacker News?" on the most literally appropriate HN post I run across... and almost used it on this one. Sadly, I probably never will.


> But you'd have to do that separately for every language.

That's correct. It's one of the challenging parts.


It's one of those ideas that would be best tried out on one instance first. It's not obvious that the ASTs of programs in different languages would be "A" enough for a generalized querying system to be useful. Still, I don't mean to shoot the idea down - it would make for an interesting experiment.


Something like this?

- Regular expression searching on compressed text, http://portal.acm.org/citation.cfm?id=967612


The cost of that algorithm still grows linearly with respect to the length of the compressed text. It's possible Google is doing something like this (searching over the entire corpus until they find a match), but is there any work on precomputing an actual "index" to speed up this type of search?


I'm certain there are many approaches to be found via citation-following.

But consider that you can use the linked algorithm, and a bit of trickiness, to create an index already:

- Step (a): prepare another version of the document where every digit is replaced with 0, every word char with 'a', etc.

- Step (b): note how much better it compresses. :)

- Step (c): Perform a similar 'weakening' on the search regex.

- Step (d): Done! Not an exact index, but it will allow you to quickly eliminate a large number of documents.

You can also combine this approach with text-based ones for the cases where there are parts of the search regex that contain literal text. (This is almost certainly the most common use case).

Or you could make it hierarchical if you need to by creating several watered-down texts for each input with different levels of substitution.

---------------------------------------------------

Since this is the sort of thing that's fun to think about, here's a whole different approach:

- Break each document apart into code lines.

- Create a hierarchical tree: leaves are the code snippets (with a pointer back to source doc), each node represents a clustering-by-text-similarity of all its children. (Appropriate distance measure left as an exercise for the reader. :) )

- Let's say that for two regular expressions, RA & RB, RA < RB iff any text matched by RA will also be matched by RB.

- At each node, create a regular expression R so that for each child regex C, C < R. (The leaves become a regex that simply matches the source line literally.) Along the way, make a best effort to restrict the length of R by introducing generalizations (R can't simply be Ca|Cb|Cc...) without making it so loose that it matches everything.

- Then, searching for a regex S can be done recursively: For each child regex R of the current node, if you can construct a new regex T such that T < S and T < R, descend into that node.

n.b. I typed that out in a hurry and without being rigorous... so it's probably not quite right, but an interesting path to think about exploring. Also: testing if one regex is less than another (as defined above) is actually fairly straightforward, even though it seems hard.

(And a quick edit to add: the last step is definitely wrong. Unfortunately I need to hop onto a conference call, so it'll have to wait until the indefinite future.)

(Edit to that: I think I fixed it for now. Shorter version: if the intersection of S and R is nonempty, descend.)


Replying to myself, since editing was getting out of control.

A neat property of the 2nd approach is that if you ever reach a node R where R < S you're done: all children will be matched.

A less neat property is that it won't allow multi-line matches, but I'm not sure those are allowed anyway.


It might be possible to adapt a regex state machine to use an AVL TDAWG ( http://www.ssp.isee.kyushu-u.ac.jp/~inenaga/papers/tcs-terna... )


I am 100% postulating here, but this is what seems most likely to me:

Google is known for MapReduce. It seems like it would make regex searches manageable when put in such an algorithm. All Google would have to do would be to take whatever key system they use for their big-tables and use MapReduce to expand the regex over the keys. Then they just have to return the data associated with those keys.


..what?

I would love to see you try to explain how you think this works :)


to quote http://infolab.stanford.edu/~backrub/google.html

Indexing Documents into Barrels -- After each document is parsed, it is encoded into a number of barrels. Every word is converted into a wordID by using an in-memory hash table -- the lexicon. New additions to the lexicon hash table are logged to a file. Once the words are converted into wordID's, their occurrences in the current document are translated into hit lists and are written into the forward barrels. The main difficulty with parallelization of the indexing phase is that the lexicon needs to be shared. Instead of sharing the lexicon, we took the approach of writing a log of all the extra words that were not in a base lexicon, which we fixed at 14 million words. That way multiple indexers can run in parallel and then the small log file of extra words can be processed by one final indexer.

So we have a hash table of words/searches -> results

Because we have this, we can regex search the keys, and end up with the result pages.

we do have other options though. It could be:

a) magic b) a trade secret (think PidgeonRank :-) )


Quote is unrelated to what you are saying!

use MapReduce to expand the regex over the keys So we have a hash table of words/searches -> results Because we have this, we can regex search the keys, and end up with the result pages. ...

Looking forward to further explanations!


Imagine the following code:

money = 5;

Could be tokenized and mapped to a few regex keywords, with one or another as key-value in a datastore. Since Google is big on MapReduce, which could be seen as computation on sets:

[a-z] = (DOCID_14566, DOCID_15999, DOCID_888)

\d{5} = (DOCID_15, DOCID_15999, DOCID_552)

Then when someone searchs for a "number with five digits", the value for the \d{5} key is retrived, and it has the documents where the search appears, and this set/list is combined with the other set/list that match other search keywords, and you have your results.

On a normal web search, the document is split into words and you can make these words into keys, so know where they appear, right? (the key's values).

Well, if you think that the user will type a regex expression "\d{5}", then you can map it to documents too.

It's like pre-processing the possible regex that would match a document token, and searching these regexes instead of words like a normal web search.

It's "just" an extra step, splitting words (tokens) from a document, and assigning possible regexes for these tokens.

Not saying Google does it :p But I guess that's what the OP meant...

Edited: Before anyone notices how inefficient my example is, I wouldn't implement every possible \d as a key, or [a-b], [a-c], [a-d] etc as another key, this is just an example of how sets of precomputed regexs (from any depth) can be (ab)used.


This makes sense and it could work :) I imagine you would have to spend time tuning the tokens you map out. I think this would break down pretty quickly with moderate or complicated regexes, but those might not be the majority of searches.

Is there any trickery you could do with the inner nodes on the b+trees? I spent a few minutes thinking about tokens mapped out and intermediate reductions but didn't come up with anything.


Hey, its all postulation :)


how would a regexp match across keys in this scenario?


Easynews also has a global search where you can search through all the newsgroup posts with a regular expression (> 150 days).

You might ask them? They are certainly not as secrective as google is ;)


Is the global search really a regex? I thought it was just substrings.


The labels are links to regularexpressions.info so I'd guess yes.

I'm pretty sure EN just greps subject + output of something like 'strings' though, it chugs pretty hard on broad searches.


I've always wanted to know that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: