My guess, back when I was doing code search at Apple, was that he was actually scanning all the docs with his fast regexp, which would be do-able with the number of machines at his disposal + the expected traffic.
If Russell Cox wrote it, it must be good. I programmed competitively with him at USACO and in college (we got to 8th best college team in the world), and he was a consistent ass-kicker. Hardcore and disciplined and fast.
That does searching of text, without generating an index. While this is fast, you are still have search time linear in the size of your searchable text.
If you used a suffix tree as an index, you'd get time complexity strongly sublinear in text length.
In summary, if the regex matching complexity dwarfs the corpus size, you want Thompson NFA. If your corpus size dwarfs the regex complexity, you want suffix tree indexes. It's not clear to me if the two approaches can be combined, but if they are, I'd like to know.
I agree that Google probably just throws HW at the problem, but you could also build something custom if it really needed to scale. Just record what people look for and keep an index for the common chunks of logic.
So, if people often search for someones name or date you keep a list of what satisfies that so you can quickly eliminate large numbers of documents without doing a full text search. You could also reduce the document to a list of tokens and search that list before the file etc.
How does that help here? My query doesn't contain any directly-searchable strings, unless you start abstracting things by treating all alpha chars as the same, etc.
A nice thing about regular expressions is that you always know where you are in the expression (no backtracking, at least in theory).
Say your query is \d+(th|nd|rd), and your searchable text is "The 3rd planet from the sun is 20000000 miles away." Your tree search will hit 2, then fail on 20, then 3, then 3r, then 3rd, and then terminate successfully. So we followed 5 pointers, and tested another 20 or so for nullity.
This was independent of the length of the searchable text (but not the number of found matches). It would have performed equally well on "The 3rd planet from the sun is 20000000 miles away. Lorem ipsum (insert 10mb of random Latin here)".
Well, there's always suffix arrays, so two 64-bit pointers, (plus the original char) per char?
Assuming their claim of 116,000,000 matches for /./, and 5k mean document size, 70TB, uncompressed.
Papers like "Compressed Suffix Arrays and Suffix Trees
with Applications to Text Indexing and String Matching" [1] lead me to believe that the state of the art could be as low as 4 bits/char, in which case the answer would be a much more tractable 270GB.
As far as partitioning goes, now I'm getting really out of my league. At a first pass, I imagine that you could keep the root and neighbors on a primary server, and as big of subtrees as you can fit in RAM on a bunch of slaves. The first couple chars get processed in the master, and then however many subtrees are alive get delegated in parallel to the secondary servers. The secondary servers return the answers to the primary, which returns to the client.
I feel like I'm answering hypothetical questions at a job interview :p
The math doesn't work. 116,000,000 matches x 5k is 580 GB. It fits on one big disk, and with 5800 machines holding 100 Mb of plain text each you could search all of it naively in real time. (Grep searches 100 Mb in 50 mSec.)
5800 machines is practical for a real product. Hardware probably costs Google $40/month/CPU, so $230k / month, about equal to 10 engineers. So if a more clever solution needs a whole team to maintain it then it's probably better to just use grep.
10 engineers to program one data structure? I'd be greatly surprised if 1 PhD and 2 engineers couldn't easily do the additional work beyond distributed grep in less than six months. Maybe one engineer to maintain it? That saves ~$200k/mo--more if you go over 20 queries/second it takes to max out the 5800cpu limit.
If I understand correctly (a big if) and the underlying structure is something along the lines of a giant suffix tree then searching for any particular string takes time asymtomtically equivalent to the length of the query string. What if there were another corresponding "meta-data" suffix-tree perhaps this would allow the fast abstract search. what if the nodes contained meta-data such as length/type etc instead of merely containing the data itself.
perl has a "study" function that examines a string in anticipation of doing many pattern matches on it. I'm not sure what it does exactly, but the kind of structures it creates could be serialized and stored along with an input file (or perhaps replace it), and used as kind of regular expression optimized index. This isn't necessarily an "index" in the normal sense, but then, neither is the likes of a full-text index either. But really, any data structures that are optimized for a specific kind of processing are indexes.
study does not do anything really interesting. It looks up substrings in a table of substring -> average frequency. It then scans for the rarest character first, so that it can fail quickly for most input.
From the docs:
(The way "study" works is this: a
linked list of every character in the string to be searched is made, so we know, for
example, where all the 'k' characters are. From each search string, the rarest
character is selected, based on some static frequency tables constructed from some C
programs and English text. Only those places that contain this "rarest" character are
examined.)
Basically, this is not what Google is doing for code search.
Study doesn't do anything really interesting, but neither is creating a keyword index for a keyword based search all that interesting either.
Considering that Google code search is not intended to search English (for which their regular search technology would most likely suffice for that) using regular expressions, then I agree that the naive study implementation you've quoted is, basically, not what Google is using for code search. However, there's no reason that the frequency tables need to be static nor need to be constructed solely from from C programs and English text. Frequency tables can be continuously updated based on the input corpus and the input corpus should be the code that is being searched. Part of search is being able to rule out significant portions of the search space in the name of getting positive results as fast as possible, so being able to fail quickly for most inputs (considering the example regular expression only returns three hits, it seems a wide majority don't match), a study-like implementation would appear to be a possibility.
This is off-topic, but I just realized Google Code Search is great for seeing where code you open source ends up. Came across a few things I had no idea were using my code. Kinda cool.
It's viable to build an index with n-grams, but I don't think you'd get any benefit over suffix-tree search. There might be a way to select n-grams cleverly some of the time, but you can also easily run into pathological cases.
IIRC, this was a rather high-compute-cost project inside of Google, far more so than other search projects. Obviously it was a favorite though of the eng decision makers, so was well cared for resource-wise.
Not only would it allow searching with regexp, but it could allow more sophisticated search queries like "Search all my Ruby code for blocks that rescue StandardError and are not followed by an ensure clause"
But you'd have to do that separately for every language.
Edit: the OP is such a great question. I've been saving up the joke of commenting "How is this Hacker News?" on the most literally appropriate HN post I run across... and almost used it on this one. Sadly, I probably never will.
It's one of those ideas that would be best tried out on one instance first. It's not obvious that the ASTs of programs in different languages would be "A" enough for a generalized querying system to be useful. Still, I don't mean to shoot the idea down - it would make for an interesting experiment.
The cost of that algorithm still grows linearly with respect to the length of the compressed text. It's possible Google is doing something like this (searching over the entire corpus until they find a match), but is there any work on precomputing an actual "index" to speed up this type of search?
I'm certain there are many approaches to be found via citation-following.
But consider that you can use the linked algorithm, and a bit of trickiness, to create an index already:
- Step (a): prepare another version of the document where every digit is replaced with 0, every word char with 'a', etc.
- Step (b): note how much better it compresses. :)
- Step (c): Perform a similar 'weakening' on the search regex.
- Step (d): Done! Not an exact index, but it will allow you to quickly eliminate a large number of documents.
You can also combine this approach with text-based ones for the cases where there are parts of the search regex that contain literal text. (This is almost certainly the most common use case).
Or you could make it hierarchical if you need to by creating several watered-down texts for each input with different levels of substitution.
Since this is the sort of thing that's fun to think about, here's a whole different approach:
- Break each document apart into code lines.
- Create a hierarchical tree: leaves are the code snippets (with a pointer back to source doc), each node represents a clustering-by-text-similarity of all its children. (Appropriate distance measure left as an exercise for the reader. :) )
- Let's say that for two regular expressions, RA & RB, RA < RB iff any text matched by RA will also be matched by RB.
- At each node, create a regular expression R so that for each child regex C, C < R. (The leaves become a regex that simply matches the source line literally.) Along the way, make a best effort to restrict the length of R by introducing generalizations (R can't simply be Ca|Cb|Cc...) without making it so loose that it matches everything.
- Then, searching for a regex S can be done recursively: For each child regex R of the current node, if you can construct a new regex T such that T < S and T < R, descend into that node.
n.b. I typed that out in a hurry and without being rigorous... so it's probably not quite right, but an interesting path to think about exploring. Also: testing if one regex is less than another (as defined above) is actually fairly straightforward, even though it seems hard.
(And a quick edit to add: the last step is definitely wrong. Unfortunately I need to hop onto a conference call, so it'll have to wait until the indefinite future.)
(Edit to that: I think I fixed it for now. Shorter version: if the intersection of S and R is nonempty, descend.)
I am 100% postulating here, but this is what seems most likely to me:
Google is known for MapReduce. It seems like it would make regex searches manageable when put in such an algorithm. All Google would have to do would be to take whatever key system they use for their big-tables and use MapReduce to expand the regex over the keys. Then they just have to return the data associated with those keys.
Indexing Documents into Barrels -- After each document is parsed, it is encoded into a number of barrels. Every word is converted into a wordID by using an in-memory hash table -- the lexicon. New additions to the lexicon hash table are logged to a file. Once the words are converted into wordID's, their occurrences in the current document are translated into hit lists and are written into the forward barrels. The main difficulty with parallelization of the indexing phase is that the lexicon needs to be shared. Instead of sharing the lexicon, we took the approach of writing a log of all the extra words that were not in a base lexicon, which we fixed at 14 million words. That way multiple indexers can run in parallel and then the small log file of extra words can be processed by one final indexer.
So we have a hash table of words/searches -> results
Because we have this, we can regex search the keys, and end up with the result pages.
we do have other options though. It could be:
a) magic
b) a trade secret (think PidgeonRank :-) )
use MapReduce to expand the regex over the keysSo we have a hash table of words/searches -> results
Because we have this, we can regex search the keys, and end up with the result pages. ...
Could be tokenized and mapped to a few regex keywords, with one or another as key-value in a datastore. Since Google is big on MapReduce, which could be seen as computation on sets:
[a-z] = (DOCID_14566, DOCID_15999, DOCID_888)
\d{5} = (DOCID_15, DOCID_15999, DOCID_552)
Then when someone searchs for a "number with five digits", the value for the \d{5} key is retrived, and it has the documents where the search appears, and this set/list is combined with the other set/list that match other search keywords, and you have your results.
On a normal web search, the document is split into words and you can make these words into keys, so know where they appear, right? (the key's values).
Well, if you think that the user will type a regex expression "\d{5}", then you can map it to documents too.
It's like pre-processing the possible regex that would match a document token, and searching these regexes instead of words like a normal web search.
It's "just" an extra step, splitting words (tokens) from a document, and assigning possible regexes for these tokens.
Not saying Google does it :p But I guess that's what the OP meant...
Edited: Before anyone notices how inefficient my example is, I wouldn't implement every possible \d as a key, or [a-b], [a-c], [a-d] etc as another key, this is just an example of how sets of precomputed regexs (from any depth) can be (ab)used.
This makes sense and it could work :) I imagine you would have to spend time tuning the tokens you map out. I think this would break down pretty quickly with moderate or complicated regexes, but those might not be the majority of searches.
Is there any trickery you could do with the inner nodes on the b+trees? I spent a few minutes thinking about tokens mapped out and intermediate reductions but didn't come up with anything.
My guess, back when I was doing code search at Apple, was that he was actually scanning all the docs with his fast regexp, which would be do-able with the number of machines at his disposal + the expected traffic.
If anyone knows any differently, I'm all ears. :)