A Web Search Engine, SQL or NOSQL?
I am trying to build a Web Search Engine with modest resources and targets. Just using one dedicated server running on LAMP environment I would like to reach the target of indexing 5 million web pages, limited to one language (Spanish), and deploying search results in less than 0.1 seconds.
Probably the most important strategy to consider in this enterprise is how to store the information. Search results speed and insertion times depend on it. I find there are two main options to consider: An SQL engine and a NOSQL engine.
In either case I am implementing both on MySQL. In the first case relational structure of MySQL is in use, meanwhile in second case I use MySQL simply as an store engine, capable of saving records and serving them to a client.
In the first case, SQL orthodoxy tells us to create a table for storing pages, another one for storing keywords and finally a big one to store the instance of a keyword on a page. We may call that table as “keypag”. Considering we are indexing 5 million web pages that contain an average of 100 distinct words each one, it means a cardinality of 500 millions for keypag.
In the second case, NOSQL schema, we just need one table called “keywords” with a row for every possible keyword on the corpus (the set of crawled web pages). We may apply here an estimation of cardinality for this table around 1 million rows, taking into account than we are dealing with just one language, and filtering contents through a dictionary before indexing. Table keywords has a big field (around 60 Kbytes long) that contains the ordered matches in web pages for that keyword. Around 100 URLs and snippets can be stored on that field.
In SQL schema, a search means an SQL query returning between 10 and 100 rows from a table “keypag” that has a cardinality around 1 billion. In NOSQL schema, a search means an SQL query returning just one row from a table “keyword” that has a cardinality around 1 million.
So, theoretically we could expect best search performance in NOSQL than in SQL schema. Doing some benchmarks, we may proof that to be correct. Let’s see the resulting graphs:
* SQL performance (someone should reach 1 billion rows if patient enough. I stopped at 11 millions):
* NOSQL performance:
Pink dots represent the time it takes to MySQL to resolve a search query in miliseconds. Blue dots represent how may inserts per second is capable of doing MySQL.