Web Search Engine project
This is a new Web Search Engine project I am developing. Yes, yet another one. Well, I have some new ideas to make it different from the rest, but before that, I would like to develop a simple search engine been capable of running over just one single server. It would store and index at least 1 million web pages, strictly reduced to one language, Spanish, and yielding search results in less than one second for every possible query.
Apparently simple but actually I would say this is quite an ambitious project. As a proof of concept prototype, I don’t want to spend too much time optimizing code, but mostly focusing on functionality. This is why I am not developing under C language, but instead just PHP on a WAMP/LAMP environment.
Here we see in action an important part of the project: the Crawler. Actually we see in this video 4 concurrent web robots getting contents from Internet. Using a domestic connectivity, on average a robot gets 1 web page per second. So, we see passing here around 4 lines per second in this real-time log web page, corresponding to the backend of this Search Engine Project.
Currently the MySQL database where these robots are storing the information has half a million links of which 300,000 are already downloaded and saved. Random SELECT queries are my first benchmark test as long as the database keeps growing up. MySQL, on a standard laptop computer is yielding 1 row of data in 15 milliseconds time. This time lapse obviously corresponds mainly to the time the hard disk drive spends in performing one single Input/Output operation (this is major bottleneck), because I am assuming the index file fits in RAM. Data file cannot be cached because it is 8 Gbytes in size, and the laptop has just 4 Gbytes of RAM.
According to my search engine design, a search query should check around 100 random rows in the database. This implies it could take on average 1 second time to be performed. My next goal is to continue feeding the database with more web pages up to reaching 50 Gbytes of data and check benchmarks again. I hope they remain the same unless index file grows too much, not fitting into RAM. Right now, index file is 20 Mbytes of size, so theoretically it still could handle up to 100 or more times the current number of links, meaning around 50 million.