Home > MySQL, PHP, Snippets of Code > Web Search Engine project

Web Search Engine project

This is a new Web Search Engine project I am developing. Yes, yet another one. Well, I have some new ideas to make it different from the rest, but before that, I would like to develop a simple search engine been capable of running over just one single server. It would store and index at least 1 million web pages, strictly reduced to one language, Spanish, and yielding search results in less than one second for every possible query.

 

Apparently simple but actually I would say this is quite an ambitious project. As a proof of concept prototype, I don’t want to spend too much time optimizing code, but mostly focusing on functionality. This is why I am not developing under C language, but instead just PHP on a WAMP/LAMP environment.

Here we see in action an important part of the project: the Crawler. Actually we see in this video 4 concurrent web robots getting contents from Internet. Using a domestic connectivity, on average a robot gets 1 web page per second. So, we see passing here around 4 lines per second in this real-time log web page, corresponding to the backend of this Search Engine Project.

Currently the MySQL database where these robots are storing the information has half a million links of which 300,000 are already downloaded and saved. Random SELECT queries are my first benchmark test as long as the database keeps growing up. MySQL, on a standard laptop computer is yielding 1 row of data in 15 milliseconds time. This time lapse obviously corresponds mainly to the time the hard disk drive spends in performing one single Input/Output operation (this is major bottleneck), because I am assuming the index file fits in RAM. Data file cannot be cached because it is 8 Gbytes in size, and the laptop has just 4 Gbytes of RAM.

According to my search engine design, a search query should check around 100 random rows in the database. This implies it could take on average 1 second time to be performed. My next goal is to continue feeding the database with more web pages up to reaching 50 Gbytes of data and check benchmarks again. I hope they remain the same unless index file grows too much, not fitting into RAM. Right now, index file is 20 Mbytes of size, so theoretically it still could handle up to 100 or more times the current number of links, meaning around 50 million.

Advertisements
  1. 2014/07/09 at 2:53 am

    Hello. I have similar project. A spider which can traverse URLs from base domain or jump around the web. Can you provide your email i wanna ask you few questions ?

  2. Andrea
    2013/01/16 at 7:59 pm

    interesting project. we can collaborate? I’m Italian.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: