Syntax: ALTER TABLE mytable ADD INDEX idx_a_b (a, b);
This MySQL command will create a very useful combined index for this table:
CREATE TABLE IF NOT EXISTS mytable (
id INT(32) UNSIGNED NOT NULL AUTO_INCREMENT,
a INT(32) UNSIGNED NOT NULL,
b INT(32) UNSIGNED NOT NULL,
PRIMARY KEY (id)
Using that index, this kind of queries will run faster:
SELECT b FROM mytable WHERE a=’value’;
Because data scan is avoided, thanks to index usage.
Trying to install the open source Sphinx Search Engine on my CentOs Linux server I got these errors:
rpm -i sphinx-2.1.5-1.rhel6.x86_64.rpm
error: Failed dependencies:
libodbc.so.2()(64bit) is needed by sphinx-2.1.5-1.rhel6.x86_64
libpq.so.5()(64bit) is needed by sphinx-2.1.5-1.rhel6.x86_64
It means there are some libraries left in the system. To find out which are they I used this command:
yum provides **/libpq.so.5
This way I could install the missing libraries:
yum install unixODBC unixODBC-devel postgresql-libs
And then, I tried again (successfully this time) to reinstall Sphinx Search:
# rpm -i sphinx-2.1.5-1.rhel6.x86_64.rpm
Now create a full-text index, start the search daemon, and you're all set.
To manage indexes:
To rebuild all disk indexes:
sudo -u sphinx indexer --all --rotate
To start/stop search daemon:
service searchd start/stop
To query search daemon using MySQL client:
mysql -h 0 -P 9306
mysql> SELECT * FROM test1 WHERE MATCH('test');
See the manual at /usr/share/doc/sphinx-2.1.5 for details.
For commercial support please contact Sphinx Technologies Inc at
I hope this tip can help someone …
How to install PHP6 on CentOS6: These are the Linux commands required to install latest PHP source code into a CentOS 6 LAMP server, and several useful libraries as: GD, Curl, Tidy, JSON, SQLite, PSpell, …
yum install wget
tar -jxvf php-trunk-201111071530.tar.bz2
yum groupinstall "Development Tools"
yum install libxml2 libxml2-devel
yum install httpd-devel
yum install pcre pcre-devel
yum install bzip2 bzip2-devel
yum install gmp gmp-devel
yum install tidy libtidy libtidy-devel
yum install curl libcurl libcurl-devel
yum install libjpeg libjpeg-devel
yum install libpng libpng-devel
yum install libXpm libXpm-devel
yum install freetype freetype-devel
yum install aspell aspell-devel
./configure –build=x86_64-unknown-linux-gnu –host=x86_64-unknown-linux-gnu –target=x86_64-redhat-linux-gnu –program-prefix= –prefix=/usr –exec-prefix=/usr –bindir=/usr/bin –sbindir=/usr/sbin –sysconfdir=/etc –datadir=/usr/share –includedir=/usr/include –libdir=/usr/lib64 –libexecdir=/usr/libexec –localstatedir=/var –sharedstatedir=/var/lib –mandir=/usr/share/man –infodir=/usr/share/info –cache-file=../config.cache –with-libdir=lib64 –with-config-file-path=/etc –with-config-file-scan-dir=/etc/php.d –disable-debug –with-pic –disable-rpath –without-pear –with-bz2 –with-exec-dir=/usr/bin –with-freetype-dir=/usr –with-png-dir=/usr –with-xpm-dir=/usr –enable-gd-native-ttf –without-gdbm –with-gettext –with-gmp –with-iconv –with-jpeg-dir=/usr –with-openssl –with-pcre-regex=/usr –with-zlib –with-layout=GNU –enable-exif –enable-ftp –enable-magic-quotes –enable-sockets –enable-sysvsem –enable-sysvshm –enable-sysvmsg –with-kerberos –enable-ucd-snmp-hack –enable-shmop –enable-calendar –with-sqlite3 –with-libxml-dir=/usr –enable-xml –with-system-tzdata –with-apxs2=/usr/sbin/apxs –with-mysql –with-gd –disable-dom –disable-dba –without-unixODBC –disable-pdo –disable-xmlreader –disable-xmlwriter –disable-phar –disable-fileinfo –enable-json –with-pspell –disable-wddx –with-curl –disable-posix –disable-sysvmsg –disable-sysvshm –disable-sysvsem –enable-mbstring –with-mysqli –with-tidy
sed -i ‘/^LoadModule php5_module/ s/^/#/’ /etc/httpd/conf/httpd.conf
service httpd restart
I am trying to build a Web Search Engine with modest resources and targets. Just using one dedicated server running on LAMP environment I would like to reach the target of indexing 5 million web pages, limited to one language (Spanish), and deploying search results in less than 0.1 seconds.
Probably the most important strategy to consider in this enterprise is how to store the information. Search results speed and insertion times depend on it. I find there are two main options to consider: An SQL engine and a NOSQL engine.
In either case I am implementing both on MySQL. In the first case relational structure of MySQL is in use, meanwhile in second case I use MySQL simply as an store engine, capable of saving records and serving them to a client.
In the first case, SQL orthodoxy tells us to create a table for storing pages, another one for storing keywords and finally a big one to store the instance of a keyword on a page. We may call that table as “keypag”. Considering we are indexing 5 million web pages that contain an average of 100 distinct words each one, it means a cardinality of 500 millions for keypag.
In the second case, NOSQL schema, we just need one table called “keywords” with a row for every possible keyword on the corpus (the set of crawled web pages). We may apply here an estimation of cardinality for this table around 1 million rows, taking into account than we are dealing with just one language, and filtering contents through a dictionary before indexing. Table keywords has a big field (around 60 Kbytes long) that contains the ordered matches in web pages for that keyword. Around 100 URLs and snippets can be stored on that field.
In SQL schema, a search means an SQL query returning between 10 and 100 rows from a table “keypag” that has a cardinality around 1 billion. In NOSQL schema, a search means an SQL query returning just one row from a table “keyword” that has a cardinality around 1 million.
So, theoretically we could expect best search performance in NOSQL than in SQL schema. Doing some benchmarks, we may proof that to be correct. Let’s see the resulting graphs:
* SQL performance (someone should reach 1 billion rows if patient enough. I stopped at 11 millions):
* NOSQL performance:
Pink dots represent the time it takes to MySQL to resolve a search query in miliseconds. Blue dots represent how may inserts per second is capable of doing MySQL.
Assuming you have installed Linux CentOS 6 and network is running correctly, follow next commands to install MySQL server. I did run these commands on a fresh Full Desktop CentOS 6 (64 bits) install over VM VirtualBox. Nevertheless, the following commands will be useful also on other configurations. Use root permissions.
yum install mysql mysql-server service mysqld start chkconfig --levels 235 mysqld on /usr/bin/mysql_secure_installation mysql -u root -p CREATE DATABASE mydb; CREATE USER 'mydbuser'@'*' IDENTIFIED BY 'mydbpass'; GRANT ALL ON mydb.* TO mydbuser@'*'; FLUSH PRIVILEGES; // Granting remote access to MySQL Server on port 3306: vi /etc/sysconfig/iptables // Write this line before last line "COMMIT" -A INPUT -m state --state NEW -m tcp -p tcp --dport 3306 -j ACCEPT service iptables restart
Done! Now try to connect to your database.
This is a new Web Search Engine project I am developing. Yes, yet another one. Well, I have some new ideas to make it different from the rest, but before that, I would like to develop a simple search engine been capable of running over just one single server. It would store and index at least 1 million web pages, strictly reduced to one language, Spanish, and yielding search results in less than one second for every possible query.
Apparently simple but actually I would say this is quite an ambitious project. As a proof of concept prototype, I don’t want to spend too much time optimizing code, but mostly focusing on functionality. This is why I am not developing under C language, but instead just PHP on a WAMP/LAMP environment.
Here we see in action an important part of the project: the Crawler. Actually we see in this video 4 concurrent web robots getting contents from Internet. Using a domestic connectivity, on average a robot gets 1 web page per second. So, we see passing here around 4 lines per second in this real-time log web page, corresponding to the backend of this Search Engine Project.
Currently the MySQL database where these robots are storing the information has half a million links of which 300,000 are already downloaded and saved. Random SELECT queries are my first benchmark test as long as the database keeps growing up. MySQL, on a standard laptop computer is yielding 1 row of data in 15 milliseconds time. This time lapse obviously corresponds mainly to the time the hard disk drive spends in performing one single Input/Output operation (this is major bottleneck), because I am assuming the index file fits in RAM. Data file cannot be cached because it is 8 Gbytes in size, and the laptop has just 4 Gbytes of RAM.
According to my search engine design, a search query should check around 100 random rows in the database. This implies it could take on average 1 second time to be performed. My next goal is to continue feeding the database with more web pages up to reaching 50 Gbytes of data and check benchmarks again. I hope they remain the same unless index file grows too much, not fitting into RAM. Right now, index file is 20 Mbytes of size, so theoretically it still could handle up to 100 or more times the current number of links, meaning around 50 million.
Running a WAMP server on a Windows 7 64 bits, CPU U7300 @ 1.3Ghz and 4GB RAM, I tried to benchmark the performance of searches in a big MyISAM MySQL database, 520k rows and 3.4Gb MYD file for data and 20Mb MYI file for the index.
In the first test MySQL files are stored in a Momentus Model ST9500325AS SATA 3Gb/s 500 GB HDD that yields a Random Read Seek Time of 14ms. The test consists on querying 1,000 random raws of a big table and measure the time elapsed after MySQL delivers the information. This test requires MySQL to make a lot of I/O operations to disk, so disk latency and Random Read Seek Time are determinant. The first bechmark in the HDD throws a result of 5.23 seconds, meaning 5 miliseconds per query on average.
Second test runs on a 32Gb LaCie iamaKey USB as a SSD Device. The benchmark yields a result of 2.15 seconds to extract 1,000 random rows from a big table. Meaning that SSD storage is twofold faster than HDD on MySQL search queries.