Archive

Posts Tagged ‘mysql’

MySql add index alter table example

2014/03/04 Leave a comment

Syntax: ALTER TABLE mytable ADD INDEX idx_a_b (a, b);

This MySQL command will create a very useful combined index for this table:

CREATE TABLE IF NOT EXISTS mytable (
id INT(32) UNSIGNED NOT NULL AUTO_INCREMENT,
a INT(32) UNSIGNED NOT NULL,
b INT(32) UNSIGNED NOT NULL,
PRIMARY KEY (id)
);

Using that index, this kind of queries will run faster:

SELECT b FROM mytable WHERE a=’value’;

Because data scan is avoided, thanks to index usage.

Advertisements

A Web Search Engine, SQL or NOSQL?

2011/08/23 1 comment

I am trying to build a Web Search Engine with modest resources and targets. Just using one dedicated server running on LAMP environment I would like to reach the target of indexing 5 million web pages, limited to one language (Spanish), and deploying search results in less than 0.1 seconds.

Probably the most important strategy to consider in this enterprise is how to store the information. Search results speed and insertion times depend on it. I find there are two main options to consider: An SQL engine and a NOSQL engine.

In either case I am implementing both on MySQL. In the first case relational structure of MySQL is in use, meanwhile in second case I use MySQL simply as an store engine, capable of saving records and serving them to a client.

In the first case, SQL orthodoxy tells us to create a table for storing pages, another one for storing keywords and finally a big one to store the instance of a keyword on a page. We may call that table as “keypag”. Considering we are indexing 5 million web pages that contain an average of 100 distinct words each one, it means a cardinality of 500 millions for keypag.

In the second case, NOSQL schema, we just need one table called “keywords” with a row for every possible keyword on the corpus (the set of crawled web pages). We may apply here an estimation of cardinality for this table around 1 million rows, taking into account than we are dealing with just one language, and filtering contents through a dictionary before indexing. Table keywords has a big field (around 60 Kbytes long) that contains the ordered matches in web pages for that keyword. Around 100 URLs and snippets can be stored on that field.

In SQL schema, a search means an SQL query returning between 10 and 100 rows from a table “keypag” that has a cardinality around 1 billion. In NOSQL schema, a search means an SQL query returning just one row from a table “keyword” that has a cardinality around 1 million.

So, theoretically we could expect best search performance in NOSQL than in SQL schema. Doing some benchmarks, we may proof that to be correct. Let’s see the resulting graphs:

* SQL performance (someone should reach 1 billion rows if patient enough. I stopped at 11 millions):

* NOSQL performance:

Pink dots represent the time it takes to MySQL to resolve a search query in miliseconds. Blue dots represent how may inserts per second is capable of doing MySQL.

Easy install MySQL on CentOS 6

2011/08/08 1 comment
CentOS

Install MySQL on CentOS 6

Assuming you have installed Linux CentOS 6 and network is running correctly, follow next commands to install MySQL server. I did run these commands on a fresh Full Desktop CentOS 6 (64 bits) install over VM VirtualBox. Nevertheless, the following commands will be useful also on other configurations. Use root permissions.

yum install mysql mysql-server
service mysqld start
chkconfig --levels 235 mysqld on
/usr/bin/mysql_secure_installation

mysql -u root -p
	CREATE DATABASE mydb;
	CREATE USER 'mydbuser'@'*' IDENTIFIED BY 'mydbpass';
	GRANT ALL ON mydb.* TO mydbuser@'*';
	FLUSH PRIVILEGES;

// Granting remote access to MySQL Server on port 3306:
vi /etc/sysconfig/iptables
// Write this line before last line "COMMIT"
-A INPUT -m state --state NEW -m tcp -p tcp --dport 3306 -j ACCEPT

service iptables restart

Done! Now try to connect to your database.

Web Search Engine project

2011/08/06 3 comments

This is a new Web Search Engine project I am developing. Yes, yet another one. Well, I have some new ideas to make it different from the rest, but before that, I would like to develop a simple search engine been capable of running over just one single server. It would store and index at least 1 million web pages, strictly reduced to one language, Spanish, and yielding search results in less than one second for every possible query.

 

Apparently simple but actually I would say this is quite an ambitious project. As a proof of concept prototype, I don’t want to spend too much time optimizing code, but mostly focusing on functionality. This is why I am not developing under C language, but instead just PHP on a WAMP/LAMP environment.

Here we see in action an important part of the project: the Crawler. Actually we see in this video 4 concurrent web robots getting contents from Internet. Using a domestic connectivity, on average a robot gets 1 web page per second. So, we see passing here around 4 lines per second in this real-time log web page, corresponding to the backend of this Search Engine Project.

Currently the MySQL database where these robots are storing the information has half a million links of which 300,000 are already downloaded and saved. Random SELECT queries are my first benchmark test as long as the database keeps growing up. MySQL, on a standard laptop computer is yielding 1 row of data in 15 milliseconds time. This time lapse obviously corresponds mainly to the time the hard disk drive spends in performing one single Input/Output operation (this is major bottleneck), because I am assuming the index file fits in RAM. Data file cannot be cached because it is 8 Gbytes in size, and the laptop has just 4 Gbytes of RAM.

According to my search engine design, a search query should check around 100 random rows in the database. This implies it could take on average 1 second time to be performed. My next goal is to continue feeding the database with more web pages up to reaching 50 Gbytes of data and check benchmarks again. I hope they remain the same unless index file grows too much, not fitting into RAM. Right now, index file is 20 Mbytes of size, so theoretically it still could handle up to 100 or more times the current number of links, meaning around 50 million.

MySQL benchmark SSD versus HDD

2011/08/02 Leave a comment


Running a WAMP server on a Windows 7 64 bits, CPU U7300 @ 1.3Ghz and 4GB RAM, I tried to benchmark the performance of searches in a big MyISAM MySQL database, 520k rows and 3.4Gb MYD file for data and 20Mb MYI file for the index.

In the first test MySQL files are stored in a Momentus Model ST9500325AS SATA 3Gb/s 500 GB HDD that yields a Random Read Seek Time of 14ms. The test consists on querying 1,000 random raws of a big table and measure the time elapsed after MySQL delivers the information. This test requires MySQL to make a lot of I/O operations to disk, so disk latency and Random Read Seek Time are determinant. The first bechmark in the HDD throws a result of 5.23 seconds, meaning 5 miliseconds per query on average.

Second test runs on a 32Gb LaCie iamaKey USB as a SSD Device. The benchmark yields a result of 2.15 seconds to extract 1,000 random rows from a big table. Meaning that SSD storage is twofold faster than HDD on MySQL search queries.

How to dump MySQL database on WAMP

2011/08/02 3 comments
Image representing Windows as depicted in Crun...

WAMP for Windows

In order to make a backup or a server transfer, you may need to dump a MySQL database into a sql file. If you try to do it through PHPMyAdmin, you may get a timeout error or an out of memory in PHP when using big databases. In WAMP server this is properly done through a program called mysqldump you will find on WAMP server binary folder. Nevertheless the best way to proceed is invoking the mysqldump application from command line of Windows, like this:

C:\wamp\bin\mysql\mysql5.5.8\bin\mysqldump -uroot -p dbname > c:\path\dbname.sql

A root password will be asked in the  prompt. Probably this password is still void after WAMP install. In that case press Return key and your sql dump file will be ready soon. Mine took several minutes  because I was dumping a 3 Gb database. With big databases be patient.

Fastest PHP/MySQL algorithm to get random rows from a huge table

2011/08/01 2 comments

If “ORDER BY RAND()” is spoiling your database performance, here it is a much faster alternative algorithm:

This is actually a 3 query solution:

1.- SELECT COUNT(*) FROM pages
2.- SELECT id FROM pages WHERE RAND()*40000
3.- SELECT id,url FROM pages WHERE id IN(24381, 80642, 127498 ,131855, 169079,  187588, 190216, 217914 ,238591, 253789, 311604, 329684, 348518, 377633, 392199, 399577, 427122)

My table calls “pages”. It contains around 400,000 records. MySQL uses 2 Gb of disk for storing the data and 20Mb for the index. It is a MyISAM table.

  • First, it finds how many rows are in the table. This is a very fast step.
  • Second, it chooses a set of random ids. This set does not contain a fixed number of elements but they are around 20.
  • Third, we extract contents from non-indexed data explicitly querying id’s.
<?php
$time= microtime_float();

$sql='SELECT COUNT(*) FROM pages';
$rquery= BD_Ejecutar($sql);
list($num_records)=mysql_fetch_row($rquery);
mysql_free_result($rquery);

$sql="SELECT id FROM pages WHERE RAND()*$num_records<20
ORDER BY RAND() LIMIT 0,10";
$rquery= BD_Ejecutar($sql);
while(list($id)=mysql_fetch_row($rquery)){
if($id_in) $id_in.=",$id"; else $id_in="$id";
}
mysql_free_result($rquery);

$sql="SELECT id,url FROM pages WHERE id IN($id_in)";
$rquery= BD_Ejecutar($sql);
while(list($id,$url)=mysql_fetch_row($rquery)){
logger("$id, $url",1);
}
mysql_free_result($rquery);

$time= microtime_float()-$time;

logger("num_records=$num_records",1);
logger("$id_in",1);
logger("Time elapsed: <b>$time segundos</b>",1);
?>

Running this snippet of code …:

LOG: 71162, http://www.defensordelmenor-and.es/index.html
LOG: 134476, http://www.frikipedia.es/friki/Ostio
LOG: 163562, http://www.asambleaex.es/pagina-3
LOG: 278046, http://www.revistaviajar.es/noticias/articulos/(tag)/fauna/(listado)/galerias
LOG: 359377, http://www.ficod2009.es/ficod/ad/redirect/13881/retransmision7.php
LOG: 365387, http://www.ceeiburgos.es/novedades/noticia
LOG: 412265, http://www.seguros.es/derecho-de-poliza.html
LOG: 428675, http://profesionaleshoy.es/interoute
LOG: 434035, http://www.2ch.es/
LOG: 463519, http://www.cef.es/node/2916
LOG: num_records=451,820
LOG: 365387,463519,134476,71162,434035,412265,359377,278046,163562,428675
LOG: Time elapsed: 0.38 seconds

This code is customized for getting 10 random rows. Second query tries to guess 20 random ids limited by a maximum of 10. If you need more than 10, increase both numbers and keep their ratio being the first the double of the second.