Home > MySQL, PHP, Snippets of Code > Fastest PHP/MySQL algorithm to get random rows from a huge table

Fastest PHP/MySQL algorithm to get random rows from a huge table

If “ORDER BY RAND()” is spoiling your database performance, here it is a much faster alternative algorithm:

This is actually a 3 query solution:

1.- SELECT COUNT(*) FROM pages
2.- SELECT id FROM pages WHERE RAND()*40000
3.- SELECT id,url FROM pages WHERE id IN(24381, 80642, 127498 ,131855, 169079,  187588, 190216, 217914 ,238591, 253789, 311604, 329684, 348518, 377633, 392199, 399577, 427122)

My table calls “pages”. It contains around 400,000 records. MySQL uses 2 Gb of disk for storing the data and 20Mb for the index. It is a MyISAM table.

  • First, it finds how many rows are in the table. This is a very fast step.
  • Second, it chooses a set of random ids. This set does not contain a fixed number of elements but they are around 20.
  • Third, we extract contents from non-indexed data explicitly querying id’s.
<?php
$time= microtime_float();

$sql='SELECT COUNT(*) FROM pages';
$rquery= BD_Ejecutar($sql);
list($num_records)=mysql_fetch_row($rquery);
mysql_free_result($rquery);

$sql="SELECT id FROM pages WHERE RAND()*$num_records<20
ORDER BY RAND() LIMIT 0,10";
$rquery= BD_Ejecutar($sql);
while(list($id)=mysql_fetch_row($rquery)){
if($id_in) $id_in.=",$id"; else $id_in="$id";
}
mysql_free_result($rquery);

$sql="SELECT id,url FROM pages WHERE id IN($id_in)";
$rquery= BD_Ejecutar($sql);
while(list($id,$url)=mysql_fetch_row($rquery)){
logger("$id, $url",1);
}
mysql_free_result($rquery);

$time= microtime_float()-$time;

logger("num_records=$num_records",1);
logger("$id_in",1);
logger("Time elapsed: <b>$time segundos</b>",1);
?>

Running this snippet of code …:

LOG: 71162, http://www.defensordelmenor-and.es/index.html
LOG: 134476, http://www.frikipedia.es/friki/Ostio
LOG: 163562, http://www.asambleaex.es/pagina-3
LOG: 278046, http://www.revistaviajar.es/noticias/articulos/(tag)/fauna/(listado)/galerias
LOG: 359377, http://www.ficod2009.es/ficod/ad/redirect/13881/retransmision7.php
LOG: 365387, http://www.ceeiburgos.es/novedades/noticia
LOG: 412265, http://www.seguros.es/derecho-de-poliza.html
LOG: 428675, http://profesionaleshoy.es/interoute
LOG: 434035, http://www.2ch.es/
LOG: 463519, http://www.cef.es/node/2916
LOG: num_records=451,820
LOG: 365387,463519,134476,71162,434035,412265,359377,278046,163562,428675
LOG: Time elapsed: 0.38 seconds

This code is customized for getting 10 random rows. Second query tries to guess 20 random ids limited by a maximum of 10. If you need more than 10, increase both numbers and keep their ratio being the first the double of the second.

About these ads
  1. 2011/08/02 at 12:18 am

    Just forgot to mention that second query here also uses the “ORDER BY RAND()” mechanism, but take into account that a WHERE clause precedes it, severely limiting the amount of rows to order randomly. So, it doesn’t spoil the performance of the query.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: