Fastest PHP/MySQL algorithm to get random rows from a huge table
If “ORDER BY RAND()” is spoiling your database performance, here it is a much faster alternative algorithm:
This is actually a 3 query solution:
1.- SELECT COUNT(*) FROM pages
2.- SELECT id FROM pages WHERE RAND()*40000
3.- SELECT id,url FROM pages WHERE id IN(24381, 80642, 127498 ,131855, 169079, 187588, 190216, 217914 ,238591, 253789, 311604, 329684, 348518, 377633, 392199, 399577, 427122)
My table calls “pages”. It contains around 400,000 records. MySQL uses 2 Gb of disk for storing the data and 20Mb for the index. It is a MyISAM table.
- First, it finds how many rows are in the table. This is a very fast step.
- Second, it chooses a set of random ids. This set does not contain a fixed number of elements but they are around 20.
- Third, we extract contents from non-indexed data explicitly querying id’s.
<?php $time= microtime_float(); $sql='SELECT COUNT(*) FROM pages'; $rquery= BD_Ejecutar($sql); list($num_records)=mysql_fetch_row($rquery); mysql_free_result($rquery); $sql="SELECT id FROM pages WHERE RAND()*$num_records<20 ORDER BY RAND() LIMIT 0,10"; $rquery= BD_Ejecutar($sql); while(list($id)=mysql_fetch_row($rquery)){ if($id_in) $id_in.=",$id"; else $id_in="$id"; } mysql_free_result($rquery); $sql="SELECT id,url FROM pages WHERE id IN($id_in)"; $rquery= BD_Ejecutar($sql); while(list($id,$url)=mysql_fetch_row($rquery)){ logger("$id, $url",1); } mysql_free_result($rquery); $time= microtime_float()-$time; logger("num_records=$num_records",1); logger("$id_in",1); logger("Time elapsed: <b>$time segundos</b>",1); ?>
Running this snippet of code …:
LOG: 71162, http://www.defensordelmenor-and.es/index.html
LOG: 134476, http://www.frikipedia.es/friki/Ostio
LOG: 163562, http://www.asambleaex.es/pagina-3
LOG: 278046, http://www.revistaviajar.es/noticias/articulos/(tag)/fauna/(listado)/galerias
LOG: 359377, http://www.ficod2009.es/ficod/ad/redirect/13881/retransmision7.php
LOG: 365387, http://www.ceeiburgos.es/novedades/noticia
LOG: 412265, http://www.seguros.es/derecho-de-poliza.html
LOG: 428675, http://profesionaleshoy.es/interoute
LOG: 434035, http://www.2ch.es/
LOG: 463519, http://www.cef.es/node/2916
LOG: num_records=451,820
LOG: 365387,463519,134476,71162,434035,412265,359377,278046,163562,428675
LOG: Time elapsed: 0.38 seconds
This code is customized for getting 10 random rows. Second query tries to guess 20 random ids limited by a maximum of 10. If you need more than 10, increase both numbers and keep their ratio being the first the double of the second.
Just forgot to mention that second query here also uses the “ORDER BY RAND()” mechanism, but take into account that a WHERE clause precedes it, severely limiting the amount of rows to order randomly. So, it doesn’t spoil the performance of the query.