Skip navigation.

Divide-and-conquer and parallel processing in PHP #

In my previous post Easy Parallel Processing in PHP, I showed you how to implement parallel batch processing using PHP and a web server. In this post, I want to discuss partitioning your tasks so that they become easily parallelized.

The strategy I prefer is divide-and-conquer. This works by recursively breaking down a problem into two or more sub-problems of the same type, until these become simple (and fast) enough to be solved directly. The solutions to the sub-problems are then combined to give a solution to the original problem.

To illustrate with an example, lets say you have millions of financial payment data records in a database you want to process in parallel using PHP:

  1. First group your data into logical chunks that need to be processed in one transaction. If you are processing payments for lots of accounts, then grouping them by account number makes lots of sense.
  2. Decide on how many parallel child processes you want to run. For this example, assume we are on a single dual core CPU server, so it makes sense to only run two concurrent child processes.
  3. Split all the records by the median account (the median is a statistical term that means the middle record in a range). To make it easy to split by the median
  4. From the parent process, pass one child process record 1 to median-1 and pass the second child process the median to final records as $_GET parameters.
  5. For simplicity's sake, both child processes will run the same code, but receive different parameters. The results of the processing can be either stored in the database, or returned back to the parent process by echo'ing the results in the child process.

To find the median of a set of records in a database, I have extended ADOdb, the popular PHP open source database abstraction library I maintain with the following function defined in the ADOConnection class:

	function GetMedian($table, $field,$where = '')
	{
		$total = $this->GetOne("select count(*) from $table $where");
		if (!$total) return false;
	
		$midrow = (integer) ($total/2);
		$rs = $this->SelectLimit("select $field from $table $where order by 1",1,$midrow);
		if ($rs && !$rs->EOF) return reset($rs->fields);
		return false;
	}

If you have a Quad-Core CPU then you can call GetMedian 3 times to break up the data into 4 approximately equal parts, and pass then to 4 child processes:

  $mid = $db->GetMedian('PAYMENTS', 'ACCOUNTNO');
  if (!$mid) return 'Error';
  $lomid = $db->GetMedian('PAYMENTS', 'ACCOUNTNO', "where ACCOUNTNO < $mid");
  $himid = $db->GetMedian('PAYMENTS', 'ACCOUNTNO', "where ACCOUNTNO >= $mid");

The above GetMedian function is not particularly optimal when you want need to run it multiple times on the same dataset. Improvements are left to the reader (or in a future blog entry).

PS: Another strategy for parallelization popularised by Google is Map Reduce.

Easy Parallel Processing in PHP #

The proliferation of multicore CPUs and the inability of our learned CPU vendors to squeeze many more GHz into their designs means that often the only way to get additional performance is by writing clever parallel software.

One problem we were having is that some of our batch processing jobs were taking too long to run. In order to speed the processing, we tried to split the processing file into half, and let a separate PHP process run each job. Given that we were using a dual core server, each process would be able to run close to full speed (subject to I/O constraints).

Here is our technique for running multiple parallel jobs in PHP. In this example, we have two job files: j1.php and j2.php we want to run. The sample jobs don't do anything fancy. The file j1.php looks like this:

$jobname = 'j1';
set_time_limit(0);
$secs = 60;

while ($secs) {
        echo $jobname,'::',$secs,"\n";
        flush(); @ob_flush();  ## make sure that all output is sent in real-time
        $secs -= 1;
        $t = time();
        sleep(1); // pause
}

The reason why we flush(); @ob_flush(); is that when we echo or print, the strings are sometimes buffered by PHP and not sent until later. These two functions ensure that all data is sent immediately.

We then have a 3rd file, control.php, which does the coordination of jobs j1 and j2. This script will call j1.php and j2.php asynchronously using fsockopen in JobStartAsync(), so we are able to run j1.php and j2.php in parallel. The output from j1.php and j2.php are returned to control.php using JobPollAsync().

#
# control.php
#
function JobStartAsync($server, $url, $port=80,$conn_timeout=30, $rw_timeout=86400)
{
	$errno = '';
	$errstr = '';
	
	set_time_limit(0);
	
	$fp = fsockopen($server, $port, $errno, $errstr, $conn_timeout);
	if (!$fp) {
	   echo "$errstr ($errno)<br />\n";
	   return false;
	}
	$out = "GET $url HTTP/1.1\r\n";
	$out .= "Host: $server\r\n";
	$out .= "Connection: Close\r\n\r\n";
	
	stream_set_blocking($fp, false);
	stream_set_timeout($fp, $rw_timeout);
	fwrite($fp, $out);
	
	return $fp;
}

// returns false if HTTP disconnect (EOF), or a string (could be empty string) if still connected
function JobPollAsync(&$fp) 
{
	if ($fp === false) return false;
	
	if (feof($fp)) {
		fclose($fp);
		$fp = false;
		return false;
	}
	
	return fread($fp, 10000);
}

###########################################################################################

 
if (1) {  /* SAMPLE USAGE BELOW */

	$fp1 = JobStartAsync('localhost','/jobs/j1.php');
	$fp2 = JobStartAsync('localhost','/jobs/j2.php');
	
	
	while (true) {
		sleep(1);
		
		$r1 = JobPollAsync($fp1);
		$r2 = JobPollAsync($fp2);
		
		if ($r1 === false && $r2 === false) break;
		
		echo "<b>r1 = </b>$r1<br>";
		echo "<b>r2 = </b>$r2<hr>";
		flush(); @ob_flush();
	}
	
	echo "<h3>Jobs Complete</h3>";
}

And the output could look like this:

r1 = HTTP/1.1 200 OK
Date: Wed, 03 Sep 2008 07:20:20 GMT
Server: Apache/2.2.4 (Unix) mod_ssl/2.2.4 OpenSSL/0.9.8d
X-Powered-By: Zend Core/2.5.0 PHP/5.2.5
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html

7 j1::60


r2 = HTTP/1.1 200 OK
Date: Wed, 03 Sep 2008 07:20:20 GMT
Server: Apache/2.2.4 (Unix) mod_ssl/2.2.4 OpenSSL/0.9.8d
X-Powered-By: Zend Core/2.5.0 PHP/5.2.5
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html

7 j2::60
----
r1 = 7 j1::59

r2 = 7 j2::59
----
r1 = 7 j1::58

r2 = 7 j2::58

----

Note that "7 j2::60" is returned by PollJobAsync(). The reason is that the HTTP standard requires the packet to return the payload length (7 bytes) in the first line.

I hope this was helpful. Have fun!

PS: Also see Divide-and-conquer and parallel processing in PHP. Also see popen for an alternative technique.

Microsoft contributes to LGPL project for first time: ADOdb mssqlnative drivers #

Last week, I got an email from Garrett Serack, M'soft Open Source Community Developer. Microsoft have been kind enough to donate a set of ADOdb drivers for the new MSSQL Native Extension for PHP. You can download the extension here and the ADOdb drivers here.

Garrett also mentions that ADOdb is the first LGPL project that Microsoft has ever contributed to. I quote from his email to me:

ADODB is actually the first LGPL Open Source project that Microsoft has ever contributed to. 
We've got a dozen or so others lined up and ready to go to other open source PHP projects 
(GPL, BSD and others), But ADODB was the *FIRST*. You could say that contributing to ADODB 
is Microsoft going from zero to one.

We announced it at OSCON, (see the post at http://port25.technet.com/archive/2008/07/25/oscon2008.aspx )
along with Microsoft becoming a platinum sponsor of the Apache Software Foundation. Either of
these two steps is such a good move for Microsoft, and both together, is a good sign that the 
Company is learning.

Thanks Garrett.

Story in The Register.

PS: ADOdb is dual licensed as LGPL and BSD. Choose which license you want.