Update: Here’s a link to the project page on drupal.org
Every Drupal installation requires regular actions to perform maintenance tasks such as cleaning up log files, checking for updates, and updating the site’s search index. More often than not, the Unix-based cron daemon is used to run these actions, which is why the task is often referred to as a cron run. Since larger sites have more maintenance tasks to perform, the cron run often times out or hangs on a particular function, preventing some operations from completing. A common, albeit hacky, solution is to create custom cron implementations to separate out the different tasks. With the release of the Cron Multi-Threaded module I developed, the need for custom implementations is eliminated. This post will explain the inspiration behind the module as well as the technical details of how it increases efficiency and adds reliability to the Drupal cron run.
I recently attended an inspiring meeting sponsored by bostonphp.org that centered on the ground-breaking technology behind the Barack Obama presidential campaign. More specifically, the company “Blue State Digital” spoke of how they used PHP and MySQL to digitalize the canvassing process and send a staggering amount of emails to voters. They addressed the various scaling issues that came along with storing over one billion rows of data in a MySQL database and gave some insight into how PHP was scaled in the unique environments it was running in.
Although websites were the interface canvassers and supporters used to organize and send donations, a lot of applications were built outside of the web space to handle various tasks such as personalizing emails, sending batch messages, and aiding in database replication. All of the daemons were written purely in PHP and utilized the “process control” extension to create true SMP solutions providing near-linear scalability (in other words, doubling their hardware allowed them to process roughly twice as much data).
The acronym SMP stands for “symmetric multiprocessing”. In systems that have multiple CPUs and use an SMP architecture, tasks can be moved between processors to balance the workload efficiently. Since webpage scripts usually exist for less than a second, SMP systems can distribute the requests across its processors. However, the load of a single PHP process cannot be dispatched since the language has no native support for multi-threading. In processes such as Drupal’s cron run, which may take minutes to complete, a single processor could be tied up for some time while the others remain idle.
As Blue State Digital did with their applications, the Cron Multi-Threaded module for Drupal utilizes the process control extension to fork the process running the PHP script. In computing terms, forking refers to a process making a copy of itself. The resulting replica is called the child process, and it is free to be distributed to another CPU by the system. This technique allows Cron Multi-Threaded to assign different tasks to the child processes, enabling the system to handle the Drupal cron run much more efficiently.
Cron MT first compiles a list of the installed modules that have maintenance tasks to perform. It then takes a module off of the stack and forks itself. The child process executes the maintenance operation while the parent process pulls another module from the stack repeating the cycle. The site administrator can configure the number of processes that are allowed to run at once as to not overload the system, but conceivably the individual tasks can be processed by separate CPUs at the same time. If one operation hangs, it will not prevent the other ones from running since they are executed separately. The only job of the parent process is to dispatch tasks to its children, thus eliminating the Achilles’ heel of the Drupal cron run.
Blue State Digital has proved that PHP can yield enterprise-level scalability in the most critical environments. With the scope of PHP applications expanding, there is room for Drupal to emerge as a platform used to build applications outside of the web space. By implementing the techniques used in Cron Multi-Threaded, the performance increases gained will allow Drupal to compete in areas currently monopolized by other traditional programming languages.




11 responses so far ↓
1 Carl McDade // Apr 20, 2009 at 4:07 pm
Err, This is not a web oriented technique which is what the greater percentage of Drupalers would be using. There is no module that can be run in a browser that will do this nor is there anything in mod_PHP or PHP CGI.exe that would support this since pcntl_fork and exec() are not compiled on most servers.
You did not describe the technique in detail but I can guess that you are just calling the cron.php script using PHP CLI then forking the process with pcntl_fork().
Another technique would be to run cron.php in a distributed FastCGI process on another machine.
2 Carl McDade // Apr 20, 2009 at 4:21 pm
Hmm,
I forgot one other technique which might be run as a module but is not really stable or popular. This to create a daemon script using PHP and then running some code on it.
I would not recommend this to everyone though as it most likely requires connecting to the Server API/OS .
3 Robert Douglass // Apr 20, 2009 at 4:41 pm
Is there a link to this module? Sounds awesome.
4 nirad // Apr 20, 2009 at 4:41 pm
can you link to the project page on drupal.org? thanks.
5 Benjamin Melançon // Apr 20, 2009 at 5:22 pm
Sounds great! Have a link to the project?
6 Jim // Apr 21, 2009 at 7:24 am
Added to DrupalSightings.com
7 Chris Pliakas // Apr 21, 2009 at 9:51 am
Sorry. Dropped the ball on that one
. The project page is located at http://drupal.org/project/cron_mt.
8 Chris Pliakas // Apr 21, 2009 at 10:56 am
Carl,
Thanks for your post, and you raise valid points. If you check out the project page, Cron MT never claims to be run within the web space nor does it ignore the need for the PCNTL extension. In fact, it will display an error notifying the user that it must be run via CLI if you try to do so otherwise.
I will disagree with you that the CLI nature of the module is a barrier for most Drupalers. The majority of websites set up a cron job to call cron.php through some utility that executes HTTP requests, such as wget or lynx. Cron MT comes bundled with a script that may be called directly by cron without the need for these utilities, actually making it easier to set up. Furthermore, Cron MT fully integrates with Drush, which is a CLI for Drupal.
Also, the module does not work as illustrated in your post. Cron MT does not simply fork the cron.php script (which wouldn’t work anyways), but rather executes the different hook_cron() implementations in separate processes allowing each function call to run in parallel on different CPUs. Running a distributed cron on a separate machine would still execute in a single threaded environment, again running into the same limitations of core cron. Each cron run would have to process one hook implementation at a time on a single CPU eliminating the scalability and fault tolerance that Cron MT provides. In addition, this module allows you to take advantage of the hardware you have without having to purchase a separate machine making it a more cost-effective solution. It also avoids the overhead of setting up and maintaining the distributed cron implementation.
In terms writing a custom PHP daemon… isn’t that what the Linux cron utility accomplishes
? If you have to write PHP daemon to execute certain tasks outside of Drupal, then you are really operating outside of the scope of what Drupal cron does.
With that being said, Cron MT will not satisfy everyone’s needs, but it addresses common problems we have encountered in our larger sites. This solution uses techniques that have been tried and tested in extremely CPU intensive environments, so I am trying to bring it into the Drupal community to improve upon the code base.
9 Carl McDade // Apr 21, 2009 at 1:36 pm
I like the idea of this but as a module it may lead people to think that this is a simple and common solution. It has a list of things that could go wrong that would require indepth knowledge of the Webserver, OS and hardware.
For a short list.
pcntl_fork is not good a handling child processes so doing this on an Apache Server using mod_php will cause some instabilities that need to be watched or it may not work at all.
If you use FastCGI and pcntl_fork you may have to adjust your available fastcgi children so that pcntl_fork does not try to push past the set FastCGI limit and cause your system to freeze.
pcntl_fork is only recommended for CGI which is slower than FastCGI. The trade off may not be worth it in most cases.
pcntl_fork is *nix only php construct so windows users will not be able to use the module.
I guess what I am trying to say is, do you think that it is worth all the warning labels and support calls if it becomes popular?
There are many on shared hosting and VPS that will be tempted to try this in an effort to speed thing up.
Other than those things the code is good!
10 Chris Pliakas // Apr 21, 2009 at 3:08 pm
Hi Carl.
Thank you for replying, and again your points are well taken. However, I am unsure as to what your argument is. Cron MT doesn’t claim to be a universal solution. The requirements are stated clearly on the project page.
To respond to your list:
1) Cron MT will never be run as an Apache module. It won’t let you. Cron MT also manages the child processes very effectively in my experience.
2) If you look at the code, you will see that Cron MT places a configurable limit on how many processes may be forked. If the limit is reached, the module waits for a process to become available before forking another one. No need to worry about configuring your server. You can set the limits via Drupal.
3) This sounds very theoretical. Blue State Digital created the entire technology stack behind the Barak Obama campaign entirely in PHP using the same techniques utilized by Cron MT. Performance and scalability are two completely separate items.
4) You are absolutely correct. This is stated clearly on the project page.
Again, I agree with you 100% that this is not a universal solution. It is not marketed as such, but I sincerely apologize if it doesn’t come off this way and appreciate you bringing attention to it. Part of the reason why we develop on top of open source software is because it often provides us with the flexibility to utilize functionality not available on proprietary platforms. That is a completely separate conversation, though, and I don’t want to start a holy war
.
11 Carl McDade // Apr 21, 2009 at 3:38 pm
I am curious about one last thing that I could not find in the PHP docs or in your code. When a SMP machine is involved how are you setting the child process on another CPU? Since all child processes automatically inherit the parents affinity mask they will be locked on the same CPU.
I know that you can change the affinity mask in ASP.NET or C (Linux) but I could not find anything on POSIX functions that would attach to the kernels CPU scheduler or use CPU_SET. How are you guys accomplishing this?
Leave a Comment