Oct 2004 - sections on stateless architectures and n-tier design added.
PHP is an open source programming language that is widely popular on the web. However because PHP so popular in shared hosting environments, many people have an impression that PHP is only for small scale web-sites. This is patently untrue, and PHP is in use in many large scale web sites such as Yahoo and Lufthansa Online Ticketing for the creation of large web applications such as IMP. This article is an attempt to readdress the balance and show how PHP is used in the enterprise.
Enterprises want to have specific assurances about a web technology they use in the following areas:
- performance and fast development
- reliability and security
- extensibility - able to use industry standards to communicate with other software systems.
- scalability - able to add additional servers as the load increases.
- load balancing - ability to distribute the load so no single server is overloaded
- high availability - ability to survive failure of server components transparently.
- easy maintenance and deployment - as the number of servers increase, enterprises want to be able to automate the management of the software
Let us address each of these issues with respect to PHP.
Performance and Fast Development
PHP is well-known for fast performance. Benchmarks such as those done by Chamas and the list at PHPKitchen shows that PHP is competitive and often faster than other well-known web technologies. For further information on developing high performance PHP software, refer to the following articles Optimizing PHP and Tuning PHP and Apache.
Template systems are also available for content management to separate code from data and its presentation. This normally involves having a template file which contains presentation information (eg. HTML and macro variables), then extracting some data from a database in PHP and merging the data into the template macro variables. Templates allow your design team to concentrate on the HTML, while the programmers are able to work independently on the PHP code and data extraction SQL. There are many such systems available in PHP; I can personally recommend the Smarty template engine.
PHP IDE's are available for rapid software development. I personally use DreamWeaver and HomeSite for PHP development. Other IDEs with PHP support include Zend Studio, PHPEdit, Maguma, and NuSphere. Many open source and commercial packages and components are available for PHP. They include components such as phpLens (of which I am the author) for database access and form generation, jpGraph for graphing, vBulletin and phpBB are for forum software, and portal development toolkits such as LogiCreate and PostNuke.
Reliability and Security
According to Netcraft, PHP is in use in over 9 million web sites in June 2002. This widespread use is a testimony to the reliability of the core language. However care should be taken in choosing which extensions to use in your system. As the extensions are developed by a wide-range of people, the quality-assurance of the extensions varies widely also. Extensions developed by the core PHP team such as the MySQL and Oracle oci8 extensions tend to be the most reliable as they know PHP best.
PHP has a record of security that is as good as other famous Open Source projects such as Apache. Several security groups have performed source code reviews of PHP. Furthermore PHP can operate in safe-mode, where operations on directories, files and permissions are restricted, and PHP functions can be selectively disabled for security reasons. Memory usage of each PHP process can be managed also using the memory-limit configuration option.
If you use Linux, you can also read this overview on Securing Apache on Linux.
Multi-tier architectures are now quite common in enterprise web applications. This normally means that the application is designed with multiple layers, typically the presentation layer, multiple layers representing the business rules or objects, and the database layer. Naturally PHP is strongest at the presentation tier as it provides extensive support for manipulating HTML, managing HTTP sessions, and there are many templating systems to choose from. I have used PHP to implement the business tiers of an application, but if the business objects have already been developed as Java Beans and COM+ servers or PL/SQL packages, PHP can talk to them also.
PHP has an extensive library of extensions that allow you to connect to LDAP, Java Beans, CORBA, COM+ and .NET servers. A wide number of database clients are also available, ranging from generic ODBC drivers to high end databases such as Sybase, Informix, Microsoft SQL Server and Oracle. For portability, several database abstraction libraries are available. The two most popular are PEAR DB and ADOdb.
PHP's Stateless Architecture is Scalable
Section added Oct 2004
One key design feature of PHP is that all state and session information can be stored in an external store, such as a database or shared caching mechanism such as msession or memcached. The main bottleneck in a web application is typically the management of state and session information, so by moving the storage of this data to a dedicated external store, we are able to make PHP very scalable as there are no bottlenecks in the core PHP design. In contrast, systems which store information in global variables in memory tend to be less scalable, because as these global variables change, they have to be propagated to all servers in the server farm.
In PHP's stateless design, the database or shared store will still remain a bottleneck (as all PHP application servers rely on it), but techniques for designing scalable databases are fairly well known. For example, for the heaviest loads, you could use Oracle Real Application Clusters. For lighter loads, you could use MySQL configured with one central master MySQL server which all writes are sent to, and then all data changes are replicated over to MySQL slave servers which are for read-only transactions.
Care should be taken to ensure that the amount of state and session information stored is minimized. I would recommend no more than 10-20K of session data. Instead of storing all user information in the session variables, store only the most commonly used data (such as user name and permissions). For all other data, just store the primary key to the relevant records in the database, and retrieve the required data on demand.
I would advise caution when using object-relational layers. These tend to exert a high overhead in a dynamic language such as PHP, and result in very large amounts of traffic between database and PHP process. This could be acceptable in some systems, but are not advisable if you want your application to enjoy exceptional scalability.
Scalability and Load Balancing
When your first application server become overloaded, it is typically because too many users are accessing that one server. The logical move is to upgrade the server. But as the load increases, eventually there will arise the need to purchase additional servers, creating a server farm. We then have a choice of configuring the servers identically, or partitioning the servers in the server farm by function, so certain parts such as forums are handled by one set of servers, authentication by another group of servers, etc.
No matter how our server farm is configured, we will need to distribute the HTTP requests in a even way to ensure all servers are given work to do. In the early days of the Internet, round-robin DNS was a popular way of distributing the load. But like any round-robin system, it is possible that some servers become overloaded as some requests tend to require more work than others. So weighted load balancing algorithms were developed where numeric weights can be dynamicly assigned to specific servers that indicate how much load it can handle relative to the other servers.
These load-balance solutions can be implemented using Cisco's Local Director or an open source solution such as Linux Virtual Servers (LVS). LVS can be configured to use one of many different scheduling algorithms to load-balance the server farm.
The following diagram taken from the Red-Hat LVS manual shows a typical setup. We configure 2 routers, the active and backup routers to the real servers. By using a polling mechanism called the Heatbeat Channel, LVS monitors the state of the routers. If one router goes down, the other router takes over.
Other PHP issues to consider in a server farm include:
- A global authentication mechanism needs to be in place. You can consider using one based on LDAP or store the userids and passwords in a database. I personally find it easier to manage the data when it is stored in a database, but LDAP is supposedly more scalable. You can use the login system of one of the portal toolkits such as PostNuke, or PHP Lib Login, an authentication library with support for groups.
- A mechanism to store server state variables persistently. For example, remembering whether to connect to the main database server or the backup. This can be implemented using application variables. In PHP, application variables are implemented in user-space. See this article on Application Variables in PHP for more information.
- PHP defaults to using files to store session variables. In a server farm, this is not practical, so session variables need to be maintained using a database-backed session handler. The following article shows an example using ADOdb's session handler.
- Ensuring that all server times are synchronized. This can be done scheduling a regular cron job to run the rdate command on a time server.
- A common question asked in the PHP newsgroups is how scalable is PHP. Unfortunately this sort of question is impossible to answer unless you define the hardware, the skill of your web programmers, how complex your web pages are, and whether you use caching and compression techniques. But in general PHP is competitive with all the other technologies that are commonly used. This case study about the Indianapolis 500 which serves 2.5 million page views on peak days is impressive.
High availability is often defined as no single point of failure in the system. This is why we have backup routers in LVS or Local Director. Each specialized form of server would need to be configured to have a backup, including your database, LDAP and time servers. This is also why you should install on each server two network cards, and setup a parallel network so that even if one network fails, you have a backup. According to this excellent PDF article on Giant-Scale Web Services, staff tripping over cables is a common cause of network failure!
LVS provides load balancing and high availability for the routers. However it does not provide failover support for the web (or database) servers, which is what we do when a failure occurs. LVS detects when a server goes down and removes it from the active server list. But LVS does not provide notification to the all other servers that a specific server (such as your main database server) is down.
There are clustering solutions (such as Red Hat Clusters) that provide such failover support, but it is probably simpler to configure this in your PHP application rather than let a clustering solution handle it. I recommend that each application server be configured identically. Then whenever a application server fails, it is easy for LVS to select another application server to take over because (a) all servers are identical, and (b) because the state information is not stored locally in any application server
One limitation of LVS or Local Director is that it does not provide data synchronization services between servers. So you need to configure your database replicate your data (or transaction logs) from your main database server to your backup server. It is also important to use the commit and rollback features of your database when updating multiple records to ensure data integrity.
Red Hat's Advanced Server documentation is an excellent overview of setting up an LVS cluster for high availability.
Multiple Tier Development
Section added Oct 2004.
Some web application architectures recommend a 3-tier system, consisting of web/proxy servers, application servers which hold the business logic, and database servers. This gives better security (as external users are not aware of the location of the application servers with the sensitive code), and enhances scalability.
In this scenario, the web/proxy servers do not have PHP configured, but merely serve as proxies to the actual application servers which have Apache/IIS and PHP installed. One common proxy server is squid; another alternative is to use apache with mod_proxy installed. Proxy servers work by redirecting HTTP requests to the application server, and sending back the responses to the web client.
Because PHP works best in a stateless environment, I do not recommend having more than 3-tiers because they add extra complexity, making it harder to ensure that your web application is really scalable.
Maintenance and Deployment
We need to setup an maintenance infrastructure once we have a server farm. Things to monitor include:
- DNS, and Router status.
- Database status and response time
- File and/or directory sizes. Ensures that given files (i.e. log files) dont grow too large, or that directories that should be cleaned out periodically (i.e. mail server queues) are handled properly.
- application server performance. Measures bytes transferred/minute, hits/minute, and other performance measures.
There are many commercial tools available to perform this monitoring, for example SiteScope. And now with SOAP using SSL, it is possible to connect your monitoring system to a response system that remotely execute shell commands to perform system maintenance. Application deployment can be done using scripts to automate ftp or scp to your servers. The Giant-Scale Web Services PDF has good advice on issues to consider.
You can use tools such as rsync, scp or sftp to synchronize your files. The PEAR group has also developed a cross-platform installer that can be used for deploying PHP packages. At this point of time it is still in beta, but by late 2002, it should be in general use.
Comparisons with J2EE are Futile
The goals of PHP are different from J2EE. J2EE is a general software environment suitable for many non-web applications. PHP is designed specifically to work in a shared-nothing web-specific environment. If you have a check-list of features for comparison between the two systems, it is inevitable that J2EE will win the check-list competition. Using PHP well doesn't mean translating J2EE patterns into PHP, but using a simpler, stateless architecture to achieve your goals.
If you believe in the KISS principle, I think you will find PHP is a better fit than J2EE for your needs.
Why should IT managers consider using PHP? Price is not the only issue for enterprises. Other factors we have discussed above play an important part, and I hope I have shown you how PHP can work in a scalable, high-availability environment. PHP is a very productive programming environment that requires minimal training for programmers familiar with C-style languages, and many high quality PHP components are available. Enterprise PHP also does not mean that you must do everything in PHP. You should leverage your existing infrastructure, be it Java, PL/SQL, or VB. But from my experience, you'd be surprised how much PHP can do.
Software Engineering Practices for Large Scale PHP Projects
presented at PHPCon 2002 by J. Scott Johnson
Microsoft's advice on Building a Highly Available and Scalable Web Farm
Overspending on Java?
Is Java really more scalable than PHP? The advantages of a shared-nothing approach to scalability.
Advanced PHP Caching Strategies