On Windows 2000, it's easy to look at statisics with the Performance Monitor. If you're into watching pretty graphs, it can even get pretty addictive. Unix provides some decent tools to do the analysis you need to tune your system too. You just need to learn how to read the hieroglyphics.
I had to learn to diagnose performance bottlenecks quickly for a client because we had an Oracle data migration that we were doing for a client on a Sun server. Based on our initial benchmarks, the server performed so slowly that we figured we would be a month behind schedule. So we got deep into the system and managed to figure out the bottlenecks. Turned out one of the hard disks was over-utilized and was going crazy with the workload. Spreading the load made everything run just fine. This document will try to show you some of the tools you can use to tune you system.
First the tools:
ps provides process information
vmstat which provides paging and CPU utilization info. It also provides disk utilization data for 4 devices, but if you have many hard disks on your system, I prefer iostat.
iostat provides disk i/o info.
netstat provides network utilization data.
ab is apachebench which simulates multiple web browsers. A good networking and application server test.
PS
On Solaris, try /usr/ucb/ps uax. Similarly on Linux, use ps uax. This command gives you the percentage CPU and Memory used.USER PID %CPU %MEM SZ RSS TT S START TIME COMMAND root 16755 0.1 1.0 1448 1208 pts/0 O 17:33:35 0:00 /usr/ucb/ps uax root 3 0.1 0.0 0 0 ? S May 24 6:19 fsflush root 1 0.1 0.6 2232 680 ? S May 24 3:10 /etc/init - root 167 0.1 1.3 3288 1536 ? S May 24 1:04 /usr/sbin/syslogd root 0 0.0 0.0 0 0 ? T May 24 0:16 sched root 2 0.0 0.0 0 0 ? S May 24 0:00 pageout gdm 14485 0.0 0.9 1424 1088 pts/0 S 16:17:57 0:00 -csh
VMSTAT
Run vmstat to get the following on a SunOS server:procs memory page disk faults cpu r b w swap free re mf pi po fr de sr f0 s0 s1 s2 in sy cs us sy id 0 1 0 2011344 50640 23 1 381 192 194 0 9 0 2 18 3 339 52 403 5 2 93and on Linux:
procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 0 0 0 0 5220 103956 752848 0 0 0 0 139 68 0 0 100The key figures for CPU utilization are (us sy id), with "id" being the percentage CPU that is idle. A value below 10% is great. If you have multiple CPUs, this is an average.
On SunOS, the pi and po are the pages swapped in and out (in KB/s), while on Linux we have si and so for pages swapped in and out per second. High po values indicate that there is not enough real RAM, and memory is being paged out - this is bad of course. In this example on SunOS, our server is swapping out 192 pages a second; assuming 4K pages, this means about 0.5 Mb of memory is being swapped out a second, which is still ok. The Sun server has 2 Gb of memory, of which 50 M is still free, the rest has been pre-allocated for Oracle. If the free memory was higher, we could allocate more of it to Oracle.
You can run vmstat every n seconds by entering vmstat [n] on the command line, where [n] is a number. Vmstat only gives you an average for all processors. If you need to analyze per processor, use mpstat on Solaris. On Linux, you can also use the really excellent command top d 1 which provides a graphical UI which refreshes every second, showing processes and memory usage and CPU usage on a per processor basis.
IOSTAT
To understand disk i/o, first a introduction to Sun hard disk conventions: On SunOS, to list all hard disk partitions, use df -k.Filesystem kbytes used avail capacity Mounted on /dev/dsk/c0t0d0s0 192790 61959 111552 36% / /dev/dsk/c0t0d0s6 1191020 905172 226297 80% /usr /proc 0 0 0 0% /proc fd 0 0 0 0% /dev/fd mnttab 0 0 0 0% /etc/mnttab /dev/dsk/c0t0d0s4 674047 61569 551814 11% /var swap 1948096 8 1948088 1% /var/run /dev/dsk/c0t1d0s3 6049124 5816355 172278 98% /u01 /dev/dsk/c0t1d0s4 6049124 1557316 4431317 27% /u02 /dev/dsk/c0t0d0s7 19110978 7748692 1071177 62% /home0 /dev/dsk/c0t1d0s5 6049124 5122531 866102 86% /u03 /dev/dsk/c0t0d0s5 1688242 2132 1635463 1% /opt /dev/dsk/c0t1d0s6 6049124 13 5988620 1% /u04 /dev/dsk/c0t3d0s6 16516485 9229999 7121322 57% /u08 /dev/dsk/c0t1d0s7 10812598 8196075 2508398 77% u05 /dev/dsk/c0t2d0s6 16516485 11013540 5337781 68% /u06This will give you a listing like the above. Each hard disk is identified by a 6 letter name cNtNdN where N is a digit, e.g.: c0t2d0. A hard disk can be partitioned and each formatted hard disk will have at least one partition, s0 (fully c0t2d0s0 in the above example).
On SunOS, use iostat -xn to get an i/o listing by hard disk. If you want to get a listing by partition, use iostat -xnp. On Linux, iostat -x is sufficient.
The following are some SunOS statistics. The r/s and w/s are the number of reads and writes per second. A measure of throughput is the kr/s and kw/s which stands for the Kb read/s and Kb writes/s.
bash-2.03$ iostat -xn
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 fd0
0.3 2.0 14.6 18.5 0.0 0.0 0.0 7.9 0 1 c0t0d0
1.2 16.5 66.5 135.1 0.0 0.2 0.0 10.8 0 12 c0t1d0
0.3 3.0 26.6 24.5 0.0 0.0 0.0 7.0 0 2 c0t2d0
0.2 2.7 19.0 22.8 0.0 0.0 0.0 8.0 0 2 c0t3d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1t6d0
1.9 6.5 43.9 58.3 0.0 0.2 0.0 29.3 0 7 c2t0d0
7.9 24.2 211.0 197.9 0.0 0.8 0.0 25.4 0 25 c2t1d0
Perhaps the most two most important columns in the above display are the %w and %b columns. %w is the percentage of time spent waiting for transactions to complete, and is a measure of how many jobs are contending to use the hard disk (eg. high is bad), while %b is the percent of time the hard disk is busy (eg. one high device relative to the other devices is bad).
From the above figures, we can see that hard disk c2t1d0 is the most highly utilized for some reason. Although this is not serious (only 25% utilization), perhaps this should probably be investigated further based on what applications or databases are using that hard disk.
Network Analysis
The problem with netstat is there is a staggering amount of information available. Try netstat -s for network statistics.
RAWIP
rawipInDatagrams = 0 rawipInErrors = 0
rawipInCksumErrs = 0 rawipOutDatagrams = 0
rawipOutErrors = 0
UDP
udpInDatagrams =17227479 udpInErrors = 0
udpOutDatagrams =17210476 udpOutErrors = 0
TCP tcpRtoAlgorithm = 4 tcpRtoMin = 400
tcpRtoMax = 60000 tcpMaxConn = -1
tcpActiveOpens =121449 tcpPassiveOpens =123593
tcpAttemptFails = 1198 tcpEstabResets = 469
tcpCurrEstab = 89 tcpOutSegs =99321141
tcpOutDataSegs =94129579 tcpOutDataBytes =1737821124
tcpRetransSegs = 6771 tcpRetransBytes =4047173
tcpOutAck =5190950 tcpOutAckDelayed =2633452
tcpOutUrg = 134 tcpOutWinUpdate = 12831
tcpOutWinProbe = 26 tcpOutControl =491560
tcpOutRsts = 2614 tcpOutFastRetrans = 352
tcpInSegs =106046901
tcpInAckSegs =94153690 tcpInAckBytes =1737971355
tcpInDupAck =363643 tcpInAckUnsent = 0
tcpInInorderSegs =100156017 tcpInInorderBytes =278299936
tcpInUnorderSegs = 2077 tcpInUnorderBytes =2975624
tcpInDupSegs = 16862 tcpInDupBytes =1618160
tcpInPartDupSegs = 6 tcpInPartDupBytes = 3844
tcpInPastWinSegs = 8 tcpInPastWinBytes = 47840
tcpInWinProbe = 884 tcpInWinUpdate = 26
tcpInClosed = 20 tcpRttNoUpdate = 2988
tcpRttUpdate =93909335 tcpTimRetrans = 5873
tcpTimRetransDrop = 22 tcpTimKeepalive = 47189
tcpTimKeepaliveProbe= 15890 tcpTimKeepaliveDrop = 31
tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpHalfOpenDrop = 0 tcpOutSackRetrans = 497
IPv4 ipForwarding = 2 ipDefaultTTL = 255
ipInReceives =101796067 ipInHdrErrors = 0
ipInAddrErrors = 0 ipInCksumErrs = 0
ipForwDatagrams = 0 ipForwProhibits = 0
ipInUnknownProtos = 0 ipInDiscards = 0
ipInDelivers =123153971 ipOutRequests =94246439
ipOutDiscards = 0 ipOutNoRoutes = 0
ipReasmTimeout = 60 ipReasmReqds = 0
ipReasmOKs = 0 ipReasmFails = 0
ipReasmDuplicates = 0 ipReasmPartDups = 0
ipFragOKs = 0 ipFragFails = 0
ipFragCreates = 0 ipRoutingDiscards = 0
tcpInErrs = 1 udpNoPorts =1343350
udpInCksumErrs = 0 udpInOverflows = 80
rawipInOverflows = 0 ipsecInSucceeded = 0
ipsecInFailed = 0 ipInIPv6 = 0
ipOutIPv6 = 0 ipOutSwitchIPv6 = 3360
The statistics shown here are a small portion of first few screens of data! I am not an expert in all the parameters, but a quick look at the error parameters such as tcpInErrs gives me an idea of the overall health of the network.
You can also use netstat -a to examine for each port the number of bytes still waiting in the queue for transmission and number of received bytes not copied to the application process by your server:
Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 *:netbios-ssn *:* LISTEN tcp 0 0 *:webcache *:* LISTEN tcp 0 0 *:x11 *:* LISTEN tcp 0 0 *:http *:* LISTEN tcp 0 0 *:ssh *:* LISTEN tcp 0 0 S34KLJ142:smtp *:* LISTEN tcp 0 0 *:https *:* LISTEN tcp 0 0 200.7.1.142:netbios-ssn 200.7.1.26:1352 ESTABLISHED tcp 0 0 200.7.1.142:33568 200.1.34.117:1521 tcp 0 180 200.7.1.142:ssh 200.7.1.25:1404 ESTABLISHEDNote that ssh still has 180 bytes of data to transmit in its queue. Not a problem, but this could be a problem if the value is very large (meaning that the ssh data cannot be transmited fast enough, so it has to be queued).
I also find it quite useful to use simple real-world benchmarks to analyze network performance, such as the time required to transfer a 10 megabyte file (should be 10-15 seconds on a 10 Mbit network), or using the apachebench ab -c1000 -n10 [url] command to simulate 10 clients sending a total of 1000 http requests to a server.
Current versus Historical Statistics
If you run "vmstat 1" you will get the latest statistics once a second. If you run "vmstat" alone, you will get the historical record since the machine booted up. Similarly, "iostat -xn 1" will give you the current machine performance, once a second, while "iostat -xn" gives you the load since the machine booted up.
Further References:
Tuning PHP and Apache on Unix
UNIX and Web Performance by Jaqui Lynch. An updated PDF version.
Sample chapter on Memory Tuning from the O'Reilly book: System Performance Tuning, 2nd Edition.
Oracle Tuning:
Linux Journal has some detailed OS tuning hints. Also see Tuning without Cache-Hit Ratios (Word doc).

