This document describes the information the orcallator.se tool measures on Solaris hosts. With each description of a particular measurement is an example of some data. At Yahoo!/GeoCities I have observed almost every measurement that exhibits problems to show up somewhere on the site. I have pulled these particular pieces of data for this page for illustration purposes.
For more information on Orca and orcallator.se, read the following articles:
Watching your systems in realtime, SunWorld, July 1999.
Viewing your network in realtime, SunWorld, September 1999.
There are many other sources of information for the following measurements. An excellent starting point is Adrian Cockroft's SunWorld Online Performance Columns. There is also Sun's support system, including Answerbook Online and SunSolve Online. A large number of books have been written on Unix performance and tuning. Two publishers worth checking out are O'Reilly and Prentice Hall.
At the bottom of this page are all the URLs referenced on this page.
This is version 0.26beta1 of this document and this version number is meant to correspond with the version of orcallator.se that has data represented here. This document was initially written by Duncan Lawie and updated by Blair Zajac and is an unfinished piece of work. If you have any additions, suggestions, feedback, corrections, please email them to the Orca discussion mailing list.
The disk measurements presented below can be slightly confusing as there are many different types of measurements and it is not always clear what the different system utilities, such as iostat and sar, are measuring. I recommend reading the following articles written by Adrian Cockroft. The orcallator.se tool measures the values according to the concepts in these articles.
How do disks really work?, SunWorld.com, June, 1996.
Clarifying disk measurements and terminology, SunWorld.com, September, 1997.
What does 100 percent busy mean?, SunWorld.com, August, 1999.
This graph is based on the SE classes that examine 11 different components of a system, such as disks, net, RAM, etc. The SE classes use rules determined from experience to represent the health of that component. This is all described very nicely in an article by Adrian Cockcroft. The rules in this 1995 article do not necessarily reflect the rules in any other SE releases. The orcallator.se script goes beyond the determination of a state and assigns a numerical value to the state, which is then plotted. The numerical values grow exponentially to represent that as the state get worse, the represented component is in a much worse state than would be represented by a linear progression. The states and their values are:
Any values over 1 warrant a look at. Note: The colors in the plot have nothing to do with the colors representing state. The values recorded by orcallator.se are twice the values plotted here. After I released orcallator.se I decided that having a component operating in an acceptable state (white, blue or green) resulting in values ranging from 0 to 2 is not natural to system administrators, where system loads above 1 mean the system is busy. The division by two makes this more intuitive. |
This graph shows the system's uptime. |
This graph plots the average number of jobs in the run
queue over the last 1, 5 and 15 minute intervals. On
Solaris systems this is how the system load is calculated
and are the same values displayed by the System LoadOlder versions of orcallator.se recorded two sets of data, one labeled {1,5,15}runq, and the other {1,5,15}load. Both sets were calculated identically from the same system variables. When I was made aware of this, orcallator.se was changed to only record the {1,5,15}runq values and orcallator.cfg now only makes one plot, not two. If long term trends indicate increasing figures, more or faster CPUs will eventually be necessary unless load can be displaced. For ideal utilization of your CPU, the maximum value here should be equal to the number of CPUs in the box. |
This graph shows the percentage of CPU time being consumed by user and system processes, with the remainder making up idle time. If idle time is always low, check the number of processes in the run queue. More, or faster, CPUs may be necessary. If user CPU time is commonly less than system CPU there may be problems with the system set up. |
This graph is a simple count of the number of processes currently on the system and the number of web server processes. This includes sleeping, defunct and otherwise inactive processes in addition to runnable processes. If there seem to be an excessive number of processes on
the system related to the run queue, check for
defunct processes ( |
This is very similar to the plots made of the Number of System & Web Server Processes plots, except only the number of web server processes are counted. In the default orcallator.cfg file, the number of web server processes data is plotted along with the total number of processes on a system and in a separate plot, just the number of web server processes are plotted. Two distinct plots are created since some hosts do more than just serve pages and the number of web server processes may be a small fraction of the system total. This plot is useful to determine if your web server is
being pushed to its limits. For Apache web servers, the
The number of web server processes is determined by
counting all the processes that contain the string defined
in the |
When orcallator.se reads the web server access log it calculates the average number of bits served per second. This number does not include overhead in the TCP/IP packet headers and retransmissions. |
This measures the number of requests that resulted in 4XX or 5XX return codes per second. |
This set of graphs shows the number of input and output bits per second on the given interface. It counts all bits from each protocol, including headers. |
This set of graphs shows the number of input and output packets per second on the given interface. |
This set of graphs shows the number input and output Ethernet errors per second on the given interface. |
This is the number of times a transmission was deferred to a future time at the interface level. This slows down the transmission time of a packet. |
The number of segments per second transferred via TCP. |
This plot graphs the percentage of incoming and outgoing bytes that were retransmitted. High values for either of these is an indication of either a congested network dropping packets or long delays. Retransmission occurs when no confirmation of receipt has occurred for the original package and the system has to resend the packet. Duplicate received occurs when another system retransmits and a packet is received more than once. For sites with large retransmission percentages, such as
web sites serving international content, you will probably
want to tune your TCP stack using the |
The number of current TCP connections. This includes "long" connections such as ssh/rlogin as well as "short" ones like scp/ftp. |
Mutex is mutual exclusion locking of kernel resources. If multiple CPUs try to get the same lock at the same time all but one CPU will sleep. This graph shows the level at which sleeps are occurring. |
This graph plots the rate of NFS reads and writes services broken down into NFS v2 and v3 reads and writes. The sum of v2reads, v2writes, v3reads and v3writes will be less than nfss_calls as we're not plotting all the other types of NFS calls (getattr, lookup, etc). This plot is identical to the previous plot except that the total number of calls is not shown. This is done to show the distribution of calls that may not otherwise be visible. |
This plot measures the number of read and write operations across all the disks on the system. |
This plot measures the average number of bytes read from and written to all the disks on the system per second. |
The run percent is measured as the percent of time in a given time interval that the disk is working on a request. This is the iostat %b measurement and is sometimes called the active/run queue utilization or run busy percentage. It varies between 0 and 100%. See the references to read in the disk resources section for more information on this measurement. This plot displays each disks run percent and can be used to gauge the load imbalance on all of your disks. There is a known bug with Orca version 0.27 and below that causes it to generate multiple versions of this plot. The problem is due to the way Orca does internal bookkeeping of the different disks it has seen and ends up generating different lists of disks from different source files that are not placed in the same plot. |
This plot is now replaced with the plot displaying each disk's run percent. This plot shows how busy all of your disks are and how much busier your busiest disk than the average of all the disks on the system. This is used to show an usage imbalance on your disks that may warrant moving some data or partitions around. The maximum value is the largest run percentage on all the disks on your system and the average is the average of all disks. Orcallator.se 1.12 and before had a bug in calculating the mean disk busy where it would be too small. |
This plot measures the number of read and write operations across all the st* tape drives on the system. |
This plot measures the average number of bytes read from and written to all the st* tape drives on the system per second. |
The run percent is measured as the percent of time in a given time interval that the st* tape drives are working on a request. This is the iostat %b measurement and is sometimes called the active/run queue utilization or run busy percentage. It varies between 0 and 100%. This plot displays each st* tape drive run percent and can be used to gauge the load imbalance on all of your st* tape drives. There is a known bug with Orca version 0.27 and below that causes it to generate multiple versions of this plot. The problem is due to the way Orca does internal bookkeeping of the different tape drives it has seen and ends up generating different lists of tape drives from different source files that are not placed in the same plot. |
There are two figures in this graph. The values are percentages showing the proportion of time that data sought is found in the cache. The Directory Name Lookup Cache(DNLC) which caches file names. The Inode cache
contains full inode information for files, such as file size
and datestamps. Low values are likely when
|
This plot shows how many times per second the Directory Name Lookup Cache (DNLC) and the Inode cache are referenced per second. |
This plot shows how many times per second the Directory Name Lookup Cache (DNLC) and the Inode cache are referenced per second. Any significant rate of inode steals indicates that the inode cache may not be sufficiently large. This is related to the value of ufs_inode. |
This plot shows how many bytes of physical memory are free. This is expected to decrease over time, as pages are not freed until physical memory is full. |
This graph indicates the rate at which the system is scanning memory for pages which can be reused. Consistently high rates of scanning (over 200 pages per second) indicate a shortage of memory. Check page usage and page residence times. |
Page residence time is the amount of time that a memory page for a particular process remains in memory. The maximum measured time is 600 seconds. Low values for page residence indicate memory shortages, particularly in combination with high page scan rates. |
This is a further breakdown of the "other" section of page usage into use for IO and locked pages. |
These are the URLs referenced through this page.
Viewing your network in realtime, SunWorld, September 1999.
Watching your systems in realtime, SunWorld, July 1999.
How do disks really work?, SunWorld.com, June, 1996.
Clarifying disk measurements and terminology, SunWorld.com, September, 1997.
What does 100 percent busy mean?, SunWorld.com, August, 1999.
Orca 0.265 by Blair Zajac blair@orcaware.com |
Funding for Orca provided by the founder of The Rothschild Image, renowned fashion image consultant, Ashley Rothschild. | Graphs made available by RRDtool. |