Page 1
SG24-4778-00 RS/6000 SP: Problem Determination Guide December 1996 This soft copy for use by IBM employees only.
Page 3
IBML SG24-4778-00 International Technical Support Organization RS/6000 SP: Problem Determination Guide December 1996 This soft copy for use by IBM employees only.
This soft copy for use by IBM employees only. Figures Control Workstation Connected to the SP Frame ....Supervisor Cards ....... . .
Page 12
This soft copy for use by IBM employees only. Diagram of the Different Phases on the High Performance Switch . . . The High Performance Switch Board ....
Page 13
This soft copy for use by IBM employees only. 107. setup_authent Script Flow Chart (7/7) ....108. install_cw Script Flow Chart (1/3) .
Page 14
This soft copy for use by IBM employees only. SP PD Guide...
The redbook also includes appendices with useful reference material about RS/6000 SP script files, the SDR structure, and how to change IP addresses and hostnames for the SP. This redbook is a valuable tool for system administrators and other technical support personnel who deal with SP problems.
Page 18
• Chapter 7, “Isolating Problems on the SP System” The RS/6000 SP provides several tools that help to identify and isolate problems. In this chapter, these tools will be used, along with the Symptom Index in Chapter 3 of the Diagnosis and Messages Guide , GC23-3899 to help to isolate problems.
IBM UK Ltd. He joined IBM in 1989 and began supporting the RS/6000 software in 1992. During the last 18 months he has focused on RS/6000 SP products and his expertise has been gained by working closely with customers resolving installation and post-installation issues.
This soft copy for use by IBM employees only. Comments Welcome We want our redbooks to be as helpful as possible. Should you have any comments about this or other redbooks, please send us a note at the following address: redbook@vnet.ibm.com...
1.1 RS/6000 SP: Hardware and Software Although RS/6000 SP is built with standard AIX and RS/6000 parts, it has its own hardware components and special software to make managing it easier.
This soft copy for use by IBM employees only. Figure 1. Control Workstation Connected to the SP Frame The serial link connects the CW with the Supervisor Card into the Frame. In this way, the CW can manage and monitor every major hardware event produced either by the Frame itself or by the nodes.
1.1.3 The High Performance Switch The High Performance Switch is one of the most unique pieces of hardware and software developed for the RS/6000 SP. The hardware portion of the switch provides a low latency and high bandwidth of communication between nodes.
1.2 Problem Determination Basically, there are two kinds of problems encountered on the RS/6000 SP: those related to AIX and those related to the POWERparallel System Support Programs. For those related to AIX, the approach to solve them is the standard AIX way, which means you can use errlog and trace facilities to find the problem.
The internal file structure and the syntax is specific to each component. When you have problems on the RS/6000 SP, it is sometimes difficult to determine the source of the problem, especially when the problem or malfunctioning involves several components within AIX and PSSP. One of the most difficult things to do is to isolate the problem to one component.
Page 26
This soft copy for use by IBM employees only. not considered suitable to include in a chapter but are useful as reference material. SP PD Guide...
Installing the RS/6000 SP system includes installing both hardware and software. Before RS/6000 SP system hardware and software are installed, a detailed plan should be created. For details of RS/6000 SP site planning, refer to RS/6000 SP: Site Planning , GC23-3905.
2.2 Prepare the Control Workstation This section is covered in steps 0 through 11, in the Installation Guide and in sections 3.1, 3.2.1, and 3.2.2 in RS/6000 SP: PSSP Version 2 Technical Presentation , SG24-4542. Completing steps 0 through 11, means that the following have been successfully achieved: 1.
This soft copy for use by IBM employees only. P A T H = / u s r / l p p / s s p / r c m d / b i n : PATH=$PATH:/usr/lpp/ssp/bin: /usr/lpp/ssp/kerberos/bin:/usr/local PATH=$PATH:/usr/bin:/etc/:/usr/sbin:/usr/ucb PATH=$PATH:/usr/dt/bin:/usr/lpp/X11/bin:/sbin:...
This soft copy for use by IBM employees only. 2.2.2 AIX Software Components The RS/6000 you are using as the Control Workstation (CWS) for RS/6000 SP System must have the following software installed: • AIX Version 4.1 Base Operating System •...
Figure 9. Output of lsvg rootvg Command to Check Disk Space 2.2.4 RS-232 Control Lines Diagnostics Each frame in the RS/6000 SP system requires a serial port on the Control Workstation to accommodate an RS-232 connection between them. As there is...
This soft copy for use by IBM employees only. of the hardware. The splogd daemon, which is also a client of the hardware monitor daemon, will not be able to log any changes in hardware state. 2.2.5 Changing IP Addresses and Hostnames The Control Workstation, in our examples, has a Token Ring and an Ethernet adapter card.
Any decrease in this number will not become effective until the next system boot. Installing your RS/6000 SP system as a root user means that your system will not affected by this limit. However, it is a good idea to increase this value, based on your application requirements now, to cater to later requirements.
This soft copy for use by IBM employees only. 2.2.8 Tunable Values After the initial installation of RS/6000 SP system, the network tunable values are set to AIX 4.1 defaults. Change these values to the optimum values for SP systems. This is achieved by setting no -o commands in tuning.cust file under tftpboot directory.
Table 2 on page 17 provides a brief overview of various components of PSSP software, what the minimum required options are, and also what is recommended by IBM to be installed. Our recommendations will be overridden by your requirements. SP PD Guide...
Other components may be required based on your environment. If you are using MIT Version 4 or AFS authentication services, then ssp.authent is not required. For an RS/6000 SP authentication server, you must have ssp.authent . If you have a High Performance Switch or an SP Switch, then you have to install the Switch Device Driver on your nodes.
PSSP 1.2 was IBM′s first implementation of POWERparallel System Support Programs (PSSP) software. This level of PSSP requires AIX Version 3.2.5. PSSP 1.2 is still supported by IBM developers and Support Centers. PTFset23 is the latest level of PTF available for PSSP 1.2. PTFset23 should be installed, because it resolves various known defects and also provides some enhancements.
IBM Parallel Environment for AIX Version 2.1 provides support for parallel applications on AIX Version 4.1. − IBM Parallel ESSL for AIX 4.1.4 improves performance of engineering and scientific applications on the RS/6000 SP Systems. − IBM PVMe for AIX supports parallel execution of applications on AIX version 4.1.4.
Remote Diagnosis Support has been added in PSSP 2.1. 2.2.14 PSSP Software Strategy The PSSP updates and fixes are available on a regular basis from IBM as PTF sets. Some of the PSSP concepts and terminology are described as follows: •...
This soft copy for use by IBM employees only. If you do not have direct access to the IBM network, then you may get these PTFs from the IBM Support Center by calling them and providing them with requirements, media type, delivery address, and urgency.
It is important to find out what software service level you have because you may, for example, need it to report a problem to the IBM Support Center. The support staff needs this information to check various databases to ascertain if such a problem is already known on the software service level in question.
Page 43
This soft copy for use by IBM employees only. # cd /usr/lpp/ssp/ssp.basic/2.1.0.10 # ls -las total 40 4 drwxr-xr-x 3 root system 512 May 14 15:43 . 4 drwxr-xr-x 3 root system 512 Apr 24 18:30 .. 4 -r--r--r-- 1 root system...
Page 44
This soft copy for use by IBM employees only. lslpp Much of the output from the command is understandable without an explanation. Other fields contain data that needs to be defined. The following lslpp table defines terms used in several of the output fields of the...
This soft copy for use by IBM employees only. Output field Term Definition The fileset was left in a broken state after BROKEN the specified action. The specified action was canceled before it Status CANCELED completed. history COMPLETE The commitment of the fileset has results of completed successfully.
Page 46
This soft copy for use by IBM employees only. • README file Always read the README file under directory /usr/lpp/ssp for information about various PSSP components, including product information, installation information, restrictions, advisories, important APAR information, and other allied information. This can be very useful information in resolving problems even before they arise.
Page 47
• Loss of sysman configuration files after installing PTFSet11 PTFset11 for RS/6000 SP Version 2 Release 1 is a full replacement PTF set and will therefore replace some configuration files that you have on your system. To retain the contents of these files, you need to save and then restore them after the installation of PTF Set 11.
984064 Table 4. Components and Sizes of the PSSP PTFset11 2.3 Authentication Services Diagnostics To initialize your primary authentication server on the RS/6000 SP Control setup_authent Workstation, you need to run the command from the command line on your Control Workstation. The Kerberos administrator needs to define a...
If required, run the command to get a valid ticket from RS/6000 SP authentication services specifying the administrative principal name that was used when authentication was set up. The full path of this script is /usr/lpp/ssp/bin/install_cw.
This soft copy for use by IBM employees only. • For a complete flow chart of what install_cw does, refer to Appendix A.2, “The install_cw Script” on page 236. Figure 14 shows a sample output of running the install_cw script.
This soft copy for use by IBM employees only. • You should have a valid ticket from the RS/6000 SP authentication services. klist kinit Use the command to check authentication and command to request authentication (if required). • If the install_cw script does not complete successfully, then refer to Appendix A.2, “The install_cw Script”...
This soft copy for use by IBM employees only. Figure 17. Snapshot of System Monitor The hardware monitor consists of a daemon, named hardmon, and a set of client commands. The hardmon daemon executes on the Control Workstation and, using the RS-232 lines to each frame, polls the frames for the state of the hardware within the frame.
This soft copy for use by IBM employees only. This result indicates that the ssp.basic option is installed. • If the verification in the previous step succeeds, then check your PATH environment variable. You can achieve this by using echo $PATH.
Page 54
This soft copy for use by IBM employees only. • Verify the RS-232 connection. Ensure that the RS-232 line is connected to the correct frame and the correct serial port on the Control Workstation. Remember that the hardmon daemon on the Control Workstation requires an RS-232 connection to each frame to poll the frame for the state of hardware within the frame.
Page 55
DIAGNOSTIC EXPLANATION Information; Node 1:8; powerLED; Power is on. This log provides full details of all SP hardware errors. • If you cannot figure out the cause of the problem, call the IBM Support Center. Chapter 2. The Installation Process...
If the CPU utilization rate is very high and cannot be attributed to the hardmon or logging daemon, then look for other processes which are consuming the CPU resources. Contact the IBM Support Center if you cannot resolve the problem.
Check for a core dump. If available, run to identify the reason for the core dump. If required, send the core dump and /unix file to the IBM Support Center for investigation. • If the hardware monitor daemon (hardmon) dies: −...
This soft copy for use by IBM employees only. Figure 18. The hardmon Daemon Is Not Running ps -eaf grep hardmon ps -eaf shows that it is running, but another shows it fuser /dev/tty0 is continually respawning. reconfirms this.
This soft copy for use by IBM employees only. If the sdr daemon is not running, enter the following: startsrc -g sdr If the problem persists, move on to the next check. 3. Check the /spdata/sys1/spmon/hmacls file. Figure 20. The hmacls File In this example, /spdata/sys1/spmon/hmacls has incomplete entries.
Page 61
This soft copy for use by IBM employees only. For all configured IP interfaces, the name resolution has to be set up correctly by using the /etc/hosts file or DNS. This also includes interfaces other than the SP internal Ethernet.
You should be logged in as a valid authenticated user of the system management commands. • Valid ticket You should have a valid ticket from the RS/6000 SP authentication services. klist kinit Use the command to check authentication, and the command to request authentication (if required).
This soft copy for use by IBM employees only. sdrd:2:once:/usr/bin/startsrc -g sdr sp:2:wait:/etc/rc.sp > /dev/console 2>&1 hardmon:2:once:/usr/bin/startsrc -s hardmon hr:2:once:/usr/bin/startsrc -g hr hb:2:once:/usr/bin/startsrc -g hb >/dev/null 2>/dev/console splogd:2:once:/usr/bin/startsrc -s splogd Figure 21. /etc/inittab File after Running install_cw Script The next figure shows what gets added to the /etc/services by running the install_cw script.
As part of PSSP 2.1, NIM was introduced as the network installation support for the RS/6000 SP. NIM is a general AIX product, and its introduction to the SP environment was intended to centralize and be coherent with the rest of the RS/6000 products.
Page 65
• dataless NIM client and other information is stored on the master. Within the RS/6000 SP environment, there is a machine object for each node of the RS/6000 SP. They will run as .standalone systems after the installation. # lsnim -l sshps01...
Page 66
The complete list of resources is available in the NIM Installation Guide and Reference , SC23-2627. Within the RS/6000 SP environment, the following resource objects are created: • lppsource (type lpp_source) This resource points to the LPPs which are located in the directory /spdata/sys1/install/lppsource in the RS/6000 SP environment.
Page 67
This soft copy for use by IBM employees only. # cat /spdata/sys1/install/pssp/bosinst_data control_flow: CONSOLE = /dev/tty0 INSTALL_METHOD = overwrite PROMPT = no EXISTING_SYSTEM_OVERWRITE = yes INSTALL_X_IF_ADAPTER = no RUN_STARTUP = no RM_INST_ROOTS = yes ERROR_EXIT = CUSTOMIZATION_FILE = TCB = no...
(SRC-controlled) which is used for communication purposes between NIM Master and NIM Clients. In the RS/6000 SP environment, this command is called by the setup_server script. SP PD Guide...
Page 69
Without any parameters, the command displays all NIM objects which have been created on this NIM master. In the RS/6000 SP environment, we see the previously mentioned resources. option displays the set of attributes, which are associated with a specific NIM object.
Page 70
, the later host name of the system, the MAC address of the network adapter that is used for network booting, and the type of network adapter. cable_type1 This is bnc or dix in the RS/6000 SP environment. prev_state This attribute displays the previous state of the NIM object. cpuid Stores the ID of the CPU.
Page 71
This soft copy for use by IBM employees only. # nim -o <operation> -a <attribute>=<value> ... <object> # lsnim -O sphps01 sphps01: define = define an object change = change an object′ s attributes remove = remove an object allocate...
Page 72
This operation is used to install and update software or to execute scripts on standalone clients. In the RS/6000 SP environment this operation is not used. It requires the nodes to be configured as NIM Clients (access to root user granted by entry in .rhosts file on the node).
Page 73
This operation is used to perform operations on installed software, such as removing or cleaning up after an interrupted installation. It is not used in the RS/6000 SP environment because nodes are not configured as NIM Clients. The following example shows how to uninstall the Performance Agent software from a workstation.
This soft copy for use by IBM employees only. # nim -Fo reset <machine object> # nim -o deallocate -a subclass=all <machine object> Or you can do it by using PSSP commands: # spbootins -r disk -l <node number> ...
This soft copy for use by IBM employees only. # exportfs sp2n03 /var/adm/acct -root=SP2CW0: sp2tr0:sp2cw0,access=SP2CW0:sp2tr0:sp2cw0=machines /spdata/sys1/install/pssplpp /export/nim/scripts/sp2n08.script -ro,root=sp2n08,access=sp2n08 /usr -ro,root=sp2n08,access=sp2n08 /spdata/sys1/install/pssp/bosinst_data -ro,root=sp2n08,access=sp2n08 /spdata/sys1/install/pssp/pssp_script -ro,root=sp2n08,access=sp2n08 /spdata/sys1/install/images/bos.obj.ssp.41 -ro,root=sp2n08,access=sp2n08 However, sometimes a node will appear in exportfs output from the NIM master even if you never defined that particular node as a NIM client. For example, this could happen if the NIM database is out of sync in a situation where NIM has not successfully removed the NIM client with a previous NIM command.
Page 77
The most likely reason for this is that some filesets are missing in the /spdata/sys1/install/lppsource directory. Compare all filesets in this directory with the list given in Step 8 of the installation process in the RS/6000 SP Installation Guide , GC23-3898.
Page 78
This soft copy for use by IBM employees only. room for eight 2.5 MB images or a total of 20 MB. However, once the images have been created, you only need psspspot.rs6k.ent. The other images could be deleted. Accordingly, one may say that you only need one file of 2.5 MB.
Page 79
This soft copy for use by IBM employees only. # setup_server 0042-001 nim: processing error encountered on ″master″ : 0042-001 m_mkbosi: processing error encountered on ″master″ : 0042-154 c_stat: the file ″ / spdata/sys1/install/images/bos.obj.ssp.41″ does not exist setup_server: 0016-072 Error detected processing the following nim command: /usr/sbin/nim -o define -t mksysb -a server=master -a location=/spdata/sys1/inst all/images/bos.obj.ssp.41 mksysb_1...
This soft copy for use by IBM employees only. 0042-001 nim: processing error encountered on ″master″ : 0042-001 m_instspot: processing error encountered on ″master″ : 0042-062 m_ckspot: ″psspspot″ is missing something which is required 3. If the problem persists, you can remove the SPOT.
1. Installing nodes Typically no external devices like tape drives or CDROM drives are connected to the nodes of a RS/6000 SP. Therefore, it is necessary to install the nodes over the network, including the boot phase. This also applies to the two following tasks.
If this fails, you see the LED error code 609. 4. A more likely step to fail is the mount which is performed now. The SPOT is mount mounted in the RS/6000 SP environment by the command bootserver:/usr /SPOT/usr . This step sometimes fails due to incorrect NFS configuration on the installation server.
Page 83
/usr/lib/boot/network/rc.diag is copied and sourced. The last possible script is /usr/lib/boot/network/rc.dd_boot, but it is not used in the RS/6000 SP environment. It would perform a normal network boot (for instance, in a diskless/dataless environment).
Page 84
. Another is, for instance, rte , which leads to a basic BOS installation. In the mksysb RS/6000 SP environment, this variable is always set to for the installation process. In case of a diagnostic network boot, it contains the value .
Page 85
This soft copy for use by IBM employees only. Following is an example of a host.info file created for a NIM installation: #------------------ Network Install Manager --------------- # warning - this file contains NIM configuration information # and should only be updated by NIM export NIM_NAME=sphps03 export NIM_HOSTNAME=speth03.aixedu...
This soft copy for use by IBM employees only. #------------------ Network Install Manager --------------- # warning - this file contains NIM configuration information # and should only be updated by NIM export NIM_NAME=sphps03 export NIM_HOSTNAME=speth03.aixedu export NIM_CONFIGURATION=standalone export NIM_MASTER_HOSTNAME=spcntl.aixedu export NIM_MASTER_PORT=1058 export RC_CONFIG=rc.diag...
Page 87
This soft copy for use by IBM employees only. • Determine the NIM master host name. • Login to NIM master. • List the objects in the NIM database. • While still on the NIM master, list the NIM client definition for the node having a problem.
This soft copy for use by IBM employees only. telnet sp2n01 was enabled to login to the boot/install server node. /etc/bootptab was checked and found that it does not have an entry for sp2n08 . So you now need to check NIM info.
This soft copy for use by IBM employees only. /etc/bootptab was checked, and now it had an entry for sp2n08 as follows: sp2n08:bf=/tftpboot/sp2n08:ip=9.12.20.8: ht=ethernet:ha=10005AFA1B12:sa=9.12.20.1: sm=255.255.255.0: 7. Net Boot was tried again and now it worked. 2.11 Node Customization Problems When a node is installed, its bootp_response option in the SDR is set to install .
Page 90
This soft copy for use by IBM employees only. The problem is that the perl script mkinstall, called it from setup_server, creates the install_info file using the initial_hostname. However, pssp_script on the other end expects the reliable_hostname. The mkinstall and mkconfig were changed to use the reliable_hostname instead.
4. Kerberos functions as a third party to authenticate the identities of clients and servers. Kerberos on the RS/6000 SP is used to initially authenticate the identity of the user and to provide information through which the server can authenticate the identity of clients in a distributed environment.
A realm name 3.2.1 Principal Kerberos defines a name space of authenticated users and services. Each different client and service has a unique principal name. An RS/6000 SP user who wishes to use any Kerberos-authenticated service must be registered to kadmin...
This soft copy for use by IBM employees only. 3.2.3 Realm A realm is the set of principals sharing the same authentication database and authentication server. The realm name identifies each independently administered Kerberos site. Kerberos does not specify any constraints on the setup_authent form of the realm name.
It is installed on the primary authentication server (typically the Control Workstation). This component is not installed on the nodes. 3.3.2 ssp.clients 2.1.0.5 This component is installed on the Control Workstation, the RS/6000 SP nodes, and other RS/6000 hosts where Kerberos is used. It includes: •...
This soft copy for use by IBM employees only. For further details, see Appendix A, “RS/6000 SP Script Files” on page 229. 3.4.2 install_cw This creates the hardware monitor ACLs in /spdata/sys1/spmon/hmacls. For further details, see Appendix A, “RS/6000 SP Script Files” on page 229.
This soft copy for use by IBM employees only. 3.5.3 kpropd Daemon The kpropd daemon only runs on secondary authentication servers (if one or more have been set up). The authentication databases used by the secondary authentication servers are copies of the primary database. The databases are maintained by the kpropd daemon, which receives the database content in encrypted form from a program, kprop, which runs on the primary server.
This soft copy for use by IBM employees only. The Kerberos database contains the name of the authentication realm and all the principals′ names and their keys. The database files can be converted to an ASCII file by the script /usr/lpp/ssp/kerberos/etc/ kdb_util dump. Use kdb_util load to convert the ASCII file back to binary.
This soft copy for use by IBM employees only. 3.6.4 /etc/krb-srvtab The server key file, /etc/krb-srvtab, contains the names and private keys of the local instances of Kerberos-protected services. During the setup of the Control Workstation or the nodes, the keys for service principals are stored in the authenticated database (for use by the authentication server) and in the file /etc/krb-srvtab (for use by the services themselves).
This soft copy for use by IBM employees only. 3.6.6 /etc/krb.realms This file maps a host name to an authentication realm for the services provided by that host. Each line in the file must be in one of the following forms: •...
This soft copy for use by IBM employees only. 26-Apr-96 14:48:04 Kerberos started, PID=17802 26-Apr-96 14:48:25 Kerberos started, PID=17818 26-Apr-96 14:48:25 kerberos: 2503-000 Could not read master key. 26-Apr-96 14:48:25 Kerberos will pause so as not to loop init 26-Apr-96 14:59:21 Kerberos started, PID=41620 26-Apr-96 15:19:11 Kerberos started, PID=57046 Figure 30.
This soft copy for use by IBM employees only. root@sp21cw0 / > klist Ticket file: /tmp/tkt0 Principal: root.admin@SP21CW0 Issued Expires Principal Apr 26 14:59:45 May 26 14:59:45 krbtgt.SP21CW0@SP21CW0 Apr 26 15:00:10 May 26 15:00:10 hardmon.sp21cw0@SP21CW0 Apr 26 15:38:50 May 26 15:38:50 rcmd.sp21n01@SP21CW0 Apr 26 15:38:50 May 26 15:38:50 rcmd.sp21n02@SP21CW0...
This soft copy for use by IBM employees only. 3.7.5 dsh and p* Commands command uses rsh to execute a specific AIX command on any group of nodes or other remote RS/6000 hosts within the authentication realm, in parallel. The group of target hosts may be pointed to by the WCOLL variable, which in turn points to a file containing the hostname of each target host.
This soft copy for use by IBM employees only. 3.8.2 Tickets Confirm that the user is known to Kerberos and that the user has a valid ticket klist for the requested service by runnig the command. Check that the user′ s ticket cache file exists.
This soft copy for use by IBM employees only. 3.8.5 TCP/IP ping Check that TCP/IP is functioning correctly by using the command to confirm communication with the various adapters on each node. Also use the host <hostname> and host <IP address> commands to ensure that hostname resolution is correct (both commands must return the same output).
This soft copy for use by IBM employees only. 3.8.8 PTF Levels Ensure that the Control Workstation and all the nodes are at the same PTF level. 3.8.9 Rebuild the Kerberos Database If all else fails, the Kerberos authentication database may be completely re-created.
Page 106
This soft copy for use by IBM employees only. SP PD Guide...
Page 107
This soft copy for use by IBM employees only. Chapter 4. The Switch Most of the switch-related problems require a clear understanding of the switch components. This chapter explains those components, gives examples about configuration and topology files, and explains how to handle switch problems and interpret switch log files.
This soft copy for use by IBM employees only. Figure 41. The HiPS Showing the External Connections It is worth stating at this point that the Control Workstation is not part of the switch network, and putting a switch adapter in the Control Workstation is not supported in the current environment.
This soft copy for use by IBM employees only. Actually, for 128-way systems, while all 16 available connections on the node switch boards are cabled to the intermediate switch boards (4 to each of the ISBs), only half of these cables are actually used to transfer data. It is by taking out these redundant cables and using the connection on this ISB to connect to other ISBs that it is possible to go beyond the 128-way to larger switch sizes.
This soft copy for use by IBM employees only. Figure 43 on page 90 gives an example of the cabling for a 3-switch system without using intermediate switch boards. Figure 43. Example of the Cabling on a 48-Way System The exact cabling configuration will be addressed later (that is, which jack connection goes to which node or other jack connectors on other switches), but first the internal cabling on the switch board will be covered.
Page 111
In the current environment, it is not possible for the High Performance Switch and the SP Switch to coexist within the same RS/6000 SP system. If there is a requirement to install a new SP Switch without migrating the existing High Performance Switch, then a new system, which has its own Control Workstation and its own SP environment, is required.
Page 112
This soft copy for use by IBM employees only. The problem that you have with global synchronous clocks is that if the master fails, the entire system goes down. SP Switch has improved recoverability by putting two oscillators on each board. However, the switchover will probably still be a manual process for a while.
This soft copy for use by IBM employees only. With newer technology, the SP Switch is more reliable that HiPS. Also, fewer parts improves the reliability. The function of the HiPS 32 driver/receiver chips embedded in the SP Switch chips. This is also true of most of the clocking logic.
Page 114
This soft copy for use by IBM employees only. SW0 to SW7 Logical identification of switch chips • In output files, add 10,000 to the logical number. • The frame number is used to build the logical number - multiply frame number by 10.
This soft copy for use by IBM employees only. For example, Node 14 (N14) is connected to the switch board at J4 on the High Performance Switch, but on the SP Switch it is connected at J34. Look at Figure 45 on page 95 and compare it with Figure 41 on page 88 for the High Performance Switch to discover how it is cabled differently.
Page 116
Note: This file must not be changed; otherwise the configuration is not supported. All RS/6000 SP systems are cabled in a standard way that follows the wiring set out in the topology files. This pattern of cabling or the contents of the topology files should only be changed for diagnostic purposes on the advice of an IBM engineer.
This soft copy for use by IBM employees only. • The board, switch chip and port at which each node is attached • The chip-to-chip connections within each board, including port data • The interboard connections, including chip and port data...
This soft copy for use by IBM employees only. Figure 47. Topology File - Example Comparing the HiPS and SP Switch boards, as shown, we see that the node-to-switch port connections for SP Switch are the same as for HiPS. The intra-switch connections are also the same.
This soft copy for use by IBM employees only. 4.3.2 Topology File Nomenclature Figure 48. Topology File Nomenclature Logical Notation The logical notation has three fields that define one end of the cable and three fields that define the other end. An end of the cable is defined as <...
This soft copy for use by IBM employees only. 4.3.3 HiPS Clock Subsystem Understanding the switch clock subsystem will help you to get a better appreciation of how various failures in the clock tree could cause certain patterns of errors to arise in the log files.
This soft copy for use by IBM employees only. jacks J19 to J33 are on a different branch. If none of the jacks have a clock, then the problem most likely stems back to level A, or back to the clock card, or even back to the switch that is supposed to be driving that clock card.
This soft copy for use by IBM employees only. Figure 51. The High Performance Switch Board Further, each phase-locked loop provides a clock redrive function, and any selected clock signal is passed through one of these redrives being distributed to all switch chips on board.
Page 123
This soft copy for use by IBM employees only. powered. A synchronous reset is performed by the switch supervisor microcode each time the clock source (PLL or mux) is set. • Power-on reset (POR) − Flush registers (zero all bits) −...
Page 124
This soft copy for use by IBM employees only. token, it will verify if it is still the backup before it takes over for the primary. The backup daemon cancels this thread when it is no longer the backup. 4.4.3 Secondary Node Behavior...
Page 125
For the High Performance Switch the only reason one of these files should exist is if an IBM engineer wishes to temporarily override the topology file in the SDR to assist in the diagnosis of a possible hardware failure on the switch.
Page 126
/etc/SP on the primary. This topology file is further distributed on to the nodes (through the boot/install servers, if there are any). Eclock - Controls the clock source for each switch board within a RS/6000 SP. The clock source topology file (located in /etc/SP on Control Workstation) is Eclock selected during installation of the system.
The next fault seen is likely to be caused by this timeout. If there is a problem with switch faulting and there are symptoms like those described above, then contact the local IBM Support Center and mention APARs IX53234 and IX54543.
Page 128
Run the following command to resolve this on that node: # /usr/lpp/ssp/css/ifconfig css0 <switch IP address> arp A detailed flow chart of this script can be found on Appendix A, “RS/6000 SP Script Files” on page 229 under A.4, “The rc.switch Script” on page 262.
Page 129
This soft copy for use by IBM employees only. Notice that in the SP Switch, the Estart_sw calculates the timeout based on the number of switches: LIMIT = 180 + ((NUM_SWITCH - 2) * 60) where NUM_SWITCH > 2 The following steps are referred to as Switch Initialization : 1.
This soft copy for use by IBM employees only. 10. The primary node updates the SDR switch_responds class for its partition. For each node in the partition, the ( autojoin,isolated )-pair is set to one of the following: (0,0) Initial...
This soft copy for use by IBM employees only. Estart can all create a switch fault. Either the Worm on the nodes or the switch chips themselves can report a fault to the Worm on the primary node. The Worm process verifies the switch connections beginning at a node designated as the primary node.
Page 132
This soft copy for use by IBM employees only. If a node does not come up on the SP Switch, always check this attribute using the following command: # SDRGetObjects switch_responds adapter_config_status This may give a good indication of where the problems lie. To assist with the...
Page 133
There is an undocumented command called that runs first the unconfigure method and then the configure method. This command should only be run at the request of an IBM engineer as the results can be unpredictable in certain circumstances. 4.7 Switch Log Files There are numerous log files associated with the switch, some of which can provide essential data for resolving switch problems.
Page 134
This soft copy for use by IBM employees only. • /var/adm/SPlogs/css/daemon.stdout • /spdata/sys1/logtables/css.tab rc.switch.log The log file contains specific information about the node that relates to the switch. If you do not know what kind of switch type is installed at the SP System, look at this log file.
Page 135
This soft copy for use by IBM employees only. The format of this file is considerably different between the SP Switch and the High Performance Switch. This in part reflects the difference between the two switches in the way they detect and handle switch faults.
Page 136
This soft copy for use by IBM employees only. fs_dump This will format fault_service kernel extension traces. This command should be run on the primary node and any of the failing nodes. To run fs_dump > /tmp/fs_dump.out & the command issue 4.7.1 The out.top Log File...
Page 137
This soft copy for use by IBM employees only. 6. -8 is the most common indicator of a miswire. You should look for a cable_miswire file on the primary node. 7. -9 could mean that the switch element (chip or adapter) was failing.
Page 138
This soft copy for use by IBM employees only. For the most part, if an internal port reports an error, the FRU is the card that contains that port. You may discern patterns that indicate clock problems, but you will quickly deduce that the FRU is the same because the clock problem is a broken driver on the switch card.
Page 139
This soft copy for use by IBM employees only. 6. All the ports on a chip are reporting a problem If all of the ports on a chip are reporting a problem, and they are not all connected to the same switch board, it is most likely a problem with the chip.
Page 140
This soft copy for use by IBM employees only. administrators to power things off and on. Such events will cause switch faults. • If many devices report errors at the same time, it is very possible that there is something strange going on. You should delve back into your knowledge of how the clocks are distributed and how the boards are connected before you determine if this is a real problem or not.
This soft copy for use by IBM employees only. Figure 55. The flt File The flt file has some nomenclature that differentiates device types, but it also lists the device number in a field that is generically labeled “id.” With this in mind, it was decided to differentiate switch IDs from nodes IDs by adding 100,000 to them.
Page 142
This soft copy for use by IBM employees only. because port 2 corresponds to bit 2. If a DO was found on port 5, the value will be 20, because port 5 corresponds to bit 5. If you are unfamiliar with such notation, you may find it useful to translate the hexadecimal values to binary first.
This soft copy for use by IBM employees only. 4.7.5 Fault Syndrome on the flt File The fault syndrome register is the register that collects error information in the adapter. There are two classes of error with which we should be concerned: •...
This soft copy for use by IBM employees only. 4.8 Additional Problem Determination The following sections discuss determining additional system problems. 4.8.1 Debugging Scripts If a problem could not be determined by the log files, it is useful to debug the command which fails by itself.
Page 145
The way in which the RS/6000 SP can be partitioned is dependent on the topology of the High Performance Switch, and so this is where the topic is started after a brief introduction.
Page 146
This soft copy for use by IBM employees only. 5.1.2 Some Rules for Partitioning There are some basic rules that should always be followed when you set up the partitions which impose some limitations on the configurations that can be selected for the partitions.
Page 147
This soft copy for use by IBM employees only. carry out this task. The underlying reason behind this is that PSSP 2.1 uses a product called the Network Installation Manager (NIM) to carry out many of the tasks associated with the installation of the nodes. NIM is not compatible with AIX Version 3.2.5.
Page 148
This soft copy for use by IBM employees only. 5.2.1 Partitioning a Single Switch/Frame System Figure 59 on page 131 shows a single switch board and the four switch chips that service the nodes. On the basis that all nodes on a single switch chip must be in the same partition, the maximum number of partitions that are possible is four.
Nodes 5, 7, 13, 15, 17, 19, 21, 23, 25, 27, 29, and 31 Using the topology files in this way is the only accurate method that can be used to assess the impact of System Partitioning on all possible RS/6000 SP configurations. However, using the slot number method is perfectly valid for configurations where a single switch services nodes in a single frame only.
Page 150
This soft copy for use by IBM employees only. 5.2.2 Partitioning a Two Switch/Frame System A maximum of two partitions only are supported in these configurations. However, any combination is supported giving the following options: Nodes Description The default single partition...
This soft copy for use by IBM employees only. 5.2.4 Switch Topologies The effect of partitioning the High Performance Switch is that there is no interference between the partitions that have been set up. Anything that happens in one partition will have no effect on the others. If a switch fault is generated in one partition, this will not affect the switch in the others.
This soft copy for use by IBM employees only. but not by 50%. Testing has shown that the expected performance reduction is usually within the 10 to 15% range. Much will depend on the specifics of the particular system, and the type of network traffic generated on the switch.
This soft copy for use by IBM employees only. IP address aliasing will be dealt with in detail later in the chapter, but it is sufficient to know at this stage that each partition is identified by having a different alias. The topology file for the second partition is shown in Figure 61 on page 133 for completeness.
Page 154
This soft copy for use by IBM employees only. performance across the switch is of great importance. By partitioning the system, the nodes may have to be concentrated on the switches, therefore reducing bandwidth. 5.2.5 Configuration Files All the switch configuration files at PSSP 2.1 are in the following directory:...
This soft copy for use by IBM employees only. Figure 62. Directory Structure for the Topology Files in PSSP 2.1 Returning to the example of an 8_8 configuration, following is a list of all the associated topology files: /spdata/sys1/syspar_configs/topologies/any.l1.8way1.0isb /spdata/sys1/syspar_configs/topologies/any.l1.8way2.0isb /spdata/sys1/syspar_configs/topologies/any.l1.8way3.0isb...
Page 156
This soft copy for use by IBM employees only. this information is essential during Estart if the Worm on the primary node is to successfully initialize the switch in that partition. In cases where the switch will not initialize within a partition, check that the...
Page 157
. Further detail about setup_server script can be found in Appendix A, “RS/6000 SP Script Files” on page 229 under A.3, “The setup_server Script” on page 239. If all the nodes are to be installed at AIX Version 4.1, then the installation can proceed normally and the partitioning can be set up as required when the time is appropriate.
Page 158
In addition, ensure that there will not be a conflict in the future should you add any additional nodes. Choose an IP address that will be definitely out of the range even if you upgrade your RS/6000 SP by adding a large number of nodes.
This soft copy for use by IBM employees only. 5.3.2 Process Overview If everything is successful so far, the next steps can all be carried out from the SMIT menus by running: # smit syspar A menu will appear, as shown in Figure 63.
This soft copy for use by IBM employees only. Figure 64. Flow Chart of the System Partitioning Process The 2nd, 3rd, and 4th steps in the SMIT menu (Select System Partition Configuration, Display Information for Given Configuration or Layout, and Select System Partition Layout) are represented by the single box list/select configs in the flow chart.
This soft copy for use by IBM employees only. 5.3.4 Customizing the Partitions Having selected the configuration (based on the number of switch boards or frames in the system), chosen the layout you require (based on which nodes are required in which partition), and selected that layout, the task of customizing each of the partitions should be carried out.
Page 162
This soft copy for use by IBM employees only. The next field, Backup Primary Node, may not be present on the SMIT screen. If recent maintenance has been applied to the system, it may be there because this functionally was introduced with the new code for the SP Switch. If this field is present but the High Performance Switch is installed on the system, leave this field blank;...
This soft copy for use by IBM employees only. Apply System Partition Configuration Type or select values in entry fields. Press Enter AFTER making all desired changes. [Entry Fields] System Partition Apply Option Verify only. Correct VSD configuration? No. Discontinue System Partition Path config.8_8/layout.1...
Page 164
This soft copy for use by IBM employees only. 4. The SDR is not up or is not responding. 5. The current data in the SDR that relates to the partitioning has inconsistencies or is incorrect. Check and correct any problems associated with points 1, 2, and 3. If everything is correct in these areas, then proceed to the next section to find out how to resolve problems with the SDR.
Page 165
5.4.1 SDR Daemons The SDR contains all the RS/6000 SP-specific configuration information. It is a central repository that only resides on the Control Workstation. Access to this information is essential for all the nodes to function properly. If there are any SDR or SDR error messages, it is a good idea to check that the SDR daemon (sdrd) is running.
Page 166
This soft copy for use by IBM employees only. startsrc uses the flag in order to start the multiple daemons. It runs the following script to start up the daemons and ensure that multiple instances of the same daemon do not get created:...
Page 167
This soft copy for use by IBM employees only. then the node can still reach the SDR by using this IP address (after failing to connect to the primary ). In the example used, if there was an AIX Version 3.2.5 partition, the contents of...
This soft copy for use by IBM employees only. Figure 67. Data Organization of the SDR in PSSP 2.1 The SDR object classes have been split into two types: object classes that are systemwide: /spdata/sys1/sdr/system and object classes that are specific to the partitions:...
This soft copy for use by IBM employees only. Figure 68. PSSP 2.1 SDR Directory Structure 5.4.3 New Object Classes Two completely new object classes have been created for System Partitioning. The Syspar_map object class has already been mentioned earlier. It is a systemwide or global class that describes which nodes are in which partition.
Page 170
This soft copy for use by IBM employees only. the object class with a program such as , but be sure to copy the object class to a safe place first should the file become inadvertently corrupted. Also, because the SDR daemon was bypassed during this operation, it is...
Page 171
This soft copy for use by IBM employees only. • The host listed in each of the custom files has the same interface as the hostname of the Control Workstation. • The SDR is up and can answer a call.
Page 172
5.5 Heartbeat Reorganization The heartbeat has been mentioned several times already because it is an important part of the administration of an RS/6000 SP and it is also affected by System Partitioning. Now we will look at the heartbeat in more depth.
Page 173
This soft copy for use by IBM employees only. node with the next highest IP address (and so on), until the ring is completed by the node with the lowest IP address pinging to the Control Workstation. If there is no response to several retries of the ping packet within the timeout period, that heartbeat daemon will notify the group leader that connectivity has been lost over en0 , which will then take that node out of the ring.
Page 174
If the node has crashed, then there will be either an 888 LED displayed, or an associated crash code (such as 0c9 , for example). In this case, reboot the node after ensuring that it has finished taking a system dump and contact IBM support for assistance in analyzing the dump data.
This soft copy for use by IBM employees only. partition by looking at the process table. However, you can run the following commands to find out this information: # lssrc -g hb # lssrc -g hr Following is an example of this output for the heartbeat:...
Page 176
This soft copy for use by IBM employees only. If there are problems with the heartbeat daemons after System Partitioning, use the same techniques that have already been described in this chapter. If there is a problem in one partition, it is possible to work just with the relevant subsystem, rather than with all of them.
Page 177
This soft copy for use by IBM employees only. sigforce = 0 display = 1 waittime = 20 grpname = ″hb″ Turning on debug merely changes the ODM attribute for stdout and refreshes the relevant daemon. Follow the next example, substituting the partition name in order to redirect output to a file: # odmget -q ″subsysname=hb.spcws″...
Page 178
This soft copy for use by IBM employees only. SP PD Guide...
Page 179
This soft copy for use by IBM employees only. Chapter 6. Error Logging The RS/6000 SP uses the AIX and BSD error logging mechanism to handle error generation and error reporting. The trend is to use AIX error logging exclusively, but PSSP 2.1 uses some public domain codes, such as AMD, NTP, Supper, and so on.
This soft copy for use by IBM employees only. Figure 70. Error Logging Components To create an entry in the error log, the errdemon retrieves the appropriate template from the repository, the resource name of the unit that caused the error, and detailed data.
Page 181
This soft copy for use by IBM employees only. recommended actions. Collectively, the templates comprise the Error Record Template Repository. 6.1.2 Error Logging Commands errclear Deletes entries from the error log. This command can erase the entire error log. Removes entries with specified error ID numbers, classes, or types.
Page 182
6.2 SP Error Logging The RS/6000 SP uses both the AIX Error Logging facilities and the BSD syslog, as well as a number of function-specific log files to record error events on each node.
Page 183
In a regular RS/6000 system, a battery is installed to maintain NVRAM. On an RS/6000 SP system, there is no battery and NVRAM may be lost when the node is powered off. AIX writes the last error log entry to NVRAM. During system startup, the last entry is read from NVRAM and placed in the error log when the error daemon is started.
Page 184
This soft copy for use by IBM employees only. The following is an example of the /etc/sysctl.conf file: /etc/sysctl.conf # (C) COPYRIGHT IBM CORP. 1993 # ALL RIGHTS RESERVED # US GOVERNMENT USERS RESTRICTED RIGHTS - USE, DUPLICATION # OR DISCLOSURE RESTRICTED BY GSA ADP SCHEDULE CONTRACT WITH # IBM CORP.
Page 185
This soft copy for use by IBM employees only. smit perrdemon_shw You can alter one or more of the configuration parameters for the AIX Error Log on a set of nodes. Because the additional entries are generated by SP System software, the AIX Error Log file size should be a minimum of 4 MB.
Page 186
This soft copy for use by IBM employees only. The logged process ID of the failing resource (optional) A free form error message Important Note that syslogd does not log the year in a record′s timestamp. The comparisons for start and end times are done on a per record basis and can cause unexpected results if the log is allowed to span more than one year.
# Licensed Materials - Property of IBM # US Government Users Restricted Rights - Use, duplication or # disclosure restricted by GSA ADP Schedule Contract with IBM Corp. # /etc/syslog.conf - control output of syslogd # Each line must consist of two parts:...
Page 188
This soft copy for use by IBM employees only. The psyslrpt Command psyslprt You can use the command to generate reports of log entries in the log files generated on a set of nodes by syslogd. Options allow you to select the files and records that will be reported.
Page 189
This soft copy for use by IBM employees only. psyslclr On the Control Workstation, is used to trim daemon facility messages older than 6 days. This is done in /usr/lpp/ssp/bin/cleanup.logs.ws, which is run from the Control Workstation′s crontab file. The SP system uses the crontabs file to periodically update file collections and clean up log files.
Page 190
This soft copy for use by IBM employees only. The psyslclr Command psyslclr You can use the command to delete log entries in the log files generated by syslogd. There are options that allow you to select the files and...
Page 191
This soft copy for use by IBM employees only. splm -a archive -t /spdata/sys1/logtables/weekly.tab -c -d /var/archives The fast path to invoke to the Remove Archives menu is: smit spremove_archive To remove all files and directories in and including /var/archives/arc_weekly.tab, issue from the command line: splm -a archive -t /spdata/sys1/logtables/weekly.tab -r -d /var/archives...
Page 192
This soft copy for use by IBM employees only. The splm Command With this command, you can execute a number of log management functions on splm a single host or a set of hosts in parallel. The functions are driven by a log table file that contains the target node designations and associated files and the commands to execute.
This soft copy for use by IBM employees only. Service service snap function first calls the command to gather system data to the snap top level directory if the option is used. The command creates a set of subdirectories based on the arguments.
Page 194
This soft copy for use by IBM employees only. In the following example, we will remove the service collections created using the table ssp.tab: splm -a service -t /spdata/sys1/logtables/ssp.tab -r Gather service collections The fast path to invoke for the Gather Service Collections menu is:...
Page 195
This soft copy for use by IBM employees only. workstation for convenience. You can do this using the @hostname parameter in the /etc/syslog.conf file. 6.3 Error Notification Facility Following is a description of the actions to be taken by the notification method EN_pend, located in the directory /spdata/sys1/err_methods.
Page 196
This soft copy for use by IBM employees only. hdisk0 errors on the nodes sp21n7, sp21n8, sp21n9, and sp21n10, and performing the pre- and post-actions: 1. Create .pre and .post scripts on one of the nodes. For example, create EN_hdisk0.pre and EN_hdisk0.post in the directory /spdata/sys1/err_methods on node sp21n7.
This soft copy for use by IBM employees only. The following is an example of the /spdata/sys1/err_methods/EN_pend file: #!/bin/ksh ##################################################################### # Module: EN_pend #CPRY # 5765-296 (C) Copyright IBM Corporation 1995 # Licensed Materials - Property of IBM # All rights reserved.
This soft copy for use by IBM employees only. The following is an example of the /spdata/sys1/err_methods/EN_pend.envs file: #!/bin/ksh ##################################################################### # Module: EN_pend.envs #CPRY # 5765-296 (C) Copyright IBM Corporation 1995 # Licensed Materials - Property of IBM # All rights reserved.
Page 199
This soft copy for use by IBM employees only. From the command line: To add, remove, or show error notification objects in parallel on the SP system, enter: penotify -f show penotify Useful options with the command are: Executes on all nodes in the system partition.
Page 200
This soft copy for use by IBM employees only. 5. The AIX Error Labels KERN_PANIC and DOUBLE_PANIC KERN_PANIC and DOUBLE_PANIC error log entries are generated when a kernel panic occurs. 6.3.2 Error Notification Object Class The error notification object class allows applications to be notified when particular errors are recorded in the system error log.
Page 201
This soft copy for use by IBM employees only. TRUE Matches alertable errors. FALSE Matches non-alertable errors. en_resource Identifies the name of the failing resource. For the hardware error class, a valid resource name is the device name. en_rtype Identifies the type of the failing resource. For the hardware error class, a valid resource type is the device type a resource is known by in the devices object.
Page 202
This soft copy for use by IBM employees only. #!/bin/ksh ######################################################################### # Run errpt to get the full error report for the error that # was written and redirect it to a unique errnot.$$ file. # $$ will expand to the PID of this script.
Page 203
This soft copy for use by IBM employees only. ------------------------------------------------------------------------- ERROR LABEL: HPS_DIAG_ERROR2_ER ERROR ID: 323C48A0 Date/Time: Fri Aug 10 11:29:03 Sequence number: 18282 Machine Id: 000005911800 Node Id: sp21n12 Error Class: Error Type: PERM Resource Name: Worm Resource Class: NONE...
Page 204
This soft copy for use by IBM employees only. #!/bin/ksh ######################################################################### # Run errpt to get the full error report for the error that # was written and redirect it to a unique errnot.$$ file. # $$ will expand to the PID of this script.
Page 205
This soft copy for use by IBM employees only. 6.3.5 Notification on Boot Device The following example shows how to mail the error report to root@controlworkstation when an error on the boot device of hdisk0 occurs. dsh -a Adding the command to the ODM commands will perform the action on all nodes of the RS/6000 SP.
Page 206
This soft copy for use by IBM employees only. 6.3.6 Notification Power Loss and PANIC The following example shows how to mail the error report to root@controlworkstation when an unexpected power loss and kernel panic occur. dsh -a Adding the command to the ODM commands will perform the action on all nodes of the RS/6000 SP.
Page 207
This soft copy for use by IBM employees only. odmdelete -o errnotify -q ″en_name = power.obj″ odmdelete -o errnotify -q ″en_name = panic.obj″ odmdelete -o errnotify -q ″en_name = dbl_panic.obj″ To view these objects in the ODM database, enter: odmget -q ″en_name = power.obj″ errnotify odmget -q ″en_name = panic.obj″...
Page 208
When this situation occurs, errdemon creates an error log entry to inform you about the problem. You can easily correct this problem by enlarging the buffer. Note: On the RS/6000 SP system, thin and wide nodes behave differently regarding NVRAM and power loss. errclear...
Page 209
This soft copy for use by IBM employees only. log the end-signal information and terminate immediately. Each message is one line. A message can contain a priority code, marked by a digit enclosed in <> (angle braces) at the beginning of the line.
This soft copy for use by IBM employees only. kern Kernel user User Level mail Mail subsystem daemon System daemons auth Security or authorization Facility Names syslog syslogd daemon Line-printer subsystem news News subsystem uucp uucp subsystem All facilities LOG_EMERG emerg .
Page 211
This soft copy for use by IBM employees only. 6.4.2 SP Log Daemons setup_logd sets up the SP logging daemon (splogd) and is called by installation scripts when the Control Workstation is installed. It can also be run by root on a different workstation to have splogd spawned by the SCR (System Resource Controller).
Page 212
This soft copy for use by IBM employees only. The hwevents file contains state change actions that are to be performed by the splogd logging daemon. The fields in this file are: frame Specifies the frame number (1- n ) or * (asterisk) for all frames.
Page 213
This soft copy for use by IBM employees only. Starting and Stopping the splogd Daemon The splogd daemon is under System Resource Controller (SRC) control. It uses the signal method of communication in the SRC. The splogd daemon is a single subsystem and not associated with any SCR group.
Page 214
This soft copy for use by IBM employees only. SP PD Guide...
This soft copy for use by IBM employees only. Chapter 7. Isolating Problems on the SP System Most of the problems covered in previous chapters can be grouped into four main categories. Each of those categories has different types of errors.
This soft copy for use by IBM employees only. /etc/tftpboot directory. They keep the information and resources to network boot or customize the nodes. More information about this process and what files are created, can be found in Chapter 2, “The Installation Process” on page 7.
This soft copy for use by IBM employees only. LED 260-262 These numbers indicate that the node is displaying the boot menu but there must be something wrong with the serial link, because nodecond cannot get messages from there. LED 231-239 This is a booting problem.
Page 218
This soft copy for use by IBM employees only. Problems with System Monitor GUI • spmon command fails to start the System Monitor GUI. Then check the following: − Verify authorization klist - Run command to verify that kerberos tickets have not expired.
This soft copy for use by IBM employees only. − Check that the following entry is in the /etc/syslog.conf configuration file: daemon.notice /var/adm/SPlogs/SPdaemon.log − Check that the /var/adm/SPlogs/SPdaemon.log file has the correct permissions (rw-r--r--). − Check that the error record templates for SPMON exists: # errpt -t | grep SPMON 7.3 Isolating Switch Problems...
This soft copy for use by IBM employees only. These files define how the nodes will communicate. The topology files define the links between nodes, and the clock files define how the clock is sourced into each switch board and adapter. If your topology or clock file is wrong, you will not be able to use the switch.
Page 221
This soft copy for use by IBM employees only. Chapter 8. Producing a System Dump There is not much difference in the way system dumps are handled in an SP environment. The only exception to this is in the primary dump device used by the nodes when they are running AIX Version 4.1.
Page 222
This soft copy for use by IBM employees only. Following are some of the kernel structures that are captured by the system dump: System Variables and Statistic This comprises the kernel parameters either set by the user or hardcoded into the kernel. Such statistics...
Page 223
This soft copy for use by IBM employees only. Make sure you know your system and know what your primary and secondary dump devices are set to. • To list the current dump destination: # sysdumpdev -l primary /dev/hd6 secondary...
Page 224
This soft copy for use by IBM employees only. Make change permanent -d directory Specifies the directory where the dump is copied at boot time. If the copy fails, the system continues to boot. -D directory Specifies the directory where the dump device is copied at boot time.
Page 225
This soft copy for use by IBM employees only. information to the primary dump device. If you start your own dump before copying the information in your dump device, your new dump will overwrite the existing information. A user-initiated dump is different from a dump initiated by an unexpected system halt because the user can designate which dump device to use.
This soft copy for use by IBM employees only. This indicates a network dump is in progress, and the host is waiting for the server to respond. The value in the three-digit display should alternate between 0c7 and 0c2 or 0c9 . If the value does not change, then the dump did not complete due to an unexpected error.
Page 227
This soft copy for use by IBM employees only. Copy a System Dump to Removable Media The system dump is 16957952 bytes and will be copied from /dev/hd6 to media inserted into the device from the list below. Please make sure that you have sufficient blank, formatted media before you continue.
Page 228
This soft copy for use by IBM employees only. #!/usr/bin/ksh # Usage: mkdump [filename] [block size] # Description: Copy the system dump from the primary dump device to file name (/var/adm/ras/dump_file by default) using the block size specified (4096 by default).
Page 229
This soft copy for use by IBM employees only. snap The information gathered with the command can be used to identify and resolve system problems. You must have root authority to execute this command. If you use the -a flag, then you need approximately 8 MB of temporary disk space to collect all the system information, including the content of the error log.
This soft copy for use by IBM employees only. 8.1.4 Using crash to Analyze a Dump Figure 82. Using crash crash crash The dump can be analyzed by using the command. The command can be used to check the validity of the dump: # crash /var/adm/ras/vmcore.x /unix...
Page 231
This soft copy for use by IBM employees only. > stat (to show machine status at time of crash) sysname: AIX nodename: sp21n02 release: 1 version: 4 machine: 000168205700 time of crash: Tue Aug 6 15:38:38 1996 age of system: 7hr, 22min...
Page 232
This soft copy for use by IBM employees only. crash after executing the command, you know that you have a full dump that can be analyzed. You should avoid sending dumps to the Support Center, only to find out that the Center cannot do anything about them because they are partial dumps.
Page 233
9.1.5 Network Time Protocol (NTP) The RS/6000 SP system uses NTP to synchronize the time-of-day clocks on the Control Workstation and the nodes. In the following sections of this chapter, the assumption is made that the user has chosen to implement SP User Management together with file collections and the auto mount daemon.
Page 234
This soft copy for use by IBM employees only. 9.2 Components The following options of the ssp installp image are relevant to User and Services Management. Component version numbers included are for PTF 11. 9.2.1 ssp.basic 2.1.0.10 This component includes the login control functions whereby users can either be allowed or denied access to specific nodes.
This soft copy for use by IBM employees only. Site Environment Information Type or select values in entry fields. Press Enter AFTER making all desired changes. [TOP] [Entry Fields] Default Network Install Image [bos.obj.ssp.41] Remove Install Image after Installs false...
This soft copy for use by IBM employees only. root@sp21cw0 / > splstdata -e ] pg List Site Environment Database Information attribute value ------------------------------------------------------------------------- control_workstation sp21cw0 cw_ipaddrs 9.12.0.137:9.12.60.98:9.12.60.99: install_image bos.obj.ssp.41 remove_image false primary_node ntp_config consensus ″″ ntp_server ntp_version amd_config true...
This soft copy for use by IBM employees only. If using the SMIT panels to add a user, only the user′s name needs to be entered. The other fields will default to values contained in the System Data Repository and in the /usr/lpp/ssp/bin/spmkuser.default file. After the user has been successfully added, use the Change/Show Characteristics of a User SMIT panel to check the user′s attributes.
/var/sysman/supper maintenance. The command uses the Software Update Protocol (SUP) to manage RS/6000 SP file collections and transfer them across the system. supper For further information on file collections and the command see chapter 14, “Managing File Collections,”...
This soft copy for use by IBM employees only. 9.4.1 Using File Collections splstdata -e If filec = true (use to confirm) then the following sequence will start the supfilesrv daemon on the Control Workstation. 1. /etc/rc.sp is started from /etc/inittab 2.
This soft copy for use by IBM employees only. SUP 7.24 (4.3 BSD) for file /tmp/.sf10668 at May 13 12:02:12 SUP Upgrade of user.admin at Mon May 13 12:02:12 1996 SUP Fileserver 7.12 (4.3 BSD) 27302 on sp21cw0 SUP Locked collection user.admin for exclusive access...
For further information on the use of Amd tool, see Chapter 15, “Managing Amd,” in the IBM RS/6000 Scalable POWERparallel Systems: Administration Guide , GC23-3897. RS/6000 SP System Management: Easy, Lean and Mean , GG24-2563 is also a good source of information.
This soft copy for use by IBM employees only. For further information on Amd options, refer to the man pages in /usr/lpp/ssp/public/amd920824upl75.tar.Z. The man pages can be extracted with the commands: zcat amd920824upl75.tar.Z | tar -xvf - amd920824upl75/amd/amd.8 zcat amd920824upl75.tar.Z | tar -xvf - amd920824upl75/amq/amq.8 mv amd920824upl75/amd/amd.8 /usr/man/man8...
This soft copy for use by IBM employees only. # This file contains the definition of the amd maps for /auto. # /etc/amd/amd-maps/amd.auto /defaults type:=nfs;opts:=rw,soft;sublink:=${key} host==sp21cw0;type:=link;fs:=/tony \ host!=sp21cw0;type:=nfs;rhost:=sp21cw0;rfs:=/tony host==sp21cw0;type:=link;fs:=/tony \ host!=sp21cw0;type:=nfs;rhost:=sp21cw0;rfs:=/tony three host==sp21cw0;type:=link;fs:=/tony \ host!=sp21cw0;type:=nfs;rhost:=sp21cw0;rfs:=/tony four host==sp21cw0;type:=link;fs:=/tony \ host!=sp21cw0;type:=nfs;rhost:=sp21cw0;rfs:=/tony Figure 94. Sample A m d Map File The example in Figure 94 will mount the directory /tony (and its subdirectories) on the Control Workstation and on the nodes as /auto.
This soft copy for use by IBM employees only. 6. Sometimes problems with A m d may be caused by the Network File System lssrc -g nfs (NFS) subsystem. Use the command to check the status of the NFS subsystem, as shown in Figure 97 on page 224.
Page 245
Print management tools are supplied with PSSP 2.1 to shift printing functions from the nodes to a print server, eliminating the need to maintain and support print queues on a large number of nodes. An RS/6000 SP node is designed to use a unique remote host, PRINT_HOST, as a print server.
This soft copy for use by IBM employees only. 9.7.1 Using NTP ntp_config = [consensus/timemaster/internet] splstdata -e (use to confirm), then the following sequence will start the xntp daemon on the Control Workstation and the nodes: 1. /etc/rc.sp is started from /etc/inittab 2.
Page 247
This soft copy for use by IBM employees only. for i in hostlist -av /usr/bin/rsh $i ′ kill -15 ps -e | grep xntp | grep -v grep | cut -c1-6′ /usr/bin/rsh $i /etc/rc.ntp done This script will stop the NTP daemon on each node and then restart it.
Page 248
This soft copy for use by IBM employees only. SP PD Guide...
This soft copy for use by IBM employees only. Appendix A. RS/6000 SP Script Files Many of the SP tasks are carried out by script files. This appendix provides flow charts for the main PSSP scripts, which can be used as a reference for problem determination.
This soft copy for use by IBM employees only. A.2 The install_cw Script This script is invoked during the installation procedure. The CW is installed here, creating the directory structure and setting up the process and subsystems infrastructure that will manage the RS/6000 SP.
This soft copy for use by IBM employees only. A.3 The setup_server Script This script is run to check and configure boot/install servers. It can be run from the Control Workstation or from the nodes which will be used as boot/install servers.
This soft copy for use by IBM employees only. Appendix B. The SDR Structure The SDR is the central data repository for the RS/6000 SP. The communication with the SDR is through a client/server relationship. The server is represented by a process (daemon) running on the Control Workstation, the client portion is...
Page 292
This soft copy for use by IBM employees only. SP PD Guide...
Page 293
C.2 RS/6000 SP System Files The following are files that contain IP addresses, or hostnames that exist on RS/6000 SP nodes and the CW. It is best to look through these files when completing the procedures for changing hostnames and IP addresses for your RS/6000 SP system.
Page 294
Works with hostnames for Resource Management pools (CW) /tftpboot/<host>.config_info Contains IP address and hostname for each RS/6000 SP node, and is found on CW and boot servers /tftpboot/<host>.intstall_info Contains IP address and hostname for each RS/6000 SP node, and is found on CW and boot servers /tftpboot/<host>-new-srvtab...
Page 295
This soft copy for use by IBM employees only. C.3 Procedures Used When Changing IP Addresses/Hostnames The changing of an IP address or hostname for a RS/6000 SP node and CW will follow the same procedures, except that the Control Workstation will include additional network and RS/6000 SP configuration steps.
Page 296
You may need to update the /etc/krb.conf file to point to the correct kerberos server. Your update should now be completed at the RS/6000 SP node. You need to make the changes at the Control Workstation and then reboot the RS/6000 SP nodes.
Page 297
This step should only be executed for hostname changes for the CW. It is not required for changes made to IP addresses, or with RS/6000 SP node updates. Reference Step 12 in the RS/6000 SP Installation Guide. Manually check that the authentication files /etc/krb.conf, /etc/krb.realms, and /etc/krb-srvtab reference the new hostname.
Page 298
IP address/hostname changes. Instead of hardcoding hostnames, you may reference the $SERVER and $CW variables. 14. You can now execute a “reboot” on each of the RS/6000 SP nodes. This can be accomplished using the System Monitor GUI. Follow the same install sequence as during installation: customize the boot servers first, and then customize the remaining RS/6000 SP nodes.
Page 299
4. It is best to reboot all of the RS/6000 SP nodes to reflect the IP address and hostname changes. When the RS/6000 SP nodes initialize, your RS/6000 SP system should be activated using the new IP addresses and hostnames.
Page 300
This soft copy for use by IBM employees only. SP PD Guide...
Page 301
IBM intends to make these available in all countries in which IBM operates. Any reference to an IBM product, program, or service is not intended to state or imply that only IBM′s product, program, or service may be used. Any functionally equivalent program that does not infringe any of IBM′s intellectual property rights may be used instead of the IBM product, program or service.
Page 302
This soft copy for use by IBM employees only. The following terms are trademarks of other companies: C-bus is a trademark of Corollary, Inc. PC Direct is a trademark of Ziff Communications Company and is used by IBM Corporation under license.
Page 303
• PSSP 2.1 Technical Presentation , SG24-4542 • RS/6000 SP System Management: Easy, Lean, and Mean , GG24-2563 E.2 Redbooks on CD-ROMs Redbooks are also available on CD-ROMs. Order a subscription and receive updates 2-4 times a year at significant savings.
Page 304
This soft copy for use by IBM employees only. SP PD Guide...
Page 305
This soft copy for use by IBM employees only. How To Get ITSO Redbooks This section explains how both customers and IBM employees can find out about ITSO redbooks, CD-ROMs, workshops, and residencies. A form for ordering books and CD-ROMs is also provided.
Page 306
IBM Direct Publications Catalog http://www.elink.ibmlink.ibm.com/pbl/pbl • Internet Listserver With an Internet E-mail address, anyone can subscribe to an IBM Announcement Listserver. To initiate the announce@webster.ibmlink.ibm.com subscribe service, send an E-mail note to with the keyword in the body of the note (leave the subject line blank).
Page 307
This soft copy for use by IBM employees only. IBM Redbook Order Form Please send me the following: Title Order Number Quantity • Please put me on the mailing list for updated versions of the IBM Redbook Catalog. First name Last name Company Address City...
Page 308
This soft copy for use by IBM employees only. SP PD Guide...
Page 309
This soft copy for use by IBM employees only. List of Abbreviations Access Control List Management Information Base Advanced Interactive Executive Message Passing Interface Application Programming Message Passing Library Interface Massively Parallel Processors Boot-Install Server Network Installation Manager Berkeley Software...
Page 310
This soft copy for use by IBM employees only. SP PD Guide...
Page 311
This soft copy for use by IBM employees only. Index Control Workstation (continued) Special Characters General Description .rhosts file maximum number of processes number of license users preparing for install prerequisites abbreviations required steps acronyms serial line diagnostics addresses, TCP/IP...
Page 312
This soft copy for use by IBM employees only. errstop command ESSL jm_config file See Engineering and Scientific Subroutines Library Estart command 105, 114 Estart_sw script Etopology command Kerberos Eunfence command See Authentication Services export problems LED Codes File Collection...
Page 313
This soft copy for use by IBM employees only. PEND, error type penotify command network booting Perf, error type Network Installation Management PERL attributes See Practical Extraction and Report Language bos.sysmgt.nim PERM, error type commands Phase-Locked Loops concepts debugging See Phase-Locked Loops...
Page 314
This soft copy for use by IBM employees only. PTF 12 29, 69 ssp.sysman 162, 214 PTFs, how to get them ssp.top 126, 130 PVMe Supervisor Card support on PSSP 2.1 supper See User Management Switch 128-way system, example rc.boot script 48-way system, example rc.switch script...
Page 315
This soft copy for use by IBM employees only. Switch (continued) Topology file, example Topology file, nomenclature See Switch topology files 4, 95 TEMP, error type Worm daemon templates, error log Switch Problems Time of the Day 91, 109 switch responds...
Page 316
IBML ® This soft copy for use by IBM employees only. Printed in U.S.A. SG24-4778-00...