I. INTRODUCTION Oracle Cluster Health Monitor - OS Tool (CHM/OS) -------------------------------------------------------------------------------- CHM/OS is designed to detect and analyze operating system (OS) and cluster resources related degradation and failures in order to improve diagnosibility of Oracle Clusterware and Oracle RAC issues such as a node eviction. It continuously tracks the OS resource use at node, process, and device level. It collects and analyzes cluster-wide data. When run in real-time mode, an alert is shown to the operator when built in thresholds are exceeded. For root cause analysis, historical data can be replayed to examine what was occuring at the time of failure. II. LINUX INSTALLATION NOTE: If CHM/OS v1.x is already installed on your cluster, you will have to remove the existing version before installing the new one. The installation requires setting a CRFPERLBIN environment variable to the location of a perl version greater than or equal to 5.6.0. Bash Example: export CRFPERLBIN=/usr/bin CHM/OS requires a Linux kernel version greater than or equal to 2.6.9 and x86 architecture. It will also work on x86_64 if the kernel is configured to run 32-bit binaries. The CHM/OS v1.x standalone installation can only be performed on systems which does not have v11.2+ integrated version of CHM/OS that is part of the Oracle Grid Infrastructure installation beginning in version 11.2.0.2. NOTE: References to the install home in this README assume /usr/lib/oracrf for the standalone install. The following are the steps for CHM/OS v1 standalone install. 1. Create user ':' (e.g. crfuser:oinstall) on all the nodes where CHM/OS is being installed. Make sure the user name's home is the same on all nodes. Example command while logged in as root: # useradd -d /opt/crfuser -s /bin/sh -g oinstall crfuser 2. Setup passwordless ssh for each user created in step 1 between each pair of nodes. Confirm from each node that this user can ssh to all other nodes nodes including the local node using only the hostname (without domain). No password and or additional user intervention such as acknowledging prompts should be required. This can be done by generating the key at one node for '' and then adding the public key to ~/.ssh/authorized_keys and copying the whole ~/.ssh directory to all nodes under home dir of the ''. Now, confirm by ssh'ing into all nodes from all nodes. For ssh-key reference see http://fedoranews.org/dowen/sshkeys/ 3. If there exists a previous installation of CHM/OS v1 (or IPD/OS), delete it from all nodes prior to installation. Perform the follwing steps: a. Log in as root and diable the daemons start script: # /etc/init.d/init.crfd disable b. Run the uninstall script: # $CRFPERLBIN /usr/lib/oracrf/install/crfinst.pl -d c. Delete all BDB databases from all nodes. d. Manually delete the install home if it still exists. 4. Login as '' and unzip crfpack.zip into a new subdirectory of the home directory. This subdirectory (Ex: /tmp/crfinstall) must be easily recreatable by ssh on all nodes. This subdirectory will be deleted when the installation is completed. 5. Run the ./install/crfinst.pl script on a node with desired node list, specified as comma separated list, for cluster-wide install. Example: # $CRFPERLBIN crfinst.pl -i node1,node2,node3 -b /opt/oracrfdb -m node1 6. Once Step 5 completes, you will be prompted to run crfinst.pl script again with -f and optionally -b on each node while logged in as root to finalize the install on that node. 7. Once the finalize operation is complete, run the following while logged in as root to enable the CHM/OS on all nodes. Example: # /etc/init.d/init.crfd enable DO NOT bypass any of above steps or try other ways to install because the daemons will not work correctly. Also remember that only root can de-install. DO NOT attempt re-install without doing de-install with '$CRFPERLBIN crfinst.pl -d' option. All node names in the nodelist should be specified without the domain name to maintain uniformity and ease of use. Entering node names with domain names may lead to a failed installation. Usage: crfinst.pl -a [] -c [] -d -f [-b ] -h -i -b [-m ] -N ClusterName -a : Add nodes for OS resource metrics collection from a configured node. The proper mechanism for adding a node is to stop CHM/OS on all the nodes, add a new node and then restart on all nodes. -b : Specify the path where a Berkeley DB can be created to store OS metrics. This location MUST be outside of the location where the ZIP file was unpacked because all of the directories under that location which were created by unzip will be removed. BDB files can be kept as it is for later usage. The location should be a path on a volume with at least 2GB per node space available and only writable by the root user. The BDB should not be on root filesystem. Its location is required to be same on all nodes. If that can not be done, please specify a different location during finalize (-f) operation on each node, following the above size requirements. The path MUST not be on shared disk. If a shared BDB path is provided to multiple hosts, BDB corruption will occur. -c : Perform the user equivalence and other pre-install checks for the nodes. The installation steps will be prompted for but the actual installation will not occur. -d : De-install software on the node where the script is run (Only root user may perform this operation). -f : Finalize the install on a node (Only root may perform this operation). -h : Show this help -i : Install software on nodes listed in the . -N : Specify the name of the cluster : Path for Berkeley DB to store OS metrics. : Name of the node for master logger daemon. : Comma separated list of nodes to install CHM/OS. III. THE DAEMONS CHM/OS consists of three daemons: ologgerd, oproxyd and osysmond. There is one ologgerd master daemon on only one node in the installed set of nodes and there is one osysmond on every node. If there is more than 1 node in the installed set of nodes, an additional node is chosen as the standby for the master ologgerd. If master daemon terminates because it is not able to restart after a fixed number of retries or the node where master was running is down, the standby becomes the master and selects a new standby. The master manages the OS metric repository in a Berkeley DB and interacts with the standby to manage a replica of the master OS metrics database on the standby node. The osysmond daemon is the monitoring and OS metric collection daemon that sends the data to ologgerd for storing in the master and standby repositories. The oproxyd daemon is the proxy that handles connections on the public network. If CHM/OS is configured with private node names, only orpoxyd is listening on the public network for external clients such as oclumon and chmosg. It runs on all the nodes and is highly avaliable. NOTE: If the oclumon or chmosg clients are run remotely, all data is sent in clear text including process names. Additionally clients do not authenicate to ologgerd. Therefore, if security of this data is a concern it is recommended to run clients from a local node. Copyright 2008, 2012 Oracle and/or its affiliates. All rights reserved.