DBA: Linux
   DOWNLOAD
 Oracle Database for Linux
 Oracle Enterprise Linux
   TAGS
linux, database, dba, All

Guide to Advanced Linux Command Mastery, Part 4: Managing the Linux Environment


by Arup Nanda Oracle ACE Director

Published May 2009

In this installment, learn how to manage the Linux environment effectively through these commonly used commands.

ifconfig

The ifconfig command shows the details of the network interface(s) defined in the system. The most common option is -a , which shows all the interfaces.

# ifconfig -a

The usual name of the primary Ethernet network interface is eth0. To find out the details of a specific interface, e.g. eth0, you can use:

# ifconfig eth0

The output is show below, with explanation:

Figure1


Here are some key parts of the output:

  • Link encap: the type of the hardware physical medium supported by this interface (Ethernet, in this case)
  • HWaddr: the unique identifier of the NIC card. Every NIC card has a unique identifier assigned by the manufacturer, called MAC or MAC address. The IP address is attached to the MAC to the server. If this IP address is changed, or this card is moved from this server to a different one, the MAC does not change.
  • Mask: the netmask
  • inet addr: the IP address attached to the interface
  • RX packets: the number of packets received by this interface
  • TX packets: the number of packets sent
  • errors: the number of errors in sending or receiving

The command is not just used to check the settings; it’s used to configure and manage the interface as well. Here is a short list of parameters and options for this command:

up/down – enables or disables a specific interface. You can use the down parameter to shutdown an interface (or disable it):

# ifconfig eth0 down

Similarly to bring it up (or enable) it, you would use:

# ifconfig eth0 up

media – sets the type of the Ethernet media such as 10baseT, 10 Base 2, etc. Common values for the media parameter are 10base2, 10baseT, and AUI. If you want Linux to sense the media automatically, you can specify “auto”, as shown below:

# ifconfig eth0 media auto

add – sets a specific IP address for the interface. To set an IP address of 192.168.1.101 to the interface eth0, you would issue:

# ifconfig eth0 add  192.168.1.101

netmask – sets the netmask parameter of the interface. Here is an example where you can set the netmask of the eth0 interface to 255.255.255.0

# ifconfig eth0 netmask  255.255.255.0

In an Oracle Real Application Clusters environment you have to set the netmask in a certain way, using this command.

In some advanced configurations, you can change the MAC address assigned to the network interface. The hw parameter accomplishes that. The general format is:

ifconfig  
                              
<Interface> hw  
                              
<TypeOfInterface>  <MAC>
                            

The <TypeOfInterface> shows the type of the interface, e.g. ether, for Ethernet. Here is how the MAC address is changed for eth0 to 12.34.56.78.90.12 (Note: the MAC address shown here is fictional. If it matches any actual MAC, it’s purely coincidental.):

# ifconfig eth0 hw ether  12.34.56.78.90.12

This is useful when you add a new card (with a new MAC address) but do not want to change the Linux-related configuration such as network interfaces.

Usage for the Oracle User

The command, along with nestat described below, is one of the most widely used in managing Oracle RAC. Oracle RAC’s performance depends heavily on the interconnect used between the nodes of the cluster. If the interconnect is saturated (that is, it no longer carries any additional traffic) or is failing, you may see reduced performance. The best course of action in this case is to look at the ifconfig output to view any failures. Here is a typical example:

# ifconfig eth9
eth9      Link encap:Ethernet   HWaddr 00:1C:23:CE:6F:82  
          inet addr:10.14.104.31   Bcast:10.14.104.255   Mask:255.255.255.0
          inet6 addr: fe80::21c:23ff:fece:6f82/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST   MTU:1500  Metric:1
          RX packets:1204285416 errors:0  
                              
                                 
dropped:560923
                               overruns:0 frame:0
          TX packets:587443664 errors:0  
                              
                                 
dropped:623409
                               overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1670104239570 (1.5 TiB)  TX bytes:42726010594 (39.7 GiB)
          Interrupt:169 Memory:f8000000-f8012100
                            

Note the text highlighted in red. The dropped count is extremely high; the number should ideally be 0 or close to it. A high number more than half a million sounds like a faulty interconnect that drops packets, causing the interconnect to resend packets—which should be a clue in the issue diagnosis.

netstat

The status of the input and output through a network interface is assessed via the command netstat. This command can provide the complete information on how the network interface is performing, down to even socket level. Here is an example:

# netstat
Active Internet connections  (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address  State      
tcp        0      0 prolin1:31027 prolin1:5500     TIME_WAIT 
tcp        4      0 prolin1l:1521 applin1:40205    ESTABLISHED 
tcp        0      0 prolin1l:1522 prolin1:39957    ESTABLISHED 
tcp        0      0 prolin1l:3938 prolin1:31017    TIME_WAIT
tcp        0      0 prolin1l:1521 prolin1:21545    ESTABLISHED
                               
… and so on …
                            

The above output goes on to show all the open sockets. In very simplistic terms, a socket is akin to a connection between two processes. [Please note: strictly speaking, “sockets” and “connections” are technically different. A socket could exist without a connection. However, a discussion on sockets and connections is beyond the scope of this article. Therefore I have merely presented the concept in an easy-to-understand manner.] Naturally, a connection has to have a source and a destination, called local and remote address. The end points could be on the same server; or on different servers.

In many cases, the programs connect to the same server. For instance, if two processes communicate among each other, the local and remote addresses will be the same, as you can see in the first line – the local and remote addresses are both the sever “prolin1”. However, the processes communicate over a port, which will be different. This port is shown next to the host name after the “:” (colon) mark. The user program sends the data to be sent across the socket to a queue and the receiver reads from a queue at the remote end. Here are the columns of the output:

  1. The leftmost column “ Proto” shows the type of the connection – tcp in this case.
  2. The column Recv-Q shows the bytes of data in the queue to be sent to the user program that established the connection. This value should be as close to 0 as possible. In busy servers this value will be more than 0 but shouldn’t be very high. A higher number may not mean much, unless you see a large number in Send-Q column, described below.
  3. The Send-Q column denotes the bytes in the queue to be sent to the remote program, i.e. the remote program has not yet acknowledged receiving it. This should be close to 0. A large number may indicate a network bottleneck.
  4. Local Address is source of the connection and the port number of the program.
  5. Foreign Address is the destination host and port number. In the first line, both the source and destination are on the same host: prolin1. The connection is simply waiting. The second line shows and established connection between port 1521 of proiln1 going to the port 40205 of the host applin1. It’s most likely an Oracle connection coming from the client applin1 to the database server prolin1. The Oracle listener on prolin1 runs on port 1521; so the port of the source is 1521. In this connection, the server is sending the requested data to the client.
  6. The column State shows the status of the connection. Here are some common values.
    • ESTABLISHED – that the connection has been established. It does not mean that any data is flowing between the end points; merely that the end points have talked to each other.
    • CLOSED – the connection has been closed, i.e. not used now.
    • TIME_WAIT – the connection is being closed but there are still packets in the network that are being handled.
    • CLOSE_WAIT – the remote end has shutdown and has asked to close the connection.

Well, from the foreign and local addresses, especially from the port numbers, you can probably guess that the connections are Oracle related, but won’t it be nice to know that for sure? Of course. The -p option shows the process information as well:

#  netstat -p
Proto  Recv-Q Send-Q Local Address Foreign Address State       PID/Program name   
tcp        0       0 prolin1:1521   prolin1:33303   ESTABLISHED  1327/oraclePROPRD1  
tcp        0       0 prolin1:1521   applin1:51324   ESTABLISHED 13827/oraclePROPRD1 
tcp        0       0 prolin1:1521   prolin1:33298   ESTABLISHED  32695/tnslsnr       
tcp        0       0 prolin1:1521   prolin1:32544   ESTABLISHED  15251/oracle+ASM    
tcp        0       0 prolin1:1521   prolin1:33331   ESTABLISHED  32695/tnslsnr    

This clearly shows the process IP and the process name in the last column, which confirms it to be Oracle server processes, listener process, and ASM server processes.

The netstat command can have various options and parameters. Here are some key ones:

To find out the network statistics for various interfaces, use the -i option.

#  netstat -i
Kernel  Interface table
Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0       1500   0  6860659      0      0      0  2055833      0      0      0 BMRU
eth8       1500   0     2345      0      0      0      833      0      0      0 BMRU
lo        16436   0 14449079      0      0      0 14449079      0      0      0 LRU

This shows the different interfaces present in the server (eth0, eth8, etc.) and the metrics associated with the interface.

  • RX-OK shows the number of packets successfully sent (for this interface)
  • RX-ERR shows number of errors
  • RX-DRP shows packets dropped and had to be re-sent (either successfully or not)
  • RX-OVR shows packets overrun

The next sets of columns (TX-OK, TX-ERR, etc.) show the corresponding stats for send data.

Flg column is a composite value of the property of the interface. Each letter indicates a specific property being present. Here is an explanation of the letters.

B – Broadcast
M – Multicast
R – Running
U – Up
O – ARP Off
P – Point to Point Connection
L – Loopback
m – Master
s - Slave

You can use the –interface (note: there are two hyphens, not one) option to display the same for a specific interface.

# netstat --interface=eth0 
Kernel Interface table
Iface       MTU Met    RX-OK  RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR  TX-DRP TX-OVR Flg
eth0       1500   0 277903459      0      0      0 170897632      0      0      0 BMsRU

Needless to say, the output is wide and is a little difficult to grasp at one shot. If you are comparing across interfaces, it makes sense to have a tabular output. If you want to examine the values in a more readable format, use the -e option to produce an extended output:

# netstat -i -e
Kernel Interface table
eth0      Link encap:Ethernet   HWaddr 00:13:72:CC:EB:00  
          inet addr:10.14.106.0   Bcast:10.14.107.255   Mask:255.255.252.0
          inet6 addr: fe80::213:72ff:fecc:eb00/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:6861068 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2055956 errors:0 dropped:0 overruns:0  carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:3574788558 (3.3 GiB)  TX bytes:401608995 (383.0 MiB)
          Interrupt:169

Does the output seem familiar? It should; it’s the same as the output of the ifconfig.

If you’d rather see the output showing IP addresses instead of host names, use the -n option.

The -s option shows the summary statistics of each protocol, rather than showing the details of each connection. This can be combined with the protocol specific flag. For instance -u shows the stats related to the UDP protocol.

# netstat -s -u
Udp:
    12764104 packets received
    600849 packets to unknown port received.
    0 packet receive errors
    13455783 packets sent

Similarly, to see the stats for tcp, use -t and for raw, -r.

One of the really useful options is the display of the routing table, the -r option.

#  netstat -r
Kernel  IP routing table
Destination     Gateway         Genmask          Flags   MSS Window  irtt Iface
10.20.191.0     *               255.255.255.128  U         0 0          0 bond0
172.22.13.0     *               255.255.255.0    U         0 0          0 eth9
169.254.0.0     *               255.255.0.0      U         0 0          0 eth9
default         10.20.191.1     0.0.0.0          UG        0 0          0 bond0

The second column of netstat output– Gateway–shows the gateway to which the routing entry points. If no gateway is used, an asterisk is printed instead. The third column– Genmask–shows the “generality” of the route, i.e., the network mask for this route. When given an IP address to find a suitable route for, the kernel steps through each of the routing table entries, taking the bitwise AND of the address and the netmask before comparing it to the target of the route.

The fourth column, Flags, displays the following flags that describe the route:

  • G means the route uses a gateway.
  • U means the interface to be used is up (available).
  • H means only a single host can be reached through the route. For example, this is the case for the loopback entry 127.0.0.1.
  • D means this route is dynamically created.
  • ! means the route is a reject route and data will be dropped.

The next three columns show the MSS, Window, and irtt that will be applied to TCP connections established via this route.

  • MSS stands for Maximum Segment Size – the size of the largest datagram for transmission via this route.
  • Window is the maximum amount of data the system will accept in a single burst from a remote host for this route.
  • irtt stands for Initial Round Trip Time. It’s a little complicated to explain. Let me explain that separately.

The TCP protocol has a built-in reliability check. If a data packet fails during transmission, it’s re-transmitted. The protocol keeps track of how long the takes for the data to reach the destination and acknowledgement to be received. If the acknowledgement does not come within that timeframe, the packet is retransmitted. The amount of time the protocol has to wait before re-transmitting is set for the interface once (which can be changed) and that value is known as initial round trip time. A value of 0 means the default value is used.

Finally, the last field displays the network interface that this route will use.

nslookup

Every reachable host in a network should have an IP address, which identifies it uniquely in the network. In the internet, which is a big network anyway, IP addresses allow the connections to reach servers running Websites, e.g. www.oracle.com. So, when one host (such as a client) wants to connect to another (such as a database server) using its name and not the IP address, how does the client browser know which IP address to connect to?

The mechanism of translating the host name to IP addresses is known as name resolution. In the most rudimentary level, the host has a special file called hosts, which stores the IP Address – Hostname pairs. Here is an example file:

# cat /etc/hosts
# Do not remove the following  line, or various programs
# that require network  functionality will fail.
127.0.0.1       localhost.localdomain       localhost
192.168.1.101   prolin1.proligence.com      prolin1
192.168.1.102   prolin2.proligence.com      prolin2

This shows that the hostname prolin1.proligence.com is translated to 192.168.1.101. The special entry with the IP address 127.0.0.1 is called a loopback entry, which points back to the server itself via a special network interface called lo (which you saw earlier in the ifconfig and netstat commands).

Well, this is good, but you can’t possibly put all the IP addresses in the world in this file. There should be another mechanism to perform the name resolution. A special purpose server called a nameserver performs that role. It’s like a phonebook that your phone company provides; not your personal phonebook. There may be several nameservers available either inside or outside the private network. The host contacts one of the nameservers first, gets the IP address of the destination host it want to contact, and then attempts to connect to the IP address.

How does the host know what these nameservers are? It looks into a special file called /etc/resolv.conf to get that information. Here is a sample resolv file.

; generated by  /sbin/dhclient-script
search proligence.com
nameserver 10.14.1.58
nameserver 10.14.1.59
nameserver 10.20.223.108

How do you make sure that the name resolution is working fine for a specific host name? In other words, you want to make sure that when the Linux system tries to contact a host called oracle.com, it can find the IP address on the nameserver. The nslookup command is useful for that. Here is how you use it:

# nslookup oracle.com
Server:         10.14.1.58
Address:        10.14.1.58#53
                              
Non-authoritative answer: Name: oracle.com Address: 141.146.8.66

Let’s dissect the output. The Server output is the address of the nameserver. The name oracle.com resolves to the IP address 141.146.8.66. The name was resolved by the nameserver shown next to the word Server in the output.

If you put this IP address in a browser–http://141.146.8.66 instead of http://oracle.com--the browser will go the oracle.com site.

If you made a mistake, or looked for a wrong host:

# nslookup oracle-site.com
Server:         10.14.1.58
Address:        10.14.1.58#53
                              
** server can't find oracle-site.com: NXDOMAIN

The message is quite clear: this host does not exist.

dig

The nslookup command has been deprecated. Instead, a new, more powerful command – dig ( domain information groper) – should be used. On some newer Linux servers the nslookup command may not be even available.

Here is an example; to check the name resolution of the host oracle.com, you use the following command:

# dig oracle.com
                              
; <<>> DiG 9.2.4 <<>> oracle.com ;; global options: printcmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62512 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 8, ADDITIONAL: 8 ;; QUESTION SECTION: ;oracle.com. IN A ;; ANSWER SECTION: oracle.com. 300 IN A 141.146.8.66 ;; AUTHORITY SECTION: oracle.com. 3230 IN NS ns1.oracle.com. oracle.com. 3230 IN NS ns4.oracle.com. oracle.com. 3230 IN NS u-ns1.oracle.com. oracle.com. 3230 IN NS u-ns2.oracle.com. oracle.com. 3230 IN NS u-ns3.oracle.com. oracle.com. 3230 IN NS u-ns4.oracle.com. oracle.com. 3230 IN NS u-ns5.oracle.com. oracle.com. 3230 IN NS u-ns6.oracle.com. ;; ADDITIONAL SECTION: ns1.oracle.com. 124934 IN A 148.87.1.20 ns4.oracle.com. 124934 IN A 148.87.112.100 u-ns1.oracle.com. 46043 IN A 204.74.108.1 u-ns2.oracle.com. 46043 IN A 204.74.109.1 u-ns3.oracle.com. 46043 IN A 199.7.68.1 u-ns4.oracle.com. 46043 IN A 199.7.69.1 u-ns5.oracle.com. 46043 IN A 204.74.114.1 u-ns6.oracle.com. 46043 IN A 204.74.115.1 ;; Query time: 97 msec ;; SERVER: 10.14.1.58#53(10.14.1.58) ;; WHEN: Mon Dec 29 22:05:56 2008 ;; MSG SIZE rcvd: 328

From the mammoth output, several things stand out. It shows that the command sent a query to the nameserver and the host got a response back from the nameserver. The name resolution was also done at some other nameservers such as ns1.oracle.com. It shows that the query took 97 milliseconds.

If the size of the output might not make it all that useful, you can use the +short option to remove all those verbose output:

# dig +short oracle.com
141.146.8.66

You can also use the IP address to reverse lookup the host name from the IP address. The -x option is used for that.

# dig -x 141.146.8.66

The +domain parameter is useful when you are looking for a host inside a domain. For instance, suppose you are searching for the host otn in the oracle.com domain, you can either use:

# dig +short otn.oracle.com

Or you can use the +domain parameter:

# dig +short +tcp  +domain=oracle.com otn
www.oracle.com.
www.oraclegha.com.
141.146.8.66

Usage for the Oracle User

The connectivity is established between the app server and the database server. The TNSNAMES.ORA file, used by SQL*Net may look like this:

prodb3 =
  (description =
    (address_list =
      (address = (protocol = tcp)(host = prolin3)(port = 1521))
    )
    (connect_data =
      (sid = prodb3)
    )
  )

The host name prolin3 should be able to be resolved by the app server. Either this should be in the /etc/hosts file; or the host prolin3 should be defined in the DNS. To make sure the name resolution works and works correctly to point to the right host, you can use the dig command.

With these two commands you can handle most of the tasks involved with network in a Linux environment. In the rest of this installment you will learn how to manage a Linux environment effectively.

uptime

You just logged on to the server and see some things that were supposed to be running are not. Perhaps the processes were killed or perhaps all processes were killed by a shutdown. Instead of guessing, find out if the server was indeed rebooted with the uptime command. The command shows the length of time the server has been up since the last reboot.

# uptime
 16:43:43 up 672 days, 17:46,   45 users,  load average: 4.45,  5.18, 5.38

The output shows much useful information. The first column shows the current time when the command was executed. The second portion – up 672 days, 17:46 – shows the amount of time the server has been up. The numbers 17:46 depict the hour and minutes. So this server has been up for 672 days, 17 hours, and 46 minutes as of now.

The next item – 45 users – shows how many users are logged in to the server right now.

The last bits of the output show how much has been the load average of the server in the last 1, 5, and 15 minutes respectively. The term “load average” is a composite score that determines the load on the system based on CPU and I/O metrics. The higher the load average, the more the load on the system. It’s not based on a scale; unlike percentages it does not end at a fixed number such as 100. In addition, load averages of two systems can’t be compared. It is a number to quantify load on a system and relevant in that system alone. This output shows that the load average was 4.45 in the last 1 min, 5.18 in the 5 last mins, and so on.

The command does not have any options or accept any parameter other than -V, which shows the version of the command.

# uptime -V
procps version 3.2.3

Usage for Oracle Users

There is no clear Oracle-specific use of this command, except that you can find out the load on the system to explain some performance issues. If you see some performance issues on the database, and you trace it to high CPU or I/O load, you should immediately check the load averages using the uptime command. If you see a high load average, your next course of action is to dive down deep below the surface to find the root cause. To perform that deep dive, you have in your arsenal tools like mpstat, iostat, and sar (covered in this installment of this series).

Consider an output as shown below:

# uptime
 21:31:04 up 330 days,   7:16,  4 users,  load average: 12.90, 1.03, 1.00

It’s interesting as the load average was very high (12.90) in the last 1 minute but has been pretty low, even irrelevant, at 1.03 and 1.00 for 5 minutes and 15 minutes respectively. What does it mean? It proves that in less than 5 minutes, some process started that caused the load average to jump up for the last minute. This process was not present earlier because the previous load averages were so small. This analysis leads us to focus on the processes that kicked off during the last few minutes – speeding up the resolution process.

Of course, since it shows how long the server has been up, it also explains why the instance has been up since then.

who

Who is logged in the system right now? That’s a common question you might want to ask, especially when you are tracking down an errant user running some resource consuming commands.

The who command answers that question. Here is the simplest usage without any arguments or parameters.

# who
oracle   pts/2        Jan  8 15:57  (10.14.105.139)
oracle   pts/3        Jan  8 15:57  (10.14.105.139)
root     pts/1        Dec 26 13:42  (:0.0)
root     :0           Oct 23 15:32

The command can take several options. The -s option is the default; it produces the same output as the above.

Looking at the output, you might be straining your memory to remember what the columns are meant to be. Well, relax. You can use the -H option to display the header:

# who -H
NAME     LINE         TIME         COMMENT
oracle   pts/2        Jan  8 15:57  (10.14.105.139)
oracle   pts/3        Jan  8 15:57  (10.14.105.139)
root     pts/1        Dec 26  13:42 (:0.0)
root     :0           Oct 23  15:32

Now the meanings of the columns are clear. The column NAME shows the username of the logged in user. LINE shows the terminal name. In Linux each connection is labeled as a terminal with the naming convention pts/<n> where <n> is a number starting with 1. The :0 terminal is a label for X terminal. TIME shows when they first logged in. COMMENTS shows the IP address where they logged in from.

What if you just want a list of names of users instead of all those extraneous details? The -q option accomplishes that. It displays the names of users on one line, sorted alphabetically. It also displays a count of total number of users at the end (45 in this case):

# who -q
ananda ananda jsmith klome  oracle oracle root root  
                              
… and so on for  45 names
# users=45
                            

Some users could be just logged on but actually doing nothing. You can check how long they have been idle, a command especially useful if you are the boss, by using the -u option.

# who -uH
NAME     LINE         TIME          IDLE          PID COMMENT
oracle   pts/2        Jan  8 15:57   .          18127 (10.14.105.139)
oracle   pts/3        Jan  8 15:57  00:26       18127 (10.14.105.139)
root     pts/1        Dec 26 13:42   old         6451 (:0.0)
root     :0           Oct 23 15:32    ?         24215

The new column IDLE shows how long they have been idle in hh:mm format. Note the value “old” in that column? It means that the user has been idle for more than 1 day. The PID column shows the process ID of their shell connection.

Another useful option is -b that shows when the system was rebooted.

# who -b
         system boot  Feb 15  13:31

It shows the system was booted on Feb 15th at 1:31 PM. Remember the uptime command? It also shows you how long this system has been up. You can subtract the days shown in uptime to know the day of the boot. The who -b command makes it much simpler; it directly shows you the time of the boot.

Very Important Caveat: The who -b command shows the month and date only, not the year. So if the system has been up longer than a year, the output will not reflect the correct value. Therefore uptime is always a preferred approach, even if you have to do a little calculation. Here is an example:

# uptime
 21:37:49 up 675 days, 22:40,   1 user,  load average: 3.35,  3.08, 2.86
# who -b
         system boot   Mar  7 22:58

Note the boot time shows as March 7. That’s in 2007, not 2008! The uptime shows the correct time – it has been up for 675 days. If subtractions are not your forte you can use a simple SQL to get that date 675 days ago:

SQL> select sysdate - 675  from dual;
                              
SYSDATE-6 --------- 07-MAR-07

The -l option shows the logons to the system:

# who -lH 
NAME     LINE         TIME         IDLE          PID COMMENT
LOGIN    tty1         Feb 15  13:32              4081 id=1
LOGIN    tty6         Feb 15  13:32              4254 id=6

To find out the user terminals that have been dead, use the -d option:

# who -dH
NAME     LINE         TIME         IDLE          PID COMMENT  EXIT
                      Feb 15  13:31               489 id=si     term=0 exit=0
                      Feb 15  13:32              2870 id=l5     term=0 exit=0
         pts/1        Oct 10  14:53             31869 id=ts/1  term=0 exit=0
         pts/4        Jan 11  00:20             22155 id=ts/4  term=0 exit=0
         pts/3        Jun 29  16:01                 0 id=/3    term=0 exit=0
         pts/2         Oct 4  22:35              8371 id=/2    term=0 exit=0
         pts/5        Dec 30  03:15              5026 id=ts/5  term=0 exit=0
         pts/4        Dec 30  22:35                 0 id=/4    term=0 exit=0

Sometimes the init process (the process that starts first when the system is booted) kicks off other processes. The -p option shows all those logins that are active.

# who -pH
NAME     LINE         TIME                PID COMMENT
                      Feb 15 13:32       4083 id=2
                      Feb 15 13:32       4090 id=3
                      Feb 15 13:32       4166 id=4
                      Feb 15 13:32       4174 id=5
                      Feb 15 13:32       4255 id=x
                      Oct  4 23:14      13754 id=h1

Later in this installment, you will learn about a command – write – that enables real time messaging. You will also learn how to disable others’ ability to write to your terminal (the mesg command). If you want to know which users do and do not allow others to write to their terminals, use the -T option:

# who -TH
NAME       LINE          TIME         COMMENT
oracle   + pts/2        Jan 11 12:08  (10.23.32.10)
oracle   + pts/3        Jan 11 12:08  (10.23.32.10)
oracle   - pts/4        Jan 11 12:08  (10.23.32.10)
root     + pts/1        Dec 26 13:42  (:0.0)
root     ? :0           Oct 23 15:32

The + sign before the terminal name means the terminal accepts write commands from others; the “-“ sign means that the terminal does not allow. The “?” in this field means the terminal does not support writing to it, e.g. an X-window session.

The current run level of the system can be obtained by the -r option:

# who -rH
NAME     LINE         TIME         IDLE          PID COMMENT
         run-level 5  Feb 15  13:31                   last=S

A more descriptive listing can be obtained by the -a (all) option. This option combines the -b -d -l -p -r -t -T -u options. So these two commands produce the same result:

# who  -bdlprtTu
# who -a

Here is a sample output (with the header, so that you can understand the columns better):

# who -aH
NAME       LINE          TIME         IDLE          PID COMMENT  EXIT
                        Feb 15 13:31               489 id=si     term=0 exit=0
           system boot  Feb 15 13:31
           run-level 5  Feb 15 13:31                   last=S
                        Feb 15 13:32              2870 id=l5     term=0 exit=0
LOGIN      tty1         Feb 15 13:32              4081 id=1
                        Feb 15 13:32              4083 id=2
                        Feb 15 13:32              4090 id=3
                        Feb 15 13:32              4166 id=4
                        Feb 15 13:32              4174 id=5
LOGIN      tty6         Feb 15 13:32              4254 id=6
                        Feb 15 13:32              4255 id=x
                        Oct  4 23:14             13754 id=h1
           pts/1        Oct 10 14:53             31869 id=ts/1  term=0 exit=0
oracle   + pts/2        Jan  8 15:57   .         18127 (10.14.105.139)
oracle   + pts/3        Jan  8 15:57  00:18      18127 (10.14.105.139)
           pts/4        Dec 30 03:15              5026 id=ts/4  term=0 exit=0
           pts/3        Jun 29 16:01                 0 id=/3    term=0 exit=0
root     + pts/1        Dec 26 13:42  old         6451 (:0.0)
           pts/2        Oct  4 22:35              8371 id=/2     term=0 exit=0
root     ? :0           Oct 23 15:32   ?         24215
           pts/5        Dec 30 03:15              5026 id=ts/5  term=0 exit=0
           pts/4        Dec 30 22:35                 0 id=/4    term=0 exit=0

To find out your own login, use the -m option:

# who -m
oracle   pts/2        Jan  8 15:57  (10.14.105.139)

Note the pts/2 value? That’s the terminal number. You can find your own terminal via the tty command:

# tty
/dev/pts/2

There is a special command structure in Linux to show your own login – who am i. It produces the same output as the -m option.

# who am i
oracle   pts/2        Jan  8 15:57  (10.14.105.139)

The only arguments allowed are “am i" and “mom likes” (yes, believe it or not!). Both produce the same output,

The Original Instant Messenger System

With the advent of instant messaging or chat programs we seem to have conquered the ubiquitous challenge of maintaining a real time exchange of information while not getting distracted by voice communication. But are these only in the domain of the fancy programs?

The instant messaging or chat concept has been available on *nix for quite a while. In fact, you have a full fledged secure IM system built right into Linux. It allows you to securely talk to anyone connected to the system; no internet connection is required. The chat is enabled through the commands – write, mesg, wall and talk. Let’s examine each of them.

The write command can write to a user’s terminal. If the user has logged in more than one terminal, you can address a specific terminal. Here is how you write a message “Beware of the virus” to the user “oracle” logged in on terminal “pts/3”:

# write oracle pts/3
Beware of the virus
ttyl 
<Control-D>
#

The Control-D key combination ends the message, returns the shell prompt (#) to the user and sends to the user’s terminal. When the above is sent, the user “oracle” will see on terminal pts/3 the messages:

Beware of the virus
ttyl

Each line will come up as the sender presses ENTER after the lines. When the sender presses Control-D, marking the end of transmission, the receiver sees EOF on the screen. The message will be displayed regardless of the current action of the user. If the user is editing a file in vi, the message comes and the user can clear it by pressing Control-L. If the user is on SQL*Plus prompt, the message still comes but does not affect the keystrokes of the user.

What if you don’t want that slight inconvenience? You don’t want anyone to send a message to you – akin to “leave the phone off the hook”. You can do that via the mesg command. This command disables others ability to send you a message. The command without any arguments shows the ability:

# mesg
is y

It shows that others can write to you. To turn it off:

# mesg n

Now to confirm:

# mesg 
is n

When you attempt to write to the users’ terminals, you may want to know which terminals have disabled this writing from others. The who -T command (described earlier in this installment) shows you that:

# who -TH
NAME       LINE          TIME         COMMENT
oracle   + pts/2        Jan 11 12:08 (10.23.32.10)
oracle   + pts/3        Jan 11 12:08 (10.23.32.10)
oracle   - pts/4        Jan 11 12:08 (10.23.32.10)
root     + pts/1        Dec 26 13:42 (:0.0)
root     ? :0           Oct 23 15:32

The + sign before the terminal name indicates that it accepts write commands from others; the “-“ sign indicates that it doesn’t. The “?” indicates that the terminal does not support writing to it, e.g. an X-window session.

What if you want to write to all the logged in users? Instead of typing to each user, use the wall command:

# wall
hello everyone

When sent, the following shows up on the terminals of all logged in users:

Broadcast message from oracle  (pts/2) (Thu Jan  8 16:37:25 2009):
                              
hello everyone

This is very useful for root user. When you want to shutdown the system, unmount a filesystem or perform similar administrative functions you may want all users to log off. Use this command to send a message to all.

Finally, the program talk allows you to chat in real time.  Just type the following:

# talk oracle pts/2

If you want to talk to a user on a different server – prolin2 – you can use

# talk oracle@prolin2 pts/2

It brings up a chat window on the other terminal and now you can chat in real time. Is it that different from a “professional” chat program you are using now? Probably not. Oh, by the way, to make the talk work, you should make sure the talkd daemon is running, which may not have been installed.

w

Yes, it’s a command, even if it’s just one letter long! The command w is a combination of uptime and who commands given one immediately after the other, in that order. Let’s see a very common output without any arguments and options.

# w
 17:29:22 up 672 days, 18:31,   2 users,  load average: 4.52,  4.54, 4.59
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
oracle   pts/1     10.14.105.139    16:43    0.00s   0.06s  0.01s w
oracle   pts/2     10.14.105.139    17:26   57.00s   3.17s  3.17s sqlplus   as sysdba
                               
… and so  on …
                            

The output has two distinct parts. The first part shows the output of the uptime command (described above in this installment) which shows how long the server has been up, how many users have logged in and the load average for last 1, 5 and 15 minutes. The parts of the output have been explained under the uptime command. The second part of the output shows the output of the who command with the option -H (also explained in this installment). Again, these various columns have been explained under the who command.

If you rather not display the header, use the -h option.

#  w -h
oracle   pts/1     10.14.105.139    16:43    0.00s   0.02s  0.01s w -h

This removes the header from the output. It’s useful in shell scripts where you want to read and act on the output without the additional burden of skipping the header.

The -s option produces a compact (short) version of the output, removing the login time, JPCU and PCPU times.

# w -s
 17:30:07 up 672 days, 18:32,   2 users,  load average: 5.03,  4.65, 4.63
USER     TTY      FROM               IDLE WHAT
oracle   pts/1     10.14.105.139     0.00s w -s
oracle   pts/2     10.14.105.139     1:42  sqlplus   as sysdba

You might find that the “FROM” field is really not very useful. It shows the IP address of the same server, since the logins are all local. To save the space on the output, you may want to suppress that. The -f option disables printing of the FROM field:

# w -f
 17:30:53 up 672 days, 18:33,   2 users,  load average: 4.77,  4.65, 4.63
USER     TTY        LOGIN@   IDLE    JCPU   PCPU WHAT
oracle   pts/1      16:43    0.00s  0.06s   0.00s w -f
oracle   pts/2      17:26    2:28   3.17s   3.17s sqlplus   as sysdba

The command accepts only one parameter: the name of a user. By default w shows the process and logins for all users. If you put a username, it shows the logins for that user only. For instance, to show logins for root only, issue:

# w -h root
root     pts/1    :0.0             26Dec08 13days 0.01s   0.01s bash
root     :0       -                23Oct08 ?xdm?   21:13m  1.81s  /usr/bin/gnome-session

The -h option was used to suppress displaying header.

kill

A process is running and you want the process to be terminated. What should you do? The process runs in the background so there is no going to the terminal and pressing Control-C; or, the process belongs to another user (using the same userid, such as “oracle”) and you want to terminate it. The kill command comes to rescue; it does what its name suggests – it kills the process. The most common use is:

# kill  
                              
<Process ID of the Linux process>
                            

Suppose you want to kill a process called sqlplus issued by the user oracle, you need to know its processid, or PID:

# ps -aef|grep sqlplus|grep ananda
oracle    8728 23916  0 10:36 pts/3    00:00:00 sqlplus
oracle    8768 23896  0 10:36 pts/2    00:00:00  grep sqlplus

Now, to kill the PID 8728:

# kill 8728

That’s it; the process is killed. Of course, you have to be the same user (oracle) to kill a process kicked off by oracle. To kill processes kicked off by other users you have to be super user – root.

Sometimes you may want to merely halt the process instead of killing it. You can use the option -SIGSTOP with the kill command.

# kill -SIGSTOP 9790
# ps -aef|grep sqlplus|grep oracle
oracle    9790 23916   0 10:41 pts/3    00:00:00 sqlplus   as sysdba
oracle    9885 23896  0 10:41 pts/2    00:00:00  grep sqlplus

This is good for background jobs but with the foreground processes, it merely stops the process and removes the control from the user. So, if you check for the process again after issuing the command:

# ps -aef|grep sqlplus|grep oracle
oracle    9790 23916  0 10:41 pts/3    00:00:00 sqlplus   as sysdba
oracle   10144 23896  0 10:42 pts/2    00:00:00  grep sqlplus

You see that the process is still running. It has not been terminated. To kill this process, and any stubborn processes that refuse to be terminated, you have to pass a new signal called SIGKILL. The default signal is SIGTERM.

# kill -SIGKILL 9790
# ps -aef|grep sqlplus|grep oracle
oracle   10092 23916  0 10:42 pts/3    00:00:00 sqlplus   as sysdba
oracle   10198 23896  0 10:43 pts/2    00:00:00  grep sqlplus

Note the options -SIGSTOP and -SIGKILL, which pass a specific signal (stop and kill, respectively) to the process. Likewise there are several other signals you can use. To get a listing of all the available signals, you can use the -l (that’s the letter “L”, not the numeral “1”) option:

# kill -l
 1) SIGHUP       2) SIGINT       3) SIGQUIT      4) SIGILL
 5) SIGTRAP      6) SIGABRT      7) SIGBUS       8) SIGFPE
 9) SIGKILL     10) SIGUSR1     11) SIGSEGV     12) SIGUSR2
13) SIGPIPE     14) SIGALRM     15) SIGTERM     17) SIGCHLD
18) SIGCONT     19) SIGSTOP     20) SIGTSTP     21) SIGTTIN
22) SIGTTOU     23) SIGURG      24) SIGXCPU     25) SIGXFSZ
26) SIGVTALRM   27) SIGPROF     28) SIGWINCH    29) SIGIO
30) SIGPWR      31) SIGSYS      34) SIGRTMIN    35) SIGRTMIN+1
36) SIGRTMIN+2  37) SIGRTMIN+3  38) SIGRTMIN+4  39) SIGRTMIN+5
40) SIGRTMIN+6  41) SIGRTMIN+7  42) SIGRTMIN+8  43) SIGRTMIN+9
44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13
48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13
52) SIGRTMAX-12 53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9
56) SIGRTMAX-8  57) SIGRTMAX-7  58) SIGRTMAX-6  59) SIGRTMAX-5
60) SIGRTMAX-4  61) SIGRTMAX-3  62) SIGRTMAX-2  63) SIGRTMAX-1
64) SIGRTMAX

You can also use the numeral equivalent of the signal in place of the actual signal name. For instance, instead of kill -SIGKILL 9790, you can use kill -9 9790.

By the way, this is an interesting command. Remember, almost all Linux commands are usually executable files located in /bin, /sbin/, /user/bin and similar directories. The PATH executable determines where these command files can be found. Some other commands are an actually “built-in” command, i.e. they are part of the shell itself. One such example is kill. To demonstrate, give the following:

# kill -h 
-bash: kill: h: invalid signal  specification

Note the output that came back from the bash shell. The usage is incorrect since the -h argument was not expected. Now use the following:

# /bin/kill -h
usage: kill [ -s signal | -p ]  [ -a ] pid ...
       kill -l [ signal ]

Aha! This version of the command kill as an executable in the /bin directory accepted the option -h properly. Now you know the subtle difference between the shell built-in commands and their namesake utilities in the form of executable files.

Why is it important to know the difference? It’s important because the functionality varies significantly across these two forms. The kill built-in has lesser functionality than its utility equivalent. When you issue the command kill, you are actually invoking the built-in, not the utility. To add the other functionality, you have to use the /bin/kill utility.

The kill utility has many options and arguments. The most popular is the kill command used to kill the processes with process names, rather than PIDs. Here is an example where you want to kill all processes with the name sqlplus:

# /bin/kill sqlplus
[1]   Terminated              sqlplus
[2]   Terminated              sqlplus
[3]   Terminated              sqlplus
[4]   Terminated              sqlplus
[5]   Terminated              sqlplus
[6]   Terminated              sqlplus
[7]-  Terminated              sqlplus
[8]+  Terminated              sqlplus

Sometimes you may want to see all the process IDs kill will terminate. The -p option accomplishes that. It prints all the PIDs it would have killed, without actually killing them. It serves as a confirmation prior to action:

#  /bin/kill -p sqlplus
6798
6802
6803
6807
6808
6812
6813
6817

The output shows the PIDs of the processes it would have killed. If you reissue the command without the -p option, it will kill all those processes.

At this time you may be tempted to know which other commands are “built-in” in the shell, instead of being utilities.

# man -k builtin
. [builtins]         (1)   - bash built-in commands, see bash(1)
: [builtins]         (1)   - bash built-in commands, see bash(1)
[ [builtins]         (1)   - bash built-in commands, see bash(1)
alias [builtins]     (1)   - bash built-in commands, see bash(1)
bash [builtins]      (1)   - bash built-in commands, see bash(1)
bg [builtins]        (1)   - bash built-in commands, see bash(1)
                               
… and so on …
                            

Some entries seem familiar – alias, bg and so on. Some are purely built-ins, e.g. alias. There is no executable file called alias.

Usage for Oracle Users

Killing a process has many uses – mostly to kill zombie processes, processes that are in the background and others that have stopped responding to the normal shutdown commands. For instance, the Oracle database instance is not shutting down as a result of some memory issue. You have to bring it down by killing one of the key processes like pmon or smon. This should not be an activity to be performed all the time, just when you don’t have much choice.

You may want to kill all sqlplus sessions or all rman jobs using the utility kill command. Oracle Enterprise Manager processes run as perl processes; or DBCA or DBUA processes run, which you may want to kill quickly:

# /bin/kill perl rman perl dbca  dbua java

There is also a more common use of the command. When you want to terminate a user session in Oracle Database, you typically do this:

  • Find the SID and Serial# of the session
  • Kill the session using ALTER SYSTEM command

Let’s see what happens when we want to kill the session of the user SH.

SQL> select sid, serial#,  status
  2  from v$session
  3* where username = 'SH';
       SID    SERIAL# STATUS
---------- ---------- --------
       116       5784  INACTIVE
 
SQL> alter system kill  session '116,5784'
  2  /
 
System altered.
 
It’s killed; but when you check the status of the session:
 
       SID    SERIAL# STATUS
---------- ---------- --------
       116       5784 KILLED

It shows as KILLED, not completely gone. It happens because Oracle waits until the user SH gets to his session and attempts to do something, during which he gets the message “ORA-00028: your session has been killed”. After that time the session disappears from V$SESSION.

A faster way to kill a session is to kill the corresponding server process at the Linux level. To do so, first find the PID of the server process:

SQL> select spid
  2  from v$process
  3  where addr =
  4  (
  5     select paddr
  6     from v$session
  7     where username =  'SH'
  8  );
SPID
------------------------
30986

The SPID is the Process ID of the server process. Now kill this process:

# kill -9 30986

Now if you check the view V$SESSION, it will be gone immediately. The user will not get a message immediately; but if he attempts to perform a database query, he will get:

ERROR at line 1:
ORA-03135: connection lost  contact
Process ID: 30986
Session ID: 125 Serial number:  34528

This is a faster method to kill a session but there are some caveats. The Oracle database has to perform a session cleanup--rollback changes and so on. So this should be performed only when the sessions are idle. Otherwise you can use one of the two other ways to kill a session immediately:

alter system disconnect session  '125,35447' immediate;
alter system disconnect session  '125,35447' post_transaction;

killall

Unlike the dual nature of kill, killall is purely a utility, i.e. this is an executable program in the /usr/bin directory. The command is similar to kill in functionality but instead of killing a process based on its PID, it accepts the process name as an argument. For instance, to kill all sqlplus processes, issue:

# killall sqlplus

This kills all processes named sqlplus (which you have the permission to kill, of course). Unlike the kill built-in command, you don’t need to know the Process ID of the processes to be killed.

If the command does not terminate the process, or the process does not respond to a TERM signal, you can send an explicit SIGKILL signal as you saw in the kill command using the -s option.

# killall -s SIGKILL sqlplus

Like kill, you can use -9 option in lieu of -s SIGKILL. For a list of all available signals, you can use the -l option.

# killall -l
HUP INT QUIT ILL TRAP ABRT IOT  BUS FPE KILL USR1 SEGV USR2 PIPE ALRM TERM
STKFLT CHLD CONT STOP TSTP TTIN  TTOU URG XCPU XFSZ VTALRM PROF WINCH IO PWR SYS
UNUSED

To get a verbose output of the killall command, use the -v option:

# killall -v sqlplus
Killed sqlplus(26448) with signal 15
Killed sqlplus(26452) with signal 15
Killed sqlplus(26456) with signal 15
Killed sqlplus(26457) with signal 15
                               
… and so on …
                            

Sometimes you may want to examine the process before terminating it. The -i option allows you run it interactively. This option prompts for your input before killing it:

# killall -i sqlplus
Kill sqlplus(2537) ? (y/n) n
Kill sqlplus(2555) ? (y/n) n
Kill sqlplus(2555) ? (y/n) y
Killed sqlplus(2555) with signal 15

What happens when you pass a wrong process name?

# killall wrong_process
wrong_process: no process  killed

There is no such running process called wrong_process so nothing was killed and the output clearly showed that. To suppress this complaint “no process killed”, use the -q option. That option comes handy in shell scripts where you can’t parse the output. Rather, you want to capture the return code from the command:

# killall -q wrong_process
# echo $?
1

The return code (shown by the shell variable $?) is “1”, instead of “0”, meaning failure. You can check the return code to examine whether the killall process was successful, i.e. the return code was “0”.

One interesting thing about this command is that it does not kill itself. Of course, it kills other killall commands given elsewhere but not itself.

Usage for Oracle Users

Like the kill command, the killall command is also used to kill processes. The biggest advantage of killall is the ability to display the processid and the interactive nature. Suppose you want to kill all perl, java, sqlplus, rman and dbca processes but do it interactively; you can issue:

# killall -i -p perl sqlplus  java rman dbca
Kill sqlplus(pgid 7053) ? (y/n) n
Kill perl(pgid 31233) ? (y/n) n
                               
... and so on ...
                            

This allows you to view the PID before you kill them, which can be very useful.

Conclusion

In this installment you learned about these commands (shown in alphabetical order)

dig

A newer version of nslookup

ifconfig

To display information on network interfaces

kill

Kill a specific process

killall

Kill a specific process, a group of processes and names matching a pattern

mesg

To turn on or off the ability of others to display something on one’s terminal.

netstat

To display statistics and other metrics on network interface usage

nslookup

To lookup a hostname for its IP address or lookup IP address for its hostname on the DNS

talk

To establish an Instant Message system between two users for realtime chat

uptime

How long the system has been up and its load average for 1, 5 and 15 minutes

w

Combination of uptime and who

wall

To display some text on the terminals of all the logged in users

who

To display the users logged into the system and what they are doing

write

To instantly display something on a specific user’s terminal session

As I have mentioned earlier, it is not my intention to present before you every available command in Linux systems. You need to master only a handful of them to effectively manage a system and this series shows you those very important ones. Practice them on your environment to understand these commands – with their parameters and options – very well. In the next installment, the last one, you will learn how to manage a Linux environment – on a regular machine, in a virtual machine, and on the cloud.


Arup Nanda ( arup@proligence.com) has been exclusively an Oracle DBA for more than 12 years with experiences spanning all areas of Oracle Database technology, and was named "DBA of the Year" by Oracle Magazine in 2003. Arup is a frequent speaker and writer in Oracle-related events and journals and an Oracle ACE Director. He co-authored four books, including RMAN Recipes for Oracle Database 11g: A Problem Solution Approach .