| DBA: Linux
Guide to Advanced Linux Command Mastery, Part 4: Managing the Linux Environment
by Arup Nanda
Published May 2009
In this installment, learn how to manage the Linux environment effectively through these commonly used commands.
The ifconfig command shows the details of the network interface(s) defined in the system. The most common option is -a , which shows all the interfaces.
# ifconfig -a
The usual name of the primary Ethernet network interface is eth0. To find out the details of a specific interface, e.g. eth0, you can use:
# ifconfig eth0
The output is show below, with explanation:
Here are some key parts of the output:
The command is not just used to check the settings; it’s used to configure and manage the interface as well. Here is a short list of parameters and options for this command:
up/down – enables or disables a specific interface. You can use the down parameter to shutdown an interface (or disable it):
# ifconfig eth0 down
Similarly to bring it up (or enable) it, you would use:
# ifconfig eth0 up
media – sets the type of the Ethernet media such as 10baseT, 10 Base 2, etc. Common values for the media parameter are 10base2, 10baseT, and AUI. If you want Linux to sense the media automatically, you can specify “auto”, as shown below:
# ifconfig eth0 media auto
add – sets a specific IP address for the interface. To set an IP address of 192.168.1.101 to the interface eth0, you would issue:
# ifconfig eth0 add 192.168.1.101
netmask – sets the netmask parameter of the interface. Here is an example where you can set the netmask of the eth0 interface to 255.255.255.0
# ifconfig eth0 netmask 255.255.255.0
In an Oracle Real Application Clusters environment you have to set the netmask in a certain way, using this command.
In some advanced configurations, you can change the MAC address assigned to the network interface. The hw parameter accomplishes that. The general format is:
ifconfig <Interface> hw <TypeOfInterface> <MAC>
The <TypeOfInterface> shows the type of the interface, e.g. ether, for Ethernet. Here is how the MAC address is changed for eth0 to 22.214.171.124.90.12 (Note: the MAC address shown here is fictional. If it matches any actual MAC, it’s purely coincidental.):
# ifconfig eth0 hw ether 126.96.36.199.90.12
This is useful when you add a new card (with a new MAC address) but do not want to change the Linux-related configuration such as network interfaces.
Usage for the Oracle User
The command, along with nestat described below, is one of the most widely used in managing Oracle RAC. Oracle RAC’s performance depends heavily on the interconnect used between the nodes of the cluster. If the interconnect is saturated (that is, it no longer carries any additional traffic) or is failing, you may see reduced performance. The best course of action in this case is to look at the ifconfig output to view any failures. Here is a typical example:
# ifconfig eth9 eth9 Link encap:Ethernet HWaddr 00:1C:23:CE:6F:82 inet addr:10.14.104.31 Bcast:10.14.104.255 Mask:255.255.255.0 inet6 addr: fe80::21c:23ff:fece:6f82/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1204285416 errors:0 dropped:560923 overruns:0 frame:0 TX packets:587443664 errors:0 dropped:623409 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:1670104239570 (1.5 TiB) TX bytes:42726010594 (39.7 GiB) Interrupt:169 Memory:f8000000-f8012100
Note the text highlighted in red. The dropped count is extremely high; the number should ideally be 0 or close to it. A high number more than half a million sounds like a faulty interconnect that drops packets, causing the interconnect to resend packets—which should be a clue in the issue diagnosis.
The status of the input and output through a network interface is assessed via the command netstat. This command can provide the complete information on how the network interface is performing, down to even socket level. Here is an example:
# netstat Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 prolin1:31027 prolin1:5500 TIME_WAIT tcp 4 0 prolin1l:1521 applin1:40205 ESTABLISHED tcp 0 0 prolin1l:1522 prolin1:39957 ESTABLISHED tcp 0 0 prolin1l:3938 prolin1:31017 TIME_WAIT tcp 0 0 prolin1l:1521 prolin1:21545 ESTABLISHED … and so on …
The above output goes on to show all the open sockets. In very simplistic terms, a socket is akin to a connection between two processes. [Please note: strictly speaking, “sockets” and “connections” are technically different. A socket could exist without a connection. However, a discussion on sockets and connections is beyond the scope of this article. Therefore I have merely presented the concept in an easy-to-understand manner.] Naturally, a connection has to have a source and a destination, called local and remote address. The end points could be on the same server; or on different servers.
In many cases, the programs connect to the same server. For instance, if two processes communicate among each other, the local and remote addresses will be the same, as you can see in the first line – the local and remote addresses are both the sever “prolin1”. However, the processes communicate over a port, which will be different. This port is shown next to the host name after the “:” (colon) mark. The user program sends the data to be sent across the socket to a queue and the receiver reads from a queue at the remote end. Here are the columns of the output:
Well, from the foreign and local addresses, especially from the port numbers, you can probably guess that the connections are Oracle related, but won’t it be nice to know that for sure? Of course. The -p option shows the process information as well:
# netstat -p Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 prolin1:1521 prolin1:33303 ESTABLISHED 1327/oraclePROPRD1 tcp 0 0 prolin1:1521 applin1:51324 ESTABLISHED 13827/oraclePROPRD1 tcp 0 0 prolin1:1521 prolin1:33298 ESTABLISHED 32695/tnslsnr tcp 0 0 prolin1:1521 prolin1:32544 ESTABLISHED 15251/oracle+ASM tcp 0 0 prolin1:1521 prolin1:33331 ESTABLISHED 32695/tnslsnr
This clearly shows the process IP and the process name in the last column, which confirms it to be Oracle server processes, listener process, and ASM server processes.
The netstat command can have various options and parameters. Here are some key ones:
To find out the network statistics for various interfaces, use the -i option.
# netstat -i Kernel Interface table Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg eth0 1500 0 6860659 0 0 0 2055833 0 0 0 BMRU eth8 1500 0 2345 0 0 0 833 0 0 0 BMRU lo 16436 0 14449079 0 0 0 14449079 0 0 0 LRU
This shows the different interfaces present in the server (eth0, eth8, etc.) and the metrics associated with the interface.
The next sets of columns (TX-OK, TX-ERR, etc.) show the corresponding stats for send data.
Flg column is a composite value of the property of the interface. Each letter indicates a specific property being present. Here is an explanation of the letters.
B – Broadcast
You can use the –interface (note: there are two hyphens, not one) option to display the same for a specific interface.
# netstat --interface=eth0 Kernel Interface table Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg eth0 1500 0 277903459 0 0 0 170897632 0 0 0 BMsRU
Needless to say, the output is wide and is a little difficult to grasp at one shot. If you are comparing across interfaces, it makes sense to have a tabular output. If you want to examine the values in a more readable format, use the -e option to produce an extended output:
# netstat -i -e Kernel Interface table eth0 Link encap:Ethernet HWaddr 00:13:72:CC:EB:00 inet addr:10.14.106.0 Bcast:10.14.107.255 Mask:255.255.252.0 inet6 addr: fe80::213:72ff:fecc:eb00/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:6861068 errors:0 dropped:0 overruns:0 frame:0 TX packets:2055956 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:3574788558 (3.3 GiB) TX bytes:401608995 (383.0 MiB) Interrupt:169
Does the output seem familiar? It should; it’s the same as the output of the ifconfig.
If you’d rather see the output showing IP addresses instead of host names, use the -n option.
The -s option shows the summary statistics of each protocol, rather than showing the details of each connection. This can be combined with the protocol specific flag. For instance -u shows the stats related to the UDP protocol.
# netstat -s -u Udp: 12764104 packets received 600849 packets to unknown port received. 0 packet receive errors 13455783 packets sent
Similarly, to see the stats for tcp, use -t and for raw, -r.
One of the really useful options is the display of the routing table, the -r option.
# netstat -r Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 10.20.191.0 * 255.255.255.128 U 0 0 0 bond0 172.22.13.0 * 255.255.255.0 U 0 0 0 eth9 169.254.0.0 * 255.255.0.0 U 0 0 0 eth9 default 10.20.191.1 0.0.0.0 UG 0 0 0 bond0
The second column of netstat output– Gateway–shows the gateway to which the routing entry points. If no gateway is used, an asterisk is printed instead. The third column– Genmask–shows the “generality” of the route, i.e., the network mask for this route. When given an IP address to find a suitable route for, the kernel steps through each of the routing table entries, taking the bitwise AND of the address and the netmask before comparing it to the target of the route.
The fourth column, Flags, displays the following flags that describe the route:
The next three columns show the MSS, Window, and irtt that will be applied to TCP connections established via this route.
The TCP protocol has a built-in reliability check. If a data packet fails during transmission, it’s re-transmitted. The protocol keeps track of how long the takes for the data to reach the destination and acknowledgement to be received. If the acknowledgement does not come within that timeframe, the packet is retransmitted. The amount of time the protocol has to wait before re-transmitting is set for the interface once (which can be changed) and that value is known as initial round trip time. A value of 0 means the default value is used.
Finally, the last field displays the network interface that this route will use.
Every reachable host in a network should have an IP address, which identifies it uniquely in the network. In the internet, which is a big network anyway, IP addresses allow the connections to reach servers running Websites, e.g. www.oracle.com. So, when one host (such as a client) wants to connect to another (such as a database server) using its name and not the IP address, how does the client browser know which IP address to connect to?
The mechanism of translating the host name to IP addresses is known as name resolution. In the most rudimentary level, the host has a special file called hosts, which stores the IP Address – Hostname pairs. Here is an example file:
# cat /etc/hosts # Do not remove the following line, or various programs # that require network functionality will fail. 127.0.0.1 localhost.localdomain localhost 192.168.1.101 prolin1.proligence.com prolin1 192.168.1.102 prolin2.proligence.com prolin2
This shows that the hostname prolin1.proligence.com is translated to 192.168.1.101. The special entry with the IP address 127.0.0.1 is called a loopback entry, which points back to the server itself via a special network interface called lo (which you saw earlier in the ifconfig and netstat commands).
Well, this is good, but you can’t possibly put all the IP addresses in the world in this file. There should be another mechanism to perform the name resolution. A special purpose server called a nameserver performs that role. It’s like a phonebook that your phone company provides; not your personal phonebook. There may be several nameservers available either inside or outside the private network. The host contacts one of the nameservers first, gets the IP address of the destination host it want to contact, and then attempts to connect to the IP address.
How does the host know what these nameservers are? It looks into a special file called /etc/resolv.conf to get that information. Here is a sample resolv file.
; generated by /sbin/dhclient-script search proligence.com nameserver 10.14.1.58 nameserver 10.14.1.59 nameserver 10.20.223.108
How do you make sure that the name resolution is working fine for a specific host name? In other words, you want to make sure that when the Linux system tries to contact a host called oracle.com, it can find the IP address on the nameserver. The nslookup command is useful for that. Here is how you use it:
# nslookup oracle.com Server: 10.14.1.58 Address: 10.14.1.58#53
Let’s dissect the output. The Server output is the address of the nameserver. The name oracle.com resolves to the IP address 188.8.131.52. The name was resolved by the nameserver shown next to the word Server in the output.
If you put this IP address in a browser–http://184.108.40.206 instead of http://oracle.com--the browser will go the oracle.com site.
If you made a mistake, or looked for a wrong host:
# nslookup oracle-site.com Server: 10.14.1.58 Address: 10.14.1.58#53
The message is quite clear: this host does not exist.
The nslookup command has been deprecated. Instead, a new, more powerful command – dig ( domain information groper) – should be used. On some newer Linux servers the nslookup command may not be even available.
Here is an example; to check the name resolution of the host oracle.com, you use the following command:
# dig oracle.com
From the mammoth output, several things stand out. It shows that the command sent a query to the nameserver and the host got a response back from the nameserver. The name resolution was also done at some other nameservers such as ns1.oracle.com. It shows that the query took 97 milliseconds.
If the size of the output might not make it all that useful, you can use the +short option to remove all those verbose output:
# dig +short oracle.com 220.127.116.11
You can also use the IP address to reverse lookup the host name from the IP address. The -x option is used for that.
# dig -x 18.104.22.168
The +domain parameter is useful when you are looking for a host inside a domain. For instance, suppose you are searching for the host otn in the oracle.com domain, you can either use:
# dig +short otn.oracle.com
Or you can use the +domain parameter:
# dig +short +tcp +domain=oracle.com otn www.oracle.com. www.oraclegha.com. 22.214.171.124
Usage for the Oracle User
The connectivity is established between the app server and the database server. The TNSNAMES.ORA file, used by SQL*Net may look like this:
prodb3 = (description = (address_list = (address = (protocol = tcp)(host = prolin3)(port = 1521)) ) (connect_data = (sid = prodb3) ) )
The host name prolin3 should be able to be resolved by the app server. Either this should be in the /etc/hosts file; or the host prolin3 should be defined in the DNS. To make sure the name resolution works and works correctly to point to the right host, you can use the dig command.
With these two commands you can handle most of the tasks involved with network in a Linux environment. In the rest of this installment you will learn how to manage a Linux environment effectively.
You just logged on to the server and see some things that were supposed to be running are not. Perhaps the processes were killed or perhaps all processes were killed by a shutdown. Instead of guessing, find out if the server was indeed rebooted with the uptime command. The command shows the length of time the server has been up since the last reboot.
# uptime 16:43:43 up 672 days, 17:46, 45 users, load average: 4.45, 5.18, 5.38
The output shows much useful information. The first column shows the current time when the command was executed. The second portion – up 672 days, 17:46 – shows the amount of time the server has been up. The numbers 17:46 depict the hour and minutes. So this server has been up for 672 days, 17 hours, and 46 minutes as of now.
The next item – 45 users – shows how many users are logged in to the server right now.
The last bits of the output show how much has been the load average of the server in the last 1, 5, and 15 minutes respectively. The term “load average” is a composite score that determines the load on the system based on CPU and I/O metrics. The higher the load average, the more the load on the system. It’s not based on a scale; unlike percentages it does not end at a fixed number such as 100. In addition, load averages of two systems can’t be compared. It is a number to quantify load on a system and relevant in that system alone. This output shows that the load average was 4.45 in the last 1 min, 5.18 in the 5 last mins, and so on.
The command does not have any options or accept any parameter other than -V, which shows the version of the command.
# uptime -V procps version 3.2.3
Usage for Oracle Users
There is no clear Oracle-specific use of this command, except that you can find out the load on the system to explain some performance issues. If you see some performance issues on the database, and you trace it to high CPU or I/O load, you should immediately check the load averages using the uptime command. If you see a high load average, your next course of action is to dive down deep below the surface to find the root cause. To perform that deep dive, you have in your arsenal tools like mpstat, iostat, and sar (covered in this installment of this series).
Consider an output as shown below:
# uptime 21:31:04 up 330 days, 7:16, 4 users, load average: 12.90, 1.03, 1.00
It’s interesting as the load average was very high (12.90) in the last 1 minute but has been pretty low, even irrelevant, at 1.03 and 1.00 for 5 minutes and 15 minutes respectively. What does it mean? It proves that in less than 5 minutes, some process started that caused the load average to jump up for the last minute. This process was not present earlier because the previous load averages were so small. This analysis leads us to focus on the processes that kicked off during the last few minutes – speeding up the resolution process.
Of course, since it shows how long the server has been up, it also explains why the instance has been up since then.
Who is logged in the system right now? That’s a common question you might want to ask, especially when you are tracking down an errant user running some resource consuming commands.
The who command answers that question. Here is the simplest usage without any arguments or parameters.
# who oracle pts/2 Jan 8 15:57 (10.14.105.139) oracle pts/3 Jan 8 15:57 (10.14.105.139) root pts/1 Dec 26 13:42 (:0.0) root :0 Oct 23 15:32
The command can take several options. The -s option is the default; it produces the same output as the above.
Looking at the output, you might be straining your memory to remember what the columns are meant to be. Well, relax. You can use the -H option to display the header:
# who -H NAME LINE TIME COMMENT oracle pts/2 Jan 8 15:57 (10.14.105.139) oracle pts/3 Jan 8 15:57 (10.14.105.139) root pts/1 Dec 26 13:42 (:0.0) root :0 Oct 23 15:32
Now the meanings of the columns are clear. The column NAME shows the username of the logged in user. LINE shows the terminal name. In Linux each connection is labeled as a terminal with the naming convention pts/<n> where <n> is a number starting with 1. The :0 terminal is a label for X terminal. TIME shows when they first logged in. COMMENTS shows the IP address where they logged in from.
What if you just want a list of names of users instead of all those extraneous details? The -q option accomplishes that. It displays the names of users on one line, sorted alphabetically. It also displays a count of total number of users at the end (45 in this case):
# who -q ananda ananda jsmith klome oracle oracle root root … and so on for 45 names # users=45
Some users could be just logged on but actually doing nothing. You can check how long they have been idle, a command especially useful if you are the boss, by using the -u option.
# who -uH NAME LINE TIME IDLE PID COMMENT oracle pts/2 Jan 8 15:57 . 18127 (10.14.105.139) oracle pts/3 Jan 8 15:57 00:26 18127 (10.14.105.139) root pts/1 Dec 26 13:42 old 6451 (:0.0) root :0 Oct 23 15:32 ? 24215
The new column IDLE shows how long they have been idle in hh:mm format. Note the value “old” in that column? It means that the user has been idle for more than 1 day. The PID column shows the process ID of their shell connection.
Another useful option is -b that shows when the system was rebooted.
# who -b system boot Feb 15 13:31
It shows the system was booted on Feb 15th at 1:31 PM. Remember the uptime command? It also shows you how long this system has been up. You can subtract the days shown in uptime to know the day of the boot. The who -b command makes it much simpler; it directly shows you the time of the boot.
Very Important Caveat: The who -b command shows the month and date only, not the year. So if the system has been up longer than a year, the output will not reflect the correct value. Therefore uptime is always a preferred approach, even if you have to do a little calculation. Here is an example:
# uptime 21:37:49 up 675 days, 22:40, 1 user, load average: 3.35, 3.08, 2.86 # who -b system boot Mar 7 22:58
Note the boot time shows as March 7. That’s in 2007, not 2008! The uptime shows the correct time – it has been up for 675 days. If subtractions are not your forte you can use a simple SQL to get that date 675 days ago:
SQL> select sysdate - 675 from dual;
The -l option shows the logons to the system:
# who -lH NAME LINE TIME IDLE PID COMMENT LOGIN tty1 Feb 15 13:32 4081 id=1 LOGIN tty6 Feb 15 13:32 4254 id=6
To find out the user terminals that have been dead, use the -d option:
# who -dH NAME LINE TIME IDLE PID COMMENT EXIT Feb 15 13:31 489 id=si term=0 exit=0 Feb 15 13:32 2870 id=l5 term=0 exit=0 pts/1 Oct 10 14:53 31869 id=ts/1 term=0 exit=0 pts/4 Jan 11 00:20 22155 id=ts/4 term=0 exit=0 pts/3 Jun 29 16:01 0 id=/3 term=0 exit=0 pts/2 Oct 4 22:35 8371 id=/2 term=0 exit=0 pts/5 Dec 30 03:15 5026 id=ts/5 term=0 exit=0 pts/4 Dec 30 22:35 0 id=/4 term=0 exit=0
Sometimes the init process (the process that starts first when the system is booted) kicks off other processes. The -p option shows all those logins that are active.
# who -pH NAME LINE TIME PID COMMENT Feb 15 13:32 4083 id=2 Feb 15 13:32 4090 id=3 Feb 15 13:32 4166 id=4 Feb 15 13:32 4174 id=5 Feb 15 13:32 4255 id=x Oct 4 23:14 13754 id=h1
Later in this installment, you will learn about a command – write – that enables real time messaging. You will also learn how to disable others’ ability to write to your terminal (the mesg command). If you want to know which users do and do not allow others to write to their terminals, use the -T option:
# who -TH NAME LINE TIME COMMENT oracle + pts/2 Jan 11 12:08 (10.23.32.10) oracle + pts/3 Jan 11 12:08 (10.23.32.10) oracle - pts/4 Jan 11 12:08 (10.23.32.10) root + pts/1 Dec 26 13:42 (:0.0) root ? :0 Oct 23 15:32
The + sign before the terminal name means the terminal accepts write commands from others; the “-“ sign means that the terminal does not allow. The “?” in this field means the terminal does not support writing to it, e.g. an X-window session.
The current run level of the system can be obtained by the -r option:
# who -rH NAME LINE TIME IDLE PID COMMENT run-level 5 Feb 15 13:31 last=S
A more descriptive listing can be obtained by the -a (all) option. This option combines the -b -d -l -p -r -t -T -u options. So these two commands produce the same result:
# who -bdlprtTu # who -a
Here is a sample output (with the header, so that you can understand the columns better):
# who -aH NAME LINE TIME IDLE PID COMMENT EXIT Feb 15 13:31 489 id=si term=0 exit=0 system boot Feb 15 13:31 run-level 5 Feb 15 13:31 last=S Feb 15 13:32 2870 id=l5 term=0 exit=0 LOGIN tty1 Feb 15 13:32 4081 id=1 Feb 15 13:32 4083 id=2 Feb 15 13:32 4090 id=3 Feb 15 13:32 4166 id=4 Feb 15 13:32 4174 id=5 LOGIN tty6 Feb 15 13:32 4254 id=6 Feb 15 13:32 4255 id=x Oct 4 23:14 13754 id=h1 pts/1 Oct 10 14:53 31869 id=ts/1 term=0 exit=0 oracle + pts/2 Jan 8 15:57 . 18127 (10.14.105.139) oracle + pts/3 Jan 8 15:57 00:18 18127 (10.14.105.139) pts/4 Dec 30 03:15 5026 id=ts/4 term=0 exit=0 pts/3 Jun 29 16:01 0 id=/3 term=0 exit=0 root + pts/1 Dec 26 13:42 old 6451 (:0.0) pts/2 Oct 4 22:35 8371 id=/2 term=0 exit=0 root ? :0 Oct 23 15:32 ? 24215 pts/5 Dec 30 03:15 5026 id=ts/5 term=0 exit=0 pts/4 Dec 30 22:35 0 id=/4 term=0 exit=0
To find out your own login, use the -m option:
# who -m oracle pts/2 Jan 8 15:57 (10.14.105.139)
Note the pts/2 value? That’s the terminal number. You can find your own terminal via the tty command:
# tty /dev/pts/2
There is a special command structure in Linux to show your own login – who am i. It produces the same output as the -m option.
# who am i oracle pts/2 Jan 8 15:57 (10.14.105.139)
The only arguments allowed are “am i" and “mom likes” (yes, believe it or not!). Both produce the same output,
The Original Instant Messenger System
With the advent of instant messaging or chat programs we seem to have conquered the ubiquitous challenge of maintaining a real time exchange of information while not getting distracted by voice communication. But are these only in the domain of the fancy programs?
The instant messaging or chat concept has been available on *nix for quite a while. In fact, you have a full fledged secure IM system built right into Linux. It allows you to securely talk to anyone connected to the system; no internet connection is required. The chat is enabled through the commands – write, mesg, wall and talk. Let’s examine each of them.
The write command can write to a user’s terminal. If the user has logged in more than one terminal, you can address a specific terminal. Here is how you write a message “Beware of the virus” to the user “oracle” logged in on terminal “pts/3”:
# write oracle pts/3 Beware of the virus ttyl <Control-D> #
The Control-D key combination ends the message, returns the shell prompt (#) to the user and sends to the user’s terminal. When the above is sent, the user “oracle” will see on terminal pts/3 the messages:
Beware of the virus ttyl
Each line will come up as the sender presses ENTER after the lines. When the sender presses Control-D, marking the end of transmission, the receiver sees EOF on the screen. The message will be displayed regardless of the current action of the user. If the user is editing a file in vi, the message comes and the user can clear it by pressing Control-L. If the user is on SQL*Plus prompt, the message still comes but does not affect the keystrokes of the user.
What if you don’t want that slight inconvenience? You don’t want anyone to send a message to you – akin to “leave the phone off the hook”. You can do that via the mesg command. This command disables others ability to send you a message. The command without any arguments shows the ability:
# mesg is y
It shows that others can write to you. To turn it off:
# mesg n
Now to confirm:
# mesg is n
When you attempt to write to the users’ terminals, you may want to know which terminals have disabled this writing from others. The who -T command (described earlier in this installment) shows you that:
# who -TH NAME LINE TIME COMMENT oracle + pts/2 Jan 11 12:08 (10.23.32.10) oracle + pts/3 Jan 11 12:08 (10.23.32.10) oracle - pts/4 Jan 11 12:08 (10.23.32.10) root + pts/1 Dec 26 13:42 (:0.0) root ? :0 Oct 23 15:32
The + sign before the terminal name indicates that it accepts write commands from others; the “-“ sign indicates that it doesn’t. The “?” indicates that the terminal does not support writing to it, e.g. an X-window session.
What if you want to write to all the logged in users? Instead of typing to each user, use the wall command:
# wall hello everyone
When sent, the following shows up on the terminals of all logged in users:
Broadcast message from oracle (pts/2) (Thu Jan 8 16:37:25 2009):
This is very useful for root user. When you want to shutdown the system, unmount a filesystem or perform similar administrative functions you may want all users to log off. Use this command to send a message to all.
Finally, the program talk allows you to chat in real time. Just type the following:
# talk oracle pts/2
If you want to talk to a user on a different server – prolin2 – you can use
# talk oracle@prolin2 pts/2
It brings up a chat window on the other terminal and now you can chat in real time. Is it that different from a “professional” chat program you are using now? Probably not. Oh, by the way, to make the talk work, you should make sure the talkd daemon is running, which may not have been installed.
Yes, it’s a command, even if it’s just one letter long! The command w is a combination of uptime and who commands given one immediately after the other, in that order. Let’s see a very common output without any arguments and options.
# w 17:29:22 up 672 days, 18:31, 2 users, load average: 4.52, 4.54, 4.59 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT oracle pts/1 10.14.105.139 16:43 0.00s 0.06s 0.01s w oracle pts/2 10.14.105.139 17:26 57.00s 3.17s 3.17s sqlplus as sysdba … and so on …
The output has two distinct parts. The first part shows the output of the uptime command (described above in this installment) which shows how long the server has been up, how many users have logged in and the load average for last 1, 5 and 15 minutes. The parts of the output have been explained under the uptime command. The second part of the output shows the output of the who command with the option -H (also explained in this installment). Again, these various columns have been explained under the who command.
If you rather not display the header, use the -h option.
# w -h oracle pts/1 10.14.105.139 16:43 0.00s 0.02s 0.01s w -h
This removes the header from the output. It’s useful in shell scripts where you want to read and act on the output without the additional burden of skipping the header.
The -s option produces a compact (short) version of the output, removing the login time, JPCU and PCPU times.
# w -s 17:30:07 up 672 days, 18:32, 2 users, load average: 5.03, 4.65, 4.63 USER TTY FROM IDLE WHAT oracle pts/1 10.14.105.139 0.00s w -s oracle pts/2 10.14.105.139 1:42 sqlplus as sysdba
You might find that the “FROM” field is really not very useful. It shows the IP address of the same server, since the logins are all local. To save the space on the output, you may want to suppress that. The -f option disables printing of the FROM field:
# w -f 17:30:53 up 672 days, 18:33, 2 users, load average: 4.77, 4.65, 4.63 USER TTY LOGIN@ IDLE JCPU PCPU WHAT oracle pts/1 16:43 0.00s 0.06s 0.00s w -f oracle pts/2 17:26 2:28 3.17s 3.17s sqlplus as sysdba
The command accepts only one parameter: the name of a user. By default w shows the process and logins for all users. If you put a username, it shows the logins for that user only. For instance, to show logins for root only, issue:
# w -h root root pts/1 :0.0 26Dec08 13days 0.01s 0.01s bash root :0 - 23Oct08 ?xdm? 21:13m 1.81s /usr/bin/gnome-session
The -h option was used to suppress displaying header.
A process is running and you want the process to be terminated. What should you do? The process runs in the background so there is no going to the terminal and pressing Control-C; or, the process belongs to another user (using the same userid, such as “oracle”) and you want to terminate it. The kill command comes to rescue; it does what its name suggests – it kills the process. The most common use is:
# kill <Process ID of the Linux process>
Suppose you want to kill a process called sqlplus issued by the user oracle, you need to know its processid, or PID:
# ps -aef|grep sqlplus|grep ananda oracle 8728 23916 0 10:36 pts/3 00:00:00 sqlplus oracle 8768 23896 0 10:36 pts/2 00:00:00 grep sqlplus
Now, to kill the PID 8728:
# kill 8728
That’s it; the process is killed. Of course, you have to be the same user (oracle) to kill a process kicked off by oracle. To kill processes kicked off by other users you have to be super user – root.
Sometimes you may want to merely halt the process instead of killing it. You can use the option -SIGSTOP with the kill command.
# kill -SIGSTOP 9790 # ps -aef|grep sqlplus|grep oracle oracle 9790 23916 0 10:41 pts/3 00:00:00 sqlplus as sysdba oracle 9885 23896 0 10:41 pts/2 00:00:00 grep sqlplus
This is good for background jobs but with the foreground processes, it merely stops the process and removes the control from the user. So, if you check for the process again after issuing the command:
# ps -aef|grep sqlplus|grep oracle oracle 9790 23916 0 10:41 pts/3 00:00:00 sqlplus as sysdba oracle 10144 23896 0 10:42 pts/2 00:00:00 grep sqlplus
You see that the process is still running. It has not been terminated. To kill this process, and any stubborn processes that refuse to be terminated, you have to pass a new signal called SIGKILL. The default signal is SIGTERM.
# kill -SIGKILL 9790 # ps -aef|grep sqlplus|grep oracle oracle 10092 23916 0 10:42 pts/3 00:00:00 sqlplus as sysdba oracle 10198 23896 0 10:43 pts/2 00:00:00 grep sqlplus
Note the options -SIGSTOP and -SIGKILL, which pass a specific signal (stop and kill, respectively) to the process. Likewise there are several other signals you can use. To get a listing of all the available signals, you can use the -l (that’s the letter “L”, not the numeral “1”) option:
# kill -l 1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL 5) SIGTRAP 6) SIGABRT 7) SIGBUS 8) SIGFPE 9) SIGKILL 10) SIGUSR1 11) SIGSEGV 12) SIGUSR2 13) SIGPIPE 14) SIGALRM 15) SIGTERM 17) SIGCHLD 18) SIGCONT 19) SIGSTOP 20) SIGTSTP 21) SIGTTIN 22) SIGTTOU 23) SIGURG 24) SIGXCPU 25) SIGXFSZ 26) SIGVTALRM 27) SIGPROF 28) SIGWINCH 29) SIGIO 30) SIGPWR 31) SIGSYS 34) SIGRTMIN 35) SIGRTMIN+1 36) SIGRTMIN+2 37) SIGRTMIN+3 38) SIGRTMIN+4 39) SIGRTMIN+5 40) SIGRTMIN+6 41) SIGRTMIN+7 42) SIGRTMIN+8 43) SIGRTMIN+9 44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13 48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12 53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9 56) SIGRTMAX-8 57) SIGRTMAX-7 58) SIGRTMAX-6 59) SIGRTMAX-5 60) SIGRTMAX-4 61) SIGRTMAX-3 62) SIGRTMAX-2 63) SIGRTMAX-1 64) SIGRTMAX
You can also use the numeral equivalent of the signal in place of the actual signal name. For instance, instead of kill -SIGKILL 9790, you can use kill -9 9790.
By the way, this is an interesting command. Remember, almost all Linux commands are usually executable files located in /bin, /sbin/, /user/bin and similar directories. The PATH executable determines where these command files can be found. Some other commands are an actually “built-in” command, i.e. they are part of the shell itself. One such example is kill. To demonstrate, give the following:
# kill -h -bash: kill: h: invalid signal specification
Note the output that came back from the bash shell. The usage is incorrect since the -h argument was not expected. Now use the following:
# /bin/kill -h usage: kill [ -s signal | -p ] [ -a ] pid ... kill -l [ signal ]
Aha! This version of the command kill as an executable in the /bin directory accepted the option -h properly. Now you know the subtle difference between the shell built-in commands and their namesake utilities in the form of executable files.
Why is it important to know the difference? It’s important because the functionality varies significantly across these two forms. The kill built-in has lesser functionality than its utility equivalent. When you issue the command kill, you are actually invoking the built-in, not the utility. To add the other functionality, you have to use the /bin/kill utility.
The kill utility has many options and arguments. The most popular is the kill command used to kill the processes with process names, rather than PIDs. Here is an example where you want to kill all processes with the name sqlplus:
# /bin/kill sqlplus  Terminated sqlplus  Terminated sqlplus  Terminated sqlplus  Terminated sqlplus  Terminated sqlplus  Terminated sqlplus - Terminated sqlplus + Terminated sqlplus
Sometimes you may want to see all the process IDs kill will terminate. The -p option accomplishes that. It prints all the PIDs it would have killed, without actually killing them. It serves as a confirmation prior to action:
# /bin/kill -p sqlplus 6798 6802 6803 6807 6808 6812 6813 6817
The output shows the PIDs of the processes it would have killed. If you reissue the command without the -p option, it will kill all those processes.
At this time you may be tempted to know which other commands are “built-in” in the shell, instead of being utilities.
# man -k builtin . [builtins] (1) - bash built-in commands, see bash(1) : [builtins] (1) - bash built-in commands, see bash(1) [ [builtins] (1) - bash built-in commands, see bash(1) alias [builtins] (1) - bash built-in commands, see bash(1) bash [builtins] (1) - bash built-in commands, see bash(1) bg [builtins] (1) - bash built-in commands, see bash(1) … and so on …
Some entries seem familiar – alias, bg and so on. Some are purely built-ins, e.g. alias. There is no executable file called alias.
Usage for Oracle Users
Killing a process has many uses – mostly to kill zombie processes, processes that are in the background and others that have stopped responding to the normal shutdown commands. For instance, the Oracle database instance is not shutting down as a result of some memory issue. You have to bring it down by killing one of the key processes like pmon or smon. This should not be an activity to be performed all the time, just when you don’t have much choice.
You may want to kill all sqlplus sessions or all rman jobs using the utility kill command. Oracle Enterprise Manager processes run as perl processes; or DBCA or DBUA processes run, which you may want to kill quickly:
# /bin/kill perl rman perl dbca dbua java
There is also a more common use of the command. When you want to terminate a user session in Oracle Database, you typically do this:
Let’s see what happens when we want to kill the session of the user SH.
SQL> select sid, serial#, status 2 from v$session 3* where username = 'SH';
SID SERIAL# STATUS ---------- ---------- -------- 116 5784 INACTIVE SQL> alter system kill session '116,5784' 2 / System altered. It’s killed; but when you check the status of the session: SID SERIAL# STATUS ---------- ---------- -------- 116 5784 KILLED
It shows as KILLED, not completely gone. It happens because Oracle waits until the user SH gets to his session and attempts to do something, during which he gets the message “ORA-00028: your session has been killed”. After that time the session disappears from V$SESSION.
A faster way to kill a session is to kill the corresponding server process at the Linux level. To do so, first find the PID of the server process:
SQL> select spid 2 from v$process 3 where addr = 4 ( 5 select paddr 6 from v$session 7 where username = 'SH' 8 ); SPID ------------------------ 30986
The SPID is the Process ID of the server process. Now kill this process:
# kill -9 30986
Now if you check the view V$SESSION, it will be gone immediately. The user will not get a message immediately; but if he attempts to perform a database query, he will get:
ERROR at line 1: ORA-03135: connection lost contact Process ID: 30986 Session ID: 125 Serial number: 34528
This is a faster method to kill a session but there are some caveats. The Oracle database has to perform a session cleanup--rollback changes and so on. So this should be performed only when the sessions are idle. Otherwise you can use one of the two other ways to kill a session immediately:
alter system disconnect session '125,35447' immediate; alter system disconnect session '125,35447' post_transaction;
Unlike the dual nature of kill, killall is purely a utility, i.e. this is an executable program in the /usr/bin directory. The command is similar to kill in functionality but instead of killing a process based on its PID, it accepts the process name as an argument. For instance, to kill all sqlplus processes, issue:
# killall sqlplus
This kills all processes named sqlplus (which you have the permission to kill, of course). Unlike the kill built-in command, you don’t need to know the Process ID of the processes to be killed.
If the command does not terminate the process, or the process does not respond to a TERM signal, you can send an explicit SIGKILL signal as you saw in the kill command using the -s option.
# killall -s SIGKILL sqlplus
Like kill, you can use -9 option in lieu of -s SIGKILL. For a list of all available signals, you can use the -l option.
# killall -l HUP INT QUIT ILL TRAP ABRT IOT BUS FPE KILL USR1 SEGV USR2 PIPE ALRM TERM STKFLT CHLD CONT STOP TSTP TTIN TTOU URG XCPU XFSZ VTALRM PROF WINCH IO PWR SYS UNUSED
To get a verbose output of the killall command, use the -v option:
# killall -v sqlplus Killed sqlplus(26448) with signal 15 Killed sqlplus(26452) with signal 15 Killed sqlplus(26456) with signal 15 Killed sqlplus(26457) with signal 15 … and so on …
Sometimes you may want to examine the process before terminating it. The -i option allows you run it interactively. This option prompts for your input before killing it:
# killall -i sqlplus Kill sqlplus(2537) ? (y/n) n Kill sqlplus(2555) ? (y/n) n Kill sqlplus(2555) ? (y/n) y Killed sqlplus(2555) with signal 15
What happens when you pass a wrong process name?
# killall wrong_process wrong_process: no process killed
There is no such running process called wrong_process so nothing was killed and the output clearly showed that. To suppress this complaint “no process killed”, use the -q option. That option comes handy in shell scripts where you can’t parse the output. Rather, you want to capture the return code from the command:
# killall -q wrong_process # echo $? 1
The return code (shown by the shell variable $?) is “1”, instead of “0”, meaning failure. You can check the return code to examine whether the killall process was successful, i.e. the return code was “0”.
One interesting thing about this command is that it does not kill itself. Of course, it kills other killall commands given elsewhere but not itself.
Usage for Oracle Users
Like the kill command, the killall command is also used to kill processes. The biggest advantage of killall is the ability to display the processid and the interactive nature. Suppose you want to kill all perl, java, sqlplus, rman and dbca processes but do it interactively; you can issue:
# killall -i -p perl sqlplus java rman dbca Kill sqlplus(pgid 7053) ? (y/n) n Kill perl(pgid 31233) ? (y/n) n ... and so on ...
This allows you to view the PID before you kill them, which can be very useful.
In this installment you learned about these commands (shown in alphabetical order)
As I have mentioned earlier, it is not my intention to present before you every available command in Linux systems. You need to master only a handful of them to effectively manage a system and this series shows you those very important ones. Practice them on your environment to understand these commands – with their parameters and options – very well. In the next installment, the last one, you will learn how to manage a Linux environment – on a regular machine, in a virtual machine, and on the cloud.
Arup Nanda ( email@example.com) has been exclusively an Oracle DBA for more than 12 years with experiences spanning all areas of Oracle Database technology, and was named "DBA of the Year" by Oracle Magazine in 2003. Arup is a frequent speaker and writer in Oracle-related events and journals and an Oracle ACE Director. He co-authored four books, including RMAN Recipes for Oracle Database 11g: A Problem Solution Approach .