Performance Analysis in a Multitenant Cloud Environment

Using Hadoop Cluster and Oracle Solaris 11

by Orgad Kimchi

Analyzing the performance of a virtualized multitenant cloud environment can be challenging because of the layers of abstraction. This article shows how to use Oracle Solaris 11 to overcome those limitations.


Published December 2013


Want to comment on this article? Post the link on Facebook's OTN Garage page.  Have a similar article to share? Bring it up on Facebook or Twitter and let's discuss.
Table of Contents
Preparation
Monitoring CPU Workload in a Virtualized Environment
Monitoring Disk I/O Activity in a Virtualized Environment
Monitoring Memory Utilization in a Virtualized Environment
Monitoring Network Utilization in a Virtualized Environment
Cleanup Tasks
Conclusion
See Also
About the Author

Note: This information in this article applies to Oracle Solaris 11 and Oracle Solaris 11.1.

Oracle Solaris 11 comes with a new set of commands that provide the ability to conduct performance analysis in a virtualized multitenant cloud environment. Performance analysis in a virtualized multitenant cloud environment with different users running various workloads can be a challenging task for the following reasons:

  • Each type of virtualization software adds an abstraction layer to enable better manageability. Although this makes it much simpler to manage the virtualized resources, it is very difficult to find the physical system resources that are overloaded.
  • Each Oracle Solaris Zone can have different workload; it can be disk I/O, network I/O, CPU, memory, or combination of these. In addition, a single Oracle Solaris Zone can overload the entire system resources.
  • It is very difficult to observe the environment; you need to be able to monitor the environment from the top level to see all the virtual instances (non-global zones) in real time with the ability to drill down to specific resources.

The following are benefits of using Oracle Solaris 11 for virtualized performance analysis:

  • Observability. The Oracle Solaris global zone is a fully functioning operating system, not a propriety hypervisor or a minimized operating system that lacks the ability to observe the entire environment—including the host and the VMs, at the same time. The global zone can see all the non-global zones' performance metrics.
  • Integration. All the subsystems are built inside the same operating system. For example, the ZFS file system and the Oracle Solaris Zones virtualization technology are integrated together. This is preferable to mixing many vendors' technology, which causes a lack of integration between the different operating system (OS) subsystems and makes it very difficult to analyze all the different OS subsystems at the same time.
  • Virtualization awareness. The built-in Oracle Solaris commands are virtualization-aware, and they can provide performance statistics for the entire system (the Oracle Solaris global zone). In addition to providing the ability to drill down into every resource (Oracle Solaris non-global zones), these commands provide accurate results during the performance analysis process.

In this article, we are going to explore four examples that show how we can monitor virtualized environment with Oracle Solaris Zones using the built-in Oracle Solaris 11 tools. These tools provide the ability to drill down to specific resources, for example, CPU, memory, disk, and network. In addition, they provide the ability to print statistics per Oracle Solaris Zone and provide information on the running applications.

In the examples, we are going to use a Hadoop MapReduce benchmark covering CPU, disk, and network workloads. The Hadoop cluster setup is based on the setup described in "How to Set Up a Hadoop Cluster Using Oracle Solaris Zones." In the examples, all the Hadoop cluster building blocks will be installed using Oracle Solaris Zones, ZFS, and network virtualization technologies. Figure 1 shows the architecture:

Figure 1

Figure 1. Architecture

Preparation

First, let's check our environment using the zoneadm command:

root@global_zone:~# zoneadm list -civ

  ID NAME             STATUS     PATH                           BRAND    IP
   0 global           running    /                              solaris  shared
  19 sec-name-node    running    /zones/sec-name-node           solaris  excl
  23 data-node1       running    /zones/data-node1              solaris  excl
  24 data-node3       running    /zones/data-node3              solaris  excl
  25 data-node2       running    /zones/data-node2              solaris  excl
  26 name-node        running    /zones/name-node               solaris  excl

As we can see, we have five Oracle Solaris Zones in our environment:

  • name-node
  • sec-name-node
  • data-node1
  • data-node2
  • data-node3

In order to improve the system utilization by leveraging the multithread awareness in Oracle Solaris, let's enable 25 job slots per Hadoop zone, for a total of 75, by adding the following properties to mapred-site.xml:

  <property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>25</value>
  </property>

  <property>
    <name>mapred.tasktracker.reduce.tasks.maximum</name>
    <value>25</value>
  </property>

Before starting our benchmarks, let's verify our Hadoop cluster health:

root@global_zone:~# zlogin -l hadoop name-node hadoop dfsadmin -report

Oracle Corporation      SunOS 5.11      11.1    December 2012
Configured Capacity: 1445236931072 (1.31 TB)
Present Capacity: 1443007083849 (1.31 TB)
DFS Remaining: 1440410395136 (1.31 TB)
DFS Used: 2596688713 (2.42 GB)
DFS Used%: 0.18%
Under replicated blocks: 137
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 3 (3 total, 0 dead)

We should see that three DataNodes are available.

Note: DataNodes are nodes that store the data in the Hadoop Distributed File System (HDFS); they are also known as slaves and run the Task Tracker process shown in Figure 2.

Figure 2

Figure 2. DataNodes

Monitoring CPU Workload in a Virtualized Environment

Best practice for any performance analysis is to get a bird's eye view of the running environment in order to see which resource is the busiest, and then drill down to each resource. Table 1 shows a summary of the commands we will use in the CPU performance analysis.

Table 1. Command Summary
Command Description
psrinfo Displays processor information
zonestat Reports active-zone statistics
mpstat Reports per-processor or per-processor-set statistics
fsstat Reports file system statistics
vmstat Report virtual memory statistics and CPU activity
ps Reports process status
pginfo Prints CPU topology
pgstat Reports utilization statistics for a processor
prstat Reports active-process statistics

First let's get the physical resources information, as shown in Listing 1. How many CPUs do we have?

root@global_zone:~# psrinfo -pv

The physical processor has 16 cores and 128 virtual processors (0-127)
  The core has 8 virtual processors (0-7)
  The core has 8 virtual processors (8-15)
  The core has 8 virtual processors (16-23)
  The core has 8 virtual processors (24-31)
  The core has 8 virtual processors (32-39)
  The core has 8 virtual processors (40-47)
  The core has 8 virtual processors (48-55)
  The core has 8 virtual processors (56-63)
  The core has 8 virtual processors (64-71)
  The core has 8 virtual processors (72-79)
  The core has 8 virtual processors (80-87)
  The core has 8 virtual processors (88-95)
  The core has 8 virtual processors (96-103)
  The core has 8 virtual processors (104-111)
  The core has 8 virtual processors (112-119)
  The core has 8 virtual processors (120-127)
    SPARC-T5 (chipid 0, clock 3600 MHz)

Listing 1

In Listing 1, we can see that there's a SPARC T5 processor from Oracle, and one of its CPUs has been allocated. The CPU has 16 cores and 128 virtual processors (8 for each core), enabling it to simultaneously run up to 128 software threads.

Note: You can use the psrinfo -p command to count how many physical CPUs there are.

Before starting our performance analysis, we need to understand the workload characteristic of each Oracle Solaris Zone and determine whether it is CPU-bound, memory-bound, disk I/O–bound, or network I/O–bound. We will use the Oracle Solaris 11 zonestat, mpstat, and fsstat commands to do this. We will then use the output of the commands to analyze the workload that is running in the environment.

The first Hadoop benchmark that we are going to run to load our environment is Pi Estimator. Pi Estimator is a MapReduce program that employs a Monte Carlo method to estimate the value of pi.

In this example, we're going to use 128 maps and each of the maps will compute one billion samples (for a total of 128 billion samples).

Note: We are running one map per CPU.

Start the Pi Estimator program by running the following command from the global zone:

root@global_zone:~# zlogin -l hadoop name-node hadoop jar 
/usr/local/hadoop/hadoop-examples-1.2.0.jar pi 128 1000000000

where:

  • zlogin -l hadoop name-node specifies that the command be run as user hadoop on the name-node zone.
  • hadoop jar /usr/local/hadoop/hadoop-examples-1.2.0.jar pi specifies the Hadoop .jar file.
  • 128 specifies the number of maps.
  • 1000000000 specifies the number of samples.

Optional: We will run the command on the global zone using the zlogin command. However, you can run the command directly from the name-node zone, for example:

hadoop@name_node:$ hadoop jar /usr/local/hadoop/hadoop-examples-1.2.0.jar pi 128 1000000000

We can list the MapReduce jobs using the hadoop job -list command.

First, open another terminal window and run the command shown in Listing 2:

root@global_zone:~# zlogin -l hadoop name-node hadoop job -list
Oracle Corporation      SunOS 5.11      11.1    December 2012
1 jobs currently running
JobId   State   StartTime       UserName        Priority        SchedulingInfo
job_201309081525_0135   1       1378711957491   hadoop  NORMAL  NA

Listing 2

In Listing 2, we can see the job ID (JobId) and its start time (StartTime).

For a full job description (for example, map and reduce completion percentage and all job counters), run the following command, providing the JobId as a parameter to the hadoop job -status command.

root@global_zone:~# zlogin -l hadoop name-node hadoop job -status job_201309081525_0135   

The first command that we are going to use in the performance analysis is the zonestat command, which allows us to monitor all the Oracle Solaris Zones running in our environment and provides real-time statistics for the CPU utilization, memory utilization, and network utilization.

Run the zonestat command at 10-second intervals, as shown in Listing 3:

root@global_zone:~# zonestat 10 10

Interval: 1, Duration: 0:00:10
SUMMARY                Cpus/Online: 128/12    PhysMem: 256G   VirtMem: 259G
                    ---CPU----  --PhysMem-- --VirtMem-- --PhysNet--
               ZONE  USED %PART  USED %USED  USED %USED PBYTE %PUSE
            [total] 118.10 92.2% 24.6G 9.62% 60.0G 23.0% 18.4E  100%
           [system]   0.00 0.00% 9684M 3.69% 40.5G 15.5%     -     -
         data-node3  42.13 32.9% 4897M 1.86% 6146M 2.30% 18.4E  100%
         data-node1  41.49 32.4% 4891M 1.86% 6173M 2.31% 18.4E  100%
         data-node2  33.97 26.5% 4851M 1.85% 6145M 2.30% 18.4E  100%
             global   0.34 0.27%  283M 0.10%  420M 0.15%  2192 0.00%
          name-node   0.15 0.11%  419M 0.15%  718M 0.26%   126 0.00%
      sec-name-node   0.00 0.00%  205M 0.07%  363M 0.13%     0 0.00%

Listing 3

As we can see from the zonestat output in Listing 3, the Pi Estimator program is a CPU-bound application (%PART 92.2.0%).

The zonestat command prints out the following information:

  • The Zone column shows the zone name:

    • [total] shows the total quantity of resources used system-wide.
    • [system] shows the quantity of resources used by the kernel or in a manner not associated with any particular zone.

    Note: When zonestat is run from within a non-global-zone, this value is the aggregate resources consumed by the system and all other zones.

  • CPU shows CPU information:

    • USED shows how many CPUs are used.
    • %PART shows the CPU utilization, as a percentage of the compute capacity of the processor set to which the zone is bound.

    Note: In order to see system-wide processor information, use the psrset -i command or the cputypes.d DTrace script located in /usr/dtrace/DTT/Bin.

  • PhysMem shows physical memory information:

    • USED shows how much memory has been used.
    • %USED shows the amount of resources used as a percentage of the total resources.
  • VirtMem shows virtual memory information:

    • USED shows how much virtual memory has been used.
    • %USED shows the amount of resources used as a percentage of the total virtual memory in the system.
  • PhysNet shows networking information:

    • PBYTE shows the number of received and transmitted bytes that consume the physical bandwidth.
    • %PUSH shows the sum of received and transmitted bytes as a percentage of the total available physical bandwidth.

The zonestat command can print a report about the total and high utilizations for a time period, which is useful if you want to see the peak usage. This information can be used to compare the current activity to previous activity and to help with capacity planning for future growth.

For example, the command shown in Listing 4 will monitor silently at a 10-second interval for three minutes, and then produce a report showing the total and high utilizations.

root@global_zone:~# zonestat -q -R total,high 10s 3m 3m

Report: Total Usage
    Start: Wednesday, November 13, 2013 10:37:04 AM IST
      End: Wednesday, November 13, 2013 10:40:04 AM IST
    Intervals: 18, Duration: 0:03:00
SUMMARY                Cpus/Online: 128/12    PhysMem: 256G   VirtMem: 259G
                    ---CPU----  --PhysMem-- --VirtMem-- --PhysNet--
               ZONE  USED %PART  USED %USED  USED %USED PBYTE %PUSE
            [total]  79.9 61.7% 29.5G 11.5% 61.7G 23.7%  7762 0.00%
           [system]  3.63 2.83% 9538M 3.63% 37.7G 14.5%     -     -
         data-node3  25.9 19.6%  557M 0.21% 5857M 2.20% 18.4E  100%
         data-node2 24.81 19.3%  435M 0.16% 5715M 2.14% 18.4E  100%
         data-node1 24.61 19.2%  552M 0.21% 5867M 2.20% 18.4E  100%
             global  0.87 0.68% 6014M 2.29% 6134M 2.30%  908K 0.00%
          name-node  0.06 0.04%  485M 0.18%  619M 0.23% 18.4E  100%
      sec-name-node  0.00 0.00%  260M 0.09%  291M 0.10%     0 0.00%

Report: High Usage
    Start: Wednesday, November 13, 2013 10:37:04 AM IST
      End: Wednesday, November 13, 2013 10:40:04 AM IST
    Intervals: 18, Duration: 0:03:00
SUMMARY                Cpus/Online: 128/12    PhysMem: 256G   VirtMem: 259G
                    ---CPU----  --PhysMem-- --VirtMem-- --PhysNet--
               ZONE  USED %PART  USED %USED  USED %USED PBYTE %PUSE
            [total] 111.17 86.8% 31.5G 12.3% 63.8G 24.5%  892K 0.00%
           [system] 23.65 18.4% 9643M 3.67% 37.8G 14.5%     -     -
         data-node3 25.85 20.2%  557M 0.21%  976M 0.36% 18.4E  100%
         data-node2 22.95 17.9%  435M 0.16%  534M 0.20% 18.4E  100%
         data-node1 22.22 17.3%  552M 0.21%  774M 0.29% 18.4E  100%
             global  2.87 2.24% 6014M 2.29% 6128M 2.30%  946K 0.00%
          name-node  0.08 0.06%  485M 0.18%  619M 0.23% 18.4E  100%
      sec-name-node  0.00 0.00%  260M 0.09%  291M 0.10%     0 0.00%

Listing 4

We can see the following from the output in Listing 4:

  • The average CPU usage was 61.7 percent (SUMMARY, %PART 61.7%).
  • The high usage was 86.8 percent (Report: High Usage, %PART 86.8%).

In addition, we can see whether the CPU utilization balanced evenly across the available CPUs. In Listing 4, we can see that the total number of CPUs that were used is 111 and each DataNode has 22 to 26 CPUs.

We can also use the zonestat command to collect system utilization information for a long period (days, weeks, or months). For example, the following command monitors silently at a 10-second interval for 24 hours, producing a total- and high-utilization report every hour:

root@global_zone:~# zonestat -q -R total,high 10s 24h 1h

Another useful command for showing whether CPU utilization is balanced evenly across the available CPUs is the mpstat command.

Listing 5 is the output of the Oracle Solaris mpstat(1M) command. Each line represents one virtual CPU.

root@global_zone:~# mpstat 1
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
   0   85   0 10183   683   59  931   40  269  464    2  1315   30  14   0  56
   1   80   0 34872   484    9 1096   39  317  498    2  1437   34  14   0  51
   2   72   0 15632   325    4  669   30  166  334    1  1321   37   9   0  54
   3   42   0 13422   253    3  553   32  144  277    2   818   31   7   0  62
   4   57   0 14009   351    5  736   43  204  352    2   936   28   8   0  64
   5   67   0 10445   258    4  562   28  162  297    2   732   27   9   0  64
   6   49   0 15770   322    5  660   36  187  304    2   332   32   7   0  60
   7   44   0 5872   351    8  802   42  222  396    2  1077   30   9   0  61
   8   34   0 12701   391    7  826   35  245  430    2   854   33  11   0  56
   9   63   0 11926   578    7 1311   52  372  613    4   958   35  13   0  53
  10   82   0 11602   423    7  930   43  222  432    3   991   32  10   0  58
  11   24   0 14940   253    4  525   26  139  281    2   692   33   7   0  60
  12   35   0 10450   285    3  571   17  141  307    2   713   30   8   0  62
  13   46   0 27600   298    7  625   35  172  310    2   580   32   8   0  60
  14   49   0 14039   371    5  770   30  212  377    2   726   28   9   0  63
  15   73   0 10714   289    4  643   27  163  334    3   883   33   8   0  59

Listing 5

As shown in Listing 5, the mpstat command reports the following information:

  • The CPU column shows the logical CPU ID.
  • The minf column shows the number of minor faults.
  • The mjf column shows the number of major faults.
  • The xcal column shows the number of inter-processor cross calls.
  • The intr column shows the number of interrupts.
  • The ithr column shows the number of interrupts serviced as threads (lower IPL).
  • The csw column shows the number of context switches (total).
  • The icsw column shows the number of involuntary context switches.
  • The migr column shows the number of thread migrations (to another processor).
  • The smtx column shows the number of spins on mutex locks.
  • The srw column shows the number of spins on reader/writer locks.
  • The syscl column shows the number of system calls.
  • The usr column shows the percentage of user time used by the CPU.
  • The sys column shows the percentage of system time (kernel) used by the CPU.
  • The wt column shows the wait I/O time (this is deprecated, so it is always zero).
  • The idl column shows the percentage of idle time.

We can see how busy each CPU is by looking at the idl field, which shows the percentage of idle time. In addition, we can visualize the output of the mpstat command using the dim_stat tool, as shown in Figure 3.

Figure 3

Figure 3. Output of the dim_stat tool

On a system with many CPUs, the mpstat output can be very long; however, we can monitor CPU utilization per core, as shown in Listing 6.

root@global_zone:~#  mpstat -A core 10
COR minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl sze
3074  103   0 23654  1680  697 1264  644  277  502   10 11268  748  52   0   0   8
3078   95   0 32090   893  137 1228  635  281  439    8 10929  759  41   0   0   8
3082   94   0 31574   889  129 1245  629  308  560    9 12792  753  47   0   0   8
3086  111   0 20262   829  121 1200  615  277  512    7 12657  753  47   0   0   8
3090  155   0 16849   896  133 1276  646  281  567    9 12321  749  51   0   0   8
3094  123   0 24022   810  100 1210  617  283  512    8 12009  751  49   0   0   8
3098  101   0 25212   798   96 1186  594  286  577    6 14205  745  55   0   0   8
3102  111   0 25626   734   88 1091  555  230  489   10 11338  762  38   0   0   8
3106  126   0 31042   832  112 1206  614  281  513   11 11111  757  43   0   0   8
3110  126   0 33856   777   88 1167  596  245  596   10 12739  751  49   0   0   8
3114  136   0 21364   895  131 1280  646  273  586    7 13259  748  52   0   0   8
3118  128   0 28063  1021  111 1506  746  265  594    7 11178  752  48   0   0   8
3122  125   0 18047   918  124 1313  667  287  550   12 12720  749  51   0   0   8
3126   87   0 30336   898  130 1257  640  268  533   11 12930  747  53   0   0   8
3130  127   0 21213   944  138 1340  676  292  516    8 13842  748  52   0   0   8
3134  115   0 31696   767  103 1098  561  259  495    7 10162  755  45   0   0   8

Listing 6

In Listing 6, we can see that mpstat prints CPU performance statistics for each core, and it shows a total of 16 cores in the SPARC T5 CPU. Later, will we see how to monitor each core and observe its floating point and integer pipelines. In addition, mpstat can print performance statistics per socket or per processor set, and it can print the number of CPUs in each core (in the sze column). For more examples, see mpstat(1M).

The next command that we are going to use in the performance analysis is the fsstat command, which is useful for disk I/O monitoring. It allows us to monitor disk I/O activity per disk or per Oracle Solaris Zone. We will use this command to check whether Pi Estimator is disk I/O–bound in addition to being CPU-bound.

For example, we can monitor the writes to all ZFS file systems at 10-second intervals using the command shown in Listing 7.

root@global_zone:~# fsstat -Z zfs 10 10

   new  name   name  attr  attr lookup rddir  read read  write write
   file remov  chng   get   set    ops   ops   ops bytes   ops bytes
 
    0     0     0    744     0  11.4K     0 6.01K 5.87M     0     0 zfs:global
    0     0     0    151     0  3.27K     0 1.41K 1.94M     7 1.42K zfs:data-node1
    0     0     0    359     0  8.72K     0 2.75K 3.95M    22 4.06K zfs:data-node2
    0     0     0    413     0  9.03K     0 2.98K 4.22M    21 4.34K zfs:data-node3
    0     0     0    14      0     51     0     0     0     0     0 zfs:name-node
    0     0     0    14      0     51     0     0     0     0     0 zfs:sec-name-node

Listing 7

The default report shows general file system activity. As shown in Listing 7, the output combines similar operations into the following general categories:

  • The new file column shows the number of creation operations for file system objects (for example, files, directories, symlinks, and so on).
  • The name remov column shows the number of name removal operations.
  • The name chng column shows the number of name change operations.
  • The attr get column shows the number of object attribute retrieval operations.
  • The attr set column shows the number of object attribute change operations.
  • The lookup ops column shows the number of object lookup operations.
  • The rddir ops column shows the number of read directory operations.
  • The read ops column shows the number of data read operations.
  • The read bytes column shows the number of bytes transferred by data read operations.
  • The write ops column shows the number of data write operations.

As we can see from the fsstat output in Listing 7, the number of read operations and write operations indicates the disk utilization is very low.

Based on the zonestat, mpstat, and fsstat output, the conclusion is that the Pi Estimator program is a CPU-bound application. (We will demonstrate how to determine the disk I/O workload in the second example.)

So let's continue our CPU performance analysis. The next question that we are going to ask is whether there idle CPU time.

Note: You might need to rerun the Pi Estimator program again if it's finished. Once the program finishes, it prints the job summary information, for example:

Job Finished in 309.819 seconds
Estimated value of Pi is 3.14159266603125000000

As shown in Listing 8, we can use the vmstat command to see whether we have idle CPU time.

root@global_zone:~# vmstat 1
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s3 s4 s5 s6   in   sy   cs us sy id
 
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s3 s4 s5 s6   in   sy   cs us sy id
 8 0 0 213772168 245340872 770 5954 0 0 0 0 0 0 0 0  0 17732 161637 39181 93 7 0
 12 0 0 213346168 244887200 134 2237 0 0 0 0 0 0 0 0 0 13689 140604 19640 96 4 0
 17 0 0 212974464 244353760 124 1939 0 0 0 0 0 0 0 0 0 12079 130895 17225 96 4 0
 29 0 0 212657512 243704448 118 2662 0 0 0 0 0 117 0 0 0 13804 131482 18530 95 5 0
 41 0 0 210748016 241728920 202 2962 0 0 0 0 0 0 0 78 71 15214 122457 21418 96 4 0
 32 0 0 209808688 240699416 80 2509 0 0 0 0 0 0 0 0  0 16524 146238 31628 97 3 0
 36 0 0 209122192 239714912 13 1991 0 0 0 0 0 0 0 0  0 16743 132784 35315 97 3 0
 22 0 0 207632424 238260184 0 2709 0 0 0 0 0 0 0  0  0 23357 146885 56380 96 4 0
 13 0 0 206528520 236636888 0 1346 0 0 0 0 0 100 0 0 0 16161 74431 37560 98 2 0
 1 0 0 206277936 236263016 0 1448 0 0 0 0 0 0  0 78 79 14499 59197 28766 96 1 3
 0 0 0 206069656 235992504 0 1801 0 0 0 0 0 0  0  0  0 20478 66313 49291 90 2 8
 18 0 0 205840984 235578720 0 957 0 0 0 0 0 0  0  0  0 15203 34428 28456 78 1 21
 0 0 0 205762976 235447600 0 653 0 0 0 0 0  0  0  0  0 7977 8926 7977 62  0 38
 0 0 0 205762976 235443400 910 1353 0 0 0 0 0 137 0 0 0 22505 14974 41402 59 3 39

Listing 8

In Listing 8, you can see the CPU idle time in the id column. A value of 0 in the id column means that the system's CPU is 100 percent busy!

The next question is whether there are threads waiting for an available CPU. This is known as run queue latency and is easily tracked in the r column, which prints the number of kernel threads in the run queue. You can also track run queue latency by using the prstat -Lm command and noting the value in the LAT column.

In addition, we can use the prstat command to see whether the CPU cycles are being consumed in user mode or in system (kernel) mode:

root@global_zone:~# prstat -ZmL

Total: 310 processes, 8269 lwps, load averages: 47.63, 48.79, 36.98
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
 19338 hadoop   100 0.0 0.0 0.0 0.0 0.0 0.0 0.3   0  73   0   0 java/2
 19329 hadoop   100 0.0 0.0 0.0 0.0 0.0 0.0 0.4   0  86   0   0 java/2
 19519 hadoop    84  15 0.1 0.0 0.2 0.0 0.0 0.8  56 153 29K   0 java/2
 19503 hadoop    88  11 0.1 0.0 0.3 0.1 0.0 1.0  52 168 23K   3 java/2
 19536 hadoop    81  18 0.1 0.0 0.4 0.0 0.0 1.1  83 268 32K   0 java/2
 19495 hadoop    89 9.6 0.1 0.0 0.3 0.7 0.0 0.7  51 163 21K   2 java/2
 19523 hadoop    82  16 0.1 0.0 0.6 0.7 0.0 0.5  87 214 31K   0 java/2
 19259 hadoop    97 0.0 0.0 0.0 0.0 2.9 0.0 0.1   4  36   9   1 java/2
 19555 hadoop    73  24 0.2 0.0 1.9 0.0 0.1 0.9 207 207 35K   0 java/2
 19263 hadoop    97 0.0 0.0 0.0 0.0 2.8 0.0 0.3   6  66  13   2 java/2
 19258 hadoop    97 0.0 0.0 0.0 0.0 2.8 0.0 0.3   6 120  15   2 java/2
 19285 hadoop    97 0.0 0.0 0.0 0.0 2.8 0.0 0.4   7  89  13   2 java/2
 19331 hadoop    97 0.0 0.0 0.0 0.0 3.0 0.0 0.2   5  65  13   2 java/2
 19272 hadoop    97 0.0 0.0 0.0 0.0 2.9 0.0 0.3   6  65  13   2 java/2
 19313 hadoop    97 0.0 0.0 0.0 0.0 3.1 0.0 0.1   6  68  11   2 java/2
ZONEID     NLWP  SWAP   RSS MEMORY      TIME  CPU ZONE
    12     1722 5073M 3638M    28%   0:54:41  31% data-node1
    13     1846 6131M 4757M    37%   0:50:54  39% data-node2
    14     1426 4446M 3452M    28%   0:51:54  28% data-node3
     0     1726  973M  603M   0.7%   0:45:37 0.1% global
    15      446  883M  558M   5.6%   0:04:32 1.9% name-node
Total: 295 processes, 7398 lwps, load averages: 50.67, 49.39, 37.25

Listing 9

As shown in Listing 9, the following columns are displayed in the prstat output:

  • The USR column shows the percentage of time the process has spent in user mode.
  • The SYS column shows the percentage of time the process has spent in system mode.
  • The TRP column shows the percentage of time the process has spent in processing system traps.
  • The TFL column shows the percentage of time the process has spent processing text page faults.
  • The DFL column shows the percentage of time the process has spent processing data page faults.
  • The LCK column shows the percentage of time the process has spent waiting for user locks.
  • The SLP column shows the percentage of time the process has spent sleeping.
  • The LAT column shows the percentage of time the process has spent waiting for CPU.
  • The VCX column shows the number of voluntary context switches.
  • The ICX column shows the number of involuntary context switches.
  • The SCL column shows the number of system calls.
  • The SIG column shows the number of signals received.

The prstat output in Listing 9 shows that the system CPU cycles are being consumed in user mode (USR). For more prstat examples, see http://www.scalingbits.com/performance/prstat.

Another useful command that is virtualization-aware is ps. When you use the -Z option, it prints—under an additional ZONE column header—the name of the zone with which the process is associated.

Note: The command is aware that it's running within a non-global zone; thus, it can't see other user processes when running from the non-global zone.

Use the ps -efZ command to see a full listing (f) about every process now running (e), along with the associated zone name (Z). For example, to print all the Hadoop processes that are running now, use the command shown in Listing 10:

root@global_zone:~# ps -efZ | grep hadoop
ZONE      UID   PID  PPID   C    STIME TTY         TIME CMD
data-nod   hadoop 14024 11795   0 07:38:19 ?           0:20 
/usr/jdk/instances/jdk1.6.0/jre/bin/java -Djava.library.path=/usr/local/hadoop-
data-nod   hadoop 14026 11798   0 07:38:19 ?           0:19 
/usr/jdk/instances/jdk1.6.0/jre/bin/java -Djava.library.path=/usr/local/hadoop-
name-nod   hadoop 11621     1   0 07:20:12 ?           0:59 /usr/java/bin/java 
-Dproc_jobtracker -Xmx1000m -Dcom.sun.management.jmxremote -
name-nod   hadoop 11263     1   0 07:20:07 ?           0:27 /usr/java/bin/java 
-Dproc_namenode -Xmx1000m -Dcom.sun.management.jmxremote -Dc
data-nod   hadoop 13912 11798   0 07:38:18 ?           0:23 
/usr/jdk/instances/jdk1.6.0/jre/bin/java -Djava.library.path=/usr/local/hadoop-
data-nod   hadoop 11730     1   1 07:20:14 ?           2:58 /usr/java/bin/java 
-Dproc_tasktracker -Xmx1000m -Dhadoop.log.dir=/var/log/hadoo
data-nod   hadoop 11458     1   0 07:20:09 ?           0:18 /usr/java/bin/java 
-Dproc_datanode -Xmx1000m -server -Dcom.sun.management.jmxre
data-nod   hadoop 13957 11798   1 07:38:18 ?           0:23 
/usr/jdk/instances/jdk1.6.0/jre/bin/java -Djava.library.path=/usr/local/hadoop-
data-nod   hadoop 13953 11795   0 07:38:18 ?           0:24 
/usr/jdk/instances/jdk1.6.0/jre/bin/java -Djava.library.path=/usr/local/hadoop-
data-nod   hadoop 13815 11730   0 07:38:15 ?           0:22 
/usr/jdk/instances/jdk1.6.0/jre/bin/java -Djava.library.path=/usr/local/hadoop-
data-nod   hadoop 13965 11795   0 07:38:18 ?           0:22 
/usr/jdk/instances/jdk1.6.0/jre/bin/java -Djava.library.path=/usr/local/hadoop-

Listing 10

Note: The ZONE column width is limited to eight characters.

A SPARC T5 CPU has 16 cores and 128 threads, and each of the cores has two integer pipelines and one floating point pipeline. For more information about the SPARC T5 CPU architecture, see this white paper.

Note: You can use the pginfo -p -T command to see the CPU topology. For more information, see https://blogs.oracle.com/d/entry/pginfo_pgstat.

If we need know how much we are loading the pipelines, pgstat is the tool that can provide this level of detail.

For example, use the command shown in Listing 11 to run a pgstat report for three minutes:

root@global_zone:~# pgstat -A 60 3

  SUMMARY: UTILIZATION OVER 180 SECONDS

                                  ------HARDWARE------ ------SOFTWARE------
PG   RELATIONSHIP                    MIN    AVG    MAX    MIN    AVG    MAX CPUS
  0  System                            -      -      -   1.8%  79.1% 100.0% 0-127
  3   Data_Pipe_to_memory              -      -      -   1.8%  79.1% 100.0% 0-127
  4    CPU_PM_Active_Power_Domain      -      -      -   1.5%  78.4% 100.0% 0-15
  2     Floating_Point_Unit         0.0%   8.8%  12.7%   1.8%  78.1% 100.0% 0-7
  1      Integer_Pipeline           0.7%  91.2%  98.7%   1.8%  78.1% 100.0% 0-7
  6     Floating_Point_Unit         0.0%   7.6%  12.7%   1.2%  78.8% 100.0% 8-15
  5      Integer_Pipeline           1.0%  90.5%  98.9%   1.2%  78.8% 100.0% 8-15
  9   CPU_PM_Active_Power_Domain      -      -      -   2.4%  78.3% 100.0% 16-31
  8    Floating_Point_Unit         0.0%   5.8%  12.7%   1.5%  82.0% 100.0% 16-23
  7     Integer_Pipeline           0.9%  86.5%  99.2%   1.5%  82.0% 100.0% 16-23
 11    Floating_Point_Unit         0.0%   5.5%  12.5%   3.2%  74.5% 100.0% 24-31
 10     Integer_Pipeline           1.4%  85.4%  98.2%   3.2%  74.5% 100.0% 24-31
 14   CPU_PM_Active_Power_Domain      -      -      -   1.6%  78.8% 100.0% 32-47
 13    Floating_Point_Unit         0.0%   7.9%  12.7%   2.5%  79.7% 100.0% 32-39
 12     Integer_Pipeline           5.1%  90.5%  99.2%   2.5%  79.7% 100.0% 32-39
 16    Floating_Point_Unit         0.0%   7.3%  12.6%   0.5%  77.7% 100.0% 40-47
 15     Integer_Pipeline           1.0%  87.3%  98.5%   0.5%  77.7% 100.0% 40-47
 19   CPU_PM_Active_Power_Domain      -      -      -   1.2%  81.0% 100.0% 48-63
 18    Floating_Point_Unit         0.0%   8.8%  12.8%   1.0%  84.7% 100.0% 48-55
 17     Integer_Pipeline           0.7%  95.2%  99.5%   1.0%  84.7% 100.0% 48-55
 21    Floating_Point_Unit         0.0%   7.7%  12.7%   1.4%  77.2% 100.0% 56-63
 20     Integer_Pipeline           1.2%  90.4%  98.9%   1.4%  77.2% 100.0% 56-63
 24   CPU_PM_Active_Power_Domain      -      -      -   2.6%  78.9% 100.0% 64-79
 23    Floating_Point_Unit         0.0%   7.1%  12.7%   4.7%  79.8% 100.0% 64-71
 22     Integer_Pipeline           9.2%  88.4%  98.9%   4.7%  79.8% 100.0% 64-71
 26    Floating_Point_Unit         0.0%   6.2%  12.6%   0.6%  77.9% 100.0% 72-79
 25     Integer_Pipeline           0.8%  84.7%  98.4%   0.6%  77.9% 100.0% 72-79
...

Listing 11

In Listing 11, we can see how busy each pipeline was (Integer_Pipeline, Floating_Point_Unit), and we can also see that we are pushing the systems up to 90+ percent integer utilization. In addition, we can see the memory bandwidth (Data_Pipe_to_memory).

Next, we want to observe the application that is responsible for the load. For example, what code paths are making the CPUs busy? And which process in each zone is responsible for the system load?

In the next example, we are going to drill down into one of the Oracle Solaris Zones to understand which application or process is responsible to the load.

Let's log in to the data-node1 zone:

root@global_zone:~# zlogin data-node1

Note: You can verify that you have successfully logged in to the data-node1 zone by using the zonename command, which will print the current zone's name.

root@data-node1:~# zonename
data-node1

As shown in Listing 12, we can use the prstat command inside the zone to see which process is responsible for the system load.

root@data-node1:~# prstat -mLc

PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
 22866 root      24  74 1.6 0.0 0.0 0.0 0.0 0.0 122 122 85K   0 prstat/1
 22715 hadoop    80 3.3 0.1 0.0 0.0 4.0 0.1  12  45 201  4K   4 java/2
 22704 hadoop    80 3.3 0.2 0.0 0.0 6.2 0.4  10  61 277  4K  10 java/2
 22721 hadoop    79 3.4 0.2 0.0 0.0 3.5 0.3  14  52 290  4K   5 java/2
 22740 hadoop    78 3.5 0.2 0.0 0.0 3.2 0.1  15  67 400  4K   9 java/2
 22710 hadoop    78 3.2 0.2 0.0 0.0 5.9 0.1  13  53 313  4K   5 java/2
 22691 hadoop    78 3.0 0.2 0.0 0.0 4.6 0.2  14  49 349  3K   5 java/2
 22734 hadoop    77 3.5 0.2 0.0 0.0 3.5 0.2  15  55 373  4K   7 java/2
 22746 hadoop    77 3.6 0.2 0.0 0.0 4.1 0.2  15  71 356  4K   9 java/2
 22752 hadoop    76 3.6 0.2 0.0 0.0 5.6 0.2  14  76 323  4K  10 java/2
 22767 hadoop    76 3.9 0.1 0.0 0.0 3.4 0.4  16  61 374  4K   9 java/2
 22698 hadoop    76 3.3 0.2 0.0 0.0 4.2 0.0  16  65 324  4K  11 java/2
 22792 hadoop    75 4.3 0.1 0.0 0.0 4.3 0.3  16  63 271  4K   6 java/2
 22760 hadoop    76 3.7 0.1 0.0 0.0 6.0 0.9  14  67 280  4K  11 java/2
 22685 hadoop    76 2.9 0.1 0.0 0.0 4.5 0.0  16  43 259  3K   4 java/2
Total: 58 processes, 2011 lwps, load averages: 58.00, 55.80, 41.70

Listing 12

In Listing 12, we can see that the hadoop processes is responsible for the system load.

Note: In Listing 12, we saw that we can run the prstat command from inside a local zone to see per-zone performance statistics. We can also run it from the global zone to get a full system view. This is an example of a virtualization-aware command; the command is aware of when it is running within a non-global zone and, thus, can't see other user processes.

Monitoring Disk I/O Activity in a Virtualized Environment

In our second example, we will use the Hadoop built-in benchmarks to observe and monitor disk I/O activity. Table 2 shows a summary of the commands we will use to monitor disk I/O activity.

Table 2. Command Summary
Command Summary
fsstat Displays disk I/O workload per Oracle Solaris Zone
zpool iostat Displays I/O statistics for the given pool
iostat Reports disk I/O statistics
iotop Displays top disk I/O events by process per zone
rwtop Displays top read/write bytes by process
iopattern Displays details about the I/O access pattern for all the disks

The ZFS and disk layout is shown is Figure 4.

Figure 4

Figure 4. Disk Layout

Let's print how many hard disks we have in the system:

root@global_zone:~# format < /dev/null
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c0t5001517803D013B3d0 <ATA-INTEL SSDSA2BZ30-0362 cyl 35769 alt 2 hd 128 sec 128>  solaris
          /scsi_vhci/disk@g5001517803d013b3
          /dev/chassis/SPARC_T5-2.AK00104209//SYS/SASBP/HDD0/disk
       1. c0t5000CCA0160D3264d0 <HITACHI-H109060SESUN600G-A31A-558.91GB>
          /scsi_vhci/disk@g5000cca0160d3264
          /dev/chassis/SPARC_T5-2.AK00104209//SYS/SASBP/HDD1/disk
       2. c0t5000CCA01612A4F0d0 <HITACHI-H109060SESUN600G-A31A-558.91GB>
          /scsi_vhci/disk@g5000cca01612a4f0
          /dev/chassis/SPARC_T5-2.AK00104209//SYS/SASBP/HDD2/disk
       3. c0t5000CCA016295ABCd0 <HITACHI-H109060SESUN600G-A31A-558.91GB>
          /scsi_vhci/disk@g5000cca016295abc
          /dev/chassis/SPARC_T5-2.AK00104209//SYS/SASBP/HDD3/disk
       4. c0t5000CCA016359F94d0 <HITACHI-H109060SESUN600G-A31A cyl 64986 alt 2 hd 27 sec 668>
          /scsi_vhci/disk@g5000cca016359f94
          /dev/chassis/SPARC_T5-2.AK00104209//SYS/SASBP/HDD4/disk
       5. c0t5000CCA0162A6E0Cd0 <HITACHI-H109060SESUN600G-A31A cyl 64986 alt 2 hd 27 sec 668>  solaris
          /scsi_vhci/disk@g5000cca0162a6e0c
          /dev/chassis/SPARC_T5-2.AK00104209//SYS/SASBP/HDD5/disk
Specify disk (enter its number):

During our disk I/O performance analysis, we are going to ask the following questions:

  • Which devices are targeted for I/O?
  • What is the read/write breakdown? In addition, what is the I/O access pattern for the disks? For example, what percentage of events was of a random or sequential nature?
  • How fast is disk I/O being processed on a per-device (hard disk) basis?
  • What code paths (applications) are making the disks busy?

Hadoop comes with several benchmarks that can benchmark the entire MapReduce system, which is useful for providing a realistic workload.

We will perform three main tasks using the benchmark:

  • Generate some random data.
  • Perform a sort on the random data.
  • Validate the results (that is, that the data has been sorted).

In the first step, we will generate random data using RandomWriter. This tool will run a MapReduce job with 10 maps per node, and each map will generate (approximately) 10 GB of random binary data (a total of 100 GB) with key and values of various sizes. Table 3 shows the configuration variables.

Note: Because of the replication factor (3), the actual amount of data that is going to be written will be three times larger.

Table 3. Configuration Variables
Name Default Value Description
test.randomwriter.maps_per_host 10 Number of maps per host
test.randomwrite.bytes_per_map 1073741824 Number of bytes written per map
test.randomwrite.min_key 10 Minimum size of the key in bytes
test.randomwrite.max_key 1000 Maximum size of the key in bytes
test.randomwrite.min_value 0 Minimum value
test.randomwrite.max_value 20000 Maximum value

Note: You can change these values if you like by setting the properties in the Hadoop configuration file. See RandomWriter for details.

First, let's edit the HADOOP_INSTALL environment variable in order to provide better command readability.

root@global_zone:~# export HADOOP_INSTALL=/usr/local/hadoop

Optional: You can add the following environment variable inside the .profile file to set up persistent configuration:

export HADOOP_INSTALL=/usr/local/hadoop

Change the benchmark file permission so you can execute the benchmark:

root@global_zone:~# chmod +x /usr/local/hadoop/hadoop-examples-1.2.0.jar

Then, generate the random data using the command shown in Listing 13:

root@global_zone:~# zlogin -l hadoop name-node hadoop jar 
$HADOOP_INSTALL/hadoop-examples-1.2.0.jar randomwriter random-data

Listing 13

where:

  • zlogin -l hadoop name-node specifies that the command be run as user hadoop on the name-node zone.
  • hadoop jar /usr/local/hadoop/hadoop-examples-1.2.0.jar randomwriter specifies the Hadoop .jar file.
  • random-data specifies the output directory.

The random data is written under the /user/hadoop/random-data directory by default.

You can see the generated files using the following command:

root@global_zone:~# zlogin -l hadoop data-node1 hadoop dfs -ls /user/hadoop/random-data
Oracle Corporation      SunOS 5.11      11.1    December 2012
Found 32 items
-rw-r--r--   3 hadoop supergroup          0 2013-10-07 08:26 /user/hadoop/random-data/_SUCCESS
drwxr-xr-x   - hadoop supergroup          0 2013-10-07 08:22 /user/hadoop/random-data/_logs
-rw-r--r--   3 hadoop supergroup 1077300619 2013-10-07 08:22 /user/hadoop/random-data/part-00000
-rw-r--r--   3 hadoop supergroup 1077301332 2013-10-07 08:25 /user/hadoop/random-data/part-00001
-rw-r--r--   3 hadoop supergroup 1077276968 2013-10-07 08:23 /user/hadoop/random-data/part-00002
-rw-r--r--   3 hadoop supergroup 1077293150 2013-10-07 08:23 /user/hadoop/random-data/part-00003
-rw-r--r--   3 hadoop supergroup 1077280173 2013-10-07 08:25 /user/hadoop/random-data/part-00004
-rw-r--r--   3 hadoop supergroup 1077289790 2013-10-07 08:23 /user/hadoop/random-data/part-00005
-rw-r--r--   3 hadoop supergroup 1077302329 2013-10-07 08:25 /user/hadoop/random-data/part-00006
-rw-r--r--   3 hadoop supergroup 1077299723 2013-10-07 08:23 /user/hadoop/random-data/part-00007
-rw-r--r--   3 hadoop supergroup 1077288561 2013-10-07 08:23 /user/hadoop/random-data/part-00008
-rw-r--r--   3 hadoop supergroup 1077279776 2013-10-07 08:25 

The first command that we are going to use in the disk I/O performance observation in the fsstat command, which allows us to analyze disk I/O workload per Oracle Solaris Zone and see file system statistics for each file system.

Listing 14 shows per-zone statistics for each zone on the system, as well as a system-wide aggregate for the tmpfs and zfs file systems.

root@global_zone:~# fsstat -A -Z tmpfs zfs 10 10

 new  name   name  attr  attr lookup rddir  read read  write write
 file remov  chng   get   set    ops   ops   ops bytes   ops bytes
  126     0   128 1.57K   512  15.9K     0     0     0   127 15.9K tmpfs
    0     0     0     0     0      0     0     0     0     0     0 tmpfs:global
   20     0    20   260    80  2.55K     0     0     0    20 2.50K tmpfs:data-node2
   52     0    52   612   208  6.36K     0     0     0    52 6.50K tmpfs:data-node3
    0     0     0    40     0     70     0     0     0     0     0 tmpfs:name-node
    0     0     0    40     0     70     0     0     0     0     0 tmpfs:sec-name-node
   54     0    56   656   224  6.83K     0     0     0    55 6.88K tmpfs:data-node1
  156     0   162 1.78K     0  22.9K     0    28 3.16K  175K 5.45G zfs
    0     0     0     0     0      3     0     2   599     0     0 zfs:global
   52     0    54   511     0  4.52K     0     0     0 58.3K 1.82G zfs:data-node2
   52     0    54   512     0  8.46K     0    12 1.28K 58.3K 1.82G zfs:data-node3
    0     0     0   140     0    514     0     1     4   106 19.2K zfs:name-node
    0     0     0   140     0    510     0     0     0     0     0 zfs:sec-name-node
   52     0    54   518     0  8.95K     0    13 1.29K 58.3K 1.81G zfs:data-node1

Listing 14

We can see from the command output in Listing 14 how much data has been written on the tmpfs and zfs file systems (read bytes and write bytes).

Next, we want to pinpoint our measurements for a specific Oracle Solaris Zone.

The following example shows per-zone statistics for zones data-node1, data-node2, and data-node3 as well as a system-wide aggregate for the tmpfs and zfs file systems.

root@global_zone:~# fsstat -A -Z -z data-node1 -z data-node2 -z data-node3
tmpfs zfs 10 10

new  name   name  attr  attr lookup rddir  read read  write write
 file remov  chng   get   set    ops   ops   ops bytes   ops bytes
  140    13   116 3.16K   512  42.7K    16   242  926K   250  342K tmpfs
   20     0    20   266    80  2.56K     0     0     0    20 2.50K tmpfs:data-node2
   57     5    46 1.35K   204  19.2K     8   115  436K   113  170K tmpfs:data-node3
   63     8    50 1.47K   228  20.8K     8   127  491K   117  170K tmpfs:data-node1
  154     0    94 7.74K     0  85.6K    40 20.9K 29.8M  127K 3.96G zfs
   52     0    32   445     0  4.25K     0     0     0 43.0K 1.34G zfs:data-node2
   52     0    32 2.98K     0  31.0K    20 6.63K 10.9M 43.1K 1.34G zfs:data-node3
   50     0    30 3.04K     0  32.9K    20 7.21K 11.9M 41.0K 1.28G zfs:data-node1

The next step in the MapReduce benchmark is to run the sort program to sort the random data we generated in a previous step. Run the command shown in Listing 15:

root@global_zone:~# zlogin -l hadoop name-node hadoop jar 
$HADOOP_INSTALL/hadoop-examples-1.2.0.jar sort random-data sorted-data

Oracle Corporation      SunOS 5.11      11.1    December 2012
Running on 3 nodes to sort from hdfs://name-node/user/hadoop/random-data into 
hdfs://name-node/user/hadoop/sorted-data with 67 reduces.
Job started: Mon Oct 07 08:34:16 IST 2013
13/10/07 08:34:16 INFO mapred.FileInputFormat: Total input paths to process : 30
13/10/07 08:34:17 INFO mapred.JobClient: Running job: job_201310062308_0015
13/10/07 08:34:18 INFO mapred.JobClient:  map 0% reduce 0%

Listing 15

where:

  • zlogin -l hadoop name-node specifies that the command be run as user hadoop on the name-node zone.
  • hadoop jar /usr/local/hadoop/hadoop-examples-1.2.0.jar sort specifies the Hadoop .jar file.
  • random-data specifies the input directory.
  • sorted-data specifies the output directory.

Next, we are going to drill down to watch individual disk read and write operations.

First, let's get the ZFS pool names using the command shown in Listing 16:

root@global_zone:~# zpool list

NAME             SIZE  ALLOC  FREE  CAP  DEDUP  HEALTH  ALTROOT
data-node1-pool  556G  56.7G  499G  10%  1.00x  ONLINE  -
data-node2-pool  556G  56.3G  500G  10%  1.00x  ONLINE  -
data-node3-pool  556G  56.4G  500G  10%  1.00x  ONLINE  -
rpool            278G  21.7G  256G   7%  1.00x  ONLINE  -

Listing 16

In Listing 16, we can see that we have the four ZFS zpools shown in Table 4.

Table 4. Zpool Summary
Pool Name Zone Name Mount Point
rpool name-node
sec-name-node
/zones/name-node
/zones/sec-name-node
data-node1-pool data-node1 /zones/data-node1
data-node2-pool data-node2 /zones/data-node2
data-node3-pool data-node3 /zones/data-node3

Note: Hadoop best practice is to use a separate hard disk for each DataNode. Therefore, every DataNode zone will have its own hard disk in order to provide better I/O distribution, as shown in Figure 4.

We can monitor all the ZFS zpools at the same time using the following command:

root@global_zone:~# zpool iostat -v 10

                              capacity     operations    bandwidth
pool                       alloc   free   read  write   read  write
-------------------------  -----  -----  -----  -----  -----  -----
data-node1-pool            31.1G   525G      2      9   124K  6.49M
  c0t5000CCA0160D3264d0    31.1G   525G      2      9   124K  6.49M
-------------------------  -----  -----  -----  -----  -----  -----
data-node2-pool            31.0G   525G      2     10  91.0K  6.50M
  c0t5000CCA01612A4F0d0    31.0G   525G      2     10  91.0K  6.50M
-------------------------  -----  -----  -----  -----  -----  -----
data-node3-pool            31.0G   525G      1      9   103K  6.49M
  c0t5000CCA016295ABCd0    31.0G   525G      1      9   103K  6.49M
-------------------------  -----  -----  -----  -----  -----  -----
rpool                      22.0G   256G     10      7  95.0K  64.1K
  c0t5001517803D013B3d0s0  22.0G   256G     10      7  95.0K  64.1K
-------------------------  -----  -----  -----  -----  -----  -----

We can also drill down, so let's monitor the disk that is associated with the data-node1 zone.

Using the ZFS pool name as a parameter, we can watch individual disk read and write operations, as shown in Listing 17.

root@global_zone:~# zpool iostat -v data-node1-pool 10

                            capacity     operations    bandwidth
pool                     alloc   free   read  write   read  write
-----------------------  -----  -----  -----  -----  -----  -----
data-node1-pool          55.8G   500G      1      8   403K  2.57M
  c0t5000CCA0160D3264d0  55.8G   500G      1      8   403K  2.57M
-----------------------  -----  -----  -----  -----  -----  -----

                            capacity     operations    bandwidth
pool                     alloc   free   read  write   read  write
-----------------------  -----  -----  -----  -----  -----  -----
data-node1-pool          50.0G   506G    157     18  25.7M   181K
  c0t5000CCA0160D3264d0  50.0G   506G    157     18  25.7M   181K
-----------------------  -----  -----  -----  -----  -----  -----

                            capacity     operations    bandwidth
pool                     alloc   free   read  write   read  write
-----------------------  -----  -----  -----  -----  -----  -----
data-node1-pool          31.0G   525G      0     21    612  72.7K
  c0t5000CCA0160D3264d0  31.0G   525G      0     21    612  72.7K
-----------------------  -----  -----  -----  -----  -----  -----

                            capacity     operations    bandwidth
pool                     alloc   free   read  write   read  write
-----------------------  -----  -----  -----  -----  -----  -----
data-node1-pool          31.0G   525G     36     17  19.4M   135K
  c0t5000CCA0160D3264d0  31.0G   525G     36     17  19.4M   135K
-----------------------  -----  -----  -----  -----  -----  -----

Listing 17

In Listing 17, we can see how much data has been written to and read from the c0t5000CCA0160D3264d0 disk, which is associated with the data-node1-pool zpool, by looking at the values shown for bandwidth read and bandwidth write.

We can also use the iostat command to see how fast the disk I/O operations are being processed on a per-device basis, as shown in Listing 18.

root@global_zone:~# iostat -xnz 5 10
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    1.6   10.8   47.9 3765.1  0.0  0.2    0.1   16.4   0   2 c0t5001517803D013B3d0
    1.2    7.1  365.9 2238.4  0.0  0.2    0.1   19.6   0   2 c0t5000CCA0160D3264d0
    0.9    8.5  279.4 2237.7  0.0  0.2    0.1   16.7   0   2 c0t5000CCA01612A4F0d0
    1.1    8.8  335.9 2237.2  0.0  0.2    0.1   16.3   0   2 c0t5000CCA016295ABCd0
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   16.6    0.0   50.1  0.0  0.0    0.0    0.3   0   0 c0t5001517803D013B3d0
   31.0   15.6 13346.7   44.4  0.0  0.8    0.0   17.1   0  12 c0t5000CCA0160D3264d0
    0.0   15.0    0.0   47.0  0.0  0.0    0.0    1.8   0   1 c0t5000CCA016295ABCd0
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   28.6    0.0  249.4  0.0  0.0    0.0    0.4   0   0 c0t5001517803D013B3d0
    0.0   15.4    0.0   43.1  0.0  0.0    0.0    1.2   0   0 c0t5000CCA0160D3264d0
   88.4   27.6 40257.3  238.9  0.0  2.5    0.0   21.9   1  32 c0t5000CCA016295ABCd0
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   15.4    0.0   38.3  0.0  0.0    0.0    0.2   0   0 c0t5001517803D013B3d0
    0.0   26.4    0.0  205.7  0.0  0.1    0.0    2.6   0   1 c0t5000CCA0160D3264d0
    0.0   14.2    0.0   38.1  0.0  0.0    0.0    1.4   0   0 c0t5000CCA016295ABCd0

Listing 18

In Listing 18, we can see several disks sustaining a moderate level of reads-per-second (r/sec), throughput ranging from 13346 KB/sec to 40257 KB/sec (kr/sec), and up to about 22 milliseconds of latency (asvc_t).

Optional: You can use the -M option to display data throughput in MB/sec instead of in KB/sec.

Note: When executed inside a non-global zone, iostat prints the disk I/O statistics for all the disks in the system. It might be puzzling how I can see disk activity while my zone is idle. The iostat command isn't visualization-aware in the sense that it doesn't know whether it's running within a non-global zone.

Another useful tool is the iotop DTrace script, which displays top disk I/O events by process per Oracle Solaris Zone, as shown in Listing 19.

root@global_zone:~# /usr/dtrace/DTT/iotop -Z 10 10
Tracing... Please wait.

2013 Oct  7 08:40:19,  load: 24.38,  disk_r:      0 KB,  disk_w:   1886 KB

 ZONE    PID   PPID CMD              DEVICE  MAJ MIN  D            BYTES
    0    717      0 zpool-data-node3 sd6      73  48  W           347648
    0      5      0 zpool-rpool      sd3      73  24  W           417280
    0    896      0 zpool-data-node1 sd4      73  32  W          1195520

Listing 19

In Listing 19, you can see the zone ID (ZONE), process ID (PID), type of operation (read or write, D), and total size of the operation (BYTES).

Note: In order to get the zone ID, you can use the zoneadm list -v command.

We can measure reads and writes at the application level, as shown in Listing 20. This matches read and writes to system calls.

root@global_zone:~# /usr/dtrace/DTT/rwtop -Z 10 10
Tracing... Please wait.

2013 Oct  7 08:41:41,  load: 21.88,  app_r: 219241 KB,  app_w: 306995 KB

 ZONE    PID   PPID CMD              D            BYTES
   
    3   3792  13851 java             R          2588146
    3   3725  13851 java             R          2637274
    6   4263  13929 java             R          2662831
    3   5382  13851 java             R          2684596
    6   3951  13929 java             R          2697760
    6   5508  13929 java             R          2725159
    6   4403  13929 java             R          2760624
    6   3482  13929 java             R          2804553
    6   5721  13929 java             R          2809765
    6   3494  13929 java             R          2940105
    3   3847  13851 java             R          2961552
    6   4509  13929 java             R          2994121
    6   4109  13929 java             R          3060994
    3   3660  13851 java             R          3099585
    6   3749  13929 java             R          3161943
    3   3435  13851 java             R          3531230
    6   3663  13929 java             R          3730160
    3   5101  13851 java             R          3968720
    6   4823  13929 java             R          4030110
    3   3905  13851 java             R          4037610
    3   4654  13851 java             R          4106273
    3   5230  13851 java             R          4608408
    3   4654  13851 java             W         22666584
    3   5101  13851 java             W         23468253
    6   5258  13929 java             W         28658892
    3  13851      1 java             R         50977972
    3  13851      1 java             W         50999762
    6  13929      1 java             W         59310428
    6  13929      1 java             R         59433776
    3   5382  13851 java             W         88006519
    3   5230  13851 java             W         88185430

Listing 20

In Listing 20, we can see that the rwtop command sorts the busiest process in terms of disk I/O. In addition, it prints the zone name (ZONE), process name (CMD), whether the operation was a read or a write (D), and how many bytes have been read or written (BYTES).

Next, we can analyze the disk I/O pattern to determine whether it is random or sequential by using the DTrace iopattern script, as shown in Listing 21.

root@global_zone:~#  /usr/dtrace/DTT/iopattern 
%RAN %SEQ  COUNT    MIN    MAX    AVG     KR     KW
  69   31    236   1024 1048576 448830 103441      0
  75   25    577    512 1048576 327938 184306    479
  92    8    598    512 1048576 198293 114275   1525
  74   26    379    512 1048576 330296 121954    294
  66   34    281   1024 1048576 500550 137358      0
  80   20    346   1024 1048576 332114 112218      0
  81   19    444    512 1048576 290734 124694   1366
  65   35    337    512 1048576 490375 161139    244
  75   25    704    512 1048576 353086 241105   1642
  75   25    444   1024 1048576 386634 167642      0
  77   23    666   1024 1048576 397105 258274      0
  77   23    853    512 1048576 385908 320740    725
  77   23    525    512 1048576 345048 175352   1553
  68   32    253    512 1048576 508290 125355    228
  64   36    237   1024 1048576 501317 116027      0

Listing 21

The output in Listing 21 shows the following items:

  • The %RAN column shows the percentage of events that are of a random nature.
  • The %SEQ column shows the percentage of events that are of a sequential nature.
  • The COUNT column shows the number of I/O events.
  • The MIN column shows the minimum I/O event size.
  • The MAX column shows the maximum I/O event size.
  • The AVG column shows the average I/O event size.
  • The KR column shows the total kilobytes read during the sample.
  • The KW column shows the total kilobytes written during the sample.

You can see from the script output that the I/O workload is mainly random reads (%RAN and KR).

Note: If we see that most of the I/O workload is random, we can use flash devices to accelerate our I/O performance.

Monitoring Memory Utilization in a Virtualized Environment

In this next example, we are going to monitor the memory subsystem. Table 5 shows a summary of the commands we will use to monitor the memory utilization.

Table 5. Command Summary
Command Description
vmstat Reports how much free memory is available
prstat Reports active-process statistics
zonestat Reports active-zone statistics
zvmstat Displays vmstat-style information per zone

First, let's print how much physical memory the system has:

root@global_zone:~# prtconf -v | grep Mem
Memory size: 262144 Megabytes

We can see that we have 256 GB of memory in the system.

Second, let's get more information about how the system memory is being allocated, as shown in Listing 22.

root@global_zone:~# echo ::memstat | mdb -k

Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                    1473974             11515    4%
ZFS File Data             4990336             38987   15%
Anon                      2223697             17372    7%
Exec and libs                3342                26    0%
Page cache                5244141             40969   16%
Free (cachelist)            27122               211    0%
Free (freelist)          19591820            153061   58%
Total                    33554432            262144

Listing 22

The categories shown in Listing 22 are as follows:

  • Kernel. The total memory used for nonpageable kernel allocations. This is how much memory the kernel is using.
  • ZFS File Data. The total memory used by the ZFS Adaptive Replacement Cache (ARC).
  • Anon. The amount of anonymous memory. This includes the user-process heap, the stack, copy-on-write pages, and shared memory mappings.
  • Exec and libs. The sum of the memory used for user binaries and shared libraries.
  • Page cache. The amount of unmapped page cache, that is, page cache not on the cache list. Files in /tmp are also included in this category.
  • Free (cachelist). The amount of page cache on the free list. The free list contains unmapped file pages and is typically where the majority of the file system cache resides.
  • Free (freelist). The amount of memory that is actually free. This is memory that has no association with any file or process.

To find how much free memory is currently available in the system, we can use the vmstat command, as shown in Listing 23, and look at the value in the free column (the unit is KB) for any line other than the first line. (The first line of vmstat represents a summary of information since boot time.)

root@global_zone:~# vmstat 10
 kthr      memory            page            disk          faults      cpu
 r b w   swap     free     re   mf  pi po fr de sr s3 s4 s5 s6    in    sy    cs us sy id
 1 0 0 202844144 233325872 315 1311 0  0  0  0  1  15 19 19 18 23352 32919 46222  3  4 93
 4 0 0 110774160 142093304 347 3681 0  0  0  0  0   0 27 15 18 72275 48754 148884 1 11 88
 5 0 0 110862440 142055728 347 3671 0  0  0  0  0  19 15 22 16 72286 48292 148838 1 11 88
 3 0 0 111113056 142043608 331 3525 0  0  0  0  0   0 20 29 20 70099 49362 143970 1 11 88

Listing 23

Listing 23 shows that the system has about 138 GB of free memory. This is the memory that has no association with any file or process.

To determine whether the system is running low on physical memory, look at the sr column in the vmstat output shown in Listing 23, where sr means scan rate. Under low memory conditions, Oracle Solaris begins to scan for memory pages that have not been accessed recently and moves them to the free list. With Oracle Solaris, a nonzero value of sr means the system is running low on physical memory.

You can also use vmstat -p to observe page in, page out, and page free activities, as shown in Listing 24.

root@global_zone:~# vmstat -p 10
     memory           page          executable      anonymous      filesystem 
   swap    free      re  mf  fr  de  sr  epi  epo  epf  api  apo  apf  fpi  fpo  fpf
 201865936 232351696 315 1332 0  0   1   0    0    0    0    0    0    0    0    0
 111431576 141932424 264 2779 0  0   0   0    0    0    0    0    0    0    0    0
 111179136 141752728 243 2580 0  0   0   0    0    0    0    0    0    0    0    0
 110802088 141459648 247 2656 0  0   0   0    0    0    0    0    0    0    0    0

Listing 24

From the vmstat output in Listing 24, we can see the three types of memory:

  • Executable memory (epi, epo, epf) memory pages that are used for program and library text.
  • Anonymous (api, apo, apf) memory pages that are not associated with files. For example, anonymous memory is used for process heaps and stacks. For example, when the process pages are being swapped in or out, you will see a large number in the api and apo columns.
  • File system (fpi, fpo, fpf) memory pages that are used for file I/O.

For more information about Oracle Solaris application memory management, see "Solaris Application Memory Management."

The third command we can use to see process statistics for the system and virtual machines (non-global zones) is the prstat command, as shown in Listing 25:

root@global_zone:~# prstat -ZmLc
   PID USERNAME  SIZE   RSS STATE   PRI NICE      TIME  CPU PROCESS/NLWP      
 20025 hadoop    293M  253M cpu60    59    0   0:00:49  12% java/68
 20739 hadoop    285M  241M sleep    59    0   0:00:49  10% java/68
 17206 hadoop    285M  237M sleep    59    0   0:01:07  10% java/68
 17782 hadoop    281M  229M sleep    59    0   0:00:57 7.4% java/67
 17356 hadoop    289M  241M sleep    59    0   0:01:04 7.0% java/68
 11621 hadoop    166M  126M sleep    59    0   0:02:32 5.9% java/90
 20924 hadoop    289M  237M sleep    59    0   0:00:49 5.3% java/68
 17134 hadoop    289M  237M sleep    59    0   0:01:04 5.1% java/67
 17498 hadoop    297M  257M sleep    59    0   0:00:57 4.8% java/68
 17298 hadoop    297M  253M sleep    59    0   0:01:05 4.6% java/68
 17940 hadoop    297M  249M sleep    59    0   0:00:52 4.3% java/68
 18474 hadoop    289M  237M sleep    59    0   0:00:49 3.9% java/67
 19600 hadoop    297M  253M sleep    59    0   0:00:49 3.8% java/68
 20617 hadoop    297M  249M sleep    59    0   0:00:49 3.7% java/67
 17432 hadoop    297M  249M sleep    59    0   0:01:03 3.6% java/68
ZONEID    NPROC  SWAP   RSS MEMORY      TIME  CPU ZONE                        
     4       74 7246M 6133M   2.3%   2:31:34  43% data-node2                  
     3       53 7442M 6248M   2.4%   2:23:01  30% data-node1                  
     5       52 7108M 6001M   2.3%   2:27:40  22% data-node3                  
     2       32  675M  468M   0.1%   0:04:36 4.0% name-node                   
     0       82  870M  414M   0.1%   1:19:20 1.0% global                      
Total: 322 processes, 8024 lwps, load averages: 15.54, 18.25, 20.09

Listing 25

From the prstat output in Listing 25, we can see the following information for each Oracle Solaris Zone:

  • The SWAP column shows the total virtual memory size for each zone.
  • The RSS column shows the total zone-resident set size (main memory usage).
  • The MEMORY column shows the main memory consumed, as a percentage of system-wide resources.
  • The CPU column shows the CPU consumed, as a percentage of system-wide resources.
  • The ZONE column shows each zone's name.

To see detailed memory statistics per zone, we can use the zonestat command. For example, we can use the zonestat -r command to analyze memory utilization, as shown in Listing 26.

root@global_zone:~# zonestat -r memory 10
Collecting data for first interval...
Interval: 1, Duration: 0:00:10
PHYSICAL-MEMORY              SYSTEM MEMORY
mem_default                           256G
                                ZONE  USED %USED   CAP  %CAP
                             [total] 87.0G 34.0%     -     -
                            [system] 68.9G 26.9%     -     -
                          data-node2 5956M 2.27%     -     -
                          data-node1 5942M 2.26%     -     -
                          data-node3 5702M 2.17%     -     -
                           name-node  444M 0.16%     -     -
                              global  285M 0.10%     -     -
                       sec-name-node  219M 0.08%     -     -

VIRTUAL-MEMORY               SYSTEM MEMORY
vm_default                            259G
                                ZONE  USED %USED   CAP  %CAP
                             [total]  122G 47.0%     -     -
                            [system] 42.3G 16.2%     -     -
                          data-node2 27.1G 10.4%     -     -
                          data-node3 25.8G 9.96%     -     -
                          data-node1 25.4G 9.78%     -     -
                           name-node  762M 0.28%     -     -
                       sec-name-node  416M 0.15%     -     -
                              global  388M 0.14%     -     -

LOCKED-MEMORY                SYSTEM MEMORY
mem_default                           256G
                                ZONE  USED %USED   CAP  %CAP
                             [total] 15.7G 6.16%     -     -
                            [system] 15.7G 6.16%     -     -
                          data-node1     0 0.00%     -     -
                          data-node2     0 0.00%     -     -
                          data-node3     0 0.00%     -     -
                              global     0 0.00%     -     -
                           name-node     0 0.00%     -     -
                       sec-name-node     0 0.00%     -     -

Listing 26

We can see from the output shown in Listing 26 how much memory is being used for physical memory (PHYSICAL-MEMORY),virtual memory (VIRTUAL-MEMORY)—which is an abstraction of physical memory—and locked memory (LOCKED-MEMORY). For example, a privileged process can lock virtual memory, which means that this memory won't be paged out. Another command we can use to monitor memory utilization in a virtualized environment is the zvmstat command, which prints vmstat output for each zone, as shown in Listing 27.

root@global_zone:~# /usr/dtrace/DTT/Bin/zvmstat 10

    ZONE         re    mf    fr    sr  epi  epo  epf  api  apo  apf  fpi  fpo  fpf
global         273   218    0     0    0    0    0    0    0    0    0    0    0
sec-name-node    0     0    0     0    0    0    0    0    0    0    0    0    0
name-nodenode    0     0    0     0    0    0    0    0    0    0    0    0    0
data-node1ode    0     0    0     0    0    0    0    0    0    0    0    0    0
data-node2ode    0     0    0     0    0    0    0    0    0    0    0    0    0
data-node3ode    0     0    0     0    0    0    0    0    0    0    0    0    0   

Listing 27

The following items are shown in Listing 27:

  • The Zone column shows the zone name.
  • The re column shows the number of page reclaims.
  • The mf column shows the number of minor faults.
  • The fr column shows the number of pages freed.
  • The sr column shows the scan rate.
  • The epi column shows the number of executable pages paged in.
  • The vepo column shows the number of executable pages paged out.
  • The epf column shows the number of executable pages freed.
  • The api column shows the number of anonymous pages paged in.
  • The apo column shows the number of anonymous pages paged out.
  • The apf column shows the number of anonymous pages freed.
  • The fpi column shows the number of file system pages paged in.
  • The fpo column shows the number of file system pages paged out.
  • The fpf column shows the number of file system pages freed.

We can pinpoint our memory measurement and check a specific zone, for example, by showing memory usage for the name-node zone.

root@global_zone:~# prstat -mLz name-node

Let's drill down into the data-node1 zone in order to see the memory statistics. How do you examine a specific Oracle Solaris Zone to see what process is occupying physical memory?

First, log in to the zone:

root@global_zone:~# zlogin data-node1

We can then sort the processes according to their memory resident set size (RSS), which is the portion of a process' memory that is held in RAM, using the command shown in Listing 28. The largest memory consumers are listed at the top.

root@data-node1:~# prstat -s rss 

PID    USERNAME  SIZE   RSS STATE   PRI NICE      TIME  CPU PROCESS/NLWP
 10236 hadoop    179M  151M cpu20    52    0   0:05:26 8.5% java/132
 11365 hadoop    193M  140M sleep    59    0   0:00:55 3.0% java/63
 11381 hadoop    189M  136M cpu29    20    0   0:00:55 5.5% java/63
 11346 hadoop    189M  120M cpu44    10    0   0:00:58 5.2% java/63
 10648 hadoop    204M  120M sleep    59    0   0:02:09 0.1% java/151
 11355 hadoop    165M  112M cpu8     50    0   0:00:56 6.1% java/63
 11319 hadoop    165M  112M sleep    59    0   0:00:56 1.3% java/63
 11337 hadoop    165M  112M cpu30    50    0   0:00:52 4.2% java/63
 11327 hadoop    165M  112M cpu40    20    0   0:00:57 3.0% java/63
 11332 hadoop    165M  112M cpu122   20    0   0:00:57 3.2% java/63
 11323 hadoop    161M  108M sleep    10    0   0:00:56 5.2% java/63
 11374 hadoop    149M   96M cpu88    10    0   0:00:55 3.2% java/63
  7109 root       18M   18M sleep    59    0   0:00:11 0.0% svc.configd/17
  7073 root       35M   17M sleep    59    0   0:00:05 0.0% svc.startd/14
  9392 root       22M   16M sleep    59    0   0:00:00 0.0% fmd/11
Total: 43 processes, 1047 lwps, load averages: 31.82, 8.25, 3.02 

Listing 28

In Listing 28, we can see that the hadoop processes are using the most RSS memory.

Monitoring Network Utilization in a Virtualized Environment

In our final example, we are going to monitor network performance. Table 6 shows a summary of the commands we will use to do this.

Table 6. Command Summary
Command Description
dladm Administers data links
dlstat Reports data links statistics
zonestat Reports active-zone statistics
flowadm Administers bandwidth resource control
flowstat Reports flow statistics

In the Hadoop cluster, most of the network traffic is for HDFS data replication between the DataNodes.

The questions that we will be answering are as follows:

  • Which zones are seeing the highest and lowest network traffic?
  • Which is the busiest zone in terms of the number of network connections that it handles currently?
  • How can we monitor specific network resources, for example, Oracle Solaris Zones, physical network cards, or virtual network interface cards (VNICs).

First let's view our network setup by using the dladm command to show how many physical network cards we have:

root@global_zone:~# dladm show-phys

LINK              MEDIA                STATE      SPEED  DUPLEX    DEVICE
net0              Ethernet             up         1000   full      ixgbe0
net2              Ethernet             unknown    0      unknown   ixgbe2
net1              Ethernet             unknown    0      unknown   ixgbe1
net3              Ethernet             unknown    0      unknown   ixgbe3
net4              Ethernet             up         10     full      usbecm0

Now, let's prints the VNIC information using the dladm command, as shown in Listing 29.

root@global_zone:~# dladm show-vnic 

LINK                OVER         SPEED  MACADDRESS        MACADDRTYPE       VID
name_node1          net0         1000   2:8:20:d4:31:b3   random            0
name-node/name_node1 net0        1000   2:8:20:d4:31:b3   random            0
secondary_name1     net0         1000   2:8:20:41:78:4b   random            0
sec-name-node/secondary_name1 net0 1000 2:8:20:41:78:4b   random            0
data_node1          net0         1000   2:8:20:f:3f:f7    random            0
data-node1/data_node1 net0       1000   2:8:20:f:3f:f7    random            0
data_node2          net0         1000   2:8:20:d0:38:ea   random            0
data-node2/data_node2 net0       1000   2:8:20:d0:38:ea   random            0
data_node3          net0         1000   2:8:20:54:da:7b   random            0
data-node3/data_node3 net0       1000   2:8:20:54:da:7b   random            0
sec-name-node/net0  net0         1000   2:8:20:da:2e:5b   random            0
name-node/net0      net0         1000   2:8:20:30:cc:45   random            0
data-node1/net0     net0         1000   2:8:20:8b:7b:f6   random            0
data-node2/net0     net0         1000   2:8:20:6a:3f:38   random            0
data-node3/net0     net0         1000   2:8:20:5d:7:8e    random            0

Listing 29

As shown in Listing 29, dladm prints the associated physical interface (OVER), the speed (SPEED), and MAC address (MACADDRESS), and the VLAN ID (VID).

As we can see, we have five VNICs, one for each zone, as shown in Figure 5.

Figure 5

Figure 5. VNICs

We can use the zonestat command with the -r and -x options for extended networking information to pinpoint our measurements to specific Oracle Solaris Zones, for example, monitoring the network traffic on three DataNode zones (data-node1, data-node2, and data-node3), as shown in Listing 30.

root@global_zone:~# zonestat -z data-node1 -z data-node2 -z data-node3 -r network -x 10
Collecting data for first interval...
Interval: 1, Duration: 0:00:10
NETWORK-DEVICE                  SPEED        STATE        TYPE
net0                         1000mbps           up        phys
              ZONE            LINK TOBYTE  MAXBW %MAXBW PRBYTE %PRBYTE POBYTE %POBYTE
           [total]            net0   269M      -      -    198   0.00%  18.4E    100%
            global            net0   2642 13474770085G      -    198   0.00%    284   0.00%
        data-node1 data-node1/net0  93.6M      -      -      0   0.00%  18.4E    100%
        data-node3 data-node3/net0  91.3M      -      -      0   0.00%  18.4E    100%
        data-node2 data-node2/net0  84.4M      -      -      0   0.00%  18.4E    100%
         name-node  name-node/net0   304K      -      -      0   0.00%  18.4E    100%
        data-node3      data_node3   2340      -      -      0   0.00%      0   0.00%
     sec-name-node sec-name-node/net0   2340      -      -      0   0.00%      0   0.00%
        data-node2      data_node2   2280      -      -      0   0.00%      0   0.00%
         name-node      name_node1   2280      -      -      0   0.00%      0   0.00%
        data-node1      data_node1   2220      -      -      0   0.00%      0   0.00%
     sec-name-node secondary_name1   2220      -      -      0   0.00%      0   0.00%

Listing 30

The command output in Listing 30 shows the following:

  • The name of a data link (LINK)
  • The number of bytes transmitted and received by data links or virtual links (TOBYTE)
  • The maximum bandwidth configured on a data link (MAXBW)
  • The sum of all transmitted and received bytes as a percentage of the configured maximum bandwidth (%MAXBW)
  • The number of received bytes that consume the physical bandwidth (PRBYTE)
  • The percentage of available physical bandwidth used to receive PRBYTE (%PRBYE)
  • The number of transmitted bytes that consume the physical bandwidth (POBYTE)
  • The percentage of available physical bandwidth used to transmit POBYTE (%POBYE)

We can use the dlstat command to monitor the three VNICs that are associated with the three DataNodes, as shown in Listing 31.

root@global_zone:~# dlstat -z data-node1,data-node2,data-node3 -i 10

           LINK    IPKTS   RBYTES    OPKTS   OBYTES
data-node1/data_node1   25.89K    1.57M        0        0
data-node2/data_node2   25.89K    1.57M        0        0
data-node3/data_node3   25.89K    1.57M        0        0
data-node1/net0   49.32M   54.99G   15.24M   52.59G
data-node2/net0   49.88M   55.80G   15.66M   54.16G
data-node3/net0   49.29M   54.60G   14.98M   52.95G
data-node1/data_node1       27    1.62K        0        0
data-node2/data_node2       27    1.62K        0        0
data-node3/data_node3       27    1.62K        0        0
data-node1/net0   50.69K   55.93M   16.06K   56.24M
data-node2/net0   49.20K   56.13M   16.32K   50.86M
data-node3/net0   46.87K   51.50M   13.90K   50.45M

Listing 31

As you can see in Listing 31, the dlstat command displays the following information:

  • The number of inbound packets (IPKTS)
  • How many bytes have been received (RBYTES)
  • The number of outbound packets (OPKTS)
  • How many bytes have been transmitted (OBYTES)

We can drill down to a specific network resource, for example, we can monitor the physical network interface (net0):

root@global_zone:~# dlstat net0 -i 10

           LINK    IPKTS     RBYTES    OPKTS   OBYTES
           net0   39.41K    2.63M    8.16K    1.44M
           net0       45    2.74K        1      198
           net0       43    2.61K        1      150
           net0       41    2.47K        1      150
^C

Note: To stop the dlstat command, press Ctrl-c.

We can also monitor only the VNIC that is associated with the data-node1 zone:

root@global_zone:~# dlstat name_node1 -i 10

     LINK         IPKTS   RBYTES    OPKTS   OBYTES
     data_node1   26.30K    1.59M        0        0
     data_node1       42    2.70K        0        0
     data_node1       43    2.58K        0        0
     data_node1       31    1.86K        0        0
^C

Note: When run from a non-global zone, dlstat displays statistics only for the links in that zone. A non-global zone cannot see links in other zones.

Validating the Data Set

As described in the "Monitoring Disk I/O Activity in a Virtualized Environment" section, the MapReduce benchmark has three phases. The final step in the MapReduce benchmark is to validate whether the data is sorted.

First, change the file permissions in order to execute the .jar file:

root@global_zone:~# chmod +x /usr/local/hadoop/hadoop-test-1.2.0.jar

Then, to validate whether the sorted data is accurate, we will run the testmapredsort program, which performs a series of checks on the unsorted and sorted data.

Let's determine whether the sorted data is really sorted by running the following command:

root@global_zone:~# zlogin -l hadoop name-node hadoop jar 
/usr/local/hadoop/hadoop-test-1.2.0.jar testmapredsort -sortInput 
random-data -sortOutput sorted-data

where:

  • zlogin -l hadoop name-node specifies that the command be run as user hadoop on the name-node zone.
  • hadoop jar /usr/local/hadoop/hadoop-test-1.2.0.jar testmapredsort specifies the Hadoop .jar file.
  • -sortInput random-data specifies the input directory for the random data.
  • -sortOutput sorted-data specifies the output directory for the sorted data.

If the data is validated as being sorted, the following message appears:

SUCCESS! Validated the MapReduce framework's 'sort' successfully.

Monitoring Network Traffic Between Two Systems on a Specific TCP or UDP Network Port

We can also monitor our network traffic on a specific TCP or UDP port. This is useful if we want to monitor how the data replication between two Hadoop clusters is progressing, for example, data being replicated from Hadoop cluster A to Hadoop cluster B, which is located in a different data center, as shown in Figure 6.

Figure 6

Figure 6. Replicating data between Hadoop clusters

In order to replicate data between Hadoop clusters, we can use the distcp command, which is a Hadoop built-in command for copying data between Hadoop clusters.

root@global_zone:~# zlogin -l hadoop name-node hadoop distcp  
"hdfs://name-node:8020/benchmarks hdfs://name-node2:8020/backup"

Note: name-node is the host name of the NameNode on the first cluster, and name-node2 is the host name of the NameNode on the second cluster.

If we want to monitor the network traffic between those clusters on a specific TCP port (8020) that is used by distcp, we can use the flow.

Flow is a sophisticated quality of service (QoS) mechanism built into the new Oracle Solaris 11 network virtualization architecture, and it allows us to measure or limit the network bandwidth for a specific network port on a specific network interface. In addition, flows can be created, modified, and removed in both global and non-global zones.

In the following example, we will set up a flow that is associated with TCP network port 8020 on the name_node1 VNIC.

First, log in to the name-node zone:

root@global_zone:~# zlogin name-node

Now, create the flow on the name_node1 VNIC:

root@name-node:~# flowadm add-flow -l name_node1 -a transport=TCP,local_port=8020 distcp-flow

Note: You don't need to reboot the zone in order to enable or disable the flow. This is very useful when you need to debug network performance issues on a production system!

Verify the flow creation, as shown in Listing 32:

root@name_node:~# flowadm show-flow

FLOW        LINK                IPADDR                   PROTO  LPORT   RPORT   DSFLD
distcp-flow name_node1          --                       tcp    8020    --      -

Listing 32

In Listing 32, you can see the distcp-flow flow for the associated VNIC (name_node1) and the TCP port number (8020).

On the name_node zone, we can use the flowstat(1M) command, which reports runtime statistics about user-defined flows. This command can report the receive-side or transmit-side statistics only. In addition, it can display statistics for all flows on the specified link or statistics for the specified flow.

To report the bandwidth on the distcp-flow flow, which monitors TCP port 8020, use the command shown in Listing 33:

root@name_node:~# flowstat -i 1

FLOW IPKTS RBYTES IDROPS OPKTS OBYTES ODROPS
distcp-flow 24.72M 37.17G 0 3.09M 204.08M 0
distcp-flow 749.28K 1.13G 0 93.73K 6.19M 0
distcp-flow 783.68K 1.18G 0 98.03K 6.47M 0
distcp-flow 668.83K 1.01G 0 83.66K 5.52M 0
distcp-flow 783.87K 1.18G 0 98.07K 6.47M 0
distcp-flow 775.34K 1.17G 0 96.98K 6.40M 0
distcp-flow 777.15K 1.17G 0 97.21K 6.42M 0
^C

Listing 33

Note: To stop the flowstat command, press Ctrl-c.

As you can see in Listing 33, the flowstat command displays the network statistics for TCP port 8020. In addition, enabling the flow on the name_node1 network interface did not degrade the network performance!

Optional: Once you finish your network measurements, you can remove the flow:

root@name_node:~# flowadm remove-flow distcp-flow

Then, verify the flow removal:

root@name_node:~# flowadm show-flow

Note: You don't need to reboot the zone in order to remove the flow; this is very useful in production environments.

For more examples of network performance monitoring, see "Advanced Network Monitoring Using Oracle Solaris 11 Tools."

Cleanup Tasks

(Optional) When you've finished benchmarking, you can delete all the generated files from HDFS using the following commands:

root@global_zone:~# zlogin -l hadoop name-node hadoop dfs -rmr hdfs://name-node/user/hadoop/random-data

root@global_zone:~# zlogin -l hadoop name-node hadoop dfs -rmr hdfs://name-node/user/hadoop/sorted-data

Conclusion

In this article, we saw how we can leverage the new Oracle Solaris 11 performance analysis tools to observe and monitor a virtualized environment that hosts a Hadoop cluster.

See Also

Also see these additional publications by this author:

And here are additional Oracle Solaris 11 resources:

About the Author

Orgad Kimchi is a principal software engineer on the ISV Engineering team at Oracle (formerly Sun Microsystems). For 6 years he has specialized in virtualization, big data, and cloud computing technologies.

Revision 1.0, 12/16/2013

Follow us:
Blog | Facebook | Twitter | YouTube