How to Set Up a Hadoop Cluster Using Oracle Solaris

Hands-On Labs of the System Admin and Developer Community of OTN

by Orgad Kimchi

How to set up a Hadoop cluster using the Oracle Solaris Zones, ZFS, and network virtualization technologies.


Published October 2013


Want to comment on this article? Post the link on Facebook's OTN Garage page.  Have a similar article to share? Bring it up on Facebook or Twitter and let's discuss.
Table of Contents
Lab Introduction
Prerequisites
System Requirements
Summary of Lab Exercises
The Case for Hadoop
Exercise 1: Install Hadoop
Exercise 2: Edit the Hadoop Configuration Files
Exercise 3: Configure the Network Time Protocol
Exercise 4: Create the Virtual Network Interfaces
Exercise 5: Create the NameNode and Secondary NameNode Zones
Exercise 6: Set Up the DataNode Zones
Exercise 7: Configure the NameNode
Exercise 8: Set Up SSH
Exercise 9: Format HDFS from the NameNode
Exercise 10: Start the Hadoop Cluster
Exercise 11: Run a MapReduce Job
Exercise 12: Use ZFS Encryption
Exercise 13: Use Oracle Solaris DTrace for Performance Monitoring
Summary
See Also
About the Author

Expected duration: 180 minutes

Lab Introduction

This hands-on lab presents exercises that demonstrate how to set up an Apache Hadoop cluster using Oracle Solaris 11 technologies such as Oracle Solaris Zones, ZFS, and network virtualization. Key topics include the Hadoop Distributed File System (HDFS) and the Hadoop MapReduce programming model.

We will also cover the Hadoop installation process and the cluster building blocks: NameNode, a secondary NameNode, and DataNodes. In addition, you will see how you can combine the Oracle Solaris 11 technologies for better scalability and data security, and you will learn how to load data into the Hadoop cluster and run a MapReduce job.

Prerequisites

This hands-on lab is appropriate for system administrators who will be setting up or maintaining a Hadoop cluster in production or development environments. Basic Linux or Oracle Solaris system administration experience is a prerequisite. Prior knowledge of Hadoop is not required.

System Requirements

This hands-on lab is run on Oracle Solaris 11 in Oracle VM VirtualBox. The lab is self-contained. All you need is in the Oracle VM VirtualBox instance.

For those attending the lab at Oracle OpenWorld, your laptops are already preloaded with the correct Oracle VM VirtualBox image.

If you want to try this lab outside of Oracle OpenWorld, you will need an Oracle Solaris 11 system. Do the following to set up your machine:

  1. If you do not have Oracle Solaris 11, download it here.
  2. Download the Oracle Solaris 11.1 VirtualBox Template (file size 1.7GB).
  3. Install the template as described here. (Note: On step 4 of Exercise 2 for installing the template, set the RAM size to 4 GB in order to get good performance.)

Notes for Oracle Open World Attendees

  • Each attendee will have his or her own laptop for the lab.
  • The login name and password for this lab are provided in a "one pager."
  • Oracle Solaris 11 uses the GNOME desktop. If you have used the desktops on Linux or other UNIX operating systems, the interface should be familiar. Here are some quick basics in case the interface is new for you.

    • In order to open a terminal window in the GNOME desktop system, right-click the background of the desktop, and select Open Terminal in the pop-up menu.
    • The following source code editors are provided on the lab machines: vi (type vi in a terminal window) and emacs (type emacs in a terminal window).

Summary of Lab Exercises

This hands-on lab consists of 13 exercises covering various Oracle Solaris and Apache Hadoop technologies:

  1. Install Hadoop.
  2. Edit the Hadoop configuration files.
  3. Configure the Network Time Protocol.
  4. Create the virtual network interfaces (VNICs).
  5. Create the NameNode and the secondary NameNode zones.
  6. Set up the DataNode zones.
  7. Configure the NameNode.
  8. Set up SSH.
  9. Format HDFS from the NameNode.
  10. Start the Hadoop cluster.
  11. Run a MapReduce job.
  12. Secure data at rest using ZFS encryption.
  13. Use Oracle Solaris DTrace for performance monitoring.

The Case for Hadoop

The Apache Hadoop software is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

To store data, Hadoop uses the Hadoop Distributed File System (HDFS), which provides high-throughput access to application data and is suitable for applications that have large data sets.

For more information about Hadoop and HDFS, see http://hadoop.apache.org/.

The Hadoop cluster building blocks are as follows:

  • NameNode: The centerpiece of HDFS, which stores file system metadata, directs the slave DataNode daemons to perform the low-level I/O tasks, and also runs the JobTracker process.
  • Secondary NameNode: Performs internal checks of the NameNode transaction log.
  • DataNodes: Nodes that store the data in HDFS, which are also known as slaves and run the TaskTracker process.

In the example presented in this lab, all the Hadoop cluster building blocks will be installed using the Oracle Solaris Zones, ZFS, and network virtualization technologies. Figure 1 shows the architecture:

Figure 1

Figure 1

Exercise 1: Install Hadoop

  1. In Oracle VM VirtualBox, enable a bidirectional "shared clipboard" between the host and the guest in order to enable copying and pasting text from this file.

    Figure 2

    Figure 2

  2. Open a terminal window by right-clicking any point in the background of the desktop and selecting Open Terminal in the pop-up menu.

    Figure 3

    Figure 3

  3. Next, switch to the root user using the following command.

    Note: For Oracle OpenWorld attendees, the root password has been provided in the one-pager associated with this lab. For those running this lab outside of Oracle OpenWorld, enter the root password you entered when you followed the steps in the "System Requirements" section.

    root@global_zone:~# su -
    Password:
    Oracle Corporation      SunOS 5.11      11.1    September 2012
    
  4. Set up the virtual network interface card (VNIC) in order to enable network access to the global zone from the non-global zones.

    Note: Oracle OpenWorld attendees can skip this step (because the preloaded Oracle VM VirtualBox image already provides configured VNICs) and go directly to step 16, "Browse the lab supplement materials."

    root@global_zone:~# dladm create-vnic -l net0 vnic0
    root@global_zone:~# ipadm create-ip vnic0
    root@global_zone:~# ipadm create-addr -T static -a local=192.168.1.100/24 vnic0/addr
    
  5. Verify the VNIC creation:

    root@global_zone:~# ipadm show-addr vnic0
    ADDROBJ           TYPE     STATE        ADDR
    vnic0/addr        static   ok           192.168.1.100/24
    
  6. Create the hadoophol directory; we will use it to store the lab supplement materials associated with this lab, such as scripts and input files.

    root@global_zone:~# mkdir -p /usr/local/hadoophol
    
  7. Create the Bin directory; we will put the Hadoop binary file there.

    root@global_zone:~# mkdir /usr/local/hadoophol/Bin
    
  8. In this lab, we will use the Apache Hadoop "23-Jul-2013, 2013: Release 1.2.1 " release. You can download the Hadoop binary file using a web browser. Open the Firefox web browser from the desktop and download the file.

    Figure 4

    Figure 4

  9. Copy the Hadoop tarball to /usr/local/hadoophol/Bin.

    root@global_zone:~# cp /export/home/oracle/Downloads/hadoop-1.2.1.tar.gz /usr/local/hadoophol/Bin/
    

    Note: By default, the file is downloaded to the user's Downloads directory.

  10. Next, we are going to create the lab scripts, so create a directory for them:

    root@global_zone:~# mkdir /usr/local/hadoophol/Scripts
    
  11. Create the createzone script using your favorite editor, as shown in Listing 1. We will use this script to set up the Oracle Solaris Zones.

    root@global_zone:~# vi /usr/local/hadoophol/Scripts/createzone
    

    Listing 1
    #!/bin/ksh
    
    # FILENAME:    createzone
    # Create a zone with a VNIC
    # Usage:
    # createzone <zone name> <VNIC>
    
    if [ $# != 2 ]
    then
        echo "Usage: createzone <zone name> <VNIC>"
        exit 1
    fi
    
    ZONENAME=$1
    VNICNAME=$2
    
    zonecfg -z $ZONENAME > /dev/null 2>&1 << EOF
    create
    set autoboot=true
    set limitpriv=default,dtrace_proc,dtrace_user,sys_time
    set zonepath=/zones/$ZONENAME
    add fs
    set dir=/usr/local
    set special=/usr/local
    set type=lofs
    set options=[ro,nodevices]
    end
    add net
    set physical=$VNICNAME
    end
    verify
    exit
    EOF
    if [ $? == 0 ] ; then
    echo "Successfully created the $ZONENAME zone"
    else
    echo "Error: unable to create the $ZONENAME zone"
    exit 1
    fi
    
  12. Create the verifycluster script using your favorite editor, as shown in Listing 2. We will use this script to verify the Hadoop cluster setup.

    root@global_zone:~# vi /usr/local/hadoophol/Scripts/verifycluster
    

    Listing 2
    #!/bin/ksh
    
    # FILENAME:   verifycluster
    # Verify the hadoop cluster configuration
    # Usage:
    # verifycluster
    
    RET=1
    
    
    for transaction in _; do
    
      for i in name-node sec-name-node data-node1 data-node2 data-node3
       do
    
       cmd="zlogin $i ls /usr/local > /dev/null 2>&1 "
       eval $cmd || break 2
    
       done
    
    
        for i in name-node sec-name-node data-node1 data-node2 data-node3
         do
           cmd="zlogin $i ping name-node > /dev/null 2>&1" 
           eval $cmd || break 2
         done 
    
        for i in name-node sec-name-node data-node1 data-node2 data-node3
          do 
            cmd="zlogin $i ping sec-name-node > /dev/null 2>&1" 
            eval $cmd || break 2
          done 
    
        for i in name-node sec-name-node data-node1 data-node2 data-node3
          do
           cmd="zlogin $i ping data-node1 > /dev/null 2>&1" 
           eval $cmd || break 2
          done 
    
        for i in name-node sec-name-node data-node1 data-node2 data-node3
          do
           cmd="zlogin $i ping data-node2 > /dev/null 2>&1" 
           eval $cmd || break 2
          done 
    
        for i in name-node sec-name-node data-node1 data-node2 data-node3
          do
           cmd="zlogin $i ping data-node3 > /dev/null 2>&1" 
           eval $cmd || break 2
        done 
    
    RET=0
    done
    
    if [ $RET == 0 ] ; then
    echo "The cluster is verified"
    else
    echo "Error: unable to verify the cluster"
    fi
    exit $RET
    
  13. Create the Doc directory; we will put the Hadoop input files there.

    root@global_zone:~# mkdir /usr/local/hadoophol/Doc
    
  14. Download the following eBook from Project Gutenberg as a plain-text file with UTF-8 encoding: The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson.
  15. Copy the downloaded file (pg20417.txt) into the /usr/local/hadoophol/Doc directory.

    root@global_zone:~# cp ~oracle/Downloads/pg20417.txt /usr/local/hadoophol/Doc/
    
  16. Browse the lab supplement materials by typing the following on the command line:

    root@global_zone:~# cd /usr/local/hadoophol  
    
  17. On the command line, type ls -l to see the content of the directory:

    root@global_zone:~# ls -l
    total 9
    drwxr-xr-x   2 root     root           2 Jul  8 15:11 Bin
    drwxr-xr-x   2 root     root           2 Jul  8 15:11 Doc
    drwxr-xr-x   2 root     root           2 Jul  8 15:12 Scripts
    

    You can see the following directory structure:

    • Bin—The Hadoop binary location
    • Doc—The Hadoop input files
    • Scripts—The lab scripts
  18. In this lab we will use the Apache Hadoop "23-Jul-2013, 2013: Release 1.2.1" release. Copy the Hadoop tarball into /usr/local:

    root@global_zone:~# cp /usr/local/hadoophol/Bin/hadoop-1.2.1.tar.gz /usr/local
    
  19. Unpack the tarball:

    root@global_zone:~# cd /usr/local
    root@global_zone:~# tar -xvfz /usr/local/hadoop-1.2.1.tar.gz
    
  20. Create the hadoop group:

    root@global_zone:~# groupadd hadoop
    
  21. Add the hadoop user:

    root@global_zone:~# useradd -m -g hadoop hadoop
    
  22. Set the user's Hadoop password. You can use whatever password that you want, but be sure you remember the password.

    root@global_zone:~# passwd hadoop
    
  23. Create a symlink for the Hadoop binaries:

    root@global_zone:~# ln -s /usr/local/hadoop-1.2.1 /usr/local/hadoop
    
  24. Give ownership to the hadoop user:

    root@global_zone:~# chown -R hadoop:hadoop /usr/local/hadoop*
    
  25. Change the permissions:

    root@global_zone:~# chmod -R 755 /usr/local/hadoop*
    

Exercise 2: Edit the Hadoop Configuration Files

In this exercise, we will edit the Hadoop configuration files, which are shown in Table 1:

Table 1. Hadoop Configuration Files
File Name Description
hadoop-env.sh Specifies environment variable settings used by Hadoop.
core-site.xml Specifies parameters relevant to all Hadoop daemons and clients.
mapred-site.xml Specifies parameters used by the MapReduce daemons and clients.
masters Contains a list of machines that run the Secondary NameNode.
slaves Contains a list of machine names that run the DataNode and TaskTracker pair of daemons.

To learn more about how the Hadoop framework is controlled by these configuration files, see http://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/conf/Configuration.html.

  1. Run the following command to change to the conf directory:

    root@global_zone:~# cd /usr/local/hadoop/conf
    
  2. Run the following commands to change the hadoop-env.sh script:

    Note: The cluster configuration will share the Hadoop directory structure (/usr/local/hadoop) across the zones as a read-only file system. Every Hadoop cluster node needs to be able to write its logs to an individual directory. The directory /var/log/hadoop is a best-practice directory for every Oracle Solaris Zone.

    root@global_zone:~# echo "export JAVA_HOME=/usr/java" >>  hadoop-env.sh
    root@global_zone:~# echo "export HADOOP_LOG_DIR=/var/log/hadoop" >> hadoop-env.sh
    
  3. Edit the masters file to replace the localhost entry with the line shown in Listing 3:

    root@global_zone:~# vi masters
    

    Listing 3
    sec-name-node
    
  4. Edit the slaves file to replace the localhost entry with the lines shown in Listing 4:

    root@global_zone:~# vi slaves
    

    Listing 4
    data-node1
    data-node2
    data-node3
    
  5. Edit the core-site.xml file so it looks like Listing 5:

    root@global_zone:~# vi core-site.xml
    

    Note: fs.default.name is the URI that describes the NameNode address (protocol specifier, hostname, and port) for the cluster. Each DataNode instance will register with this NameNode and make its data available through it. In addition, the DataNodes send heartbeats to the NameNode to confirm that each DataNode is operating and the block replicas it hosts are available.

    Listing 5
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    
    <!-- Put site-specific property overrides in this file. -->
    <configuration>
         <property>
             <name>fs.default.name</name>
             <value>hdfs://name-node</value>
         </property>
    </configuration>
    
  6. Edit the hdfs-site.xml file so it looks like Listing 6:

    root@global_zone:~# vi hdfs-site.xml
    

    Notes:

    • dfs.data.dir is the path on the local file system in which the DataNode instance should store its data.
    • dfs.name.dir is the path on the local file system of the NameNode instance where the NameNode metadata is stored. It is only used by the NameNode instance to find its information.
    • dfs.replication is the default replication factor for each block of data in the file system. (For a production cluster, this should usually be left at its default value of 3.)
    Listing 6
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    
    <!-- Put site-specific property overrides in this file. -->
    <configuration>
    <property>
           <name>dfs.data.dir</name>
           <value>/hdfs/data/</value>
    </property>
    <property>
          <name>dfs.name.dir</name>
          <value>/hdfs/name/</value>
    </property>
      <property>
          <name>dfs.replication</name>
          <value>3</value>
      </property>
    </configuration>
    
  7. Edit the mapred-site.xml file so it looks like Listing 7:

    root@global_zone:~# vi mapred-site.xml
    

    Note: mapred.job.tracker is a host:port string specifying the JobTracker's RPC address.

    Listing 7
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
        <property>
             <name>mapred.job.tracker</name>
             <value>name-node:8021</value>
         </property>
    </configuration>
    

Exercise 3: Configure the Network Time Protocol

We should ensure that the system clock on the Hadoop zones is synchronized by using the Network Time Protocol (NTP).

Note: It is best to select an NTP server that can be a dedicated time synchronization source so that other services are not negatively affected if the machine is brought down for planned maintenance.

In the following example, the global zone is configured as an NTP server.

  1. Configure an NTP server:

    root@global_zone:~# cd /etc/inet
    root@global_zone:~# cp ntp.server ntp.conf
    root@global_zone:~# chmod +w /etc/inet/ntp.conf
    root@global_zone:~# touch /var/ntp/ntp.drift
    
  2. Edit the NTP server configuration file, as shown in Listing 8:

    root@global_zone:~# vi /etc/inet/ntp.conf
    

    Listing 8
    server 127.127.1.0 prefer
    broadcast 224.0.1.1 ttl 4
    enable auth monitor
    driftfile /var/ntp/ntp.drift
    statsdir /var/ntp/ntpstats/
    filegen peerstats file peerstats type day enable
    filegen loopstats file loopstats type day enable
    filegen clockstats file clockstats type day enable
    keys /etc/inet/ntp.keys
    trustedkey 0
    requestkey 0
    controlkey 0
    
  3. Enable the NTP server service:

    root@global_zone:~# svcadm enable ntp
    
  4. Verify that the NTP server is online by using the following command:

    root@global_zone:~# svcs -a | grep ntp
    online         16:04:15 svc:/network/ntp:default
    

Exercise 4: Create the Virtual Network Interfaces

Concept Break: Oracle Solaris 11 Networking Virtualization Technology

Oracle Solaris provides a reliable, secure, and scalable infrastructure to meet the growing needs of data center implementations. Its powerful network stack architecture, also known as Project Crossbow, provides the following.

  • Network virtualization with virtual NICs (VNICs) and virtual switching
  • Tight integration with Oracle Solaris Zones and Oracle Solaris 10 Zones
  • Network resource management, which provides an efficient and easy way to manage integrated QoS to enforce bandwidth limits on VNICs and traffic flows
  • An optimized network stack that reacts to network load levels
  • The ability to build a "data center in a box"

Oracle Solaris Zones on the same system can benefit from very high network I/O throughput (up to four times faster) with very low latency compared to systems with, say, 1 Gb physical network connections. For a Hadoop cluster, this means that the DataNodes can replicate the HDFS blocks much faster.

For more information about network virtualization benchmarks, see "How to Control Your Application's Network Bandwidth."

  1. Create a series of virtual network interfaces (VNICs) for the different zones:

    root@global_zone:~# dladm create-vnic -l net0 name_node1
    root@global_zone:~# dladm create-vnic -l net0 secondary_name1
    root@global_zone:~# dladm create-vnic -l net0 data_node1
    root@global_zone:~# dladm create-vnic -l net0 data_node2
    root@global_zone:~# dladm create-vnic -l net0 data_node3
    
  2. Verify the VNICs creation:

    root@global_zone:~# dladm show-vnic
    LINK                OVER         SPEED  MACADDRESS        MACADDRTYPE       VID
    name_node1          net0         1000   2:8:20:c6:3e:f1   random            0
    secondary_name1     net0         1000   2:8:20:b9:80:45   random            0
    data_node1          net0         1000   2:8:20:30:1c:3a   random            0
    data_node2          net0         1000   2:8:20:a8:b1:16   random            0
    data_node3          net0         1000   2:8:20:df:89:81   random            0
    

    We can see that we have five VNICs now. Figure 5 shows the architecture layout:

    Figure 5

    Figure 5

Exercise 5: Create the NameNode and Secondary NameNode Zones

Concept Break: Oracle Solaris Zones

Oracle Solaris Zones let you isolate one application from others on the same OS, allowing you to create an isolated environment in which users can log in and do what they want from inside an Oracle Solaris Zone without affecting anything outside that zone. In addition, Oracle Solaris Zones are secure from external attacks and internal malicious programs. Each Oracle Solaris Zone contains a complete resource-controlled environment that allows you to allocate resources such as CPU, memory, networking, and storage.

If you are the administrator who owns the system, you can choose to closely manage all the Oracle Solaris Zones or you can assign rights to other administrators for specific Oracle Solaris Zones. This flexibility lets you tailor an entire computing environment to the needs of a particular application, all within the same OS.

For more information about Oracle Solaris Zones, see "How to Get Started Creating Oracle Solaris Zones in Oracle Solaris 11."

All the Hadoop nodes for this lab will be installed using Oracle Solaris Zones.

  1. If you don't already have a file system for the NameNode and Secondary NameNode zones, run the following command:

    root@global_zone:~# zfs create -o mountpoint=/zones rpool/zones
    
  2. Verify the ZFS file system creation:

    root@global_zone:~# zfs list rpool/zones
    NAME         USED  AVAIL  REFER  MOUNTPOINT
    rpool/zones   31K  51.4G    31K  /zones
    
  3. Create the name-node zone:

    root@global_zone:~# zonecfg -z name-node
    Use 'create' to begin configuring a new zone.
    Zonecfg:name-node> create
    create: Using system default template 'SYSdefault'
    zonecfg:name-node> set autoboot=true
    zonecfg:name-node> set limitpriv=default,dtrace_proc,dtrace_user,sys_time
    zonecfg:name-node> set zonepath=/zones/name-node
    zonecfg:name-node> add fs
    zonecfg:name-node:fs> set dir=/usr/local
    zonecfg:name-node:fs> set special=/usr/local
    zonecfg:name-node:fs> set type=lofs
    zonecfg:name-node:fs> set options=[ro,nodevices]
    zonecfg:name-node:fs> end
    zonecfg:name-node> add net
    zonecfg:name-node:net> set physical=name_node1
    zonecfg:name-node:net> end
    zonecfg:name-node> verify
    zonecfg:name-node> exit 
    

    (Optional) You can create the name-node zone using the following script, which will create the zone configuration file. For arguments, the script needs the zone name and VNIC name, for example: createzone <zone name> <VNIC name>.

    root@global_zone:~# /usr/local/hadoophol/Scripts/createzone name-node name_node1
    
  4. Create the sec-name-node zone:

    root@global_zone:~# zonecfg -z sec-name-node
    Use 'create' to begin configuring a new zone.
    Zonecfg:sec-name-node> create
    create: Using system default template 'SYSdefault'
    zonecfg:sec-name-node> set autoboot=true
    zonecfg:sec-name-node> set limitpriv=default,dtrace_proc,dtrace_user,sys_time
    zonecfg:sec-name-node> set zonepath=/zones/sec-name-node
    zonecfg:sec-name-node> add fs
    zonecfg:sec-name-node:fs> set dir=/usr/local
    zonecfg:sec-name-node:fs> set special=/usr/local
    zonecfg:sec-name-node:fs> set type=lofs
    zonecfg:sec-name-node:fs> set options=[ro,nodevices]
    zonecfg:sec-name-node:fs> end
    zonecfg:sec-name-node> add net
    zonecfg:sec-name-node:net> set physical=secondary_name1
    zonecfg:sec-name-node:net> end
    zonecfg:sec-name-node> verify
    zonecfg:sec-name-node> exit
    

    (Optional) You can create the sec-name-node zone using the following script, which will create the zone configuration file. For arguments, the script needs the zone name and VNIC name, for example: createzone <zone name> <VNIC name>.

    root@global_zone:~: /usr/local/hadoophol/Scripts/createzone sec-name-node secondary_name1
    

Exercise 6: Set Up the DataNode Zones

In this exercise, we will leverage the integration between Oracle Solaris Zones virtualization technology and the ZFS file system that is built into Oracle Solaris.

Table 2 shows a summary of the Hadoop zones configuration we will create:

Table 2. Zone Summary
Function Zone Name ZFS Mount Point VNIC Name IP Address
NameNode name-node /zones/name-node name_node1 192.168.1.1
Secondary NameNode sec-name-node /zones/sec-name-node secondary_name1 192.168.1.2
DataNode data-node1 /zones/data-node1 data_node1 192.168.1.3
DataNode data-node2 /zones/data-node2 data_node2 192.168.1.4
DataNode data-node3 /zones/data-node3 data_node3 192.168.1.5

  1. Create the data-node1 zone:

    root@global_zone:~# zonecfg -z data-node1
    Use 'create' to begin configuring a new zone.
    zonecfg:data-node1> create
    create: Using system default template 'SYSdefault'
    zonecfg:data-node1> set autoboot=true
    zonecfg:data-node1> set limitpriv=default,dtrace_proc,dtrace_user,sys_time
    zonecfg:data-node1> set zonepath=/zones/data-node1 
    zonecfg:data-node1> add fs
    zonecfg:data-node1:fs> set dir=/usr/local
    zonecfg:data-node1:fs> set special=/usr/local
    zonecfg:data-node1:fs> set type=lofs
    zonecfg:data-node1:fs> set options=[ro,nodevices]
    zonecfg:data-node1:fs> end
    zonecfg:data-node1> add net
    zonecfg:data-node1:net> set physical=data_node1
    zonecfg:data-node1:net> end
    zonecfg:data-node1> verify
    zonecfg:data-node1> commit
    zonecfg:data-node1> exit
    

    (Optional) You can create the data-node1 zone using the following script:

    root@global_zone:~# /usr/local/hadoophol/Scripts/createzone data-node1 data_node1
    
  2. Create the data-node2 zone:

    root@global_zone:~# zonecfg -z data-node2
    Use 'create' to begin configuring a new zone.
    zonecfg:data-node2> create
    create: Using system default template 'SYSdefault'
    zonecfg:data-node2> set autoboot=true
    zonecfg:data-node2> set limitpriv=default,dtrace_proc,dtrace_user,sys_time
    zonecfg:data-node2> set zonepath=/zones/data-node2  
    zonecfg:data-node2> add fs
    zonecfg:data-node2:fs> set dir=/usr/local
    zonecfg:data-node2:fs> set special=/usr/local
    zonecfg:data-node2:fs> set type=lofs
    zonecfg:data-node2:fs> set options=[ro,nodevices]
    zonecfg:data-node2:fs> end
    zonecfg:data-node2> add net
    zonecfg:data-node2:net> set physical=data_node2
    zonecfg:data-node2:net> end
    zonecfg:data-node2> verify
    zonecfg:data-node2> commit
    zonecfg:data-node2> exit
    

    (Optional) You can create the data-node2 zone using the following script:

    root@global_zone:~# /usr/local/hadoophol/Scripts/createzone data-node2 data_node2
    
  3. Create the data-node3 zone:

    root@global_zone:~# zonecfg -z data-node3
    Use 'create' to begin configuring a new zone.
    zonecfg:data-node3> create
    create: Using system default template 'SYSdefault'
    zonecfg:data-node3> set autoboot=true
    zonecfg:data-node3> set limitpriv=default,dtrace_proc,dtrace_user,sys_time
    zonecfg:data-node3> set zonepath=/zones/data-node3
    zonecfg:data-node3> add fs
    zonecfg:data-node3:fs> set dir=/usr/local
    zonecfg:data-node3:fs> set special=/usr/local
    zonecfg:data-node3:fs> set type=lofs
    zonecfg:data-node3:fs> set options=[ro,nodevices]
    zonecfg:data-node3:fs> end
    zonecfg:data-node3> add net
    zonecfg:data-node3:net> set physical=data_node3
    zonecfg:data-node3:net> end
    zonecfg:data-node3> verify
    zonecfg:data-node3> commit
    zonecfg:data-node3> exit
    

    (Optional) You can create the data-node3 zone using the following script:

    root@global_zone:~# /usr/local/hadoophol/Scripts/createzone data-node3 data_node3
    

Exercise 7: Configure the NameNode

  1. Now, install the name-node zone; later we will clone it in order to accelerate zone creation time.

    root@global_zone:~# zoneadm -z name-node install
    The following ZFS file system(s) have been created:
        rpool/zones/name-node
    Progress being logged to /var/log/zones/zoneadm.20130106T134835Z.name-node.install
           Image: Preparing at /zones/name-node/root.
    
  2. Boot the name-node zone:

    root@global_zone:~# zoneadm -z name-node boot
    
  3. Check the status of the zones we've created:

    root@global_zone:~# zoneadm list -cv
           ID NAME      STATUS     PATH                          BRAND    IP
     0 global           running    /                             solaris shared
     1 name-node        running    /zones/name-node              solaris  excl
     - sec-name-node    configured /zones/sec-name-node          solaris  excl
     - data-node1       configured /zones/data-node1             solaris  excl
     - data-node2       configured /zones/data-node2             solaris  excl
     - data-node3       configured /zones/data-node3             solaris  excl
    
  4. Log in to the name-node zone:

    root@global_zone:~# zlogin -C name-node
    
  5. Provide the zone host information by using the following configuration for the name-node zone:

    1. For the host name, use name-node.
    2. Select manual network configuration.
    3. Ensure the network interface name_node1 has an IP address of 192.168.1.1 and a netmask of 255.255.255.0.
    4. Ensure the name service is based on your network configuration. In this lab, we will use /etc/hosts for name resolution, so we won't set up DNS for host name resolution. Select Do not configure DNS.
    5. For Alternate Name Service, select None.
    6. For Time Zone Region, select Americas.
    7. For Time Zone Location, select United States.
    8. For Time Zone, select Pacific Time.
    9. Enter your root password.
  6. After finishing the zone setup, you will get the login prompt. Log in to the zone as user root.

    name-node console login: root
    Password:
    
  7. Developing for Hadoop requires a Java programming environment. You can install Java Development Kit (JDK) 6 using the following command:

    root@name-node:~# pkg install jdk-6
    
  8. Verify the Java installation:

    root@name-node:~# which java
    /usr/bin/java
    
    root@name-node:~# java -version
    java version "1.6.0_35"
    Java(TM) SE Runtime Environment (build 1.6.0_35-b10)
    Java HotSpot(TM) Client VM (build 20.10-b01, mixed mode)
    
  9. Create a Hadoop user inside the name-node zone:

    root@name-node:~# groupadd hadoop
    root@name-node:~# useradd -m -g hadoop hadoop
    root@name-node:~# passwd hadoop
    

    Note: The password should be the same password as you entered in Step 22 of Exercise 1 when you set the user's Hadoop password.

  10. Create a directory for the Hadoop log files:

    root@name-node:~# mkdir /var/log/hadoop
    root@name-node:~# chown hadoop:hadoop /var/log/hadoop
    
  11. Configure an NTP client, as shown in the following example:

    1. Install the NTP package:

      root@name-node:~# pkg install ntp
      
    2. Create the NTP client configuration files:

      root@name-node:~# cd /etc/inet
      root@name-node:~# cp ntp.client ntp.conf
      root@name-node:~# chmod +w /etc/inet/ntp.conf
      root@name-node:~# touch /var/ntp/ntp.drift
      
    3. Edit the NTP client configuration file, as shown in Listing 9:

      root@name-node:~# vi /etc/inet/ntp.conf
      

      Note: In this lab, we are using the global zone as a time server so we add its name (for example, global-zone) to /etc/inet/ntp.conf.

      Listing 9
      server global-zone prefer
      driftfile /var/ntp/ntp.drift
      statsdir /var/ntp/ntpstats/
      filegen peerstats file peerstats type day enable
      filegen loopstats file loopstats type day enable
      
  12. Add the Hadoop cluster members' host names and IP addresses to /etc/hosts, as shown in Listing 10:

    root@name-node:~# vi /etc/hosts
    

    Listing 10
    ::1             localhost
    127.0.0.1       localhost loghost
    192.168.1.1 name-node
    192.168.1.2 sec-name-node
    192.168.1.3 data-node1
    192.168.1.4 data-node2
    192.168.1.5 data-node3
    192.168.1.100 global-zone
    
  13. Enable the NTP client service:

    root@name-node:~# svcadm enable ntp
    
  14. Verify the NTP client status:

    root@name-node:~# svcs ntp
    STATE          STIME    FMRI
    online         11:15:59 svc:/network/ntp:default
    
  15. Check whether the NTP client can synchronize its clock with the NTP server:

    root@name-node:~# ntpq -p
    

Exercise 8: Set Up SSH

  1. Set up SSH key-based authentication for the Hadoop user on the name_node zone in order to enable password-less login to the Secondary DataNode and the DataNodes:

    root@name-node:~# su - hadoop
    hadoop@name-node $ ssh-keygen -t dsa -P "" -f ~/.ssh/id_dsa
    hadoop@name-node $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
    
  2. Edit $HOME/.profile to append to the end of the file the lines shown in Listing 11:

    hadoop@name-node $ vi $HOME/.profile
    

    Listing 11
    # Set JAVA_HOME 
    export JAVA_HOME=/usr/java
    # Add Hadoop bin/ directory to PATH
    export PATH=$PATH:/usr/local/hadoop/bin
    

    Then run the following command:

    hadoop@name-node $ source $HOME/.profile
    
  3. Check that Hadoop runs by typing the following command:

    hadoop@name-node:~$ hadoop version
    Hadoop 1.2.1
    Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1503152
    Compiled by mattf on Mon Jul 22 15:23:09 PDT 2013
    From source with checksum 6923c86528809c4e7e6f493b6b413a9a
    

    Note: Press ~. to exit from the name-node console and return to the global zone.

    You can verify that you are in the global zone using the zonename command:

    root@global_zone:~# zonename
    global
    
  4. From the global zone, run the following command to create the sec-name-node zone as a clone of name-node:

    root@global_zone:~# zoneadm -z name-node shutdown 
    root@global_zone:~# zoneadm -z sec-name-node clone name-node
    
  5. Boot the sec-name-node zone:

    root@global_zone:~# zoneadm -z sec-name-node boot
    root@global_zone:~# zlogin -C sec-name-node
    
  6. As we experienced previously, the system configuration tool is launched (see Figure 6), so do the final configuration for sec-name-node zone:

    Note: All the zones must have the same time zone configuration and the same root password.

    Figure 6

    Figure 6

    1. For the host name, use sec-name-node.
    2. Select manual network configuration and for the network interface, use secondary_name1.
    3. Use an IP address of 192.168.1.2 and a netmask of 255.255.255.0.
    4. Select Do not configure DNS in the DNS name service window.
    5. Ensure Alternate Name Service is set to None.
    6. For Time Zone Region, select Americas.
    7. For Time Zone Location, select United States.
    8. For Time Zone, select Pacific Time.
    9. Enter your root password.

      Note: Press ~. to exit from the sec-name-node console and return to the global zone.

  7. Perform similar steps for data-node1, data-node2, and data-node3:

    1. Do the following for data-node1:

      root@global_zone:~# zoneadm -z data-node1 clone name-node
      root@global_zone:~# zoneadm -z data-node1 boot
      root@global_zone:~# zlogin -C data-node1 
      

      • For the host name, use data-node1.
      • Select manual network configuration and for the network interface, use data_node1.
      • Use an IP address of 192.168.1.3 and a netmask of 255.255.255.0.
      • Select Do not configure DNS in the DNS name service window.
      • Ensure Alternate Name Service is set to None.
      • For Time Zone Region, select Americas.
      • For Time Zone Location, select United States.
      • For Time Zone, select Pacific Time.
      • Enter your root password.
    2. Do the following for data-node2:

      root@global_zone:~# zoneadm -z data-node2 clone name-node
      root@global_zone:~# zoneadm -z data-node2 boot
      root@global_zone:~# zlogin -C data-node2
      

      • For the host name, use data-node2.
      • For the network interface, use data_node2.
      • Use an IP address of 192.168.1.4 and a netmask of 255.255.255.0.
      • Select Do not configure DNS in the DNS name service window.
      • Ensure Alternate Name Service is set to None.
      • For Time Zone Region, select Americas.
      • For Time Zone Location, select United States.
      • For Time Zone, select Pacific Time.
      • Enter your root password.
    3. Do the following for data-node3:

      root@global_zone:~# zoneadm -z data-node3 clone name-node
      root@global_zone:~# zoneadm -z data-node3 boot
      root@global_zone:~# zlogin -C data-node3
      

      • For the host name, use data-node3.
      • For the network interface, use data_node3.
      • Use an IP address of 192.168.1.5 and a netmask of 255.255.255.0.
      • Select Do not configure DNS in the DNS name service window.
      • Ensure Alternate Name Service is set to None.
      • For Time Zone Region, select Americas.
      • For Time Zone Location, select United States.
      • For Time Zone, select Pacific Time.
      • Enter your root password.
  8. Boot the name_node zone:

    root@global_zone:~# zoneadm -z name-node boot
    
  9. Verify that all the zones are up and running:

    root@global_zone:~# zoneadm list -cv
      ID NAME             STATUS     PATH                          BRAND    IP
       0 global           running    /                             solaris shared
      10 sec-name-node    running    /zones/sec-name-node          solaris  excl
      12 data-node1       running    /zones/data-node1             solaris  excl
      14 data-node2       running    /zones/data-node2             solaris  excl
      16 data-node3       running    /zones/data-node3             solaris  excl
      17 name-node        running    /zones/name-node              solaris  excl
    
  10. To verify your SSH access without using a password for the Hadoop user, do the following.

    1. From name_node, log in via SSH into name-node (that is, to itself):

      root@global_zone:~# zlogin name-node
      root@name-node:~# su - hadoop
      hadoop@name-node $ ssh name-node 
      
      The authenticity of host 'name-node (192.168.1.1)' can't be established.
      RSA key fingerprint is 04:93:a9:e0:b7:8c:d7:8b:51:b8:42:d7:9f:e1:80:ca.
      Are you sure you want to continue connecting (yes/no)? yes
      Warning: Permanently added 'name-node,192.168.1.1' (RSA) to the list of known hosts.
      
    2. Now, try to log in to sec-name-node and the DataNodes (data-node1, data-node2, and data-node3).
    3. Try logging in to the hosts again using SSH. You shouldn't get a prompt to add the host to the known keys list.
  11. Edit the /etc/hosts files inside sec-name-node and the DataNodes in order to add the name-node entry:

    root@global_zone:~# zlogin sec-name-node 'echo "192.168.1.1 name-node" >> /etc/hosts' 
    root@global_zone:~# zlogin data-node1 'echo "192.168.1.1 name-node" >> /etc/hosts'
    root@global_zone:~# zlogin data-node2 'echo "192.168.1.1 name-node" >> /etc/hosts'
    root@global_zone:~# zlogin data-node3 'echo "192.168.1.1 name-node" >> /etc/hosts'
    
  12. Verify name resolution by ensuring that the global zone and all the Hadoop zones have the host entries shown in Listing 12 in /etc/hosts:

    # cat /etc/hosts
    

    Listing 12
    ::1             localhost
    127.0.0.1       localhost loghost
    192.168.1.1 name-node
    192.168.1.2 sec-name-node
    192.168.1.3 data-node1
    192.168.1.4 data-node2
    192.168.1.5 data-node3
    192.168.1.100 global-zone
    

    Note: If you are using the global zone as an NTP server, you must also add its host name and IP address to /etc/hosts.

  13. Verify the cluster using the verifycluster script:

    root@global_zone:~# /usr/local/hadoophol/Scripts/verifycluster 
    

    If the cluster setup is fine, you will get a cluster is verified message.

    Note: If the verifycluster script fails with an error message, check that the /etc/hosts file in every zone includes all the zones names as described in the Step 12, and then rerun the verifiability script again.

Exercise 9: Format HDFS from the NameNode

Concept Break: Hadoop Distributed File System (HDFS)

HDFS is a distributed, scalable file system. HDFS stores metadata on the NameNode. Application data is stored on the DataNodes, and each DataNode serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses the TCP/IP layer for communication. Clients use Remote Procedure Call (RPC) to communicate with each other.

The DataNodes do not rely on data protection mechanisms, such as RAID, to make the data durable. Instead, the file content is replicated on multiple DataNodes for reliability.

With the default replication value (3), which is set up in the hdfs-site.xml file, data is stored on three nodes. DataNodes can talk to each other in order to rebalance data, to move copies around, and to keep the replication of data high. In Figure 7, we can see that every data block is replicated across three data nodes based on the replication value.

An advantage of using HDFS is data awareness between the JobTracker and TaskTracker. The JobTracker schedules map or reduce jobs to TaskTracker with an awareness of the data location. An example of this would be if node A contained data (x,y,z) and node B contained data (a,b,c). Then the JobTracker will schedule node B to perform map or reduce tasks on (a,b,c) and node A would be scheduled to perform map or reduce tasks on (x,y,z). This reduces the amount of traffic that goes over the network and prevents unnecessary data transfer.. This data awareness can have a significant impact on job-completion times, which has been demonstrated when running data-intensive jobs.

For more information about Hadoop HDFS see https://en.wikipedia.org/wiki/Hadoop.

Figure 7

Figure 7

  1. To format HDFS, run the following commands and answer Y at the prompt:

    root@global_zone:~# zlogin name-node
    root@name-node:~# mkdir -p /hdfs/name
    root@name-node:~# chown -R hadoop:hadoop /hdfs
    root@name-node:~# su - hadoop
    hadoop@name-node:$ /usr/local/hadoop/bin/hadoop namenode -format
    13/10/13 09:10:52 INFO namenode.NameNode: STARTUP_MSG: 
    /************************************************************
    STARTUP_MSG: Starting NameNode
    STARTUP_MSG:   host = name-node/192.168.1.1
    STARTUP_MSG:   args = [-format]
    STARTUP_MSG:   version = 1.2.1
    STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1503152; compiled by 'mattf' on Mon Jul 22 15:23:09 PDT 2013
    STARTUP_MSG:   java = 1.6.0_35
    ************************************************************/ 
    
    
    hadoop@name-node:$ Re-format filesystem in /hdfs/name ? (Y or N) Y
    
  2. On every DataNode (data-node1, data-node2, and data-node3), create a Hadoop data directory to store the HDFS blocks:

    root@global_zone:~# zlogin data-node1
    root@data-node1:~# mkdir -p /hdfs/data
    root@data-node1:~# chown -R hadoop:hadoop /hdfs
    
    
    root@global_zone:~# zlogin data-node2
    root@data-node2:~# mkdir -p /hdfs/data
    root@data-node2:~# chown -R hadoop:hadoop /hdfs
    
    
    root@global_zone:~# zlogin data-node3
    root@data-node3:~# mkdir -p /hdfs/data
    root@data-node3:~# chown -R hadoop:hadoop /hdfs
    

Exercise 10: Start the Hadoop Cluster

Table 3 describes the startup scripts.

Table 3. Startup Scripts
File Name Description
start-dfs.sh Starts the HDFS daemons, the NameNode, and the DataNodes. Use this before start-mapred.sh.
stop-dfs.sh Stops the Hadoop DFS daemons.
start-mapred.sh Starts the Hadoop MapReduce daemons, the JobTracker, and the TaskTrackers.
stop-mapred.sh Stops the Hadoop MapReduce daemons.

  1. From the name-node zone, start the Hadoop DFS daemons, the NameNode, and the DataNodes using the following commands:

    root@global_zone:~# zlogin name-node
    root@name-node:~# su - hadoop
    hadoop@name-node:$ start-dfs.sh
    starting namenode, logging to /var/log/hadoop/hadoop--namenode-name-node.out
    data-node2: starting datanode, logging to /var/log/hadoop/hadoop-hadoop-datanode-data-node2.out
    data-node1: starting datanode, logging to /var/log/hadoop/hadoop-hadoop-datanode-data-node1.out
    data-node3: starting datanode, logging to /var/log/hadoop/hadoop-hadoop-datanode-data-node3.out
    sec-name-node: starting secondarynamenode, logging to 
    /var/log/hadoop/hadoop-hadoop-secondarynamenode-sec-name-node.out
    
  2. Start the Hadoop Map/Reduce daemons, the JobTracker, and the TaskTrackers using the following command:

    hadoop@name-node:$ start-mapred.sh
    starting jobtracker, logging to /var/log/hadoop/hadoop--jobtracker-name-node.out
    data-node1: starting tasktracker, logging to /var/log/hadoop/hadoop-hadoop-tasktracker-data-node1.out
    data-node3: starting tasktracker, logging to /var/log/hadoop/hadoop-hadoop-tasktracker-data-node3.out
    data-node2: starting tasktracker, logging to /var/log/hadoop/hadoop-hadoop-tasktracker-data-node2.out
    
  3. To view a comprehensive status report, execute the following command to check the cluster status. The command will output basic statistics about the cluster health, such as NameNode details, the status of each DataNode, and disk capacity amounts.

    hadoop@name-node:$ hadoop dfsadmin -report
    Configured Capacity: 171455269888 (159.68 GB)
    Present Capacity: 169711053357 (158.06 GB)
    DFS Remaining: 169711028736 (158.06 GB)
    DFS Used: 24621 (24.04 KB)
    DFS Used%: 0%
    Under replicated blocks: 0
    Blocks with corrupt replicas: 0
    Missing blocks: 0
    -------------------------------------------------
    Datanodes available: 3 (3 total, 0 dead)
    ...
    

    You should see that three DataNodes are available.

    Note: You can find the same information on the NameNode web status page (shown in Figure 8) at http://<NameNode IP address>:50070/dfshealth.jsp. The name node IP address is 192.168.1.1.

    Figure 8

    Figure 8

Exercise 11: Run a MapReduce Job

Concept Break: MapReduce

MapReduce is a framework for processing parallelizable problems across huge data sets using a cluster of computers.

The essential idea of MapReduce is using two functions to grab data from a source: using the Map() function and then processing the data across a cluster of computers using the Reduce() function. Specifically, Map() will apply a function to all the members of a data set and post a result set, which Reduce() will then collate and resolve.

Map() and Reduce() can be run in parallel and across multiple systems.

For more information about MapReduce, see http://en.wikipedia.org/wiki/MapReduce.

We will use the WordCount example, which reads text files and counts how often words occur. The input and output consist of text files, each line of which contains a word and the number of times the word occurred, separated by a tab. For more information about WordCount, see http://wiki.apache.org/hadoop/WordCount.

  1. Create the input data directory; we will put the input files there.

    hadoop@name-node:$ hadoop fs -mkdir /input-data
    
  2. Verify the directory creation:

    hadoop@name-node:$ hadoop dfs -ls /
    Found 1 items
    drwxr-xr-x   - hadoop supergroup          0 2013-10-13 23:45 /input-data
    
  3. Copy the pg20417.txt file you downloaded earlier to HDFS using the following command:

    Note: Oracle OpenWorld attendees can find the pg20417.txt file in the /usr/local/hadoophol/Doc directory.

    hadoop@name-node:$ hadoop dfs -copyFromLocal /usr/local/hadoophol/Doc/pg20417.txt /input-data
    
  4. Verify that the file is located on HDFS:

    hadoop@name-node:$ hadoop dfs -ls /input-data
    Found 1 items
    -rw-r--r--   3 hadoop supergroup     674570 2013-10-13 10:20 /input-data/pg20417.txt
    
  5. Create the output directory; the MapReduce job will put its outputs in this directory:

    hadoop@name-node:$ hadoop fs -mkdir /output-data
    
  6. Start the MapReduce job using the following command:

    hadoop@name-node:$ hadoop jar /usr/local/hadoop/hadoop-examples-1.2.1.jar 
    wordcount /input-data/pg20417.txt /output-data/output1
    
    13/10/13 10:23:08 INFO input.FileInputFormat: Total input paths to process : 1
    13/10/13 10:23:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library 
    for your platform... using builtin-java classes where applicable
    13/10/13 10:23:08 WARN snappy.LoadSnappy: Snappy native library not loaded
    13/10/13 10:23:09 INFO mapred.JobClient: Running job: job_201310130918_0010
    13/10/13 10:23:10 INFO mapred.JobClient:  map 0% reduce 0%
    13/10/13 10:23:19 INFO mapred.JobClient:  map 100% reduce 0%
    13/10/13 10:23:29 INFO mapred.JobClient:  map 100% reduce 33%
    13/10/13 10:23:31 INFO mapred.JobClient:  map 100% reduce 100%
    13/10/13 10:23:34 INFO mapred.JobClient: Job complete: job_201310130918_0010
    13/10/13 10:23:34 INFO mapred.JobClient: Counters: 26
    

    The program takes about 60 seconds to execute on the cluster.

    All of the files in the input directory (input-data in the command line shown above) are read and the counts for the words in the input are written to the output directory (called output-data/output1).

  7. Verify the output data:

    hadoop@name-node:$ hadoop dfs -ls /output-data/output1
    Found 3 items
    -rw-r--r--   3 hadoop supergroup          0 2013-10-13 10:30 /output-data/output1/_SUCCESS
    drwxr-xr-x   - hadoop supergroup          0 2013-10-13 10:30 /output-data/output1/_logs
    -rw-r--r--   3 hadoop supergroup     196192 2013-10-13 10:30 /output-data/output1/part-r-00000
    

Exercise 12: Use ZFS Encryption

Concept Break: ZFS Encryption

Oracle Solaris 11 adds transparent data encryption functionality to ZFS. All data and file system metadata (such as ownership, access control lists, quota information, and so on) is encrypted when stored persistently in the ZFS pool.

A ZFS pool can support a mix of encrypted and unencrypted ZFS data sets (file systems and ZVOLs). Data encryption is completely transparent to applications and other Oracle Solaris file services, such as NFS or CIFS. Since encryption is a first-class feature of ZFS, we are able to support compression, encryption, and deduplication together. Encryption key management for encrypted data sets can be delegated to users, Oracle Solaris Zones, or both. Oracle Solaris with ZFS encryption provides a very flexible system for securing data at rest, and it doesn't require any application changes or qualification.

For more information about ZFS encryption, see "How to Manage ZFS Data Encryption."

The output data can contain sensitive information, so use ZFS encryption to protect the output data.

  1. Create the encrypted ZFS data set:

    Note: You need to provide the passphrase; it must be at least eight characters.

    root@name-node:~# zfs create -o encryption=on rpool/export/output
    
    Enter passphrase for 'rpool/export/output':
    Enter again:
    
  2. Verify that the ZFS data set is encrypted:

    root@name-node:~# zfs get all rpool/export/output | grep encry
    rpool/export/output  encryption          on               local
    
  3. Change the ownership:

    root@name-node:~# chown hadoop:hadoop /export/output
    
  4. Copy the output file from HDFS into ZFS:

    root@name-node:~# su - hadoop
    Oracle Corporation      SunOS 5.11      11.1    September 2012
    
    hadoop@name-node:$ hadoop dfs -getmerge /output-data/output1 /export/output 
    
  5. Analyze the output text file. Each line contains a word and the number of times the word occurred, separated by a tab.

    hadoop@name-node:$ head /export/output/output1
    "A      2
    "Alpha  1
    "Alpha,"        1
    "An     2
    "And    1
    "BOILING"       2
    "Batesian"      1
    "Beta   2
    
  6. Protect the output text file by unmounting the ZFS data set, and then unload the wrapping key for an encrypted data set using the following command:

    root@name-node:~# zfs key -u rpool/export/output
    

    If the command is successful, the data set is not accessible and it is unmounted.

  7. If you want to mount this ZFS file system, you need to provide the passphrase:

    root@name-node:~# zfs mount rpool/export/output
    Enter passphrase for 'rpool/export/output':
    

    By using a passphrase, you ensure that only those who know the passphrase can observe the output file.

Exercise 13: Use Oracle Solaris DTrace for Performance Monitoring

Concept Break: Oracle Solaris DTrace

Oracle Solaris DTrace is a comprehensive, advanced tracing tool for troubleshooting systematic problems in real time. Administrators, integrators, and developers can use DTrace to dynamically and safely observe live production systems, including both applications and the operating system itself, for performance issues.

DTrace allows you to explore a system to understand how it works, track down problems across many layers of software, and locate the cause of any aberrant behavior. Whether it's at a high-level global overview, such memory consumption or CPU time, or at a much finer-grained level, such as what specific function calls are being made, DTrace can provide operational insights that have been missing in the data center by enabling you to do the following:

  • Insert 80,000+ probe points across all facets of the operating system.
  • Instrument user and system level software.
  • Use a powerful and easy-to-use scripting language and command line interfaces.

For more information about DTrace, see http://www.oracle.com/technetwork/server-storage/solaris11/technologies/dtrace-1930301.html.

  1. Open another terminal window and log in into name-node as user hadoop.
  2. Run the following MapReduce job:

    hadoop@name-node:$ hadoop jar /usr/local/hadoop/hadoop-examples-1.2.1.jar 
    wordcount /input-data/pg20417.txt /output-data/output2
    
  3. When the Hadoop job is run, determine what processes are executed on the NameNode.

    In the terminal window, run the following DTrace command:

    root@global-zone:~# dtrace -n 'proc:::exec-success/strstr(zonename,"name-node")>0/ { trace(curpsinfo->pr_psargs); }'
    
    dtrace: description 'proc:::exec-success' matched 1 probe
    
     CPU   ID             FUNCTION:NAME
       0 4473  exec_common:exec-success   /usr/bin/env bash /usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/hadoop-exa
       0 4473  exec_common:exec-success   bash /usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/hadoop-examples-1.1.2.j
       0 4473  exec_common:exec-success   dirname /usr/local/hadoop-1.1.2/libexec/--
       0 4473  exec_common:exec-success   dirname /usr/local/hadoop-1.1.2/libexec/--
       0 4473  exec_common:exec-success   sed -e s/ /_/g                   
       1 4473  exec_common:exec-success   dirname /usr/local/hadoop/bin/hadoop
       1 4473  exec_common:exec-success   dirname -- /usr/local/hadoop/bin/../libexec/hadoop-config.sh
       1 4473  exec_common:exec-success   basename -- /usr/local/hadoop/bin/../libexec/hadoop-config.sh
       1 4473  exec_common:exec-success   basename /usr/local/hadoop-1.1.2/libexec/--
       1 4473  exec_common:exec-success   uname                            
       1 4473  exec_common:exec-success   /usr/java/bin/java -Xmx32m org.apache.hadoop.util.PlatformName
       1 4473  exec_common:exec-success   /usr/java/bin/java -Xmx32m org.apache.hadoop.util.PlatformName
       0 4473  exec_common:exec-success   /usr/java/bin/java -Dproc_jar -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop -Dhado
       0 4473  exec_common:exec-success   /usr/java/bin/java -Dproc_jar -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop -Dhado
    ^C
    

    Note: Press Ctrl-c in order to see the DTrace output.

  4. When the Hadoop job is run, determine what files are written to the NameNode.

    Note: If the MapReduce job is finished, you can run another job with a different output directory (for example, /output-data/output3).

    For example:

    hadoop@name-node:$ hadoop jar /usr/local/hadoop/hadoop-examples-1.2.1.jar 
    wordcount /input-data/pg20417.txt /output-data/output3
    

    root@global-zone:~# dtrace -n 'syscall::write:entry/strstr(zonename,"name-node")>0/ {@write[fds[arg0].fi_pathname]=count();}'
    
    dtrace: description 'syscall::write:entry' matched 1 probe
    ^C
    
      /zones/name-node/root/tmp/hadoop-hadoop/mapred/local/jobTracker/.job_201307181457_0007.xml.crc  1
      /zones/name-node/root/var/log/hadoop/history/.job_201307181457_0007_conf.xml.crc  1
      /zones/name-node/root/dev/pts/3                      5
      /zones/name-node/root/var/log/hadoop/job_201307181457_0007_conf.xml  6
      /zones/name-node/root/tmp/hadoop-hadoop/mapred/local/jobTracker/job_201307181457_0007.xml  8
      /zones/name-node/root/var/log/hadoop/history/job_201307181457_0007_conf.xml  11
      /zones/name-node/root/var/log/hadoop/hadoop--jobtracker-name-node.log  13
      /zones/name-node/root/hdfs/name/current/edits.new  25
      /zones/name-node/root/var/log/hadoop/hadoop--namenode-name-node.log  45
      /zones/name-node/root/dev/poll                    207
      <unknown>                                         3131655
    

    Note: Press Ctrl-c in order to see the DTrace output.

  5. When the Hadoop job is run, determine what processes are executed on the DataNode:

    root@global-zone:~# dtrace -n 'proc:::exec-success/strstr(zonename,"data-node1")>0/ { trace(curpsinfo->pr_psargs); }'
    
    dtrace: description 'proc:::exec-success' matched 1 probe
    
     CPU   ID             FUNCTION:NAME
       0 8833  exec_common:exec-success   dirname /usr/local/hadoop/bin/hadoop
       0 8833  exec_common:exec-success   dirname /usr/local/hadoop/libexec/--
       0 8833  exec_common:exec-success   sed -e s/ /_/g                  
       1 8833  exec_common:exec-success   dirname -- /usr/local/hadoop/bin/../libexec/hadoop-config.sh
       2 8833  exec_common:exec-success   basename /usr/local/hadoop/libexec/--
       2 8833  exec_common:exec-success   /usr/java/bin/java -Xmx32m org.apache.hadoop.util.PlatformName
       2 8833  exec_common:exec-success   /usr/java/bin/java -Xmx32m org.apache.hadoop.util.PlatformName
       3 8833  exec_common:exec-success   /usr/bin/env bash /usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/hadoop-exa
       3 8833  exec_common:exec-success   bash /usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/hadoop-examples-1.0.4.j
       3 8833  exec_common:exec-success   basename -- /usr/local/hadoop/bin/../libexec/hadoop-config.sh
       3 8833  exec_common:exec-success   dirname /usr/local/hadoop/libexec/--
       3 8833  exec_common:exec-success   uname                           
       3 8833  exec_common:exec-success   /usr/java/bin/java -Dproc_jar -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop -Dhado
       3 8833  exec_common:exec-success   /usr/java/bin/java -Dproc_jar -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop -Dhado
    ^C
    
  6. When the Hadoop job is run, determine what files are written on the DataNode:

    (There were 222 lines of output, which were reduced for readability.)

    root@global-zone:~# dtrace -n 'syscall::write:entry/strstr(zonename,"data-node1")>0/ {@write[fds[arg0].fi_pathname]=count();}'
    
    dtrace: description 'syscall::write:entry' matched 1 probe
    
    ^C
      /zones/data-node1/root/hdfs/data/blocksBeingWritten/blk_-5404946161781239203    1
      /zones/data-node1/root/hdfs/data/blocksBeingWritten/blk_-5404946161781239203_1103.meta    1
      /zones/data-node1/root/hdfs/data/blocksBeingWritten/blk_-6136035696057459536    1
      /zones/data-node1/root/hdfs/data/blocksBeingWritten/blk_-6136035696057459536_1102.meta    1
      /zones/data-node1/root/hdfs/data/blocksBeingWritten/blk_-8420966433041064066    1
      /zones/data-node1/root/hdfs/data/blocksBeingWritten/blk_-8420966433041064066_1105.meta    1
      /zones/data-node1/root/hdfs/data/blocksBeingWritten/blk_1792925233420187481   1
      /zones/data-node1/root/hdfs/data/blocksBeingWritten/blk_1792925233420187481_1101.meta    1
      /zones/data-node1/root/hdfs/data/blocksBeingWritten/blk_4108435250688953064    1
      /zones/data-node1/root/hdfs/data/blocksBeingWritten/blk_4108435250688953064_1106.meta    1
      /zones/data-node1/root/hdfs/data/blocksBeingWritten/blk_8503732348705847964          
    
  7. Determine the total amount of HDFS data written for the DataNodes:

      root@global-zone:~# dtrace -n 'syscall::write:entry / 
      (   strstr(zonename,"data-node1")!=0 || strstr(zonename,"data-node2")!=0 || 
    strstr(zonename,"data-node3")!=0 ) && strstr(fds[arg0].fi_pathname,"hdfs")!=0  
    && strstr(fds[arg0].fi_pathname,"blocksBeingWritten")>0/ 
    { @write[fds[arg0].fi_pathname]=sum(arg2); }'
    ^C
    

Summary

In this lab, we learned how to set up a Hadoop cluster using Oracle Solaris 11 technologies such as Oracle Solaris Zones, ZFS, and network virtualization and DTrace.

See Also

About the Author

Orgad Kimchi is a principal software engineer on the ISV Engineering team at Oracle (formerly Sun Microsystems). For 6 years he has specialized in virtualization, big data, and cloud computing technologies.

Revision 1.0, 10/21/2013

Follow us:
Blog | Facebook | Twitter | YouTube