DBA: Scripting
Filtering and Grouping: A Comparison of SQL, Linux Scripting, and Ruby
by Casimir Saternos
Approaching problems involving sets of data with Oracle SQL, Linux and Ruby
Published January 2008
"The way to create something beautiful is often to make subtle tweaks to something that already exists, or to combine existing ideas in a slightly new way."
- Paul Graham, Hackers and Painters
It is common to consider software development as a creative, almost artistic endeavor today, with the goal being software that is not only functional but in some measure elegant, useful, and well designed. However, IT professionals are also called upon to at times to act in a "performer" role more akin to a jazz musician. A good jazz musician can choose just the right notes that fit in a given context—and make it look easy. With a thorough knowledge of scripting technologies, you can likewise quickly create a simple command, statement, or program that effectively addresses a business or technical problem. They provide an invaluable service, and make it look easy.
A version of the Oracle Database for Linux was released in 1998. Since then, other Oracle applications and middleware have become available on Linux. With the Oracle Unbreakable Linux support program in place, many organizations are migrating their Oracle installations to the Linux platform. These changes result in more and more DBAs finding themselves in the position of investigating resources, such as scripting, for managing servers running on Linux.
In a previous article, I noted that there is a traditional division of responsibilities between sysadmins and DBAs that is not always clear in practice. Companies employ individuals who address concerns at the database as well as the operating system levels. At minimum, the responsibility of installing Oracle database software requires the DBA to interact with the underlying operating system.
Even with the plethora of new scripting languages, administrators often still need to resort to shell scripting as a technology that is available on all systems. Existing scripts have been developed and have been running for years on many servers—and these scripts require maintenance and upgrades. Shell scripting is an invaluable tool for IT professionals of all types.
In this article you will learn some of the transferable concepts common to Oracle SQL, Linux operating system commands, and scripting. In particular, it will show how to use a "set oriented" way of thinking about problems that involve sets of data that need to be grouped, sorted, and filtered. These operations are familiar to those with a knowledge of SQL but are also often required when using Linux commands and or the Ruby programming language.
I will not an attempt to discuss mathematical set theory in any sort of rigorous disciplined fashion. Instead, all the fuzzy boundaries that exist in human language (and typical business problems) can be found throughout. Oracle SQL provides a much purer way of manipulating sets of data than scripting: each column contains fields of a defined type, and the logical data is handled independently of the underlying physical representation. Data returned from a query is not the same in any formal sense as the output of a shell command. The intent of what follows is pedagogical, to provide some analogies that are familiar to those who know SQL to leverage their problem solving abilities when required to work outside the database. It will provide a different perspective for approaching problems and encourage you to delve into new and unfamiliar areas.
The Data Set
SQL is not a procedural language. Rather, SQL is used to view, summarize, or manipulate sets of data. However, executing certain shell commands result in output being written. These results can also be interpreted as a set of data. It may not be a particularly structured set of data, but in many cases it is. Many different shell commands result in a set of data as output but the following examples will use the ls command which should be familiar to most users. A listing of files within a given directory will be modified by subsequent commands. Start with a simple long listing command:
ls -l
The resulting output will depend upon your directory. Consider the following result:
total 60K
-rw-r--r-- 1 root dba 1.7K May 13 09:02 xe_s000_2072.trc
-rw-r--r-- 1 cas dba 929 Apr 28 22:21 xe_smon_3664.trc
-rw-r--r-- 1 cas dba 794 Apr 26 17:19 xe_smon_3676.trc
-rw-r--r-- 1 oracle dba 792 Apr 23 21:52 xe_smon_2120.trc
-rw-r--r-- 1 oracle dba 794 Apr 19 17:37 xe_smon_3364.trc
-rw-r--r-- 1 root dba 5.5K Jun 20 2006 xe_s000_2412.trc
-rw-r--r-- 1 cas dba 790 May 25 2006 xe_smon_540.trc
The first column contains a set of directory permissions. The second column displays the number of links. The third column indicates the owner. The fourth column represents the associated group. The fifth column indicates the size of the file in bytes. The sixth column is the file creation date (and poses some problems; see below). The seventh column indicates the name of the file itself.
A table could be visualized that would hold this data:
| FILE_PERMISSIONS |
NUM_LINKS |
FILE_OWNER |
FILE_GROUP |
SIZE |
CREATED_AT |
FILE_NAME |
| -rw-r--r-- |
1 |
root |
dba |
1675 |
May 13 09:02 |
xe_s000_2072.trc |
| -rw-r--r-- |
1 |
root |
dba |
5532 |
Jun 20 2006 |
xe_s000_2412.trc |
| -rw-r--r-- |
1 |
oracle |
dba |
792 |
Apr 23 21:52 |
xe_smon_2120.trc |
| -rw-r--r-- |
1 |
oracle |
dba |
794 |
Apr 19 17:37 |
xe_smon_3364.trc |
| -rw-r--r-- |
1 |
cas |
Dba |
929 |
Apr 28 22:21 |
xe_smon_3664.trc |
| -rw-r--r-- |
1 |
cas |
dba |
794 |
Apr 26 17:19 |
xe_smon_3676.trc |
| -rw-r--r-- |
1 |
cas |
dba |
790 |
May 25 2006 |
xe_smon_540.trc |
This "imaginary table" of results (the output of the ls -l command listed above) will be built upon in subsequent examples to show how the data can be limited, sorted, and aggregated.
Some Caveats
The structure of this set of data is evident at least to a human viewing the data. However notice the first line that does not fit the structure: total 60K. This shows the limitation of the approach at the onset; some of the output is not part of the structured data.)
To clean up this set of data, you can redirect the output to a file that you can then manipulate with subsequent commands.
ls -l > test.txt
You can then edit this file with your favorite text editor and delete the first line. The data could be tweaked until it is structured enough for your purposes. The degree of structure will depend upon your intended target audience. A human readable report will be different from a file to be loaded into an Oracle table.
If this manual intermediate step is not an option (or offends your automating sensibilities) then consider piping the output using the grep -v option described later in the article. This command is used to exclude lines that match a given pattern.
You might also notice that the date format is not consistent. Besides being displayed in several different formats, it can not be ordered in a temporal manner and it contains spaces. Again, this shows the usefulness of a database for storing typed data. Oracle provides numerous functions for maniplating and ordering dates. Dates are often easier to manipulate within the database as structured data.
When forced to order by date at the command line, you might also look for a -t or similar option to provide this ordering prior to piping to subsequent commands.
|
Limiting and Ordering Data
Many Linux commands contain options that allow you to order the data in a useful manner. For example, sticking with our humble ls command, you can specify the -t option to order by time or -X to order by extension. However, lets say you wanted to only select the owner and the name of the file and want to order the result by the owner. There is no option included with ls to allow such an output. If you had the set of data in a table you could do something like:
SELECT file_owner, file_name FROM imaginary_table ORDER BY owner;
However, you can achieve the desired result by piping (or redirecting) the output of a command to another command using the pipe operator: |. Start by limiting the data to the two columns in question:
[root@linux-server test]# ls -l | awk '{print $3"\t"$9}'
root xe_s000_2072.trc
root xe_s000_2412.trc
oracle xe_smon_2120.trc
oracle xe_smon_3364.trc
cas xe_smon_3664.trc
cas xe_smon_3676.trc
cas xe_smon_540.trc
The awk utility is used in this example to filter the data so that only the owner and file name are printed. You might have noticed that the owner is the third column of the imaginary table, and so the $3 indicates this column. The \t indicates a tab will be used to separate the data from the next column. However, where did the $9 come from?
A closer look at the data reveals the fact that the date data is separated into three columns. Because of the whitespace that is separating these parts, the data is really "seen" as three separate columns.
Now these results need to be piped to another command to be ordered appropriately. The sort utility prints lines in alphabetical order or in another order specified. So you can get the desired result as follows:
[root@linux-server test]# ls -l | awk '{print $3"\t"$9}' | sort
cas xe_smon_3664.trc
cas xe_smon_3676.trc
cas xe_smon_540.trc
oracle xe_smon_2120.trc
oracle xe_smon_3364.trc
root xe_s000_2072.trc
root xe_s000_2412.trc
Now lets say you only want to see files that are owned by oracle. A SQL statement would use a WHERE clause something like WHERE owner='oracle'. The grep family of utilities can be used to limit data in this manner:
[root@linux-server test]# ls -l | awk '{print $3"\t"$9}' | sort | grep oracle
oracle xe_smon_2120.trc
oracle xe_smon_3364.trc
In a similar manner, the -v flag can be used to limit the data to rows that are not owned by oracle (or the equivalent of WHERE owner <> 'oracle' ):
[root@linux-server test]# ls -l | grep -v oracle
cas xe_smon_3664.trc
cas xe_smon_3676.trc
cas xe_smon_540.trc
root xe_s000_2072.trc
root xe_s000_2412.trc
Unique Data and Grouping Data
Starting with only the owner data (or the equivalent of SELECT owner FROM imaginary_table ORDER BY 1):
[root@linux-server test]# ls -l | awk '{print $3}' | sort
cas
cas
cas
oracle
oracle
root
root
You might be interested in seeing a simple list of who owns the files in the directory. In SQL this would be expressed as
SELECT DISTINCT owner FROM imaginary_table ORDER BY 1
You can use the uniq command to achieve a similar result:
[root@linux-server test]# ls -l | awk '{print $3}' | sort | uniq
cas
oracle
root
The uniq operator has a count function that allows you to get a count of each file associated with a given owner. In SQL you would write a query like
SELECT count(*), owner FROM imaginary_table GROUP BY owner
By adding the -c option to uniq you achieve this result:
[root@linux-server test]# ls -l | awk '{print $3}' | sort | uniq -c
1
3 cas
2 oracle
2 root
What is that 1 at the beginning of the output? If you recall from the initial output of the ls -l command, there was a row that read total 60K. Again this output was interpreted by the command as a column 3 containing no data. If this is unacceptable, exclude it using grep as demonstrated earlier:
[root@linux-server test]# ls -l |grep -v total |awk '{print $3}' | sort | uniq -c
3 cas
2 oracle
2 root
All of the examples above use the ls command to provide output. However, any command that produces output can be used in a similar manner. Another way to think about it is to consider if you can redirect the output of the command to a file. If you can, the resulting file will contain the set of data that will subsequently be manipulated.
From Shell Commands to Ruby
Lest you think that becoming familiar with this type of syntax would deprive you of time spent learning newer, hot technologies, consider the way these idioms appear in the Ruby programming language. Ruby has gained wide acceptance as part of the Rails Web application development framework. However, it is a very useful general purpose language that can be leveraged to address a wide variety of concerns of interest to systems administrators and DBAs.
Oracle and SQL share a common characteristic of being relatively "dense" or "terse" languages. Such languages provide the ability to express extremely powerful statements in a succinct manner. This allows for the creation of programs that are clearer and easier to maintain.
The following experiment can be performed within the Interactive Ruby Shell (irb) which is included in a typical Ruby distribution. These examples do not require any installation prerequisites other than the Ruby programming language itself. The previous usage of sort, uniq and grep is reflected in the method names of the ruby Array class.
Type "irb" at the command line to get to an irb prompt:
[root@linux-server test]# irb
irb(main):001:0>
Create a list of values contained in an array object:
myList = ['ruby','sql','ruby','bash','python','perl','java','sql']
=> ["ruby", "sql", "ruby", "bash", "python", "perl", "java", "sql"]
When running irb the second line displays the value of the expression as it is evaluated. This list of values is analogous to the results of the ls command used in previous examples. This will serve as our initial list that will be sorted and limited.
The list can now be sorted:
irb(main):002:0> myList.sort
=> ["bash", "java", "perl", "python", "ruby", "ruby", "sql", "sql"]
A list of unique results can be returned:
irb(main):003:0> myList.uniq
=> ["ruby", "sql", "bash", "python", "perl", "java"]
There is even a grep method complete with regular expression matching capabilities:
irb(main):004:0> myList.grep(/r/)
=> ["ruby", "ruby", "perl"]
You can also simulate the uniq -c option by calling the size method on the array returned by the uniq method:
irb(main):005:0> myList.uniq.size
=> 6
Ruby Glue
SQL is useful for manipulating sets of data within Oracle Database. Shell scripts (comprising Linux commands) work well at the file system level. What can be done in a situation that requires manipulation of sets of data both inside and outside the database?
Ruby can serve as a "glue" language that facilitates communication between independent technologies or application layers. One example of such a function is the use of Ruby to provide interaction between the database and operating system. Consider the rather common scenario in web applications where a record in the database contains a reference to a file on the file system. An example might be a user who uploaded their picture which is subsequently saved to the file system using the user's id. If you need to determine which users (by name) have uploaded files, you are required to use a solution that bridges the these two worlds.
These examples do require some additional installations besides Ruby itself. The Oracle Client (along with a configured connection to an Oracle database that contains the HR schema, such as Oracle Database XE) and the OCI8 gem (ruby-oci9) should be installed before running the following examples (see "Downloads" box at top of page)..
Create the Sample Data
Start by navigating to a directory where you will create files to simulate this situation. You can either enter the following program line by line in irb or save the entire listing in a file and execute it at the command line (ruby <name of file>) in the directory containing the files created earlier. The listing includes comments that are lines preceded by a pound sign:
# First, enter some preliminary commands to create files that will
# simulate data that was uploaded by users. Load the OCI8 package
# and make a connection to your database:
require 'OCI8'
conn =OCI8.new('hr', 'hr', 'xe')
# Next create a SQL query that will return all of the users
# that have an "a" in their last name.
# This simulates arbitrary user uploaded files:
sql="select employee_id||'_'||upper(last_name)|| '.gif' "
sql+="from employees where upper(last_name) like '%A%'"
# Now we will create the actual (empty) files.
# They will be named using the employee id and last name.
# as specified in the first (and only) field in the query
conn.exec(sql) { |r| File.new(r[0],'w') }
If you ran these examples within irb, type "exit" at the prompt and view the files that were created.
Create the File System/Database Reconciliation Report
Next create a report using Ruby that lists each user and includes and X by the names of users that have files associated with them.
# Load OCI8
require 'OCI8'
# Make a connection to the Oracle Database containing the HR sample schema
conn =OCI8.new('hr', 'hr', 'xe')
# Create a SQL query
sql="select employee_id||'_'||upper(last_name)|| '.gif',"
sql+=" last_name, first_name from employees order by 2,3"
# Execute the query. Each result is an array of strings representing
# each field in the query. If a file exists on the file system that matches
# the pattern we specified earlier, display an X
conn.exec(sql) do |r|
if File.exists?(r[0])
print 'X '
else
print ' '
end
# Display the last name followed by the first name
puts "#{r[1]}, #{r[2]}"
end
The results of this report are written to standard out and should look something like the following:
X Abel, Ellen
X Ande, Sundar
X Atkinson, Mozhe
X Austin, David
X Baer, Hermann
X Baida, Shelli
X Banda, Amit
X Bates, Elizabeth
Bell, Sarah
Bernstein, David
Bissot, Laura
Bloom, Harrison
Bull, Alexis
X Cabrio, Anthony
...
Weiss, Matthew
X Whalen, Jennifer
Zlotkey, Eleni
Note that (as expected) only users with an "a" in their last name have an "X" by their name to indicate the existence of a related file.
Oracle provides a number of ways to handle this situation completely within the database. For instance, BLOBs can be used to store the images as binary data within the database itself. However, it is helpful to know that there are alternatives when such a design is not in place.
Summary and Comparison of Operations
In the table below the source of data is different in each case, but the methods of processing it are similar. The SQL data set is the result of a query against database tables. The Linux data set is the result of some command that is being piped into the listed command. The dataset in question for the ruby example is a Ruby array of strings. The expressions are representative; the particular values will vary depending upon the data set in question.
| SQL |
Linux Command |
Ruby |
Description |
| WHERE col1 = 'value' |
grep value |
myArr.grep('value') |
Limit results, return matches |
| WHERE col1 != 'value' |
grep -v value |
MyArr. – myArr.grep('value') |
Limit results, return those that do not match |
| ORDER BY 1 |
sort |
myArr.sort |
Put the results in order |
| SELECT DISTINCT col1 |
uniq field1 |
myArr.uniq |
Remove duplicates from results |
| SELECT COUNT(*) , col1 FROM y GROUP BY x |
sort | uniq -c |
myArr.uniq.size |
Find out how many of a given results exist within a set of results |
| SELECT col2, col1 |
awk '{print $2 "\t" $1}' |
myArr.each{|rec| puts "#{rec[1,2]} #{rec[0,1]}"} |
Limit the results to only specific parts of a given result line of data |
At a certain point, it is easier to take advantage of the power of Oracle SQL to do complex aggregation of data. It is also easier to interpret a well crafted SQL statement than a series of piped Linux commands. Oracle provides a number of tools that can be used to import such data once it has been structured in some way. Oracle External Tables or SqlLoader(sqlldr) can be used to quickly import such data into tables.
Scripting can be used to quickly prototype a bit of needed functionality. A solution that meets a given business requirement could become more widely used and need to be enhanced. When dealing with sets of data that are requiring complex filtering and analysis, you will find opportunities to convert these scripts into Oracle Database-backed programs that allow for reliable maintenance and storage of data. The greatest immediate benefit may well be the use of Oracle SQL—which was expressly designed for the sort of manipulations required.
Conclusion
In general, organizations should strive for repeatable processes and well designed systems. The "artists" (who are generally programmers) can create software that is ideally well suited for the requirements. Inevitably, the organization or the business requirements change, and those in "performer" roles step in to quickly adapt to changing circumstances. The performer (frequently a system administrator or DBA) has less time and fewer resources to develop a solution. They need to quickly identify a qualifying technology to address the problem, and have a repertoire of basic algorithms that can be used to meet the technical and business needs. They often do not have the luxury of an extensive testing process (which programmers often have available), but enjoy the opportunity to provide unique and creative solutions that are indispensable to ongoing operations.
If you know SQL, take a look at scripting and command construction using the concepts introduced in this article. You already have a wealth of problem solving strategies under your belt that are applicable in other settings as well. If you know something of scripting and are delving into SQL, you will find that the procedural approach has provided you with some basic resources that you can use as you enter into a more consistent set-oriented way of thinking. In either case, you will find that much of your existing knowledge will serve you well as you obtain new skills to address the pressing daily business and technical demands that meet you.
Casimir Saternos is an Oracle Certified DBA, IBM Certified Enterprise Developer, and Sun Certified Java Programmer based in Allentown, Pa.
Send us your comments
|