Loops

Overview

Teaching: 45 min
Exercises: 15 min

Questions

How can I perform the same actions on many different files?

Objectives

Write a loop that applies one or more commands separately to each file in a set of files.

Trace the values taken on by a loop variable during execution of the loop.

Explain the difference between a variable’s name and its value.

Explain why spaces and some punctuation characters shouldn’t be used in file names.

Demonstrate how to see what commands have recently been executed.

Re-run recently executed commands without retyping them.

In this lesson we will build on top of what we learned on the previous lesson An Introduction to Linux with Command Line. We will continue working with Nelle’s pipeline and expand it to make use of Bash scripts to automatize the production of results using many of the Bash commands already learned.

Loops are a programming construct which allow us to repeat a command or set of commands for each item in a list. As such they are key to productivity improvements through automation. Similar to wildcards and tab completion, using loops also reduces the amount of typing required (and hence reduces the number of typing mistakes).

Suppose we have several hundred genome data files named basilisk.dat, minotaur.dat, and unicorn.dat. For this example, we’ll use the creatures directory which only has three example files, but the principles can be applied to many many more files at once.

The structure of these files is the same: the common name, classification, and updated date are presented on the first three lines, with DNA sequences on the following lines. Let’s look at the files:

$ head -n 5 basilisk.dat minotaur.dat unicorn.dat

We would like to print out the classification for each species, which is given on the second line of each file. For each file, we would need to execute the command head -n 2 and pipe this to tail -n 1. We’ll use a loop to solve this problem, but first let’s look at the general form of a loop:

for thing in list_of_things
do
    operation_using $thing    # Indentation within the loop is not required, but aids legibility
done

and we can apply this to our example like this:

$ for filename in basilisk.dat minotaur.dat unicorn.dat
> do
>    head -n 2 $filename | tail -n 1
> done

CLASSIFICATION: basiliscus vulgaris
CLASSIFICATION: bos hominus
CLASSIFICATION: equus monoceros

Follow the Prompt

The shell prompt changes from $ to > and back again as we were typing in our loop. The second prompt, >, is different to remind us that we haven’t finished typing a complete command yet. A semicolon, ;, can be used to separate two commands written on a single line.

When the shell sees the keyword for, it knows to repeat a command (or group of commands) once for each item in a list. Each time the loop runs (called an iteration), an item in the list is assigned in sequence to the variable, and the commands inside the loop are executed, before moving on to the next item in the list. Inside the loop, we call for the variable’s value by putting $ in front of it. The $ tells the shell interpreter to treat the variable as a variable name and substitute its value in its place, rather than treat it as text or an external command.

In this example, the list is three filenames: basilisk.dat, minotaur.dat, and unicorn.dat. Each time the loop iterates, it will assign a file name to the variable filename and run the head command. The first time through the loop, $filename is basilisk.dat. The interpreter runs the command head on basilisk.dat and pipes the first two lines to the tail command, which then prints the second line of basilisk.dat. For the second iteration, $filename becomes minotaur.dat. This time, the shell runs head on minotaur.dat and pipes the first two lines to the tail command, which then prints the second line of minotaur.dat. For the third iteration, $filename becomes unicorn.dat, so the shell runs the head command on that file, and tail on the output of that. Since the list was only three items, the shell exits the for loop.

Same Symbols, Different Meanings

Here we see > being used a shell prompt, whereas > is also used to redirect output. Similarly, $ is used as a shell prompt, but, as we saw earlier, it is also used to ask the shell to get the value of a variable.

If the shell prints > or $ then it expects you to type something, and the symbol is a prompt.

If you type > or $ yourself, it is an instruction from you that the shell should redirect output or get the value of a variable.

When using variables it is also possible to put the names into curly braces to clearly delimit the variable name: $filename is equivalent to ${filename}, but is different from ${file}name. You may find this notation in other people’s programs.

We have called the variable in this loop filename in order to make its purpose clearer to human readers. The shell itself doesn’t care what the variable is called; if we wrote this loop as:

$ for x in basilisk.dat minotaur.dat unicorn.dat
> do
>    head -n 2 $x | tail -n 1
> done

or:

$ for temperature in basilisk.dat minotaur.dat unicorn.dat
> do
>    head -n 2 $temperature | tail -n 1
> done

it would work exactly the same way. Don’t do this. Programs are only useful if people can understand them, so meaningless names (like x) or misleading names (like temperature) increase the odds that the program won’t do what its readers think it does.

Variables in Loops

This exercise refers to the data-shell-scripting/molecules directory. ls gives the following output:
cubane.pdb  ethane.pdb  methane.pdb  octane.pdb  pentane.pdb  propane.pdb
What is the output of the following code?
$ for datafile in *.pdb
> do
>    ls *.pdb
> done
Now, what is the output of the following code?
$ for datafile in *.pdb
> do
>	ls $datafile
> done
Why do these two loops give different outputs?
Solution

The first code block gives the same output on each iteration through the loop. Bash expands the wildcard *.pdb within the loop body (as well as before the loop starts) to match all files ending in .pdb and then lists them using ls. The expanded loop would look like this:
$ for datafile in cubane.pdb  ethane.pdb  methane.pdb  octane.pdb  pentane.pdb  propane.pdb
> do
>	ls cubane.pdb  ethane.pdb  methane.pdb  octane.pdb  pentane.pdb  propane.pdb
> done
cubane.pdb  ethane.pdb  methane.pdb  octane.pdb  pentane.pdb  propane.pdb
cubane.pdb  ethane.pdb  methane.pdb  octane.pdb  pentane.pdb  propane.pdb
cubane.pdb  ethane.pdb  methane.pdb  octane.pdb  pentane.pdb  propane.pdb
cubane.pdb  ethane.pdb  methane.pdb  octane.pdb  pentane.pdb  propane.pdb
cubane.pdb  ethane.pdb  methane.pdb  octane.pdb  pentane.pdb  propane.pdb
cubane.pdb  ethane.pdb  methane.pdb  octane.pdb  pentane.pdb  propane.pdb
The second code block lists a different file on each loop iteration. The value of the datafile variable is evaluated using $datafile, and then listed using ls.
cubane.pdb
ethane.pdb
methane.pdb
octane.pdb
pentane.pdb
propane.pdb

Limiting Sets of Files

What would be the output of running the following loop in the data-shell-scripting/molecules directory?
$ for filename in c*
> do
>    ls $filename
> done
No files are listed.

All files are listed.

Only cubane.pdb, octane.pdb and pentane.pdb are listed.

Only cubane.pdb is listed.

Solution

4 is the correct answer. * matches zero or more characters, so any file name starting with the letter c, followed by zero or more other characters will be matched.

How would the output differ from using this command instead?
$ for filename in *c*
> do
>    ls $filename
> done
The same files would be listed.

All the files are listed this time.

No files are listed this time.

The files cubane.pdb and octane.pdb will be listed.

Only the file octane.pdb will be listed.

Solution

4 is the correct answer. * matches zero or more characters, so a file name with zero or more characters before a letter c and zero or more characters after the letter c will be matched.

Saving to a File in a Loop - Part One

In the data-shell-scripting/molecules directory, what is the effect of this loop?
for alkanes in *.pdb
do
    echo $alkanes
    cat $alkanes > alkanes.pdb
done
Prints cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, pentane.pdb and propane.pdb, and the text from propane.pdb will be saved to a file called alkanes.pdb.

Prints cubane.pdb, ethane.pdb, and methane.pdb, and the text from all three files would be concatenated and saved to a file called alkanes.pdb.

Prints cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, and pentane.pdb, and the text from propane.pdb will be saved to a file called alkanes.pdb.

None of the above.

Solution

The text from each file in turn gets written to the alkanes.pdb file. However, the file gets overwritten on each loop interation, so the final content of alkanes.pdb is the text from the propane.pdb file.

Saving to a File in a Loop - Part Two

Also in the data-shell-scripting/molecules directory, what would be the output of the following loop?
for datafile in *.pdb
do
    cat $datafile >> all.pdb
done
All of the text from cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, and pentane.pdb would be concatenated and saved to a file called all.pdb.

The text from ethane.pdb will be saved to a file called all.pdb.

All of the text from cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, pentane.pdb and propane.pdb would be concatenated and saved to a file called all.pdb.

All of the text from cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, pentane.pdb and propane.pdb would be printed to the screen and saved to a file called all.pdb.

Solution

3 is the correct answer. >> appends to a file, rather than overwriting it with the redirected output from a command. Given the output from the cat command has been redirected, nothing is printed to the screen.

Let’s continue with our example in the data-shell-scripting/creatures directory. Here’s a slightly more complicated loop:

$ for filename in *.dat
> do
>     echo $filename
>     head -n 100 $filename | tail -n 20
> done

The shell starts by expanding *.dat to create the list of files it will process. The loop body then executes two commands for each of those files. The first command, echo, prints its command-line arguments to standard output. For example:

$ echo hello there

prints:

hello there

In this case, since the shell expands $filename to be the name of a file, echo $filename prints the name of the file. Note that we can’t write this as:

$ for filename in *.dat
> do
>     $filename
>     head -n 100 $filename | tail -n 20
> done

because then the first time through the loop, when $filename expanded to basilisk.dat, the shell would try to run basilisk.dat as a program. Finally, the head and tail combination selects lines 81-100 from whatever file is being processed (assuming the file has at least 100 lines).

Spaces in Names

Spaces are used to separate the elements of the list that we are going to loop over. If one of those elements contains a space character, we need to surround it with quotes, and do the same thing to our loop variable. Suppose our data files are named:
red dragon.dat
purple unicorn.dat
To loop over these files, we would need to add double quotes like so:
$ for filename in "red dragon.dat" "purple unicorn.dat"
> do
>     head -n 100 "$filename" | tail -n 20
> done
It is simpler to avoid using spaces (or other special characters) in filenames.

The files above don’t exist, so if we run the above code, the head command will be unable to find them, however the error message returned will show the name of the files it is expecting:
head: cannot open ‘red dragon.dat’ for reading: No such file or directory
head: cannot open ‘purple unicorn.dat’ for reading: No such file or directory
Try removing the quotes around $filename in the loop above to see the effect of the quote marks on spaces. Note that we get a result from the loop command for unicorn.dat when we run this code in the creatures directory:
head: cannot open ‘red’ for reading: No such file or directory
head: cannot open ‘dragon.dat’ for reading: No such file or directory
head: cannot open ‘purple’ for reading: No such file or directory
CGGTACCGAA
AAGGGTCGCG
CAAGTGTTCC

We would like to modify each of the files in data-shell-scripting/creatures, but also save a version of the original files, naming the copies original-basilisk.dat and original-unicorn.dat. We can’t use:

$ cp *.dat original-*.dat

because that would expand to:

$ cp basilisk.dat minotaur.dat unicorn.dat original-*.dat

This wouldn’t back up our files, instead we get an error:

cp: target `original-*.dat' is not a directory

This problem arises when cp receives more than two inputs. When this happens, it expects the last input to be a directory where it can copy all the files it was passed. Since there is no directory named original-*.dat in the creatures directory we get an error.

Instead, we can use a loop:

$ for filename in *.dat
> do
>     cp $filename original-$filename
> done

This loop runs the cp command once for each filename. The first time, when $filename expands to basilisk.dat, the shell executes:

cp basilisk.dat original-basilisk.dat

The second time, the command is:

cp minotaur.dat original-minotaur.dat

The third and last time, the command is:

cp unicorn.dat original-unicorn.dat

Since the cp command does not normally produce any output, it’s hard to check that the loop is doing the correct thing. However, we learned earlier how to print strings using echo, and we can modify the loop to use echo to print our commands without actually executing them. As such we can check what commands would be run in the unmodified loop.

The following diagram shows what happens when the modified loop is executed, and demonstrates how the judicious use of echo is a good debugging technique.

For Loop in Action

Nelle’s Pipeline: Processing Files

Nelle is now ready to process her data files using goostats — a shell script written by her supervisor. This calculates some statistics from a protein sample file, and takes two arguments:

an input file (containing the raw data)
an output file (to store the calculated statistics)

Since she’s still learning how to use the shell, she decides to build up the required commands in stages. Her first step is to make sure that she can select the right input files — remember, these are ones whose names end in ‘A’ or ‘B’, rather than ‘Z’. Starting from her home directory, Nelle types:

$ cd north-pacific-gyre/2012-07-03
$ for datafile in NENE*[AB].txt
> do
>     echo $datafile
> done

NENE01729A.txt
NENE01729B.txt
NENE01736A.txt
...
NENE02043A.txt
NENE02043B.txt

Her next step is to decide what to call the files that the goostats analysis program will create. Prefixing each input file’s name with ‘stats’ seems simple, so she modifies her loop to do that:

$ for datafile in NENE*[AB].txt
> do
>     echo $datafile stats-$datafile
> done

NENE01729A.txt stats-NENE01729A.txt
NENE01729B.txt stats-NENE01729B.txt
NENE01736A.txt stats-NENE01736A.txt
...
NENE02043A.txt stats-NENE02043A.txt
NENE02043B.txt stats-NENE02043B.txt

She hasn’t actually run goostats yet, but now she’s sure she can select the right files and generate the right output filenames.

Typing in commands over and over again is becoming tedious, though, and Nelle is worried about making mistakes, so instead of re-entering her loop, she presses ↑. In response, the shell redisplays the whole loop on one line (using semi-colons to separate the pieces):

$ for datafile in NENE*[AB].txt; do echo $datafile stats-$datafile; done

Using the left arrow key, Nelle backs up and changes the command echo to bash goostats:

$ for datafile in NENE*[AB].txt; do bash goostats $datafile stats-$datafile; done

When she presses Enter, the shell runs the modified command. However, nothing appears to happen — there is no output. After a moment, Nelle realizes that since her script doesn’t print anything to the screen any longer, she has no idea whether it is running, much less how quickly. She kills the running command by typing Ctrl+C, uses ↑ to repeat the command, and edits it to read:

$ for datafile in NENE*[AB].txt; do echo $datafile; bash goostats $datafile stats-$datafile; done

Beginning and End

We can move to the beginning of a line in the shell by typing Ctrl+A and to the end using Ctrl+E.

When she runs her program now, it produces one line of output every five seconds or so:

NENE01729A.txt
NENE01729B.txt
NENE01736A.txt
...

1518 times 5 seconds, divided by 60, tells her that her script will take about two hours to run. As a final check, she opens another terminal window, goes into north-pacific-gyre/2012-07-03, and uses cat stats-NENE01729B.txt to examine one of the output files. It looks good, so she decides to get some coffee and catch up on her reading.

Those Who Know History Can Choose to Repeat It

Another way to repeat previous work is to use the history command to get a list of the last few hundred commands that have been executed, and then to use !123 (where ‘123’ is replaced by the command number) to repeat one of those commands. For example, if Nelle types this:
$ history | tail -n 5
  456  ls -l NENE0*.txt
  457  rm stats-NENE01729B.txt.txt
  458  bash goostats NENE01729B.txt stats-NENE01729B.txt
  459  ls -l NENE0*.txt
  460  history
then she can re-run goostats on NENE01729B.txt simply by typing !458.

Other History Commands

There are a number of other shortcut commands for getting at the history.

Ctrl+R enters a history search mode ‘reverse-i-search’ and finds the most recent command in your history that matches the text you enter next. Press Ctrl+R one or more additional times to search for earlier matches. You can then use the left and right arrow keys to choose that line and edit it then hit Return to run the command.

!! retrieves the immediately preceding command (you may or may not find this more convenient than using ↑)

!$ retrieves the last word of the last command. That’s useful more often than you might expect: after bash goostats NENE01729B.txt stats-NENE01729B.txt, you can type less !$ to look at the file stats-NENE01729B.txt, which is quicker than doing ↑ and editing the command-line.

Doing a Dry Run

A loop is a way to do many things at once — or to make many mistakes at once if it does the wrong thing. One way to check what a loop would do is to echo the commands it would run instead of actually running them.

Suppose we want to preview the commands the following loop will execute without actually running those commands:
$ for datafile in *.pdb
> do
>   cat $datafile >> all.pdb
> done
What is the difference between the two loops below, and which one would we want to run?
# Version 1
$ for datafile in *.pdb
> do
>   echo cat $datafile >> all.pdb
> done
# Version 2
$ for datafile in *.pdb
> do
>   echo "cat $datafile >> all.pdb"
> done
Solution

The second version is the one we want to run. This prints to screen everything enclosed in the quote marks, expanding the loop variable name because we have prefixed it with a dollar sign.

The first version appends the output from the command echo cat $datafile to the file, all.pdb. This file will just contain the list; cat cubane.pdb, cat ethane.pdb, cat methane.pdb etc.

Try both versions for yourself to see the output! Be sure to open the all.pdb file to view its contents.

Nested Loops

Suppose we want to set up a directory structure to organize some experiments measuring reaction rate constants with different compounds and different temperatures. What would be the result of the following code:
$ for species in cubane ethane methane
> do
>     for temperature in 25 30 37 40
>     do
>         mkdir $species-$temperature
>     done
> done
Solution

We have a nested loop, i.e. contained within another loop, so for each species in the outer loop, the inner loop (the nested loop) iterates over the list of temperatures, and creates a new directory for each combination.

Try running the code for yourself to see which directories are created!

Key Points

A for loop repeats commands once for every thing in a list.

Every for loop needs a variable to refer to the thing it is currently operating on.

Use $name to expand a variable (i.e., get its value). ${name} can also be used.

Do not use spaces, quotes, or wildcard characters such as ‘*’ or ‘?’ in filenames, as it complicates variable expansion.

Give files consistent names that are easy to match with wildcard patterns to make it easy to select them for looping.

Use the up-arrow key to scroll up through previous commands to edit and repeat them.

Use Ctrl+R to search through the previously entered commands.

Use history to display recent commands, and !number to repeat a command by number.

Shell Scripts

Overview

Teaching: 40 min
Exercises: 20 min

Questions

How can I save and re-use commands?

Objectives

Write a shell script that runs a command or series of commands for a fixed set of files.

Run a shell script from the command line.

Write a shell script that operates on a set of files defined by the user on the command line.

Create pipelines that include shell scripts you, and others, have written.

We are finally ready to see what makes the shell such a powerful programming environment. We are going to take the commands we repeat frequently and save them in files so that we can re-run all those operations again later by typing a single command. For historical reasons, a bunch of commands saved in a file is usually called a shell script, but make no mistake: these are actually small programs.

Let’s start by going back to molecules/ and creating a new file, middle.sh which will become our shell script:

$ cd molecules
$ nano middle.sh

The command nano middle.sh opens the file middle.sh within the text editor ‘nano’ (which runs within the shell). If the file does not exist, it will be created. We can use the text editor to directly edit the file – we’ll simply insert the following line:

head -n 15 octane.pdb | tail -n 5

This is a variation on the pipe we constructed earlier: it selects lines 11-15 of the file octane.pdb. Remember, we are not running it as a command just yet: we are putting the commands in a file.

Then we save the file (Ctrl-O in nano), and exit the text editor (Ctrl-X in nano). Check that the directory molecules now contains a file called middle.sh.

Once we have saved the file, we can ask the shell to execute the commands it contains. Our shell is called bash, so we run the following command:

$ bash middle.sh

ATOM      9  H           1      -4.502   0.681   0.785  1.00  0.00
ATOM     10  H           1      -5.254  -0.243  -0.537  1.00  0.00
ATOM     11  H           1      -4.357   1.252  -0.895  1.00  0.00
ATOM     12  H           1      -3.009  -0.741  -1.467  1.00  0.00
ATOM     13  H           1      -3.172  -1.337   0.206  1.00  0.00

Sure enough, our script’s output is exactly what we would get if we ran that pipeline directly.

Text vs. Whatever

We usually call programs like Microsoft Word or LibreOffice Writer “text editors”, but we need to be a bit more careful when it comes to programming. By default, Microsoft Word uses .docx files to store not only text, but also formatting information about fonts, headings, and so on. This extra information isn’t stored as characters, and doesn’t mean anything to tools like head: they expect input files to contain nothing but the letters, digits, and punctuation on a standard computer keyboard. When editing programs, therefore, you must either use a plain text editor, or be careful to save files as plain text.

What if we want to select lines from an arbitrary file? We could edit middle.sh each time to change the filename, but that would probably take longer than typing the command out again in the shell and executing it with a new file name. Instead, let’s edit middle.sh and make it more versatile:

$ nano middle.sh

Now, within “nano”, replace the text octane.pdb with the special variable called $1:

head -n 15 "$1" | tail -n 5

Inside a shell script, $1 means ‘the first filename (or other argument) on the command line’. We can now run our script like this:

$ bash middle.sh octane.pdb

ATOM      9  H           1      -4.502   0.681   0.785  1.00  0.00
ATOM     10  H           1      -5.254  -0.243  -0.537  1.00  0.00
ATOM     11  H           1      -4.357   1.252  -0.895  1.00  0.00
ATOM     12  H           1      -3.009  -0.741  -1.467  1.00  0.00
ATOM     13  H           1      -3.172  -1.337   0.206  1.00  0.00

or on a different file like this:

$ bash middle.sh pentane.pdb

ATOM      9  H           1       1.324   0.350  -1.332  1.00  0.00
ATOM     10  H           1       1.271   1.378   0.122  1.00  0.00
ATOM     11  H           1      -0.074  -0.384   1.288  1.00  0.00
ATOM     12  H           1      -0.048  -1.362  -0.205  1.00  0.00
ATOM     13  H           1      -1.183   0.500  -1.412  1.00  0.00

Double-Quotes Around Arguments

For the same reason that we put the loop variable inside double-quotes, in case the filename happens to contain any spaces, we surround $1 with double-quotes.

Currently, we need to edit middle.sh each time we want to adjust the range of lines that is returned. Let’s fix that by configuring our script to instead use three command-line arguments. After the first command-line argument ($1), each additional argument that we provide will be accessible via the special variables $1, $2, $3, which refer to the first, second, third command-line arguments, respectively.

Knowing this, we can use additional arguments to define the range of lines to be passed to head and tail respectively:

$ nano middle.sh

head -n "$2" "$1" | tail -n "$3"

We can now run:

$ bash middle.sh pentane.pdb 15 5

ATOM      9  H           1       1.324   0.350  -1.332  1.00  0.00
ATOM     10  H           1       1.271   1.378   0.122  1.00  0.00
ATOM     11  H           1      -0.074  -0.384   1.288  1.00  0.00
ATOM     12  H           1      -0.048  -1.362  -0.205  1.00  0.00
ATOM     13  H           1      -1.183   0.500  -1.412  1.00  0.00

By changing the arguments to our command we can change our script’s behaviour:

$ bash middle.sh pentane.pdb 20 5

ATOM     14  H           1      -1.259   1.420   0.112  1.00  0.00
ATOM     15  H           1      -2.608  -0.407   1.130  1.00  0.00
ATOM     16  H           1      -2.540  -1.303  -0.404  1.00  0.00
ATOM     17  H           1      -3.393   0.254  -0.321  1.00  0.00
TER      18              1

This works, but it may take the next person who reads middle.sh a moment to figure out what it does. We can improve our script by adding some comments at the top:

$ nano middle.sh

# Select lines from the middle of a file.
# Usage: bash middle.sh filename end_line num_lines
head -n "$2" "$1" | tail -n "$3"

A comment starts with a # character and runs to the end of the line. The computer ignores comments, but they’re invaluable for helping people (including your future self) understand and use scripts. The only caveat is that each time you modify the script, you should check that the comment is still accurate: an explanation that sends the reader in the wrong direction is worse than none at all.

What if we want to process many files in a single pipeline? For example, if we want to sort our .pdb files by length, we would type:

$ wc -l *.pdb | sort -n

because wc -l lists the number of lines in the files (recall that wc stands for ‘word count’, adding the -l option means ‘count lines’ instead) and sort -n sorts things numerically. We could put this in a file, but then it would only ever sort a list of .pdb files in the current directory. If we want to be able to get a sorted list of other kinds of files, we need a way to get all those names into the script. We can’t use $1, $2, and so on because we don’t know how many files there are. Instead, we use the special variable $@, which means, ‘All of the command-line arguments to the shell script’. We also should put $@ inside double-quotes to handle the case of arguments containing spaces ("$@" is special syntax and is equivalent to "$1" "$2" …).

Here’s an example:

$ nano sorted.sh

# Sort files by their length.
# Usage: bash sorted.sh one_or_more_filenames
wc -l "$@" | sort -n

$ bash sorted.sh *.pdb ../creatures/*.dat

methane.pdb
ethane.pdb
propane.pdb
cubane.pdb
pentane.pdb
octane.pdb
../creatures/basilisk.dat
../creatures/minotaur.dat
../creatures/unicorn.dat
total

List Unique Species

Leah has several hundred data files, each of which is formatted like this:
2013-11-05,deer,5
2013-11-05,rabbit,22
2013-11-05,raccoon,7
2013-11-06,rabbit,19
2013-11-06,deer,2
2013-11-06,fox,1
2013-11-07,rabbit,18
2013-11-07,bear,1
An example of this type of file is given in data-shell/data/animal-counts/animals.txt.

We can use the command cut -d , -f 2 animals.txt | sort | uniq to produce the unique species in animals.txt. In order to avoid having to type out this series of commands every time, a scientist may choose to write a shell script instead.

Write a shell script called species.sh that takes any number of filenames as command-line arguments, and uses a variation of the above command to print a list of the unique species appearing in each of those files separately.
Solution
# Script to find unique species in csv files where species is the second data field
# This script accepts any number of file names as command line arguments

# Loop over all files
for file in $@
do
	echo "Unique species in $file:"
	# Extract species names
	cut -d , -f 2 $file | sort | uniq
done

Suppose we have just run a series of commands that did something useful — for example, that created a graph we’d like to use in a paper. We’d like to be able to re-create the graph later if we need to, so we want to save the commands in a file. Instead of typing them in again (and potentially getting them wrong) we can do this:

$ history | tail -n 5 > redo-figure-3.sh

The file redo-figure-3.sh now contains:

bash goostats NENE01729B.txt stats-NENE01729B.txt
bash goodiff stats-NENE01729B.txt /data/validated/01729.txt > 01729-differences.txt
cut -d ',' -f 2-3 01729-differences.txt > 01729-time-series.txt
ygraph --format scatter --color bw --borders none 01729-time-series.txt figure-3.png
history | tail -n 5 > redo-figure-3.sh

After a moment’s work in an editor to remove the serial numbers on the commands, and to remove the final line where we called the history command, we have a completely accurate record of how we created that figure.

Why Record Commands in the History Before Running Them?

If you run the command:
$ history | tail -n 5 > recent.sh
the last command in the file is the history command itself, i.e., the shell has added history to the command log before actually running it. In fact, the shell always adds commands to the log before running them. Why do you think it does this?

Solution

If a command causes something to crash or hang, it might be useful to know what that command was, in order to investigate the problem. Were the command only be recorded after running it, we would not have a record of the last command run in the event of a crash.

In practice, most people develop shell scripts by running commands at the shell prompt a few times to make sure they’re doing the right thing, then saving them in a file for re-use. This style of work allows people to recycle what they discover about their data and their workflow with one call to history and a bit of editing to clean up the output and save it as a shell script.

Nelle’s Pipeline: Creating a Script

Nelle’s supervisor insisted that all her analytics must be reproducible. The easiest way to capture all the steps is in a script.

First we return to Nelle’s data directory:

$ cd ../north-pacific-gyre/2012-07-03/

She runs the editor and writes the following:

# Calculate stats for data files.
for datafile in "$@"
do
    echo $datafile
    bash goostats $datafile stats-$datafile
done

She saves this in a file called do-stats.sh so that she can now re-do the first stage of her analysis by typing:

$ bash do-stats.sh NENE*[AB].txt

She can also do this:

$ bash do-stats.sh NENE*[AB].txt | wc -l

so that the output is just the number of files processed rather than the names of the files that were processed.

One thing to note about Nelle’s script is that it lets the person running it decide what files to process. She could have written it as:

# Calculate stats for Site A and Site B data files.
for datafile in NENE*[AB].txt
do
    echo $datafile
    bash goostats $datafile stats-$datafile
done

The advantage is that this always selects the right files: she doesn’t have to remember to exclude the ‘Z’ files. The disadvantage is that it always selects just those files — she can’t run it on all files (including the ‘Z’ files), or on the ‘G’ or ‘H’ files her colleagues in Antarctica are producing, without editing the script. If she wanted to be more adventurous, she could modify her script to check for command-line arguments, and use NENE*[AB].txt if none were provided. Of course, this introduces another tradeoff between flexibility and complexity.

Variables in Shell Scripts

In the molecules directory, imagine you have a shell script called script.sh containing the following commands:
head -n $2 $1
tail -n $3 $1
While you are in the molecules directory, you type the following command:
bash script.sh '*.pdb' 1 1
Which of the following outputs would you expect to see?

All of the lines between the first and the last lines of each file ending in .pdb in the molecules directory

The first and the last line of each file ending in .pdb in the molecules directory

The first and the last line of each file in the molecules directory

An error because of the quotes around *.pdb
Solution

The correct answer is 2.

The special variables $1, $2 and $3 represent the command line arguments given to the script, such that the commands run are:
$ head -n 1 cubane.pdb ethane.pdb octane.pdb pentane.pdb propane.pdb
$ tail -n 1 cubane.pdb ethane.pdb octane.pdb pentane.pdb propane.pdb
The shell does not expand '*.pdb' because it is enclosed by quote marks. As such, the first argument to the script is '*.pdb' which gets expanded within the script by head and tail.

Find the Longest File With a Given Extension

Write a shell script called longest.sh that takes the name of a directory and a filename extension as its arguments, and prints out the name of the file with the most lines in that directory with that extension. For example:
$ bash longest.sh /tmp/data pdb
would print the name of the .pdb file in /tmp/data that has the most lines.
Solution
# Shell script which takes two arguments:
#    1. a directory name
#    2. a file extension
# and prints the name of the file in that directory
# with the most lines which matches the file extension.

wc -l $1/*.$2 | sort -n | tail -n 2 | head -n 1
The first part of the pipeline, wc -l $1/*.$2 | sort -n, counts the lines in each file and sorts them numerically (largest last). When there’s more than one file, wc also outputs a final summary line, giving the total number of lines across all files. We use tail -n 2 | head -n 1 to throw away this last line.

With wc -l $1/*.$2 | sort -n | tail -n 1 we’ll see the final summary line: we can build our pipeline up in pieces to be sure we understand the output.

Script Reading Comprehension

For this question, consider the data-shell/molecules directory once again. This contains a number of .pdb files in addition to any other files you may have created. Explain what each of the following three scripts would do when run as bash script1.sh *.pdb, bash script2.sh *.pdb, and bash script3.sh *.pdb respectively.
# Script 1
echo *.*
# Script 2
for filename in $1 $2 $3
do
    cat $filename
done
# Script 3
echo $@.pdb
Solutions

In each case, the shell expands the wildcard in *.pdb before passing the resulting list of file names as arguments to the script.

Script 1 would print out a list of all files containing a dot in their name. The arguments passed to the script are not actually used anywhere in the script.

Script 2 would print the contents of the first 3 files with a .pdb file extension. $1, $2, and $3 refer to the first, second, and third argument respectively.

Script 3 would print all the arguments to the script (i.e. all the .pdb files), followed by .pdb. $@ refers to all the arguments given to a shell script.
cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb.pdb

Debugging Scripts

Suppose you have saved the following script in a file called do-errors.sh in Nelle’s north-pacific-gyre/2012-07-03 directory:
# Calculate stats for data files.
for datafile in "$@"
do
    echo $datfile
    bash goostats $datafile stats-$datafile
done
When you run it:
$ bash do-errors.sh NENE*[AB].txt
the output is blank. To figure out why, re-run the script using the -x option:
bash -x do-errors.sh NENE*[AB].txt
What is the output showing you? Which line is responsible for the error?

Solution

The -x option causes bash to run in debug mode. This prints out each command as it is run, which will help you to locate errors. In this example, we can see that echo isn’t printing anything. We have made a typo in the loop variable name, and the variable datfile doesn’t exist, hence returning an empty string.

Key Points

Save commands in files (usually called shell scripts) for re-use.

bash filename runs the commands saved in a file.

$@ refers to all of a shell script’s command-line arguments.

$1, $2, etc., refer to the first command-line argument, the second command-line argument, etc.

Place variables in quotes if the values might have spaces in them.

Letting users decide what files to process is more flexible and more consistent with built-in Unix commands.

Flow Control

Overview

Teaching: 30 min
Exercises: 15 min

Questions

How to control the flow of a Bash program?

Objectives

Understand Bash basic conditional structure.

Familiarize with common operators to compare strings and integers.

Familiarize with common operators to test existance of files and directories.

The `if` statement

A programming language, even a simple one, requires the ability to change its functionality depending upon certain conditions. For example, if a particular file exists or if the number represented by a variable is greater than some value then perform some action, otherwise perform a different action. In this section we shall look at ways of determining the flow of a script.

We have already seen how variables can be assigned and printed. But we may want to test the value of a variable in order to determine how to proceed. In this case we can use the if statement to test the validity of an expression. A typical shell comparison to test the validity of CONDITION has the form:

if [[ CONDITION ]]; then
    echo "Condition is true"
fi

Where CONDITION is typically a construct that uses two arguments which are compared using a comparison operator. Some of the more common comparison operators used to form conditions are summarized inthe following table:

Arithmetic		String
-eq	equals	=	equals
-nq	not equal to	!=	not equal to
-lt	less than	<	less than
-gt	greater than	>	greater than
-le	less than or equal to
-ge	greater than or equal to

It is also possible to add an additional default action to be executed in case our test is not satisfied, for this we use the else statement:

if [[ CONDITION ]]; then
    echo "Condition is true"
else
    echo "Condition is false"
fi

For example, we can perform a comparison on two integers. Copy the following command to a bash script (you can also try executing directly in the terminal similar as we did with for loops) called comparing-integers.sh:

if [[ $1 -eq $2 ]]
then
    echo ${1} is equal to ${2}
elif [[ $1 > $2 ]]
then
    echo ${1} is greater than ${2}
elif [[ $1 -lt $2 ]]
then
    echo ${1} is less than ${2}
fi

And execute it like this (try also with other numbers):

$ bash comparing-integers.sh 10 3

10 is greater than 3

We can also compare strings. Copy the following commands to a script called comparing-strings.sh:

if [[ $1 == $2 ]]
then
    echo strings are equal
else
    echo strings are different
fi

And execute it like this:

$ bash comparing-integers.sh dog cat

strings are different

More on Integer and String comparisons

Consider the following code snippet, it demonstrates some basic string and integer comparisons, with branching code depending upon the outcome. We first define two variables X and Y (you can also use $1 and $2 to access arguments passed) to the script) and assign them integer values (remember that to Bash they are still strings).

The next step is to build an if…then…else construct to test our variables using an arithmetic comparison operator. Specifically, using -eq lets Bash know that the values stored in the variables are to be treated as numbers. If X is equal to Y the script performs one action, if not, then it performs another.
#!/bin/bash
#Declare two integers for testing
X=3
Y=10

#Perform a comparison on the integers
if [[ $X -eq $Y ]]; then
    echo "${X} equals ${Y}"
else
    echo "${X} does not equal ${Y}"
fi
Now try the following:

Use some of the other comparison operators and see if your results meet your expectations. What outcome would you expect when using the string comparison operators < or >?

Test different numbers, for example, compare 10 and 3; “10” and “3” (including the "); “10” “ 3” (notice the space in front of the number 3); “40” “3”. What kind of results do you obtain?

Solution

Notice that in the first case the results are not necessarily as we expect since the characters are compared in alphabetical (ASCII) order since we are using a string comparison operator.

In the second case we can confirm this by noticing that bash (when instructed to compare the numbers as strings) compares the first character of each string, and if the one on the left has a lower value then it’s true, if greater then it is false; if they’re the same, then it compares the second character, etc.

A common practical application of if statements in programming scripts is to add a help flag to print some useful information about the script. For example:

if [[ "$1" == "--help" ]]
    then
    echo "Returns the file with the most number"
    echo "of lines in the list provided"
    echo "To execute:"
    echo "print-largest.sh <list-of-files>"
fi

wc -l $@ | sort -n | tail -1

The above script should print the help message when executed like this:

bash print-largest.sh --help

But you might see an addional error message reminding you about how wc works:

Returns the file with the most number
of lines in the list provided
To execute:
print-largest.sh <list-of-files>
wc: illegal option -- -
usage: wc [-clmw] [file ...]

What is happening in this case is that bash is evaluating the conditional test in our if statement and executing the commands if the condition is true, however, our script doesn’t end there and bash continues executing any commands after the if statement. If for some reason we would like to exit our script at some point (e.g. in case some condition is not satisfied) we can use the command exit to instruct bash to quit the script at that point. exit takes a numeric argument that is used to specify what caused the program to exit (convention is to use 0 if everything is ok, and is the numeric value used by your script if it finishes without errors, try using $? after executing your script to take a look at this value). We can tell bash to exit our script after printing our help message in this way:

if [[ "$1" == "--help" ]]
    then
    echo "Returns the file with the most number"
    echo "of lines in the list provided"
    echo "To execute:"
    echo "print-largest.sh <list-of-files>"
    exit 0
fi

wc -l $@ | sort -n | tail -1

bash print-largest.sh --help

Returns the file with the most number
of lines in the list provided
To execute:
print-largest.sh <list-of-files>

This time the script exits inmediately after printing the help message without trying to execute the wc command.

Counting arguments with `$#`

One further bash operator useful when working with scripts that take external arguments is to use the $# operator. For example, try the following script (save to counting-arguments.sh):

echo the number of arguments received is $#

And run like this (try with a different number of arguments).

bash counting-arguments.sh argument1 argument2 argument3

the number of arguments received is 3

We can use $# to make our help printing if statement even more flexible. For example, imagine a new researcher just received our script and has no idea what to do with it he or she migh be tempted to run the script like this and see what happens:

bash print-largest.sh

But this would led to the script hangging! (you can cancel it with Ctrl+C). A more useful default behaviour would be that if your script requires arguments to work, trying to run it with no arguments cause the help message to be printed. We can do this with $# like this:

if [[ "$1" == "--help" ]] || [[ $# -eq 0 ]]
    then
    echo "Returns the file with the most number"
    echo "of lines in the list provided"
    echo "To execute:"
    echo "print-largest.sh <list-of-files>"
    exit 0
fi

wc -l $@ | sort -n | tail -1

bash print-largest.sh

Returns the file with the most number
of lines in the list provided
To execute:
print-largest.sh <list-of-files>

That’s a more useful default behaviour!. Notice the if structure we have used where we have included a new operator || (that is two vertical lines, look for the key | in your keyboard) that works as a logical OR (there is an equivalent && operator that works as a logical AND) and let us two or more comparison tests.

Other conditional structures.

Here you have seen how to use bash basic if…then…else…fi and the Else If ladder structure. However, there are other bash constructs that could be useful depending on the case under consideration:

Case Statements. Is a more slightly more complex conditional structure useful when we have several posible options. The general structure is:
case EXPRESSION in
  CASE1)
    COMMAND-LIST;;
  CASE2)
    COMMAND-LIST;;
  CASEN) 
    COMMAND-LIST;;
  * )
    COMMAND-LIST;;
esac

File Test Operators

In addition to variable comparisons, there are other comparison operators that can be used to query the existence and attributes of files. This would allow the script author to, for example, test whether a file exists before trying to read it and potentially producing an error. The table below summarizes some of these operators:

Operator ARGUMENT	Purpose
-d DIRECTORY	Test for existence of a directory
-f FILENAME	Test for existence of file
-r FILENAME	Test if file is readable
-w FILENAME	Test if file is writable
-x FILENAME	Test if file is executable

For example, to test if the directory molecules exists:

$  if [ -e molecules ]
> then
>     echo moecules exists
> fi

Another common task is to identify directories in a certain location. Try typing the following for loop in a script called search-directories.sh and run it in our data-shell-scripting directory:

for filename in $@
do
    if [[ -d $filename ]]
    then
        echo $filename is a directory
    fi
done

It is also possible to check for the negative outcome of a test by preceeding the statement with a ! symbol. For example:

if [[ ! -f myfile.txt ]]; then
  echo "File does not exist"
fi

Logging a directory’s content

Try modifying the above script to create a log file within an identified directory. The log file should contain the names of the files inside the directory. Avoid rewriting the log file if it already exists.
Solution
for filename in $@
do
    if [[ -d $filename ]]
    then
       echo $filename is a directory
       if [[ ! -f $filename/${filename}.log ]]
       then
           echo creating ${filename}.log
           cd $filename
           ls > ${filename}.log
           cd ..
       else
           echo "Warning: ${filename}.log is already present!"
       fi
    fi
done
The above script shows an example of using a few file test operators to check for the existence of a directories and files before trying to perform an action (creating a log file). If the file test were not performed and the file already existed we would rewrite potentially valuable data. Checking for its existence first allows us to throw a warning in this case and perhaps performing another action (e.g. backing up the log file already present).

Key Points

The basic conditional structure in Bash is built as: if…then…else…fi .

Bash has operators specific for string and integer comparisons.

Bash also has comparison operators useful to test the existance of files and directories.

Arithmetic and Arrays

Overview

Teaching: 30 min
Exercises: 10 min

Questions

How to perform basic arithmetic operations in Bash?

How to define arrays?

Objectives

Understand Bash variables types and how to construct arithmetic operations.

Understant how to construct Bash arrays and access their contents.

Bash Variables

Before going into detail about how to perform arithmetic operations and define arrays in Bash, it would be useful to do a quick recap of how variables are used in Bash. As in other languages, Bash variables allow the script’s author to refer to data by a label. In Bash this assignment is performed using the = symbol: We can display this using the command line using the export command to set the variable.

$ export MYTEXT=‘Hello’

The contents of a variable can then be obtained by a process called dereferencing or variable substitution. To return the value assigned to a variable, prefix the label using the $ symbol, i.e:

$ echo $MYTEXT
Hello

More on Bash variables

If you are familiar with a language like C++ or Fortran, where variables have explicit types such as integers or characters, you may be surprised that the contents of a Bash variable may be assigned with no declaration or preamble. This is because Bash is what is known as an untyped language. Essentially, all variables are stored as strings and the contents is treated differently depending on context. So, for example, if I try and print the contents of a variable as above using the echo command, Bash treats it like a text string. If I try and multiply it by 2 then Bash will treat it as a number.

Needless to say, whilst this can simplify the process of creating and assigning variables, it requires the author to be more careful in how variables are treated, as multiplying a string by 2 will not produce an error as it would when using a strongly typed language.

Arithmetic Expansion

Now we know that Bash treats variables according to context, let’s see how we can perform arithmetic operations whereby Bash regards the contents as a number. The format for the Bash arithmetic expansion is:

$(( arithmetic expression ))

For example:

$ echo $((1 + 3))
4

Arithmetic operations are surrounded by the $((…)) construct. Notice that the result of this expression can be again stored in a variable.

$ MYVAR=$((1 + 3))
$ echo $MYVAR
4

Without this construct Bash would treat 1 + 3 as a string and print it accordingly:

$ echo 1 + 3
1 + 3

Consider the following script. Here we define two variables X and Y with integer values and proceed to use the construct $((…)) so that Bash understands that we want to treat the variables as numbers and wish to perform some simple mathematical operations. We store the result of these operations in a variable RESULT and use it in further arithmetic operations.

#!/bin/bash
# Assign two variables with integer values.
X=7
Y=12

# Add these two variables together, store the result in a third variable,
# and print out the result
RESULT=$((X+Y))
echo "X + Y = ${RESULT}!"

# Increment the result by 5.
RESULT=$((RESULT+5))
echo "RESULT is now ${RESULT}"

# Divide the result by 5. Remember bash only deals with integers.
DIVISION=$((RESULT/5))
echo "${RESULT} divided by 5 is ${DIVISION}"

Running this script should produce the following result:

X + Y = 19!
RESULT is now 24
24 divided by 5 is 4

Exercise

Now try modifying the previous script to perform some other calculations. How would you expect to be able to multiply two numbers, or take one number away from another?

Solution

#!/bin/bash
#Add these two variables together, store the result in a third variable,
#and print out the result
RESULT=$(($1+$2))
echo "$1 + $2 = ${RESULT}!"

#Increment the result by 5.
RESULT=$((RESULT+5))
echo "RESULT is now ${RESULT}"

#Divide the result by 5. Remember bash only deals with integers.
DIVISION=$((RESULT/5))
echo "${RESULT} divided by 5 is ${DIVISION}"

#Solution
MULTIPLY=$(($1 * $2))
echo ${MULTIPLY}

SUBTRACT=$(($1-$2))
echo ${SUBTRACT}

Counters

A useful Bash arithmetic feature is its capability to post/pre-increment/decrement variables similar to other languagues such as C++. For example, try the following:

$ MYCOUNTER=0
$ echo $((++MYCOUNTER))
1
$ echo $MYCOUNTER
1
$ echo $((++MYCOUNTER))
2
$ echo $MYCOUNTER
2

Now try:

$ MYCOUNTER=0
$ echo $((MYCOUNTER++))
0
$ echo $MYCOUNTER
1
$ echo $((MYCOUNTER++))
1
$ echo $MYCOUNTER
2

In the first instance we use a pre-increment operator where the variable is incremented by 1 before a command is executed (in this case echo). While in the second example we used a post-increment operator, where the operation/command is executed first and then the variable is incremented. These operators can be particularly useful when working with Bash C-styled for loops, for example:

$ for ((i = 0 ; i < 5 ; i++)); do
>   echo "my counter has a value of $i"
> done
my counter has a value of 0
my counter has a value of 1
my counter has a value of 2
my counter has a value of 3
my counter has a value of 4

You can find a more comprehensive list of Bash arithmetic operators here.

Sequences

Bash allows you to print sequences of numbers. Try the following:

$ seq 1 5
1
2
3
4
5

You can also define an increment:

$ seq 1 2 20
1
3
5
7
9
11
13
15
17
19

This can be useful for example if you need to run a command a defined number of times in a for loop:

$ for i in $(seq 1 10)
> do
>    echo running iteration $i
> done
running iteration 1
running iteration 2
running iteration 3
running iteration 4
running iteration 5
running iteration 6
running iteration 7
running iteration 8
running iteration 9
running iteration 10

Arrays

Arrays allow the script author to associate a number of values with a single label. This means you only need to remember one variable name instead of many, which can come in particularly useful if you have potentially hundreds of values you want to store, such as the words in a text file you have just read in.

In Bash, arrays are one dimensional, zero indexed, and sparse, and much like variables they don’t require formal declaration in order to use them.

Array feature	Description
Dimensionality	The number of indices used to locate an array element. For example, a two dimensional array could describe a set of rows and columns and have indexes i, j.
Indexing	The number of the first element in an array. For example an array of ten elements can be referred to as elements 0 to 9 or 1 to 10. The former is zero-based indexing and the latter is one-based indexing.
Layout	An array in which every possible element is defined is fully populated. One in which only certain elements, i.e. three out of ten, have values is sparse.

For the most part you won’t need to worry about any of this, but if you’re familiar with language in which arrays are more tightly defined or have additional features, BASH arrays may seem quite simple and loosely defined.

Creating an array in Bash is as simple as enclosing the elements within a pair of brackets, ( ). We can display this using the command line to set the array. For example:

$ MY_ARRAY=(1 2 3 ‘Hawk’)

This example demonstrates another potentially unexpected feature of Bash arrays: each of the elements can appear to be a different type. But remember, we learned that Bash is an untyped language, so this is a potentially useful consequence.

Accessing the contents of an array element can be achieved using the following construct:

$ echo ${MY_ARRAY[3]}
Hawk

Finding the size of an array is very useful when using them as part of loops. This can be done with the construct ${#ArrayName[@]}. For example:

$ echo ${#MY_ARRAY[@]}

Here’s a script that shows some additional ways of accessing array contents.

#!/bin/bash
# Define a one dimensional array
MY_ARRAY=(1 2 3 ’raven’)

# Print out the contents of the third element
echo "The third element is ${MY_ARRAY[2]}."

# Return the number of elements in an array
echo "The array contains ${#MY_ARRAY[@]} elements"

# Print the contents of the entire array.
echo "The array consists of: ${MY_ARRAY[@]}"

# Define a sparse array and print out the contents
SPARSE_AR[0]=50
SPARSE_AR[3]='some words'
echo "Index 0 of spare array is ${SPARSE_AR[0]}."
echo "Index 1 of spare array is ${SPARSE_AR[1]}."
echo "Index 2 of spare array is ${SPARSE_AR[2]}."
echo "Index 3 of spare array is ${SPARSE_AR[3]}."

That produces the following output:

The third element is 3.
The array contains 4 elements
The array consists of: 1 2 3 ’raven’
Index 0 of spare array is 50.
Index 1 of spare array is .
Index 2 of spare array is .
Index 3 of spare array is some words.

Notice how despite the fourth element of the sparse array consists of two words, because we’ve enclosed them in a single block of quotes it is still considered a single element. We’ll see in the section on Flow Control how we can use another form of array called a list to loop over blocks of commands to repeat operations without having to type out those commands multiple times.

Exercise

Consider the previous script. Try defining your own arrays and manipulating them. Can you assign an array element to another variable or include the contents of a variable as an array element?
Solution
#!/bin/bash

# Define a sparse array and print out the contents
SPARSE_AR[0]=50
SPARSE_AR[3]='some words'

VAR1=(${SPARSE_AR[0]})
echo "${VAR1}"

VAR2=77

SPARSE_AR[4]="${VAR2}"
echo ${SPARSE_AR[4]}

Key Points

Bash is an untyped language. This means that all variables are stored as strings.

The $(( )) construct is used to create arithmetic operations

Bash arrays are created byi enclosing the elements within a pair of brackets, ( )

We can find the size of a Bash array with the construct ${#ArrayName[@]}

Functions and External Tools

Overview

Teaching: 25 min
Exercises: 10 min

Questions

How to create functions for easy repeated access to common tasks?

Objectives

Understand the syntax of Bash functions and how we can use them in our scripts.

Understant how we can assing the output of Bash commands to variables and use them in our scripts.

Functions

Functions allow tasks, corresponding to a number of individual operations, so be represented as a label.

Generally, a function takes one or more variables as input and in most programming languages can potentially return a value. They can be thought of as a way of creating user-defined commands which can be repeatedly used to process different sets of input. A typical bash function has the following structure:

function function_name {
<commands>
}

Couple of key points regarding bash functions:

Unlike most programming languages, there’s no easy way of returning a value from a function in Bash (there are various techniques for achieving this but are beyond the scope of this tutorial). In other programming languages it is common to have arguments passed to the function listed inside the brackets (). In Bash they are there only for decoration and you never put anything inside them.
The function definition ( the actual function itself) must appear in the script before any calls to the function.

The following code shows an example of a simple user-defined function, and how it is used by a script.

#!/bin/bash

# Define a simple function that simply prefixes any string it receives with the current date
function datestamp {
  # The first variable passed to the function is stored as $1, as
  # in the case of command line arguments
  STR_RECV=$1

  #Store the current date in a suitable format
  DATE_FMT=$(date +'%d/%m/%y')

  #Print out the combined date stamp plus string
  echo "${DATE_FMT}; ${STR_RECV}"
}

# Call the function with an example string
datestamp "Here is some text"

Exercise

Create a script that allows you to pass a series of strings to it and prefix each one with the current date and time. This sort of operation can be useful when logging messages from your code so you can tell exactly when a potential problem occurred. Can you read in a file and prefix each line of the file with the date?

Solution

#!/bin/bash

# Define a simple function that simply prefixes any string it receives with the current date
function datestamp {
  # The first variable passed to the function is stored as $1, as
  # in the case of command line arguments
  STR_RECV=$1

  #Store the current date in a suitable format
  DATE_FMT=$(date +'%d/%m/%y-%H:%M:%S')

  #Print out the combined date stamp plus string
  echo "${DATE_FMT}; ${STR_RECV}"
}

## Solution 2
numargs=$#
COUNTER=1

for i in $(seq 1 $numargs); do
  datestamp ${COUNTER}
  COUNTER=$(( $COUNTER + 1 ))
done

This script should produce an output similar to:

$ bash functions_01_solution_01.sh command1 command2 command3
14/10/20-11:14:34; 1
14/10/20-11:14:34; 2
14/10/20-11:14:34; 3

An alternative solution could be to read the commands from a text file (e.g. date_example_input.txt):

#!/bin/bash

# Define a simple function that simply prefixes any string it receives with the current date
function datestamp {
  # The first variable passed to the function is stored as $1, as
  # in the case of command line arguments
  STR_RECV=$1

  #Store the current date in a suitable format
  DATE_FMT=$(date +'%d/%m/%y-%H:%M:%S')

  #Print out the combined date stamp plus string
  echo "${DATE_FMT}; ${STR_RECV}"
}

## Solution 2
cat date_example_input.txt | while read line; do
    datestamp $line
done

This script should produce an output similar to:

$ bash functions_01_solution_02.sh
14/10/20-11:14:34; 1
14/10/20-11:14:34; 2
14/10/20-11:14:34; 3

Integrating external tools

By now we have covered the most commonly used features of the Bash syntax. But one of the most useful aspects of shell scripting is the ability to integrate shell commands and either pass arguments to them or take output from them to use in the script.

In Bash an external command is run by enclosing it within a $( ) construct.

#!/bin/bash

# Invoke the external command ‘ls’ to display the contents of,
# a directory and store that as a shell string.
DIRCONT=$(ls -1)

# Loop over the contents of the directory and print each file found
# to the screen.
for filename in ${DIRCONT}; do
    echo "File Found: ${filename}"
done

In this example instead of explicitly specifying the files we want to look at we have dynamically obtained a list of the contents of the current directory.

This means we don’t have to then check to see whether a file exists and we can run the same code on different groups of files without modification.

This provides a useful template for other commands to be inserted into the for loop. Instead of just printing out the name of the file we could read in the contents using the cat command, copy it to another location, or append some rows of figures.

#!/bin/bash

#Find all the files in a directory hierarchy that are marked as executable
for file in $(find -executable); do
  echo "File found: ${file}"

  # Change file permissions so that all users can execute the file
  # chmod go+r ${file}
done

The find tool is another shell command that descends a directory hierarchy and returns a list of files depending on such categories as name, location, type, and many other options.

In this case we have a hypothetical situation where only the current user has permission to execute files and we want to give everybody permission. Instead of changing the permissions of every file by hand, this simple script automates the task.

Note the chmod command has been commented out to prevent accidental permission changes. Only uncomment it if you’re certain you understand how it works.

Exercise

Try creating a set of subdirectories containing named files, each one with a list of words. Use the find command to retrieve only those matching a certain pattern and then read the words into a variable and print them out.
Solution
#!/bin/bash

mkdir script_test1
mkdir script_test2

echo "text1" > ./script_test1/text1
echo "text2" > ./script_test2/text2

COUNTER=0

#Find all the files in a directory hierarchy that are marked as executable
for line in $(find -iname text*); do
    LINECAT=$(cat $line)
    MY_ARRAY[$COUNTER]="${LINECAT}"
    COUNTER=$((COUNTER + 1))
done
echo "Array contents ${MY_ARRAY[@]}"

More tools: sed and awk

awk simple examples

We previosly show you examples of using grep, a very powerful tool specialized in searching one or more input files for lines containing a match to a specified pattern. There are two additional tools worth mentioning: sed and awk. Both tools specialized in text parsing and general text processing. Although an in-depth explanation of how to use them is beyond the scope of this course, we would like to provide you with a couple of examples we find useful to demonstrate their functionality and tease you into finding out more.

Let us start with awk. Consider the following document in the data directory:

$ cd data
$ cat amino-acids.txt 
Alanine         Ala
Arginine        Arg
Asparagine      Asn
Aspartic acid   Asp
Cysteine        Cys
Glutamic acid   Glu
Glutamine       Gln
Glycine         Gly
Histidine       His
Isoleucine      Ile
Leucine         Leu
Lysine          Lys
Methionine      Met
Phenylalanine   Phe
Proline         Pro
Serine          Ser
Threonine       Thr
Tryptophan      Trp
Tyrosine        Tyr
Valine          Val

We can see it is a file composed of two columns, with aminoacids long and short names. We can send the files content to awk and manipulate it. For example, say we are interested only in the first column:

$ cat amino-acids.txt | awk '{print $1}'
Alanine
Arginine
Asparagine
Aspartic
Cysteine
Glutamic
Glutamine
Glycine
Histidine
Isoleucine
Leucine
Lysine
Methionine
Phenylalanine
Proline
Serine
Threonine
Tryptophan
Tyrosine
Valine

awk is a programming language which many powerful tools. In the simple example above the section within '{}' is the program defining the actions to be performed on the input, in this case, to print column $1. Don’t get confuse with bash interpretation of $1 (first argument passed to a script), for awk it represents a column number. Try to to extract column 2 instead and even swap columns 1 and 2 positions.

A slightly more complex and maybe useful example. Consider the contents of animal-counts in the the data directory. We can see that it is a CSV file (comma separated values).

$ cd data/animal-counts
$ cat animals.txt 
2012-11-05,deer,5
2012-11-05,rabbit,22
2012-11-05,raccoon,7
2012-11-06,rabbit,19
2012-11-06,deer,2
2012-11-06,fox,4
2012-11-07,rabbit,16
2012-11-07,bear,1

If we try to extract the first column as before, we quickly find out that it doesn’t work:

$ cat data/animal-counts/animals.txt | awk '{print $1}'
2012-11-05,deer,5
2012-11-05,rabbit,22
2012-11-05,raccoon,7
2012-11-06,rabbit,19
2012-11-06,deer,2
2012-11-06,fox,4
2012-11-07,rabbit,16
2012-11-07,bear,1

The problem is that by default awk expects a space as file separator, but in this case we have a ,. Fortunately awk is flexible and let us define a new symbol to use as file separator. This needs to be defined in a special section of the awk script, the BEGIN section:

$ cat data/animal-counts/animals.txt | awk 'BEGIN{FS=","} {print $1}'
2012-11-05
2012-11-05
2012-11-05
2012-11-06
2012-11-06
2012-11-06
2012-11-07
2012-11-07

Say that we want to know the total number of animals recorded in our file. For this we need to modify our script to perform a summation of every element in column 3 and report the final result.

$ cat data/animal-counts/animals.txt | awk 'BEGIN{FS=","} {sum+=$3} END{print sum}'
76

In the above example we can see the use of the three main sections of an awk program: the BEGIN and END sections and the main script. There are many many more things that can be accomplished with awk but hopefully these simple examples gave you a taste of what is possible.

sed simple examples

sed is a stream editor. A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline).

Usual syntax

$ sed SCRIPT INPUTFILE...

Consider the animals.txt file in the data directory:

$ cd data
$ cat animals.txt 
2012-11-05,deer
2012-11-05,rabbit
2012-11-05,raccoon
2012-11-06,rabbit
2012-11-06,deer
2012-11-06,fox
2012-11-07,rabbit
2012-11-07,bear

Imagine that we made a mistake and deer wasn’t the correct category, maybe we should use elk instead. Amending potentially hundreds of files could be very time consuming but with sed this task can be performed in a few seconds:

$ sed 's/deer/elk/' animals.txt 
2012-11-05,elk
2012-11-05,rabbit
2012-11-05,raccoon
2012-11-06,rabbit
2012-11-06,elk
2012-11-06,fox
2012-11-07,rabbit
2012-11-07,bear

By default sed doesn’t modify the input file. You can check by printing again content of animals.txt. If we wanted to save the output of sed we have two options:

$ sed 's/deer/elk/' animals.txt > animals-fixed.txt

or modify the file in place (check sed --help), but be careful as this could destroy the original file if you make a mistake:

$ sed -i 's/deer/elk/' animals.txt

But sed has many many more useful functions (check the [manual][sed-manula]), for example:

Print a specific line:
```
$ sed -n '45p' file.txt
```
Delete a specific line:
```
$ sed -n '10d' file.txt
```
Appending text after a line.
```
$ sed '2a hello there' file.txt
```
Insert text before a line
```
$ sed '2i hello there' file.txt
```

Things can get more complex. But hopefully the above examples can give you an indication of what is available on Linux and the command line.

Key Points

Functions help us pack a set of operations with a single label.

Generally, Bash functions do not return values.

The output of Bash commands like cat, ls or find can be assigned to a Bash variable using the construct VAR=$(command)

An Introduction to Linux Shell Scripting

Loops

Overview

Follow the Prompt

Same Symbols, Different Meanings

Variables in Loops

Solution

Limiting Sets of Files

Solution

Solution

Saving to a File in a Loop - Part One

Solution

Saving to a File in a Loop - Part Two

Solution

Spaces in Names

Nelle’s Pipeline: Processing Files

Beginning and End

Those Who Know History Can Choose to Repeat It

Other History Commands

Doing a Dry Run

Solution

Nested Loops

Solution

Key Points

Shell Scripts

Overview

Text vs. Whatever

Double-Quotes Around Arguments

List Unique Species

Solution

Why Record Commands in the History Before Running Them?

Solution

Nelle’s Pipeline: Creating a Script

Variables in Shell Scripts

Solution

Find the Longest File With a Given Extension

Solution

Script Reading Comprehension

Solutions

Debugging Scripts

Solution

Key Points

Flow Control

Overview

The if statement

More on Integer and String comparisons

Solution

Counting arguments with $#

Other conditional structures.

File Test Operators

Logging a directory’s content

Solution

Key Points

Arithmetic and Arrays

Overview

Bash Variables

More on Bash variables

Arithmetic Expansion

Exercise

Solution

Counters

Sequences

Arrays

Exercise

Solution

Key Points

Functions and External Tools

Overview

Functions

Exercise

Solution

Integrating external tools

Exercise

Solution

More tools: sed and awk

awk simple examples

sed simple examples

Key Points

The `if` statement

Counting arguments with `$#`