Loops
Overview
Teaching: 45 min
Exercises: 15 minQuestions
How can I perform the same actions on many different files?
Objectives
Write a loop that applies one or more commands separately to each file in a set of files.
Trace the values taken on by a loop variable during execution of the loop.
Explain the difference between a variable’s name and its value.
Explain why spaces and some punctuation characters shouldn’t be used in file names.
Demonstrate how to see what commands have recently been executed.
Re-run recently executed commands without retyping them.
In this lesson we will build on top of what we learned on the previous lesson An Introduction to Linux with Command Line. We will continue working with Nelle’s pipeline and expand it to make use of Bash scripts to automatize the production of results using many of the Bash commands already learned.
Loops are a programming construct which allow us to repeat a command or set of commands for each item in a list. As such they are key to productivity improvements through automation. Similar to wildcards and tab completion, using loops also reduces the amount of typing required (and hence reduces the number of typing mistakes).
Suppose we have several hundred genome data files named basilisk.dat
, minotaur.dat
, and
unicorn.dat
.
For this example, we’ll use the creatures
directory which only has three example files,
but the principles can be applied to many many more files at once.
The structure of these files is the same: the common name, classification, and updated date are presented on the first three lines, with DNA sequences on the following lines. Let’s look at the files:
$ head -n 5 basilisk.dat minotaur.dat unicorn.dat
We would like to print out the classification for each species, which is given on the second
line of each file.
For each file, we would need to execute the command head -n 2
and pipe this to tail -n 1
.
We’ll use a loop to solve this problem, but first let’s look at the general form of a loop:
for thing in list_of_things
do
operation_using $thing # Indentation within the loop is not required, but aids legibility
done
and we can apply this to our example like this:
$ for filename in basilisk.dat minotaur.dat unicorn.dat
> do
> head -n 2 $filename | tail -n 1
> done
CLASSIFICATION: basiliscus vulgaris
CLASSIFICATION: bos hominus
CLASSIFICATION: equus monoceros
Follow the Prompt
The shell prompt changes from
$
to>
and back again as we were typing in our loop. The second prompt,>
, is different to remind us that we haven’t finished typing a complete command yet. A semicolon,;
, can be used to separate two commands written on a single line.
When the shell sees the keyword for
,
it knows to repeat a command (or group of commands) once for each item in a list.
Each time the loop runs (called an iteration), an item in the list is assigned in sequence to
the variable, and the commands inside the loop are executed, before moving on to
the next item in the list.
Inside the loop,
we call for the variable’s value by putting $
in front of it.
The $
tells the shell interpreter to treat
the variable as a variable name and substitute its value in its place,
rather than treat it as text or an external command.
In this example, the list is three filenames: basilisk.dat
, minotaur.dat
, and unicorn.dat
.
Each time the loop iterates, it will assign a file name to the variable filename
and run the head
command.
The first time through the loop,
$filename
is basilisk.dat
.
The interpreter runs the command head
on basilisk.dat
and pipes the first two lines to the tail
command,
which then prints the second line of basilisk.dat
.
For the second iteration, $filename
becomes
minotaur.dat
. This time, the shell runs head
on minotaur.dat
and pipes the first two lines to the tail
command,
which then prints the second line of minotaur.dat
.
For the third iteration, $filename
becomes
unicorn.dat
, so the shell runs the head
command on that file,
and tail
on the output of that.
Since the list was only three items, the shell exits the for
loop.
Same Symbols, Different Meanings
Here we see
>
being used a shell prompt, whereas>
is also used to redirect output. Similarly,$
is used as a shell prompt, but, as we saw earlier, it is also used to ask the shell to get the value of a variable.If the shell prints
>
or$
then it expects you to type something, and the symbol is a prompt.If you type
>
or$
yourself, it is an instruction from you that the shell should redirect output or get the value of a variable.
When using variables it is also
possible to put the names into curly braces to clearly delimit the variable
name: $filename
is equivalent to ${filename}
, but is different from
${file}name
. You may find this notation in other people’s programs.
We have called the variable in this loop filename
in order to make its purpose clearer to human readers.
The shell itself doesn’t care what the variable is called;
if we wrote this loop as:
$ for x in basilisk.dat minotaur.dat unicorn.dat
> do
> head -n 2 $x | tail -n 1
> done
or:
$ for temperature in basilisk.dat minotaur.dat unicorn.dat
> do
> head -n 2 $temperature | tail -n 1
> done
it would work exactly the same way.
Don’t do this.
Programs are only useful if people can understand them,
so meaningless names (like x
) or misleading names (like temperature
)
increase the odds that the program won’t do what its readers think it does.
Variables in Loops
This exercise refers to the
data-shell-scripting/molecules
directory.ls
gives the following output:cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb
What is the output of the following code?
$ for datafile in *.pdb > do > ls *.pdb > done
Now, what is the output of the following code?
$ for datafile in *.pdb > do > ls $datafile > done
Why do these two loops give different outputs?
Solution
The first code block gives the same output on each iteration through the loop. Bash expands the wildcard
*.pdb
within the loop body (as well as before the loop starts) to match all files ending in.pdb
and then lists them usingls
. The expanded loop would look like this:$ for datafile in cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb > do > ls cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb > done
cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb
The second code block lists a different file on each loop iteration. The value of the
datafile
variable is evaluated using$datafile
, and then listed usingls
.cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb
Limiting Sets of Files
What would be the output of running the following loop in the
data-shell-scripting/molecules
directory?$ for filename in c* > do > ls $filename > done
- No files are listed.
- All files are listed.
- Only
cubane.pdb
,octane.pdb
andpentane.pdb
are listed.- Only
cubane.pdb
is listed.Solution
4 is the correct answer.
*
matches zero or more characters, so any file name starting with the letter c, followed by zero or more other characters will be matched.How would the output differ from using this command instead?
$ for filename in *c* > do > ls $filename > done
- The same files would be listed.
- All the files are listed this time.
- No files are listed this time.
- The files
cubane.pdb
andoctane.pdb
will be listed.- Only the file
octane.pdb
will be listed.Solution
4 is the correct answer.
*
matches zero or more characters, so a file name with zero or more characters before a letter c and zero or more characters after the letter c will be matched.
Saving to a File in a Loop - Part One
In the
data-shell-scripting/molecules
directory, what is the effect of this loop?for alkanes in *.pdb do echo $alkanes cat $alkanes > alkanes.pdb done
- Prints
cubane.pdb
,ethane.pdb
,methane.pdb
,octane.pdb
,pentane.pdb
andpropane.pdb
, and the text frompropane.pdb
will be saved to a file calledalkanes.pdb
.- Prints
cubane.pdb
,ethane.pdb
, andmethane.pdb
, and the text from all three files would be concatenated and saved to a file calledalkanes.pdb
.- Prints
cubane.pdb
,ethane.pdb
,methane.pdb
,octane.pdb
, andpentane.pdb
, and the text frompropane.pdb
will be saved to a file calledalkanes.pdb
.- None of the above.
Solution
- The text from each file in turn gets written to the
alkanes.pdb
file. However, the file gets overwritten on each loop interation, so the final content ofalkanes.pdb
is the text from thepropane.pdb
file.
Saving to a File in a Loop - Part Two
Also in the
data-shell-scripting/molecules
directory, what would be the output of the following loop?for datafile in *.pdb do cat $datafile >> all.pdb done
- All of the text from
cubane.pdb
,ethane.pdb
,methane.pdb
,octane.pdb
, andpentane.pdb
would be concatenated and saved to a file calledall.pdb
.- The text from
ethane.pdb
will be saved to a file calledall.pdb
.- All of the text from
cubane.pdb
,ethane.pdb
,methane.pdb
,octane.pdb
,pentane.pdb
andpropane.pdb
would be concatenated and saved to a file calledall.pdb
.- All of the text from
cubane.pdb
,ethane.pdb
,methane.pdb
,octane.pdb
,pentane.pdb
andpropane.pdb
would be printed to the screen and saved to a file calledall.pdb
.Solution
3 is the correct answer.
>>
appends to a file, rather than overwriting it with the redirected output from a command. Given the output from thecat
command has been redirected, nothing is printed to the screen.
Let’s continue with our example in the data-shell-scripting/creatures
directory.
Here’s a slightly more complicated loop:
$ for filename in *.dat
> do
> echo $filename
> head -n 100 $filename | tail -n 20
> done
The shell starts by expanding *.dat
to create the list of files it will process.
The loop body
then executes two commands for each of those files.
The first command, echo
, prints its command-line arguments to standard output.
For example:
$ echo hello there
prints:
hello there
In this case,
since the shell expands $filename
to be the name of a file,
echo $filename
prints the name of the file.
Note that we can’t write this as:
$ for filename in *.dat
> do
> $filename
> head -n 100 $filename | tail -n 20
> done
because then the first time through the loop,
when $filename
expanded to basilisk.dat
, the shell would try to run basilisk.dat
as a program.
Finally,
the head
and tail
combination selects lines 81-100
from whatever file is being processed
(assuming the file has at least 100 lines).
Spaces in Names
Spaces are used to separate the elements of the list that we are going to loop over. If one of those elements contains a space character, we need to surround it with quotes, and do the same thing to our loop variable. Suppose our data files are named:
red dragon.dat purple unicorn.dat
To loop over these files, we would need to add double quotes like so:
$ for filename in "red dragon.dat" "purple unicorn.dat" > do > head -n 100 "$filename" | tail -n 20 > done
It is simpler to avoid using spaces (or other special characters) in filenames.
The files above don’t exist, so if we run the above code, the
head
command will be unable to find them, however the error message returned will show the name of the files it is expecting:head: cannot open ‘red dragon.dat’ for reading: No such file or directory head: cannot open ‘purple unicorn.dat’ for reading: No such file or directory
Try removing the quotes around
$filename
in the loop above to see the effect of the quote marks on spaces. Note that we get a result from the loop command for unicorn.dat when we run this code in thecreatures
directory:head: cannot open ‘red’ for reading: No such file or directory head: cannot open ‘dragon.dat’ for reading: No such file or directory head: cannot open ‘purple’ for reading: No such file or directory CGGTACCGAA AAGGGTCGCG CAAGTGTTCC
We would like to modify each of the files in data-shell-scripting/creatures
, but also save a version
of the original files, naming the copies original-basilisk.dat
and original-unicorn.dat
.
We can’t use:
$ cp *.dat original-*.dat
because that would expand to:
$ cp basilisk.dat minotaur.dat unicorn.dat original-*.dat
This wouldn’t back up our files, instead we get an error:
cp: target `original-*.dat' is not a directory
This problem arises when cp
receives more than two inputs. When this happens, it
expects the last input to be a directory where it can copy all the files it was passed.
Since there is no directory named original-*.dat
in the creatures
directory we get an
error.
Instead, we can use a loop:
$ for filename in *.dat
> do
> cp $filename original-$filename
> done
This loop runs the cp
command once for each filename.
The first time,
when $filename
expands to basilisk.dat
,
the shell executes:
cp basilisk.dat original-basilisk.dat
The second time, the command is:
cp minotaur.dat original-minotaur.dat
The third and last time, the command is:
cp unicorn.dat original-unicorn.dat
Since the cp
command does not normally produce any output, it’s hard to check
that the loop is doing the correct thing.
However, we learned earlier how to print strings using echo
, and we can modify the loop
to use echo
to print our commands without actually executing them.
As such we can check what commands would be run in the unmodified loop.
The following diagram
shows what happens when the modified loop is executed, and demonstrates how the
judicious use of echo
is a good debugging technique.
Nelle’s Pipeline: Processing Files
Nelle is now ready to process her data files using goostats
— a shell script written by her supervisor.
This calculates some statistics from a protein sample file, and takes two arguments:
- an input file (containing the raw data)
- an output file (to store the calculated statistics)
Since she’s still learning how to use the shell, she decides to build up the required commands in stages. Her first step is to make sure that she can select the right input files — remember, these are ones whose names end in ‘A’ or ‘B’, rather than ‘Z’. Starting from her home directory, Nelle types:
$ cd north-pacific-gyre/2012-07-03
$ for datafile in NENE*[AB].txt
> do
> echo $datafile
> done
NENE01729A.txt
NENE01729B.txt
NENE01736A.txt
...
NENE02043A.txt
NENE02043B.txt
Her next step is to decide
what to call the files that the goostats
analysis program will create.
Prefixing each input file’s name with ‘stats’ seems simple,
so she modifies her loop to do that:
$ for datafile in NENE*[AB].txt
> do
> echo $datafile stats-$datafile
> done
NENE01729A.txt stats-NENE01729A.txt
NENE01729B.txt stats-NENE01729B.txt
NENE01736A.txt stats-NENE01736A.txt
...
NENE02043A.txt stats-NENE02043A.txt
NENE02043B.txt stats-NENE02043B.txt
She hasn’t actually run goostats
yet,
but now she’s sure she can select the right files and generate the right output filenames.
Typing in commands over and over again is becoming tedious, though, and Nelle is worried about making mistakes, so instead of re-entering her loop, she presses ↑. In response, the shell redisplays the whole loop on one line (using semi-colons to separate the pieces):
$ for datafile in NENE*[AB].txt; do echo $datafile stats-$datafile; done
Using the left arrow key,
Nelle backs up and changes the command echo
to bash goostats
:
$ for datafile in NENE*[AB].txt; do bash goostats $datafile stats-$datafile; done
When she presses Enter, the shell runs the modified command. However, nothing appears to happen — there is no output. After a moment, Nelle realizes that since her script doesn’t print anything to the screen any longer, she has no idea whether it is running, much less how quickly. She kills the running command by typing Ctrl+C, uses ↑ to repeat the command, and edits it to read:
$ for datafile in NENE*[AB].txt; do echo $datafile; bash goostats $datafile stats-$datafile; done
Beginning and End
We can move to the beginning of a line in the shell by typing Ctrl+A and to the end using Ctrl+E.
When she runs her program now, it produces one line of output every five seconds or so:
NENE01729A.txt
NENE01729B.txt
NENE01736A.txt
...
1518 times 5 seconds,
divided by 60,
tells her that her script will take about two hours to run.
As a final check,
she opens another terminal window,
goes into north-pacific-gyre/2012-07-03
,
and uses cat stats-NENE01729B.txt
to examine one of the output files.
It looks good,
so she decides to get some coffee and catch up on her reading.
Those Who Know History Can Choose to Repeat It
Another way to repeat previous work is to use the
history
command to get a list of the last few hundred commands that have been executed, and then to use!123
(where ‘123’ is replaced by the command number) to repeat one of those commands. For example, if Nelle types this:$ history | tail -n 5
456 ls -l NENE0*.txt 457 rm stats-NENE01729B.txt.txt 458 bash goostats NENE01729B.txt stats-NENE01729B.txt 459 ls -l NENE0*.txt 460 history
then she can re-run
goostats
onNENE01729B.txt
simply by typing!458
.
Other History Commands
There are a number of other shortcut commands for getting at the history.
- Ctrl+R enters a history search mode ‘reverse-i-search’ and finds the most recent command in your history that matches the text you enter next. Press Ctrl+R one or more additional times to search for earlier matches. You can then use the left and right arrow keys to choose that line and edit it then hit Return to run the command.
!!
retrieves the immediately preceding command (you may or may not find this more convenient than using ↑)!$
retrieves the last word of the last command. That’s useful more often than you might expect: afterbash goostats NENE01729B.txt stats-NENE01729B.txt
, you can typeless !$
to look at the filestats-NENE01729B.txt
, which is quicker than doing ↑ and editing the command-line.
Doing a Dry Run
A loop is a way to do many things at once — or to make many mistakes at once if it does the wrong thing. One way to check what a loop would do is to
echo
the commands it would run instead of actually running them.Suppose we want to preview the commands the following loop will execute without actually running those commands:
$ for datafile in *.pdb > do > cat $datafile >> all.pdb > done
What is the difference between the two loops below, and which one would we want to run?
# Version 1 $ for datafile in *.pdb > do > echo cat $datafile >> all.pdb > done
# Version 2 $ for datafile in *.pdb > do > echo "cat $datafile >> all.pdb" > done
Solution
The second version is the one we want to run. This prints to screen everything enclosed in the quote marks, expanding the loop variable name because we have prefixed it with a dollar sign.
The first version appends the output from the command
echo cat $datafile
to the file,all.pdb
. This file will just contain the list;cat cubane.pdb
,cat ethane.pdb
,cat methane.pdb
etc.Try both versions for yourself to see the output! Be sure to open the
all.pdb
file to view its contents.
Nested Loops
Suppose we want to set up a directory structure to organize some experiments measuring reaction rate constants with different compounds and different temperatures. What would be the result of the following code:
$ for species in cubane ethane methane > do > for temperature in 25 30 37 40 > do > mkdir $species-$temperature > done > done
Solution
We have a nested loop, i.e. contained within another loop, so for each species in the outer loop, the inner loop (the nested loop) iterates over the list of temperatures, and creates a new directory for each combination.
Try running the code for yourself to see which directories are created!
Key Points
A
for
loop repeats commands once for every thing in a list.Every
for
loop needs a variable to refer to the thing it is currently operating on.Use
$name
to expand a variable (i.e., get its value).${name}
can also be used.Do not use spaces, quotes, or wildcard characters such as ‘*’ or ‘?’ in filenames, as it complicates variable expansion.
Give files consistent names that are easy to match with wildcard patterns to make it easy to select them for looping.
Use the up-arrow key to scroll up through previous commands to edit and repeat them.
Use Ctrl+R to search through the previously entered commands.
Use
history
to display recent commands, and!number
to repeat a command by number.
Shell Scripts
Overview
Teaching: 40 min
Exercises: 20 minQuestions
How can I save and re-use commands?
Objectives
Write a shell script that runs a command or series of commands for a fixed set of files.
Run a shell script from the command line.
Write a shell script that operates on a set of files defined by the user on the command line.
Create pipelines that include shell scripts you, and others, have written.
We are finally ready to see what makes the shell such a powerful programming environment. We are going to take the commands we repeat frequently and save them in files so that we can re-run all those operations again later by typing a single command. For historical reasons, a bunch of commands saved in a file is usually called a shell script, but make no mistake: these are actually small programs.
Let’s start by going back to molecules/
and creating a new file, middle.sh
which will
become our shell script:
$ cd molecules
$ nano middle.sh
The command nano middle.sh
opens the file middle.sh
within the text editor ‘nano’
(which runs within the shell).
If the file does not exist, it will be created.
We can use the text editor to directly edit the file – we’ll simply insert the following line:
head -n 15 octane.pdb | tail -n 5
This is a variation on the pipe we constructed earlier:
it selects lines 11-15 of the file octane.pdb
.
Remember, we are not running it as a command just yet:
we are putting the commands in a file.
Then we save the file (Ctrl-O
in nano),
and exit the text editor (Ctrl-X
in nano).
Check that the directory molecules
now contains a file called middle.sh
.
Once we have saved the file,
we can ask the shell to execute the commands it contains.
Our shell is called bash
, so we run the following command:
$ bash middle.sh
ATOM 9 H 1 -4.502 0.681 0.785 1.00 0.00
ATOM 10 H 1 -5.254 -0.243 -0.537 1.00 0.00
ATOM 11 H 1 -4.357 1.252 -0.895 1.00 0.00
ATOM 12 H 1 -3.009 -0.741 -1.467 1.00 0.00
ATOM 13 H 1 -3.172 -1.337 0.206 1.00 0.00
Sure enough, our script’s output is exactly what we would get if we ran that pipeline directly.
Text vs. Whatever
We usually call programs like Microsoft Word or LibreOffice Writer “text editors”, but we need to be a bit more careful when it comes to programming. By default, Microsoft Word uses
.docx
files to store not only text, but also formatting information about fonts, headings, and so on. This extra information isn’t stored as characters, and doesn’t mean anything to tools likehead
: they expect input files to contain nothing but the letters, digits, and punctuation on a standard computer keyboard. When editing programs, therefore, you must either use a plain text editor, or be careful to save files as plain text.
What if we want to select lines from an arbitrary file?
We could edit middle.sh
each time to change the filename,
but that would probably take longer than typing the command out again
in the shell and executing it with a new file name.
Instead, let’s edit middle.sh
and make it more versatile:
$ nano middle.sh
Now, within “nano”, replace the text octane.pdb
with the special variable called $1
:
head -n 15 "$1" | tail -n 5
Inside a shell script,
$1
means ‘the first filename (or other argument) on the command line’.
We can now run our script like this:
$ bash middle.sh octane.pdb
ATOM 9 H 1 -4.502 0.681 0.785 1.00 0.00
ATOM 10 H 1 -5.254 -0.243 -0.537 1.00 0.00
ATOM 11 H 1 -4.357 1.252 -0.895 1.00 0.00
ATOM 12 H 1 -3.009 -0.741 -1.467 1.00 0.00
ATOM 13 H 1 -3.172 -1.337 0.206 1.00 0.00
or on a different file like this:
$ bash middle.sh pentane.pdb
ATOM 9 H 1 1.324 0.350 -1.332 1.00 0.00
ATOM 10 H 1 1.271 1.378 0.122 1.00 0.00
ATOM 11 H 1 -0.074 -0.384 1.288 1.00 0.00
ATOM 12 H 1 -0.048 -1.362 -0.205 1.00 0.00
ATOM 13 H 1 -1.183 0.500 -1.412 1.00 0.00
Double-Quotes Around Arguments
For the same reason that we put the loop variable inside double-quotes, in case the filename happens to contain any spaces, we surround
$1
with double-quotes.
Currently, we need to edit middle.sh
each time we want to adjust the range of
lines that is returned.
Let’s fix that by configuring our script to instead use three command-line arguments.
After the first command-line argument ($1
), each additional argument that we
provide will be accessible via the special variables $1
, $2
, $3
,
which refer to the first, second, third command-line arguments, respectively.
Knowing this, we can use additional arguments to define the range of lines to
be passed to head
and tail
respectively:
$ nano middle.sh
head -n "$2" "$1" | tail -n "$3"
We can now run:
$ bash middle.sh pentane.pdb 15 5
ATOM 9 H 1 1.324 0.350 -1.332 1.00 0.00
ATOM 10 H 1 1.271 1.378 0.122 1.00 0.00
ATOM 11 H 1 -0.074 -0.384 1.288 1.00 0.00
ATOM 12 H 1 -0.048 -1.362 -0.205 1.00 0.00
ATOM 13 H 1 -1.183 0.500 -1.412 1.00 0.00
By changing the arguments to our command we can change our script’s behaviour:
$ bash middle.sh pentane.pdb 20 5
ATOM 14 H 1 -1.259 1.420 0.112 1.00 0.00
ATOM 15 H 1 -2.608 -0.407 1.130 1.00 0.00
ATOM 16 H 1 -2.540 -1.303 -0.404 1.00 0.00
ATOM 17 H 1 -3.393 0.254 -0.321 1.00 0.00
TER 18 1
This works,
but it may take the next person who reads middle.sh
a moment to figure out what it does.
We can improve our script by adding some comments at the top:
$ nano middle.sh
# Select lines from the middle of a file.
# Usage: bash middle.sh filename end_line num_lines
head -n "$2" "$1" | tail -n "$3"
A comment starts with a #
character and runs to the end of the line.
The computer ignores comments,
but they’re invaluable for helping people (including your future self) understand and use scripts.
The only caveat is that each time you modify the script,
you should check that the comment is still accurate:
an explanation that sends the reader in the wrong direction is worse than none at all.
What if we want to process many files in a single pipeline?
For example, if we want to sort our .pdb
files by length, we would type:
$ wc -l *.pdb | sort -n
because wc -l
lists the number of lines in the files
(recall that wc
stands for ‘word count’, adding the -l
option means ‘count lines’ instead)
and sort -n
sorts things numerically.
We could put this in a file,
but then it would only ever sort a list of .pdb
files in the current directory.
If we want to be able to get a sorted list of other kinds of files,
we need a way to get all those names into the script.
We can’t use $1
, $2
, and so on
because we don’t know how many files there are.
Instead, we use the special variable $@
,
which means,
‘All of the command-line arguments to the shell script’.
We also should put $@
inside double-quotes
to handle the case of arguments containing spaces
("$@"
is special syntax and is equivalent to "$1"
"$2"
…).
Here’s an example:
$ nano sorted.sh
# Sort files by their length.
# Usage: bash sorted.sh one_or_more_filenames
wc -l "$@" | sort -n
$ bash sorted.sh *.pdb ../creatures/*.dat
9 methane.pdb
12 ethane.pdb
15 propane.pdb
20 cubane.pdb
21 pentane.pdb
30 octane.pdb
163 ../creatures/basilisk.dat
163 ../creatures/minotaur.dat
163 ../creatures/unicorn.dat
596 total
List Unique Species
Leah has several hundred data files, each of which is formatted like this:
2013-11-05,deer,5 2013-11-05,rabbit,22 2013-11-05,raccoon,7 2013-11-06,rabbit,19 2013-11-06,deer,2 2013-11-06,fox,1 2013-11-07,rabbit,18 2013-11-07,bear,1
An example of this type of file is given in
data-shell/data/animal-counts/animals.txt
.We can use the command
cut -d , -f 2 animals.txt | sort | uniq
to produce the unique species inanimals.txt
. In order to avoid having to type out this series of commands every time, a scientist may choose to write a shell script instead.Write a shell script called
species.sh
that takes any number of filenames as command-line arguments, and uses a variation of the above command to print a list of the unique species appearing in each of those files separately.Solution
# Script to find unique species in csv files where species is the second data field # This script accepts any number of file names as command line arguments # Loop over all files for file in $@ do echo "Unique species in $file:" # Extract species names cut -d , -f 2 $file | sort | uniq done
Suppose we have just run a series of commands that did something useful — for example, that created a graph we’d like to use in a paper. We’d like to be able to re-create the graph later if we need to, so we want to save the commands in a file. Instead of typing them in again (and potentially getting them wrong) we can do this:
$ history | tail -n 5 > redo-figure-3.sh
The file redo-figure-3.sh
now contains:
297 bash goostats NENE01729B.txt stats-NENE01729B.txt
298 bash goodiff stats-NENE01729B.txt /data/validated/01729.txt > 01729-differences.txt
299 cut -d ',' -f 2-3 01729-differences.txt > 01729-time-series.txt
300 ygraph --format scatter --color bw --borders none 01729-time-series.txt figure-3.png
301 history | tail -n 5 > redo-figure-3.sh
After a moment’s work in an editor to remove the serial numbers on the commands,
and to remove the final line where we called the history
command,
we have a completely accurate record of how we created that figure.
Why Record Commands in the History Before Running Them?
If you run the command:
$ history | tail -n 5 > recent.sh
the last command in the file is the
history
command itself, i.e., the shell has addedhistory
to the command log before actually running it. In fact, the shell always adds commands to the log before running them. Why do you think it does this?Solution
If a command causes something to crash or hang, it might be useful to know what that command was, in order to investigate the problem. Were the command only be recorded after running it, we would not have a record of the last command run in the event of a crash.
In practice, most people develop shell scripts by running commands at the shell prompt a few times
to make sure they’re doing the right thing,
then saving them in a file for re-use.
This style of work allows people to recycle
what they discover about their data and their workflow with one call to history
and a bit of editing to clean up the output
and save it as a shell script.
Nelle’s Pipeline: Creating a Script
Nelle’s supervisor insisted that all her analytics must be reproducible. The easiest way to capture all the steps is in a script.
First we return to Nelle’s data directory:
$ cd ../north-pacific-gyre/2012-07-03/
She runs the editor and writes the following:
# Calculate stats for data files.
for datafile in "$@"
do
echo $datafile
bash goostats $datafile stats-$datafile
done
She saves this in a file called do-stats.sh
so that she can now re-do the first stage of her analysis by typing:
$ bash do-stats.sh NENE*[AB].txt
She can also do this:
$ bash do-stats.sh NENE*[AB].txt | wc -l
so that the output is just the number of files processed rather than the names of the files that were processed.
One thing to note about Nelle’s script is that it lets the person running it decide what files to process. She could have written it as:
# Calculate stats for Site A and Site B data files.
for datafile in NENE*[AB].txt
do
echo $datafile
bash goostats $datafile stats-$datafile
done
The advantage is that this always selects the right files:
she doesn’t have to remember to exclude the ‘Z’ files.
The disadvantage is that it always selects just those files — she can’t run it on all files
(including the ‘Z’ files),
or on the ‘G’ or ‘H’ files her colleagues in Antarctica are producing,
without editing the script.
If she wanted to be more adventurous,
she could modify her script to check for command-line arguments,
and use NENE*[AB].txt
if none were provided.
Of course, this introduces another tradeoff between flexibility and complexity.
Variables in Shell Scripts
In the
molecules
directory, imagine you have a shell script calledscript.sh
containing the following commands:head -n $2 $1 tail -n $3 $1
While you are in the
molecules
directory, you type the following command:bash script.sh '*.pdb' 1 1
Which of the following outputs would you expect to see?
- All of the lines between the first and the last lines of each file ending in
.pdb
in themolecules
directory- The first and the last line of each file ending in
.pdb
in themolecules
directory- The first and the last line of each file in the
molecules
directory- An error because of the quotes around
*.pdb
Solution
The correct answer is 2.
The special variables $1, $2 and $3 represent the command line arguments given to the script, such that the commands run are:
$ head -n 1 cubane.pdb ethane.pdb octane.pdb pentane.pdb propane.pdb $ tail -n 1 cubane.pdb ethane.pdb octane.pdb pentane.pdb propane.pdb
The shell does not expand
'*.pdb'
because it is enclosed by quote marks. As such, the first argument to the script is'*.pdb'
which gets expanded within the script byhead
andtail
.
Find the Longest File With a Given Extension
Write a shell script called
longest.sh
that takes the name of a directory and a filename extension as its arguments, and prints out the name of the file with the most lines in that directory with that extension. For example:$ bash longest.sh /tmp/data pdb
would print the name of the
.pdb
file in/tmp/data
that has the most lines.Solution
# Shell script which takes two arguments: # 1. a directory name # 2. a file extension # and prints the name of the file in that directory # with the most lines which matches the file extension. wc -l $1/*.$2 | sort -n | tail -n 2 | head -n 1
The first part of the pipeline,
wc -l $1/*.$2 | sort -n
, counts the lines in each file and sorts them numerically (largest last). When there’s more than one file,wc
also outputs a final summary line, giving the total number of lines across all files. We usetail -n 2 | head -n 1
to throw away this last line.With
wc -l $1/*.$2 | sort -n | tail -n 1
we’ll see the final summary line: we can build our pipeline up in pieces to be sure we understand the output.
Script Reading Comprehension
For this question, consider the
data-shell/molecules
directory once again. This contains a number of.pdb
files in addition to any other files you may have created. Explain what each of the following three scripts would do when run asbash script1.sh *.pdb
,bash script2.sh *.pdb
, andbash script3.sh *.pdb
respectively.# Script 1 echo *.*
# Script 2 for filename in $1 $2 $3 do cat $filename done
# Script 3 echo $@.pdb
Solutions
In each case, the shell expands the wildcard in
*.pdb
before passing the resulting list of file names as arguments to the script.Script 1 would print out a list of all files containing a dot in their name. The arguments passed to the script are not actually used anywhere in the script.
Script 2 would print the contents of the first 3 files with a
.pdb
file extension.$1
,$2
, and$3
refer to the first, second, and third argument respectively.Script 3 would print all the arguments to the script (i.e. all the
.pdb
files), followed by.pdb
.$@
refers to all the arguments given to a shell script.cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb.pdb
Debugging Scripts
Suppose you have saved the following script in a file called
do-errors.sh
in Nelle’snorth-pacific-gyre/2012-07-03
directory:# Calculate stats for data files. for datafile in "$@" do echo $datfile bash goostats $datafile stats-$datafile done
When you run it:
$ bash do-errors.sh NENE*[AB].txt
the output is blank. To figure out why, re-run the script using the
-x
option:bash -x do-errors.sh NENE*[AB].txt
What is the output showing you? Which line is responsible for the error?
Solution
The
-x
option causesbash
to run in debug mode. This prints out each command as it is run, which will help you to locate errors. In this example, we can see thatecho
isn’t printing anything. We have made a typo in the loop variable name, and the variabledatfile
doesn’t exist, hence returning an empty string.
Key Points
Save commands in files (usually called shell scripts) for re-use.
bash filename
runs the commands saved in a file.
$@
refers to all of a shell script’s command-line arguments.
$1
,$2
, etc., refer to the first command-line argument, the second command-line argument, etc.Place variables in quotes if the values might have spaces in them.
Letting users decide what files to process is more flexible and more consistent with built-in Unix commands.
Flow Control
Overview
Teaching: 30 min
Exercises: 15 minQuestions
How to control the flow of a Bash program?
Objectives
Understand Bash basic conditional structure.
Familiarize with common operators to compare strings and integers.
Familiarize with common operators to test existance of files and directories.
The if
statement
A programming language, even a simple one, requires the ability to change its functionality depending upon certain conditions. For example, if a particular file exists or if the number represented by a variable is greater than some value then perform some action, otherwise perform a different action. In this section we shall look at ways of determining the flow of a script.
We have already seen how variables can be assigned and printed. But we may want to test the value of a variable in order to determine how to proceed. In this case we can use the if statement to test the validity of an expression. A typical shell comparison to test the validity of CONDITION has the form:
if [[ CONDITION ]]; then
echo "Condition is true"
fi
Where CONDITION is typically a construct that uses two arguments which are compared using a comparison operator. Some of the more common comparison operators used to form conditions are summarized inthe following table:
Arithmetic | String | ||
---|---|---|---|
-eq | equals | = | equals |
-nq | not equal to | != | not equal to |
-lt | less than | < | less than |
-gt | greater than | > | greater than |
-le | less than or equal to | ||
-ge | greater than or equal to |
It is also possible to add an additional default action to be executed in case our
test is not satisfied, for this we use the else
statement:
if [[ CONDITION ]]; then
echo "Condition is true"
else
echo "Condition is false"
fi
For example, we can perform a comparison on two integers. Copy the following command
to a bash script (you can also try executing directly in the terminal similar as we
did with for
loops) called comparing-integers.sh:
if [[ $1 -eq $2 ]]
then
echo ${1} is equal to ${2}
elif [[ $1 > $2 ]]
then
echo ${1} is greater than ${2}
elif [[ $1 -lt $2 ]]
then
echo ${1} is less than ${2}
fi
And execute it like this (try also with other numbers):
$ bash comparing-integers.sh 10 3
10 is greater than 3
We can also compare strings. Copy the following commands to a script called comparing-strings.sh:
if [[ $1 == $2 ]]
then
echo strings are equal
else
echo strings are different
fi
And execute it like this:
$ bash comparing-integers.sh dog cat
strings are different
More on Integer and String comparisons
Consider the following code snippet, it demonstrates some basic string and integer comparisons, with branching code depending upon the outcome. We first define two variables X and Y (you can also use
$1
and$2
to access arguments passed) to the script) and assign them integer values (remember that to Bash they are still strings).The next step is to build an if…then…else construct to test our variables using an arithmetic comparison operator. Specifically, using
-eq
lets Bash know that the values stored in the variables are to be treated as numbers. If X is equal to Y the script performs one action, if not, then it performs another.#!/bin/bash #Declare two integers for testing X=3 Y=10 #Perform a comparison on the integers if [[ $X -eq $Y ]]; then echo "${X} equals ${Y}" else echo "${X} does not equal ${Y}" fi
Now try the following:
- Use some of the other comparison operators and see if your results meet your expectations. What outcome would you expect when using the string comparison operators
<
or>
?- Test different numbers, for example, compare 10 and 3; “10” and “3” (including the
"
); “10” “ 3” (notice the space in front of the number 3); “40” “3”. What kind of results do you obtain?Solution
Notice that in the first case the results are not necessarily as we expect since the characters are compared in alphabetical (ASCII) order since we are using a string comparison operator.
In the second case we can confirm this by noticing that bash (when instructed to compare the numbers as strings) compares the first character of each string, and if the one on the left has a lower value then it’s true, if greater then it is false; if they’re the same, then it compares the second character, etc.
A common practical application of if
statements in programming scripts is to add
a help
flag to print some useful information about the script. For example:
if [[ "$1" == "--help" ]]
then
echo "Returns the file with the most number"
echo "of lines in the list provided"
echo "To execute:"
echo "print-largest.sh <list-of-files>"
fi
wc -l $@ | sort -n | tail -1
The above script should print the help message when executed like this:
bash print-largest.sh --help
But you might see an addional error message reminding you about how wc
works:
Returns the file with the most number
of lines in the list provided
To execute:
print-largest.sh <list-of-files>
wc: illegal option -- -
usage: wc [-clmw] [file ...]
What is happening in this case is that bash
is evaluating the conditional test in
our if
statement and executing the commands if the condition is true, however, our
script doesn’t end there and bash
continues executing any commands after the if
statement. If for some reason we would like to exit our script at some point (e.g. in
case some condition is not satisfied) we can use the command exit
to instruct
bash
to quit the script at that point. exit
takes a numeric argument that is used
to specify what caused the program to exit (convention is to use 0
if everything is
ok, and is the numeric value used by your script if it finishes without errors, try
using $?
after executing your script to take a look at this value). We can tell
bash
to exit our script after printing our help message in this way:
if [[ "$1" == "--help" ]]
then
echo "Returns the file with the most number"
echo "of lines in the list provided"
echo "To execute:"
echo "print-largest.sh <list-of-files>"
exit 0
fi
wc -l $@ | sort -n | tail -1
bash print-largest.sh --help
Returns the file with the most number
of lines in the list provided
To execute:
print-largest.sh <list-of-files>
This time the script exits inmediately after printing the help message without trying
to execute the wc
command.
Counting arguments with $#
One further bash
operator useful when working with scripts that take external
arguments is to use the $#
operator. For example, try the following script (save to
counting-arguments.sh):
echo the number of arguments received is $#
And run like this (try with a different number of arguments).
bash counting-arguments.sh argument1 argument2 argument3
the number of arguments received is 3
We can use $#
to make our help printing if
statement even more flexible. For
example, imagine a new researcher just received our script and has no idea what to do
with it he or she migh be tempted to run the script like this and see what happens:
bash print-largest.sh
But this would led to the script hangging! (you can cancel it with
Ctrl+C). A more useful default behaviour would be that if your
script requires arguments to work, trying to run it with no arguments cause the help
message to be printed. We can do this with $#
like this:
if [[ "$1" == "--help" ]] || [[ $# -eq 0 ]]
then
echo "Returns the file with the most number"
echo "of lines in the list provided"
echo "To execute:"
echo "print-largest.sh <list-of-files>"
exit 0
fi
wc -l $@ | sort -n | tail -1
bash print-largest.sh
Returns the file with the most number
of lines in the list provided
To execute:
print-largest.sh <list-of-files>
That’s a more useful default behaviour!. Notice the if
structure we have used where
we have included a new operator ||
(that is two vertical lines, look for the key
| in your keyboard) that works as a logical OR
(there is an equivalent
&&
operator that works as a logical AND
) and let us two or more comparison tests.
Other conditional structures.
Here you have seen how to use
bash
basic if…then…else…fi and the Else If ladder structure. However, there are otherbash
constructs that could be useful depending on the case under consideration:
- Case Statements. Is a more slightly more complex conditional structure useful when we have several posible options. The general structure is:
case EXPRESSION in CASE1) COMMAND-LIST;; CASE2) COMMAND-LIST;; CASEN) COMMAND-LIST;; * ) COMMAND-LIST;; esac
File Test Operators
In addition to variable comparisons, there are other comparison operators that can be used to query the existence and attributes of files. This would allow the script author to, for example, test whether a file exists before trying to read it and potentially producing an error. The table below summarizes some of these operators:
Operator ARGUMENT | Purpose |
---|---|
-d DIRECTORY | Test for existence of a directory |
-f FILENAME | Test for existence of file |
-r FILENAME | Test if file is readable |
-w FILENAME | Test if file is writable |
-x FILENAME | Test if file is executable |
For example, to test if the directory molecules exists:
$ if [ -e molecules ]
> then
> echo moecules exists
> fi
Another common task is to identify directories in a certain location. Try typing the following for loop in a script called search-directories.sh and run it in our data-shell-scripting directory:
for filename in $@
do
if [[ -d $filename ]]
then
echo $filename is a directory
fi
done
It is also possible to check for the negative outcome of a test by preceeding the statement with a ! symbol. For example:
if [[ ! -f myfile.txt ]]; then
echo "File does not exist"
fi
Logging a directory’s content
Try modifying the above script to create a log file within an identified directory. The log file should contain the names of the files inside the directory. Avoid rewriting the log file if it already exists.
Solution
for filename in $@ do if [[ -d $filename ]] then echo $filename is a directory if [[ ! -f $filename/${filename}.log ]] then echo creating ${filename}.log cd $filename ls > ${filename}.log cd .. else echo "Warning: ${filename}.log is already present!" fi fi done
The above script shows an example of using a few file test operators to check for the existence of a directories and files before trying to perform an action (creating a log file). If the file test were not performed and the file already existed we would rewrite potentially valuable data. Checking for its existence first allows us to throw a warning in this case and perhaps performing another action (e.g. backing up the log file already present).
Key Points
The basic conditional structure in Bash is built as: if…then…else…fi .
Bash has operators specific for string and integer comparisons.
Bash also has comparison operators useful to test the existance of files and directories.
Arithmetic and Arrays
Overview
Teaching: 30 min
Exercises: 10 minQuestions
How to perform basic arithmetic operations in Bash?
How to define arrays?
Objectives
Understand Bash variables types and how to construct arithmetic operations.
Understant how to construct Bash arrays and access their contents.
Bash Variables
Before going into detail about how to perform arithmetic operations and define arrays
in Bash, it would be useful to do a quick recap of how variables are used in
Bash. As in other languages, Bash variables allow the script’s author to
refer to data by a label. In Bash this assignment is performed using the =
symbol: We can display this using the command line using the export command to set
the variable.
$ export MYTEXT=‘Hello’
The contents of a variable can then be obtained by a process called dereferencing or variable substitution. To return the value assigned to a variable, prefix the label using the $ symbol, i.e:
$ echo $MYTEXT Hello
More on Bash variables
If you are familiar with a language like C++ or Fortran, where variables have explicit types such as integers or characters, you may be surprised that the contents of a Bash variable may be assigned with no declaration or preamble. This is because Bash is what is known as an untyped language. Essentially, all variables are stored as strings and the contents is treated differently depending on context. So, for example, if I try and print the contents of a variable as above using the
echo
command, Bash treats it like a text string. If I try and multiply it by 2 then Bash will treat it as a number.Needless to say, whilst this can simplify the process of creating and assigning variables, it requires the author to be more careful in how variables are treated, as multiplying a string by 2 will not produce an error as it would when using a strongly typed language.
Arithmetic Expansion
Now we know that Bash treats variables according to context, let’s see how we can perform arithmetic operations whereby Bash regards the contents as a number. The format for the Bash arithmetic expansion is:
$(( arithmetic expression ))
For example:
$ echo $((1 + 3)) 4
Arithmetic operations are surrounded by the $((…)) construct. Notice that the result of this expression can be again stored in a variable.
$ MYVAR=$((1 + 3)) $ echo $MYVAR 4
Without this construct Bash would treat 1 + 3
as a string and print it
accordingly:
$ echo 1 + 3 1 + 3
Consider the following script. Here we define two variables X and Y with integer values and proceed to use the construct $((…)) so that Bash understands that we want to treat the variables as numbers and wish to perform some simple mathematical operations. We store the result of these operations in a variable RESULT and use it in further arithmetic operations.
#!/bin/bash
# Assign two variables with integer values.
X=7
Y=12
# Add these two variables together, store the result in a third variable,
# and print out the result
RESULT=$((X+Y))
echo "X + Y = ${RESULT}!"
# Increment the result by 5.
RESULT=$((RESULT+5))
echo "RESULT is now ${RESULT}"
# Divide the result by 5. Remember bash only deals with integers.
DIVISION=$((RESULT/5))
echo "${RESULT} divided by 5 is ${DIVISION}"
Running this script should produce the following result:
X + Y = 19! RESULT is now 24 24 divided by 5 is 4
Exercise
Now try modifying the previous script to perform some other calculations. How would you expect to be able to multiply two numbers, or take one number away from another?
Solution
#!/bin/bash #Add these two variables together, store the result in a third variable, #and print out the result RESULT=$(($1+$2)) echo "$1 + $2 = ${RESULT}!" #Increment the result by 5. RESULT=$((RESULT+5)) echo "RESULT is now ${RESULT}" #Divide the result by 5. Remember bash only deals with integers. DIVISION=$((RESULT/5)) echo "${RESULT} divided by 5 is ${DIVISION}" #Solution MULTIPLY=$(($1 * $2)) echo ${MULTIPLY} SUBTRACT=$(($1-$2)) echo ${SUBTRACT}
Counters
A useful Bash arithmetic feature is its capability to post/pre-increment/decrement variables similar to other languagues such as C++. For example, try the following:
$ MYCOUNTER=0 $ echo $((++MYCOUNTER)) 1 $ echo $MYCOUNTER 1 $ echo $((++MYCOUNTER)) 2 $ echo $MYCOUNTER 2
Now try:
$ MYCOUNTER=0 $ echo $((MYCOUNTER++)) 0 $ echo $MYCOUNTER 1 $ echo $((MYCOUNTER++)) 1 $ echo $MYCOUNTER 2
In the first instance we use a pre-increment operator where the variable is
incremented by 1 before a command is executed (in this case echo
). While in the
second example we used a post-increment operator, where the operation/command is
executed first and then the variable is incremented. These operators can be
particularly useful when working with Bash C-styled for
loops, for example:
$ for ((i = 0 ; i < 5 ; i++)); do > echo "my counter has a value of $i" > done my counter has a value of 0 my counter has a value of 1 my counter has a value of 2 my counter has a value of 3 my counter has a value of 4
You can find a more comprehensive list of Bash arithmetic operators here.
Sequences
Bash allows you to print sequences of numbers. Try the following:
$ seq 1 5
1
2
3
4
5
You can also define an increment:
$ seq 1 2 20
1
3
5
7
9
11
13
15
17
19
This can be useful for example if you need to run a command a defined number of times in a for loop:
$ for i in $(seq 1 10)
> do
> echo running iteration $i
> done
running iteration 1
running iteration 2
running iteration 3
running iteration 4
running iteration 5
running iteration 6
running iteration 7
running iteration 8
running iteration 9
running iteration 10
Arrays
Arrays allow the script author to associate a number of values with a single label. This means you only need to remember one variable name instead of many, which can come in particularly useful if you have potentially hundreds of values you want to store, such as the words in a text file you have just read in.
In Bash, arrays are one dimensional, zero indexed, and sparse, and much like variables they don’t require formal declaration in order to use them.
Array feature | Description |
---|---|
Dimensionality | The number of indices used to locate an array element. For example, a two dimensional array could describe a set of rows and columns and have indexes i, j. |
Indexing | The number of the first element in an array. For example an array of ten elements can be referred to as elements 0 to 9 or 1 to 10. The former is zero-based indexing and the latter is one-based indexing. |
Layout | An array in which every possible element is defined is fully populated. One in which only certain elements, i.e. three out of ten, have values is sparse. |
For the most part you won’t need to worry about any of this, but if you’re familiar with language in which arrays are more tightly defined or have additional features, BASH arrays may seem quite simple and loosely defined.
Creating an array in Bash is as simple as enclosing the elements within a pair of brackets, ( ). We can display this using the command line to set the array. For example:
$ MY_ARRAY=(1 2 3 ‘Hawk’)
This example demonstrates another potentially unexpected feature of Bash arrays: each of the elements can appear to be a different type. But remember, we learned that Bash is an untyped language, so this is a potentially useful consequence.
Accessing the contents of an array element can be achieved using the following construct:
$ echo ${MY_ARRAY[3]} Hawk
Finding the size of an array is very useful when using them as part of loops. This can be done with the construct ${#ArrayName[@]}. For example:
$ echo ${#MY_ARRAY[@]}
Here’s a script that shows some additional ways of accessing array contents.
#!/bin/bash
# Define a one dimensional array
MY_ARRAY=(1 2 3 ’raven’)
# Print out the contents of the third element
echo "The third element is ${MY_ARRAY[2]}."
# Return the number of elements in an array
echo "The array contains ${#MY_ARRAY[@]} elements"
# Print the contents of the entire array.
echo "The array consists of: ${MY_ARRAY[@]}"
# Define a sparse array and print out the contents
SPARSE_AR[0]=50
SPARSE_AR[3]='some words'
echo "Index 0 of spare array is ${SPARSE_AR[0]}."
echo "Index 1 of spare array is ${SPARSE_AR[1]}."
echo "Index 2 of spare array is ${SPARSE_AR[2]}."
echo "Index 3 of spare array is ${SPARSE_AR[3]}."
That produces the following output:
The third element is 3. The array contains 4 elements The array consists of: 1 2 3 ’raven’ Index 0 of spare array is 50. Index 1 of spare array is . Index 2 of spare array is . Index 3 of spare array is some words.
Notice how despite the fourth element of the sparse array consists of two words, because we’ve enclosed them in a single block of quotes it is still considered a single element. We’ll see in the section on Flow Control how we can use another form of array called a list to loop over blocks of commands to repeat operations without having to type out those commands multiple times.
Exercise
Consider the previous script. Try defining your own arrays and manipulating them. Can you assign an array element to another variable or include the contents of a variable as an array element?
Solution
#!/bin/bash # Define a sparse array and print out the contents SPARSE_AR[0]=50 SPARSE_AR[3]='some words' VAR1=(${SPARSE_AR[0]}) echo "${VAR1}" VAR2=77 SPARSE_AR[4]="${VAR2}" echo ${SPARSE_AR[4]}
Key Points
Bash is an untyped language. This means that all variables are stored as strings.
The $(( )) construct is used to create arithmetic operations
Bash arrays are created byi enclosing the elements within a pair of brackets, ( )
We can find the size of a Bash array with the construct ${#ArrayName[@]}
Functions and External Tools
Overview
Teaching: 25 min
Exercises: 10 minQuestions
How to create functions for easy repeated access to common tasks?
Objectives
Understand the syntax of Bash functions and how we can use them in our scripts.
Understant how we can assing the output of Bash commands to variables and use them in our scripts.
Functions
Functions allow tasks, corresponding to a number of individual operations, so be represented as a label.
Generally, a function takes one or more variables as input and in most programming languages can potentially return a value. They can be thought of as a way of creating user-defined commands which can be repeatedly used to process different sets of input. A typical bash function has the following structure:
function function_name {
<commands>
}
Couple of key points regarding bash functions:
- Unlike most programming languages, there’s no easy way of returning a value from a function in Bash (there are various techniques for achieving this but are beyond the scope of this tutorial). In other programming languages it is common to have arguments passed to the function listed inside the brackets (). In Bash they are there only for decoration and you never put anything inside them.
- The function definition ( the actual function itself) must appear in the script before any calls to the function.
The following code shows an example of a simple user-defined function, and how it is used by a script.
#!/bin/bash
# Define a simple function that simply prefixes any string it receives with the current date
function datestamp {
# The first variable passed to the function is stored as $1, as
# in the case of command line arguments
STR_RECV=$1
#Store the current date in a suitable format
DATE_FMT=$(date +'%d/%m/%y')
#Print out the combined date stamp plus string
echo "${DATE_FMT}; ${STR_RECV}"
}
# Call the function with an example string
datestamp "Here is some text"
Exercise
Create a script that allows you to pass a series of strings to it and prefix each one with the current date and time. This sort of operation can be useful when logging messages from your code so you can tell exactly when a potential problem occurred. Can you read in a file and prefix each line of the file with the date?
Solution
#!/bin/bash # Define a simple function that simply prefixes any string it receives with the current date function datestamp { # The first variable passed to the function is stored as $1, as # in the case of command line arguments STR_RECV=$1 #Store the current date in a suitable format DATE_FMT=$(date +'%d/%m/%y-%H:%M:%S') #Print out the combined date stamp plus string echo "${DATE_FMT}; ${STR_RECV}" } ## Solution 2 numargs=$# COUNTER=1 for i in $(seq 1 $numargs); do datestamp ${COUNTER} COUNTER=$(( $COUNTER + 1 )) done
This script should produce an output similar to:
$ bash functions_01_solution_01.sh command1 command2 command3 14/10/20-11:14:34; 1 14/10/20-11:14:34; 2 14/10/20-11:14:34; 3An alternative solution could be to read the commands from a text file (e.g. date_example_input.txt):
#!/bin/bash # Define a simple function that simply prefixes any string it receives with the current date function datestamp { # The first variable passed to the function is stored as $1, as # in the case of command line arguments STR_RECV=$1 #Store the current date in a suitable format DATE_FMT=$(date +'%d/%m/%y-%H:%M:%S') #Print out the combined date stamp plus string echo "${DATE_FMT}; ${STR_RECV}" } ## Solution 2 cat date_example_input.txt | while read line; do datestamp $line done
This script should produce an output similar to:
$ bash functions_01_solution_02.sh 14/10/20-11:14:34; 1 14/10/20-11:14:34; 2 14/10/20-11:14:34; 3
Integrating external tools
By now we have covered the most commonly used features of the Bash syntax. But one of the most useful aspects of shell scripting is the ability to integrate shell commands and either pass arguments to them or take output from them to use in the script.
In Bash an external command is run by enclosing it within a $( ) construct.
#!/bin/bash
# Invoke the external command ‘ls’ to display the contents of,
# a directory and store that as a shell string.
DIRCONT=$(ls -1)
# Loop over the contents of the directory and print each file found
# to the screen.
for filename in ${DIRCONT}; do
echo "File Found: ${filename}"
done
In this example instead of explicitly specifying the files we want to look at we have dynamically obtained a list of the contents of the current directory.
This means we don’t have to then check to see whether a file exists and we can run the same code on different groups of files without modification.
This provides a useful template for other commands to be inserted into the for loop. Instead of just printing out the name of the file we could read in the contents using the cat command, copy it to another location, or append some rows of figures.
#!/bin/bash
#Find all the files in a directory hierarchy that are marked as executable
for file in $(find -executable); do
echo "File found: ${file}"
# Change file permissions so that all users can execute the file
# chmod go+r ${file}
done
The find tool is another shell command that descends a directory hierarchy and returns a list of files depending on such categories as name, location, type, and many other options.
In this case we have a hypothetical situation where only the current user has permission to execute files and we want to give everybody permission. Instead of changing the permissions of every file by hand, this simple script automates the task.
Note the chmod command has been commented out to prevent accidental permission changes. Only uncomment it if you’re certain you understand how it works.
Exercise
Try creating a set of subdirectories containing named files, each one with a list of words. Use the find command to retrieve only those matching a certain pattern and then read the words into a variable and print them out.
Solution
#!/bin/bash mkdir script_test1 mkdir script_test2 echo "text1" > ./script_test1/text1 echo "text2" > ./script_test2/text2 COUNTER=0 #Find all the files in a directory hierarchy that are marked as executable for line in $(find -iname text*); do LINECAT=$(cat $line) MY_ARRAY[$COUNTER]="${LINECAT}" COUNTER=$((COUNTER + 1)) done echo "Array contents ${MY_ARRAY[@]}"
More tools: sed and awk
awk simple examples
We previosly show you examples of using grep
, a very powerful
tool specialized in searching one or more input files for lines containing a match
to a specified pattern. There are two additional tools worth mentioning:
sed
and awk
. Both tools specialized in text
parsing and general text processing. Although an in-depth explanation of how to
use them is beyond the scope of this course, we would like to provide you with a
couple of examples we find useful to demonstrate their functionality and tease you
into finding out more.
Let us start with awk
. Consider the following document in the data
directory:
$ cd data
$ cat amino-acids.txt
Alanine Ala
Arginine Arg
Asparagine Asn
Aspartic acid Asp
Cysteine Cys
Glutamic acid Glu
Glutamine Gln
Glycine Gly
Histidine His
Isoleucine Ile
Leucine Leu
Lysine Lys
Methionine Met
Phenylalanine Phe
Proline Pro
Serine Ser
Threonine Thr
Tryptophan Trp
Tyrosine Tyr
Valine Val
We can see it is a file composed of two columns, with aminoacids long and short names. We can send the files content to awk and manipulate it. For example, say we are interested only in the first column:
$ cat amino-acids.txt | awk '{print $1}'
Alanine
Arginine
Asparagine
Aspartic
Cysteine
Glutamic
Glutamine
Glycine
Histidine
Isoleucine
Leucine
Lysine
Methionine
Phenylalanine
Proline
Serine
Threonine
Tryptophan
Tyrosine
Valine
awk
is a programming language which many powerful tools. In the simple example
above the section within '{}'
is the program defining the actions to be performed on the input, in this case, to print
column $1
. Don’t get confuse with bash
interpretation of $1
(first argument passed to a script), for awk
it represents
a column number. Try to to extract column 2 instead and even swap columns 1 and 2
positions.
A slightly more complex and maybe useful example. Consider the contents of
animal-counts
in the the data
directory. We can see that it is a CSV file
(comma separated values).
$ cd data/animal-counts
$ cat animals.txt
2012-11-05,deer,5
2012-11-05,rabbit,22
2012-11-05,raccoon,7
2012-11-06,rabbit,19
2012-11-06,deer,2
2012-11-06,fox,4
2012-11-07,rabbit,16
2012-11-07,bear,1
If we try to extract the first column as before, we quickly find out that it doesn’t work:
$ cat data/animal-counts/animals.txt | awk '{print $1}'
2012-11-05,deer,5
2012-11-05,rabbit,22
2012-11-05,raccoon,7
2012-11-06,rabbit,19
2012-11-06,deer,2
2012-11-06,fox,4
2012-11-07,rabbit,16
2012-11-07,bear,1
The problem is that by default awk
expects a space as file separator, but in this
case we have a ,
. Fortunately awk
is flexible and let us define a new symbol
to use as file separator. This needs to be defined in a special section of the awk
script, the BEGIN
section:
$ cat data/animal-counts/animals.txt | awk 'BEGIN{FS=","} {print $1}'
2012-11-05
2012-11-05
2012-11-05
2012-11-06
2012-11-06
2012-11-06
2012-11-07
2012-11-07
Say that we want to know the total number of animals recorded in our file. For this we need to modify our script to perform a summation of every element in column 3 and report the final result.
$ cat data/animal-counts/animals.txt | awk 'BEGIN{FS=","} {sum+=$3} END{print sum}'
76
In the above example we can see the use of the three main sections of an awk
program: the BEGIN
and END
sections and the main script. There are many many
more things that can be accomplished with awk
but hopefully these simple examples
gave you a taste of what is possible.
sed simple examples
sed is a stream editor. A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline).
Usual syntax
$ sed SCRIPT INPUTFILE...
Consider the animals.txt
file in the data
directory:
$ cd data
$ cat animals.txt
2012-11-05,deer
2012-11-05,rabbit
2012-11-05,raccoon
2012-11-06,rabbit
2012-11-06,deer
2012-11-06,fox
2012-11-07,rabbit
2012-11-07,bear
Imagine that we made a mistake and deer wasn’t the correct category, maybe we
should use elk instead. Amending potentially hundreds of files could be very
time consuming but with sed
this task can be performed in a few seconds:
$ sed 's/deer/elk/' animals.txt
2012-11-05,elk
2012-11-05,rabbit
2012-11-05,raccoon
2012-11-06,rabbit
2012-11-06,elk
2012-11-06,fox
2012-11-07,rabbit
2012-11-07,bear
By default sed
doesn’t modify the input file. You can check by printing again
content of animals.txt
. If we wanted to save the output of sed
we have two
options:
$ sed 's/deer/elk/' animals.txt > animals-fixed.txt
or modify the file in place (check sed --help
), but be careful as this could
destroy the original file if you make a mistake:
$ sed -i 's/deer/elk/' animals.txt
But sed
has many many more useful functions (check the [manual][sed-manula]),
for example:
-
Print a specific line:
$ sed -n '45p' file.txt
-
Delete a specific line:
$ sed -n '10d' file.txt
-
Appending text after a line.
$ sed '2a hello there' file.txt
-
Insert text before a line
$ sed '2i hello there' file.txt
Things can get more complex. But hopefully the above examples can give you an indication of what is available on Linux and the command line.
Key Points
Functions help us pack a set of operations with a single label.
Generally, Bash functions do not return values.
The output of Bash commands like cat, ls or find can be assigned to a Bash variable using the construct VAR=$(command)