Abstract:

The bourne shell (/bin/sh) is both the command interpreter and the basic programming language on UNIX systems. Any serious UNIX user should be able to write little programs - called shellscripts - in sh. It is very easy to gain this abitity from your interactive use of sh since syntax and commands are the same. You may also use the knowledge you gained to modify existing shellscripts, such as the ones that start up your system at boottime.

However, writing somewhat advanced programs in sh quickly gets hairy, as it is a language with low expressional power, but some field-grown, "interesting" constructs on top. At that point, people can either switch to a different language or they could improve their /bin/sh programming knowledge.

The Web page is for the latter kind of people, and also for people forced to use /bin/sh for various reasons (i.e. maintaining legacy scripts or being responsible for startup scripts of some kind).

Intended audience: People writing UNIX shell scripts, having problems with
  • keeping scripts correct when they get complicated
  • interaction with other programs, runaway scripts
  • portability or reliability (i.e. ensuring actions are taken)
Required knowledge: Basic shell script knowledge. It also can't hurt if you know what signals are, i.e. that pressing Control-C on your keyboards (usually) sends signal SIGINT to all processes running in the foreground.


Portability of these constructs

These web page presents constructs that are part of the POSIX 1003.2 specification. They are usable on all conforming shells and they don't push POSIX to the limit. The constructs here are available in any halfway modern bourne shell variant, including ksh.

All example constructs have been tested to work on FreeBSD's /bin/sh (derived from 4.4BSD ash, but seriously changed since then), bash2, SunOS 5.5 /bin/sh and /bin/ksh and pdksh. FIXME original ksh?

However, if you intend to write scripts that have to run on very old systems, you might have to install a more modern shell on those systems. In that case, installating a completely independent scripting system besides that one used for legacy scripts might be useful.

Use traps

There are quite a few different approaches to signal handling in shells, as outlined in my web page about SIGINT and shell script termination. The different behaviour of bourne shell implementations is obviously a problem for the portability of your shell scripts.

Thankfully, there's a way around that: Always install trap handlers for SIGINT and SIGQUIT, even if your desired action is the default action in most shells anyway. Just remember to kill yourself if you use SIGINT/SIGQUIT as an abortion signal, otherwise you face even more problems with runway, unbreakable scripts (as outlined in sigint.html.

It looks like this for a normal script with no interactive programs called that might use SIGINT or SIGQUIT, hence the signals will always terminate the whole script. This script also illustrates the use of traps to remove temporary files in shellscripts even when those are killed by signals.

#! /bin/sh

tmp=/tmp/s.$$
onint ()
{
	# maybe do some cleanup here
	rm $tmp
	trap SIGINT
	kill -SIGINT $$
}
trap SIGINT onint
trap SIGQUIT onint # ignoring that this will exit with SIGINT when
                   # sending SIGQUIT. In POSIX there is no way
                   # for a shell script trap handler to get the number
                   # of the signal sent.
trap SIGTERM       # In case someone kills this script from kill(1),
                   # the cleanup procedure should also run.
rm $tmp
# do a long loop that *must* be interruptable
for file in *.dat ; do
	dat2ascii $file >> $tmp
done
# whatever your temp file was good for
wc $tmp
rm $tmp

The script above will ensure that the shellscript ends if you break one of the dat2ascii processes with SIGINT, no matter if it exited with the right exit code or not.

As a side note, the specification of signal numbers in POSIX is a little annoying, imho. You are quaranteed to be able to use the symbolic names SIG... as required in C. But you're only able to use the numbers (written as digits) if the number are exactly those that they usually are. This may makes sense to keep scripts portable, but it's a fact that most old shells only allowed you to use numbers, not names, so your scripts will most definitivly break on old shells.

Back to our example. So we managed to ensure that it is properly terminated on the first SIGINT. But if your shellscript contains interactive programs that might use SIGINT or SIGQUIT as normal keys that shouldn't terminate the whole thing, you ignore these signals in your shell. Shells that are implemented with such programs in mind will behave right, but by explicitly using the trap command you will ensure that your script doesn't break on others.

#! /bin/sh

# while emacs is running, block the signals
trap '' SIGINT
trap '' SIGQUIT
# Understand the difference between blocking and defaulting. The empty
# string '' as a trap command blocks the signal. Nothing (like below)
# resets it to the default.
emacs -nw /tmp/bla

# Now that SIGINT and SIGQUIT have their usual meaning again, set
# default actions. Even better, use the construction from the
# non-interactive script above.
trap SIGINT
trap SIGQUIT
cp /tmp/bla >> $HOME/sent-mail
mail joey < /tmp/bla
rm /tmp/bla
If you don't block the signals while emacs is running, shells from Type 3 of my sigint.html would execute cp, mail and rm when your emacs session went without using C-g, while these shells would not execute cp, mail and rm when you used C-g in emacs. Since C-g is part of normal editing, it should not have any effect on later parts of the shell script. As noted in my Web page, I think this behaviour isn't useful for exactly this reason, but using this construction you can protect yourself from these effects.

And don't forget that system(3) calls from C programs are shellscripts as well, an instance of /bin/sh is between you and your called program. You might want to use the same construction to protect yourself from undesired effects simuilar to the former shellscript example.

system("emacs /tmp/foobar.$$;mail foo@bar.com < /tmp/foobar.$$;rm /tmp/foobar.$$");

Always setting traps also saves you from strange effects if your shellscript is entered from another shellscript that already blocks signals. Unless you re-default the signals, your script would also block. On the other hand, it might be useful to inherit the signals settings, no take the required time to reason about the right thing to do.

Asynchronous traps and blocking programs

Now that you've seen the wonderful world of traps in shells, there is something about them that I find unpleasent.

When a shellscript runs and a signal that is trapped (to a shellscript routine of your choice) is received while a foreground child is running, then the trap function in the shell script is called after the childs exits.

In straightforward use, this is a problem, since you cannot install a trap handler to do anything about it when a program blocks all signals and refuses to give control back.

If you call such a blocking program directly, you will never get your program flow back to the shell, in interactive use this means you'll never get your prompt back.

In interactive use, you usually can send SIGSTOP to get commandline control back, but that is not in every shell and it may stop more than you intended (i.e. if you stack shell scripts).

Sending SIGKILL isn't perfect, either, since it requires you to get an additional command prompt over the network or from on a different (virtual) terminal, but there may not be another login possibilty. Also, SIGKILL will not kill programs that hang in system calls, for example processes hanging on dead NFS filesystems.

Thankfully, POSIX 1003.2 requires that the wait builtin is interruptable, so you can solve the simple problem of getting your prompt (or control to upper scripts) back with SIGINT this way:

#! /bin/sh

./some-blocking-program &
wait $!
The blocking program will continue to run in the background. This may be useful since it has a chance to complete whatever it attempted to do. If you don't want this, write the script like this:
#! /bin/sh

pid=
onint()
{
    kill $pid
}

trap onint SIGINT
./hardguy &
pid=$!
wait $pid

If you want more flexible handling of possibly blocking programs, you can extend the mechanism to call wait in a loop, while doing bookeeping and decisions in the trap handler.

However, this way of doing things quickly becomes a mess if you have several diffferent commands in a shell loop that may or may not block.

Many of such constructs become more simple if traps would be called immedeately, while is foreground child is still running. You would just install a trap handler that does "something" about the problem and it would be called everytime you hit SIGINT (or SIGQUIT). Just as signals handler in C programs are called immedeately. Maybe I am too much of a C programmer, but I find the delayed sh behaviour very non-intuitive.

#! /bin/sh

onsig()
{
	trap SIGINT
	kill -SIGINT $$
}
set -T            # set async execution of traps on FreeBSD
trap onsig SIGINT
./some-blocking-program
set +T            # set traps execution behaviour back to normal

This makes the trap handler a bit more complicated, but it allows you to write the main part of your shell script as usualy, without keeping in mind that a program may block and taking the appropriate action about it.

If you had a more complex script instead of just one call to some-blocking-program, you would face serious complications if you had to call all of them as a background child, remember the pid and place the wait commands at the right place.

To enable constructs like this, I introduced the -T switch in FreeBSD's /bin/sh. With this switch, your shell will execute traps immedeately.

Remark: Shell switches like this may be given from inside the script as shown here, from the commandline as in sh -T script.sh or even from the first line of the script #! /bin/sh -T as long as you don't need more than one parameter string (where the string may have more than one option letter).

But keep in mind that the former construction involving backgrounding and wait is the only portable solution.

Backquote and alternatives

To execute a command and return its output in a way that may used in expressions (i.e. assign it to a variable), you usually use the backquote, like this:

foo=`echo bar | grep bar`  

However, POSIX specifies a second mechanism

foo=$(echo bar | grep bar)

The latter has several advantages:

I like the construct and would like to see it in more shell scripts. The readability and the parenthesis-matching are a big help especially when reading other people's scripts.

emacs shellscript-mode

emacs has a mode for editing shellscripts. I don't use it often, but when I do, I use some constructs to avoid the worst of the braindeadness, maybe you will like it if you use emacs.

This is part of my emacs setup.

(add-hook 'shell-mode-hook '(lambda () (modify-syntax-entry ?. "w")))
(setq sh-shell-file "/bin/sh")
(add-hook 'sh-mode-hook '(lambda () (setq tab-width 8)))
(add-hook 'sh-mode-hook '(lambda () (setq indent-tabs-mode t)))
(add-hook 'sh-mode-hook '(lambda () 
                           (substitute-key-definition 
                            'backward-delete-char-untabify
                            'delete-backward-char
                            sh-mode-map)))
This make it much more usable. Any tabwidth other than 8 is evil, since everyone else (probably including your printer) uses 8 and your scripts will be messed up. Also, it resets the construction to use your login script as default script for scripts and uses /bin/sh instead. That one is probably a good candidate for the "bad idea of the year" (TM).

Avoiding common wastages

You often see scripts like this (for example, in my old systems :-)
#! /bin/sh

for file in *.html ; do
	target=`echo $file | sed 's/\.html$/.dat/'`  
	grep bla $file > $target
done
This isn't neccessary, POSIX 1003.2 defines the basic editing capabilities needed for this, and the shells I have access to all implement them with no problems.

A modern variant looks like this:

#! /bin/sh

for file in *.html ; do
	target=${file%.html}.dat
	grep bla $file > $target
done
Look up the manpage for your shell, such constructions are available for removing text at the beginning (# and ##) and the end (% and %%) of a variable. While you're at it, make sure you are familiar with the default parameter replacement constructs (the other ${varname...} constructs).

Avoid common lazyness

#! /bin/sh

if [ $# != 2 ] ; then
	echo 'You fool!'
	exit 1
In C, people usually remember to send error message to stderr, but in shellscripts this is - sadly - less common. Do it like this:
	echo 'You fool!' 1>&2

Use the right variable when you mean "everything"

If you want to pass all the commandline parameters unchanged to a program called in a shellscript, do not use $*, use "$@" -including the double quotes - . otherwise your parameters that have whitespace in them will be messed up. Example script:
#! /bin/sh

someprog $*
someprog "$*"
someprog "$@"

Example call of this script:

./script 1 '2 3'
...ends up calling 'someprog' as if you directly typed
someprog 1 2 3
someprog '1 2 3'
someprog 1 '2 3'
As you can see, only the last one is right.

The nice thing is that you still can use shift to get rid of the first parameter(s), passing only the rest to some other program, still preserving whitespace params, both in shifting and in calling.

What exactly is the difference between single and double quotes?

Some special chars retain their meaning in double quotes, but not in single quotes: dollar sign, backquote, backslash. Essentially this means variable expansion (including ${varname%.txt} and similar constructs) and exceution of commands (and inserting their output into the text that is in double quotes) still work in double quotes. Be careful that "$@" is a special thing in itself.

Use getopt(1) with care

The shell variant of getopt - the one who's result is passed to set - cannot work with switch parameters that have whitespace in them. There is not way of fixing this without breaking behaviour when switch parameters contain shell metachars, at least these were my findings why I tried to fix it in FreeBSD last time. If you're in doubt, please see the FreeBSD history of this utility in /usr/src/usr.bin/getopt.

Huh?

Sorry, in clear text that means if you have a shellscript that uses switches and one of this switches accepts a paramater, this whole thing will not work when a parameter has whitespace in it, although this works with no problems in C programs that use getopt(3) or in shellscript that don't use getopt(1).

./shellscript -i -f 'bla fasel' -q
In this call, 'bla' and 'fasel' will end up as two seperate scripts after processing in the shellscript, breaking the whole commandline parsing.

I'm working on a solution, although a solution may not be non-intrusive ans getopt(1) tries to be. Watch my software page if you're interested (shameless plug :-).

Arithmetic

POSIX 1003.2 specifies that shells implement basic calulation like this
bla=$(( (3 + 5) * 4))
which is equivalent to
bla=`expr \( 3 + 5 \) \* 4 `
Notes:

If you want to use boolean logic expressions in control constructs, you cannot use the exit status, since none is returned. You can, however, compare the result to "0". I think this approach is much cleaner. expr(1) returns its results two ways, as string and as exit status. Exit status != 0 should be reserved for "real" errors, such as syntactically wrong expressions.

if [ $(( 3 > 4 )) != 0 ] ; then echo yes ; else echo no ; fi
==> no
if [ $(( 3 > 2 )) != 0 ] ; then echo yes ; else echo no ; fi
==> yes
If the shell has a builtin test(1) (the [...] - construct is just another syntax for calling test), no programs are spawned at all. Even in shells that execute test(1), it still saves the call to expr(1), roughly doubling the speed.

Fun with 'test -n' and switchless 'test'

People can't agree about the proper thing to do when test(1)'s -n option is used without the required additional parameter. If you read this, remember that the [...] construct is just syntactic sugar for test(1).

test -n something should return true if 'something' is a string with a length greater than zero. The problem is that test(1) also should return true when only a string is passed, with no switches.

test -n bla # case is clear: -n is used and there is a string and its
            # length is more than 0 ==> true
                    
test -n ''  # case is clear: -n is used and there is a string which is
            # of length zero ==> false

test        # case is clear: -n is not used and there is no string
            # passed ==> false, but could also count as syntax error.

test ''     # case is clear: -n is not used and the string - although
            # one is given - has a length of zero.

test -n     # Now what is this? Is it the switch -n and its parameter
            # forgotten which would lead to a syntax error?
            # Or is it just the string '-n', not a switch and should 
            # therefore return true?

From the possible options for the last case...

... I've seen everything implemented in one shell/test combination or another.

I don't offer an opinion what is right here, except that the non-switch behaviour should never been included in test(1), as it just doubles the -n option. Besides, this is a typical example of misusing constructs. The return code of UNIX commands is usually used to signal serious problems, failed program runs. In this case, by using the return code as a normal way to communicate, you loose the ability to make it clear when something serious went wrong, such as a call with entirely broken syntax. Ops, I think that counts as opinion.

POSIX 1003.2 is clear about the issue: One parameter always returns true if it isn't the empty string. A call to test(1) with just one paramater cannot be a failed call to a switch. However, this doesn't really improve the situation, since many current system don't follow this rule.

What makes the situation really bad (and the need to eliminate the non-switch syntax) is the fact that it is easy to loose an empty string somewhere. While test -n '' is clear and defined, many shells and shellscripts aren't careful enough not to through the empty strings away when handling variable assignments and usage. Thus the call would in fact lead to the fatal test -n without additional parameter.

To understand the following example, recognize that a variable that is assigned to the empty string evaluates to nothing in a shellscript's context, not the empty string (bad enough). Thus, to test whether it is the empty string, but to make sure there is any string at all passed to test(1), it is followed or preceeded by double quotes with nothing inside.

#!/bin/sh

foobar=""
if test -n ""$foobar; then
	echo "Help, I am broken"
fi 

There are shells where this still goes wrong, ignoring the fact that we already uglified our code behind recognition for the shell's hail. What happens here is not that ""$foobar counts as a nonempty string. What happens is that the complete ""$foobar is thown away - although the double quotes are as direct in the code as they can be - so that test(1) still doesn't get any paramater, not even an empty one. FreeBSD has just recently been fixed here (thank to Tor Egge), while NetBSD still uses the old version that removes empty-but-existing strings with no mercy. The result in this case is that 'test -n' is called without additional parameter and 'test -n' in 4.4BSD counts as 'no -n switch has been used, this is just the string "-n"' ==> BOOM! OpenBSD uses pdksh as the default shell, BTW.

And if that wasn't enough, many shells have a builtin test(1) (like bash) but unlike FreeBSD's /bin/sh. Thus, your script will call different versions of test(1) even on the same machine.

Getting it almost right:

#! /bin/sh

# reusable function
isempty1 ()
{
    [ dreck"${*#-}" != dreck"${*}" ] && return 0
    # at this point, it does no begin with -, so we can pass it to
    # test(1) without parameter
    [ "$*" ] && return 0
    return 1
}

# test cases
unset foo ; isempty1 $foo && echo ja: $foo
foo=""    ; isempty1 $foo && echo ja: $foo
foo="n"   ; isempty1 $foo && echo ja: $foo
foo="-n"  ; isempty1 $foo && echo ja: $foo
foo="-nn" ; isempty1 $foo && echo ja: $foo
foo="-n n"; isempty1 $foo && echo ja: $foo
foo="n -n"; isempty1 $foo && echo ja: $foo

This solution works by never using the '-n' switch to test(1), neither on purpose, nor by accedentially calling it when it is part of the search string.

The drawback is that you need a shell where the ${varname#xyz} construct is implemented, which isn't the case on older shells.

For modern shells this is an improvement, since all of them have the ${varname#xyz} construct, but even some current shells don't treat the -n switch right.

The following is a less efficient version that should work on older shells:

#! /bin/sh

# reusable function
isempty2 ()
{
    echo dreck"${*}" | grep dreck- > /dev/null && return 0
    # at this point, it does no begin with -, so we can pass it to
    # test(1) without parameter
    [ "$*" ] && return 0
    return 1
}

# test cases
unset foo ; isempty2 $foo && echo ja: $foo
foo=""    ; isempty2 $foo && echo ja: $foo
foo="n"   ; isempty2 $foo && echo ja: $foo
foo="-n"  ; isempty2 $foo && echo ja: $foo
foo="-nn" ; isempty2 $foo && echo ja: $foo
foo="-n n"; isempty2 $foo && echo ja: $foo
foo="n -n"; isempty2 $foo && echo ja: $foo

The efficiency of the version is of course horrible, a grep(1) for each test isn't exactly what you want.

Well, next shot: Make the function select the right one at runtime, depending of the shell's ability to process the ${varname#xyz} construct.

#! /bin/sh

if [ nsh -c foo=txt.bla ; echo ${foo#txt.}' 2>/dev/null` != ntxt.bla ] ; then
	eval 'isempty ( ) { isempty1 "$@"; }'
else
	eval 'isempty ( ) { isempty2 "$@"; }'
fi

# test cases
unset foo ; isempty $foo && echo ja: $foo
foo=""    ; isempty $foo && echo ja: $foo
foo="n"   ; isempty $foo && echo ja: $foo
foo="-n"  ; isempty $foo && echo ja: $foo
foo="-nn" ; isempty $foo && echo ja: $foo
foo="-n n"; isempty $foo && echo ja: $foo
foo="n -n"; isempty $foo && echo ja: $foo

foo=bar eval echo \$foo
foo=bar ; eval echo \$foo

So you need a better language

If you feel you can't reach your goals with a bourne shell script alone anymore, which language would I recommend as a next step?

TCL is a small language whose syntax is probably as close to the bourne shell as real languages can get. If you program bourne shell and tcl, you will face many edges where their similarities will help you. TCL is also very strong in writing little applications with graphical user interfaces. I'd say TCL is right is you generally like the bourne shell, if you can't invest as much time into programming as would be required to master two really different languages and/or you want to program GUI applications.

perl. I don't like perl. It's power is enourmous, but in my opinion big programs become unmaintainable too fast, and that makes its power much less useful. Also, like other "little" languages its only implementation is a bytecode interpreter and that may cause enourmous speed differences. Most common tools in perl are really fast, but once you implement something really on your own, performances drops horribly. For example, I once wrote a fault-tolerant string matching function that couldn't be expressed as regular expressions. I had to walk the string in perl code and the speed was as nearly unusable. Of course, the bourne shell and tcl are much slower than perl, but for it's enourmous power the speed just doesn't match. perl is unbeatable if you face input data that is more or less chaotically and doesn't follow a real regular syntax. Also, perl offers access to more features of UNIX than most other scripting languages.

awk is great to process ASCII data files (or program output) of all kind. Making statistic, finding specific things, even using a multiple-table database structure. I find awk much more elegant for these tasks as perl, it's heavy orientation to exacly these tasks make the program much smaller. perl is better when the input data is chaotic, but if the input data is under your control, awk still rules, IMHO. of course, you might also use perl to convert chatic data into regular awk-friendly data and write your program logic in awk. Another big advantage of awk is that it is available on all UNIX systems without requireing the adminstrator to install your own pet language.

If you want a scripting language that may also use lower-level features of the UNIX system (networking, signals) and packs them into a nice regular syntax, you might want to check out scsh, the Scheme shell. Perl is also good in accessing lower-level UNIX features.

C as the native language of UNIX is of course the language that makes using UNIX features most convenient and speed is also great. But it is quite hard to write C programs that are as safe (i.e. in case of errors they give a clean error message and exit instead of giving wrong results) and as free of hard limits (especially artificial data structures size limits are almost unavoidable for a new C programmer) as your usualy "little language" program. Also, the way of thinking doesn't really match the one from shellscripting, so I would recommend C as an addition to shell scripting only if you have the time to maintain knowledge and gain experience in two languages.

Amoung the small languages, Python is probably the one with the strongest program organization features. If you want to write bigger systems and want to make heavy use of code reuse, you might to to look into Python. Programming python also makes a great way to gain some of the discipline the "real" programming languages will require. On the downside, I don't think python offers great scripting features (for example, that means special syntax to use UNIX commandline utilities), speed is as horrible as real scripting languages and it is quite common to face detectable errors at runtime, not compile time.