Foundations for BASh Part 1

Laying a Foundation for Understanding

In my post about how to setup bash for data analysis, I introduced the concept of a system shell, the BASh command line, and how to setup BASh ready for undertaking data analysis tasks. In that post, I promised a full series covering all aspects of BASh command line operations and how to write shell scripts for data analysis. This is the first of two posts kicking-off that series.

Over this and the next post, we are going to cover a few key BASh-related concepts outside of command line operations and shell scripting that anyone learning to undertake those tasks would first benefit from understanding. These are concepts you will frequently encounter when using the command line and writing shell scripts but are very rarely explained in detail.

Anatomy of a BASh Command

BASh commands are the little words you type on the command line of a terminal application to tell a Unix-like operating system what to do. Learning to use even basic commands can open a world of untapped potential in your system and make computing tasks much easier and more efficient. Commands and chains of commands, known as pipelines, saved in text files form the basis of the shell scripts we will be learning to write to both automate and document data analysis-related tasks.

A command typically consists of a program name followed by options (also called flags) and arguments passed to these options.

wc -l example_file.txt

In this example, the program name wc, short for “word count”, refers to a program that exists somewhere on the system, which the shell will locate and run.

Options usually begin with a dash and alter the behaviour of the program. The -l option tells wc to count the number of lines in the file instead of counting words. In this case, example_file.txt is the argument which determines the file that wc will act on.

Commands can have multiple options and arguments. Options may be given individually or combined after a single dash. Multiple arguments are also often acceptable.

# individual options
wc -l -w example_file.txt

# combined options - works same as above
wc -lw example_file.txt

# multiple arguments - counts words in two files
wc -w example_file_1.txt example_file_2.txt

# combined options with multiple arguments
wc -lw example_file_1.txt example_file_2.txt

This is where we encounter one of the first “quirks” of shell commands: options and arguments are not standardised. This is not unique to BASh either, but a feature of many shell scripting languages.

Here’s a few things to be mindful of:

Options might be a single dash and one character e.g., -l, two dashes and a word e.g., --lines or can take on one of a few other formats you might encounter in your career as a shell user.
The same option might have different meanings to different programs: in the command wc -l, the option -l means “lines of text,” but when used with the ls command, the -l option means “long output”.
Programs might use different options to mean the same thing, such as -q for “run quietly” and -s for “run silently”.
Some options require a value, such as -n 5, with the white space between the option and the argument being either mandatory or not. For example, git log -n 5 and git log -n5 both work in an identical manner.
Arguments usually represent filenames for input or output, but they can be other things, like directory names or numbers dictating how many results are returned, as in the git log -n 5 example above.

These issues can certainly be a pain and will invariably throw off a few learners. At least now that you are armed with an awareness that these issues exist, you can avoid the frustration incurred to the uninitiated. Some of the methods I will outline for getting help later will be particularly useful for avoiding these issues by enabling you to see how a command wants to be used.

Input and Output

Most BASh commands accept input and produce output. Commands are generally flexible with their input and output. Input can come from the standard input stream or stdin (usually your keyboard), from files, or from the output of other commands. Output is usually either written to the standard output stream or stdout (usually your shell window or screen), to files, or is passed on as the input to other commands.

When we say a command “reads” something we are referring to stdin unless we say otherwise. Similarly, when we say a command “writes” or “prints,” we mean to stdout.

Error messages have their own output stream and are displayed in the standard error stream or stderr, which usually shares the same modality as stdout but is kept separate.

A frequently important aspect of BASh scripting, certainly in my job as a bioinformatician running large-scale BASh pipelines, is to correctly capture and direct stdin, stdout, and stderr to and from files or pipes. A basic understanding of these data streams is therefore important when working with BASh.

The Filesystem

When using a graphical user interface (GUI), the location and organisation of files and directories (folders) is obvious as they are displayed and explored on-screen. When using a command-line interface (CLI) this isn’t always the case, particularly when there is no GUI available with which to view files and directories, such as when using cloud-based or high-performance computing infrastructure.

A basic understanding of filesystem structure on Unix-like operating systems (e.g., Linux or MacOS) is important so that you can keep track of where you are when moving around and using the command line interactively. It is also important when scripting so that you can correctly instruct your programs to read input and direct output.

Examples of filesystems (partial) on Linux and MacOS

Although their filesystems differ somewhat, files on Linux or MacOS are both collected into directories which form a hierarchy or tree: one directory may contain other directories, called subdirectories, which can contain other files and subdirectories, which can contain other files and subdirectories…ad infinitum.

On Unix-like operating systems, the topmost directory is called the root directory, which is denoted by a slash. All files and directories descend from the root, unlike DOS or Windows, on which devices are accessed by drive letters.

The syntax for referring to files and directories is based on a series of names and slashes called a path. Consider the following example:

/Users/Lewis/Documents/example_file.txt

Looking at this, you should now hopefully be able to see that the path first refers to the root directory “/” then the “Users” directory, which contains a directory called “Documents” and finally inside this is the file “example_file.txt”. Any such path that begins with a slash, descending all the way from the root directory to the destination, is called an absolute path.

Paths don’t have to be absolute; they can be relative to a directory other than the root. Any time you refer to a path that doesn’t begin with a slash it’s called a relative path. To make sense of a relative path, you need to know where you are in the filesystem, known as your current working directory. Unless told otherwise, any shell commands you run will operate relative to your current working directory.

There are two special relative paths, these are denoted “.” (a single period) and “..” (two consecutive periods). The former means your current working directory, and the latter means your parent directory, one level above.

Using an example from the Linux section of the filesystem diagram above: if my current working directory was “/home/Sean/Documents”, then “.” would refer to that directory, while “..” would refer to “/home/Sean”. If I wanted to refer to the “bin” directory within “usr” and my current working directory remained “/home/Sean/Documents”, then I would refer to it with “../../usr/bin”.

We will cover the commands required to track your location, move around the filesystem, and explore directory contents later in this series.

Directory and File Names

As mentioned, the slash “/” character is used to refer to the root directory and to separate objects in a path. Therefore, slashes absolutely cannot be used in directory or filenames for (hopefully) obvious reasons.

There are a few characters other than the slash that have special meaning to the shell, including spaces, asterisks, dollar signs, parentheses, and a couple of others. Unlike the slash, these characters can theoretically be used in file names but should be avoided. For practical purposes, directory and file names should contain only capital or lowercase letters, numbers, periods, dashes, or underscores.

Users and Home Directories

Unix-like operating systems such as Linux and MacOS are multi-user operating systems i.e., multiple people share the use of a single computer. On any given machine, each individual user is identified by a username and owns a (reasonably) private part of the system. There is also a special user named root or the superuser. While regular users can run most programs and modify the files that they own, the root user has the ability to do anything; they can create, modify, or delete any file and run any program.

The files belonging to ordinary users are usually located in the “/home” directory on Linux or the “/Users” directory on MacOS. The users’ home directory is typically a subdirectory within either of these e.g., “/home/Sean” or “/Users/Timothy”.

There are to main ways refer to your home directory:

1. The $HOME variable

The $HOME variable is an example of an environmental variable; a topic we will cover in the next post. The variable $HOME contains the name of your home directory. Try running echo $HOME in your terminal. You should see the absolute path to your home directory printed in the console.

2. The lone tilde “~” symbol

When used in place of a directory, a lone tilde is expanded by the shell to the name of your home directory. Try running echo ~ in your terminal. You should again see the absolute path to your home directory printed in the console.

Both ways of referring to your home directory can be used to build paths e.g., “$HOME/Documents” or “~/Documents”. Here, both examples refer to the same “Documents” subdirectory within the home directory of the user, although the second method is more common.

We will touch on home directories again when we learn how to navigate and track our location in the filesystem.

Permissions

A discussion of multiuser systems wouldn’t be complete without touching on permissions. This topic is also important to shell scripting (e.g., when we want a program script to be executable so it can be run) and we will revisit it later in this series for exactly this reason.

Access control for directories and files is embodied by two questions:

1. Who has permission?

Files and directories all have an owner that has permission to do anything they wish with them; typically, this will be the user who created them, but ownership can be changed. Additionally, a predefined group of users may have permission to access a file or directory.

2. What kind of permission is granted?

The owner, defined groups or everyone can have permission to read, write (modify), and execute (run) a file. Permissions also extend to directories, to which a user might be granted read access (read files within the directory), write access (create and delete files within the directory), or execute access (make the directory their current working directory).

Permission Denied

Ownership and permissions can be viewed using the ls -l command followed by the name or path to a file or directory. We will come back to ls in the next post. For now, running this command on a file in my current working directory called “example_file.txt” would result in the following output:

-rw-r--r-- 1 Lewis staff 932B Aug 2 12:00 example_file.txt

In any such output, the file permissions are the leftmost 10 characters: a string of dashes or letters, such as r (read), w (write), x (execute).

Position 1 indicates the type (e.g., “-“ = file, “d” = directory), positions 2 to 4 indicate read, write, and execute permissions for the file’s owner, 5 to 7 indicate read, write, and execute permissions for the file’s group and positions 8 to 10 indicate read, write, and execute permissions for other users.

With this information in mind, you should now hopefully be able to see that, in the above example, -rw-r--r-- means a file that can be read and written by the owner, read by the group, and read by any other user.

There is a series of commands that modify the owner, group ownership, or permissions of a file that we will encounter later in our BASh voyage.

Getting Help

So far we have run into a few BASh commands and I have indicated a few issues that commands can throw up. To save pulling your hair out, it is always good to know where to look when your current knowledge of a command or the output it returns runs out. Below are the most useful resources I tend to use for getting help.

1. Run the man Command

The man command displays a manual page, or manpage, for a given program. You can access the manpage for a specific command by typing man followed by the program of interest.

# view the manpage for the "wc" program
man wc

To search available manpages by a keyword for a particular topic, use the -k option followed by the keyword argument e.g., “database”. In the command shown here I have piped the output into a command less to display the results one screen at a time (when using less you can press “space” to continue and “q” to quit).

# search all manpages for the word "database"
man -k database | less

2. Use the --help option

Many Linux commands respond to the option --help by printing a short help message. It might not exist or work for some programs. If the output is longer than the screen, pipe it into the less as I showed previously to display it in pages.

# view help for the "wc" program
wc --help

# paginate the help for the "wc" program
wc --help | less

3. Useful Websites

There are many websites that answer questions related to BASh and Linux-related issues, including:

This list is obviously not exhaustive and if you can’t find an answer on these sites, or just can’t be bothered to search around specific websites in the first place, try the next suggestion.

4. Search the Web

It’s going to be almost 100% guaranteed that whatever issue you’re having has been encountered by someone before, so one of the most useful ways to get help fast is to do a quick web search. Similarly, a good way to decipher a specific error message is often to copy and paste it verbatim into a web search engine such as Google. This approach works well with any programming language, not just BASh, and can save a lot of time trawling through specific sites for the information you need.

Summary

That’s it for this post. We have covered several mostly theoretical concepts that, although admittedly unexciting, will form the basis for becoming a successful BASh user when we start writing BASh commands and scripts.

In the next post we will cover some powerful shell features and constructs including wildcards, brace expansion, shell variables, the search path, and aliases. We will also look at how to kill shell processes.

See you then.

. . . . .

Thanks for reading. I hope you enjoyed the article and that it helps you to get a job done more quickly or inspires you to further your data science journey. Please do let me know if there’s anything you want me to cover in future posts.

Happy Data Analysis!

. . . . .

Disclaimer: All views expressed on this site are exclusively my own and do not represent the opinions of any entity whatsoever with which I have been, am now or will be affiliated.

← Previous Post Next Post →