EECS 201 - Lecture 5: Unix++

class: center, middle
# Week 5
---

# Announcements
* HW3, ADV3 due by 11:59 PM on Feb 10
* HW4, ADV4 due by 11:59 PM on Feb 17
---

# Random extras
* I'll be using a tool called `tmux` ("terminal multiplexer") that allows me to create multiple terminal windows inside of a terminal
  * An alternative terminal multiplexer is GNU Screen

---
class: center, middle
# Lecture 5: Unix++
### `:(){ :|:& };:`
#### Do __NOT__ run this
---

# Overview
* \*nix file descriptors
* Diving into Bash
* Regular expressions
---

# Some more \*nix
### Remember that in \*nix:
--

* Everything is a __file__
--

* A __file__ is a _stream of bytes_
--

* Each utility is narrow in scope and does its job well
--

* Utilities can be stringed together to perform more complex tasks tying their outputs and inputs together using a pipe (`|`)
--

* `$ command1 | command2`
  * command1's output will go directly to command2's input
---

## What is a file?
* To \*nix processes, __files__ are visible as __file descriptors__
--

* Each \*nix process has a __file descriptor table__ containing handles to various resources
  * Such resources could be: an "actual" data file living on disk, a virtual file representing OS info, a network socket, a terminal input, a terminal output, another program's input etc.
--

* __File descriptors__ are integers that index into this __file descriptor table__
--

* When a (terminal) shell creates a process the shell sets the terminal's input and output as the process's input and output
--

* Reminder:
  * fd 0: `stdin`, `cin`
  * fd 1: `stdout`, `cout`
  * fd 2: `stderr`, `cerr`
--

* Some related POSIX C API functions (__not to be confused with C standard library functions!__):
  * `open()`, analogous to `fopen()`
  * `close()`, analogous to `fclose()`
  * `read()`, analogous to `fread()`
  * `write()`, analogous to `fwrite()`
  * `dup2()`
  * `pipe()`
---

## File redirection
#### Now that we have discussed file descriptors, we can look more in-depth into how we can manipulate these _streams of data_
#### Recall from the first \*nix lecture:
* `<`: set file as standard input (fd 0)
* `>`: set file as standard output, overwrite (fd 1)
* `>>`: set file as standard output, append (fd 1)
* `|`: connect output of one process to input of another (command 1's fd 1 -> command 2's fd 0)
--

#### Let's look at them in a more general form (brackets mean optional):
* `[n]<`: set file as an input for fd _n_ (fd 0 if unspecified)
  * "input" means that the process can `read()` from this fd
* `[n]>`: set file as an output for fd _n_ (fd 1 if unspecified)
  * "output" means that the process can `write()` to this fd
  * `2>`: capture `stderr` to a file
* `[n]>>`: set file as an output for fd _n_, append mode (fd 1 if unspecified)
---

### Advanced Bash file redirection
* `&>`: set file as fd 1 and fd 2, overwrite (`stdout` and `stderr` go to same file)
* `&>>`: set file as fd 1 and fd 2, append (`stdout` and `stderr` go to same file)
* `[n]<>`: set file as input and output on fd _n_ (fd 0 if unspecified)
* `[n]<&digit[-]`: copies fd _digit_ to fd _n_ (0 if unspecified) for input; `-` closes _digit_
* `[n]>&digit[-]`: copies fd _digit_ to fd _n_ (1 if unspecified) for output; `-` closes _digit_
* (there's a few more like Here Documents; refer to the manual)
---

### By piping and redirecting, we can put together larger and larger commands
```bash
# note: '\U' and friends are a GNU sed extension;
#       BSD sed might not have it
wget https://eecs.umich.edu/courses/eecs201/
  files/text/lorem-ipsum.txt
cat lorem-ipsum.txt | sed -e 's/./\U&/g' |
  sed -e 's/[.,]//g' | sed -e 's/U/V/g' -e 's/J/I/g' >
  LOREM-IPSVM.txt
```
* Useful utilities
  * `cat`
  * `head`
  * `tail`
  * `cut`
  * `sed`
  * `awk`
---

# Diving into Bash
* Side note: `bash` != `sh`
* `bash` has a feature superset over `sh` (kinda like a `vim`/`vi` relationship)
  * Again, confounded by some systems linking/aliasing `sh` to `bash`
* The horse's mouth:  [GNU Bash manual](https://www.gnu.org/software/bash/manual/)
  * If you like the nitty gritty details it's a great read
  * These slides summarize major features of Bash
* You may have stumbled upon these while working on HW2
--

## What Bash does
* Receive a command from a file or terminal input
* Splits it into tokens separated by __white-space__
  * Takes into account _"quoting"_ rules
* Expands/substitutes special tokens 
* Perform file redirections (and making sure they don't end up as command args)
* Execute command
---

## Command grouping
* We discussed before that we can string commands together with `;`, `&&`, `||`
* We can also group commands together as a unit, with redirects staying local to them:
* `(commands)`: performs _commands_ in a "subshell" (another shell instance: this means that variable assignments won't be visible to the parent shell)
* `{ commands; }`: performs _commands_ in the calling shell instance
  * __Note__: There has to be spaces around the brackets and a semicolon (or newline or `&`) terminating the _commands_
---

## Expansion and substitution
#### Bash has special characters that will indicate that it should _expand_ or _substitute_ to something in a command
--

### Variable expansion
* `$varname` will expand to the value of `varname`
* `${varname}`: you can use curly brackets to explicitly draw the boundaries on the variable name
  * `$ echo ${varname}somestring` vs `$ echo $varnamesomestring`
* __Note__: expansions/substitutions will be further split into individual tokens by their white-space
--

### Command substitution (via subshell)
* `$(command)` will substitute the output of a _command_ in the brackets
  * `$(echo hello | rev)` will be substituted with "olleh"
---

### Process substitution (ironically helpful on HW2)
* `<(command)` will substitute the _command_ output as a filepath, with the output of _command_ being __readable__
* `>(command)` will substitute the _command_ input as a filepath, with the input of _command_ being __writeable__
* `$ diff <(echo hello) <(echo olleh | rev)`
  * `diff` takes in two file names, but we're replacing them with command outputs
--

### Arithmetic expansion
* `$((expr))` will expand to an evaluated arithmetic expression _expr_
--

#### But wait, what if I actually wanted to not expand a variable?
#### What if I didn't want a variable to be split by white-space?
#### What if I'm lazy and don't want to escape spaces?
---

### Quoting
* Allows you to retain certain characters without Bash expanding them and keep them one string
  * Common use case is to preserve spaces e.g. for filepaths that have spaces in them (spaces delimit tokens in a command)
--

* Single quotes (`'`) preserves all of the characters between them
  * `$ echo '$HOME'` will output `$HOME`
--

* Double quotes (`"`) preserve all characters except: `$`, `\`, and backtick
  * `$ ls "$HOME/Evil Directory With Spaces"` will list the contents of a directory `/home/jdoe/Evil Directory With Spaces`
  * __Variables expanded inside of double quotes retain their white-space__
--

* Note that when quoting, the quotes don't appear in the program's argument
  * `$ someutil 'imastring'`: `someutil`'s argv[1] will be `imastring`
---

## Control flow
### `if-elif-else`
```bash
# brackets indicate optional parts
if test-commands; then
  commands
[elif more-test-commands; then
  more-commands]
[else
  alt-commands]
fi
```
* _test-commands_ is executed and its __return code__ is used as the condition
  * ___0___ = success = "true", everything else is "false"
---

### Commands for conditionals
#### You can use any commands for conditions, but these constructs should be familiar:
* `test expr`: `test` command
  * Shorthand: `[ expr ]` (remember your spaces! `[` is technically a utility name)
  * `test $a -eq $b`
  * `[ $a -eq $b ]`
--

* `[[ expr ]]`: Bash conditional
  * Richer set of operators: `==`, `=`, `!=`, `<`, `>`, among others
  * __Note__: The symbol operators above operate on strings,
    thus`<` and `>` operators do lexicographic (i.e. dictionary) comparison;
    "100" is lexicographically less than "2" since for the first characters "1" comes before "2"
  * Use specific arithmetic binary operators (_a la_ `test`) if you intend on comparing numeric values
  * `[[ $a == $b ]]`
  * `[[ $a < $b ]]`: this would evaluate to "true" if a=100, b=2
--

* `(( expr ))`: Bash arithmetic conditional
  * Evaluates as an arithmetic expression
  * `(( $a < $b ))`: this would evaluate to "false" if a=100, b=2
---

### `while`
```bash
while test-commands; do
  commands
done
```
* Similarly to `if`, the return code of _test-commands_ is used as the conditional
* Repeats _commands_ until the condition __fails__
--

### `until`
```bash
until test-commands; do
  commands
done
```
* Repeats _commands_ until the condition __succeeds__
---

### `for`
```bash
for var in list; do
  commands
done
```
* _list_ will be __expanded__ and on each iteration _var_ will be set to each member of the list
* __Note__: if there is no `in list`, it will implicitly iterate over the argument list (i.e. `$@`)
---

## Functions
```bash
func-name () compound-command
# or
function func-name [()] compound-command # [] for optional parens
```
* A __compound command__ is a __command group__ (`()`, `{}`) or a control flow element (`if-elif-else`, `for`)
* Called by invoking them like any other utility, including passing arguments
  * Arguments can be accessed via `$n`, where _n_ is the argument number
  * `$@`: list of arguments
  * `$#`: number of arguments
---

### Examples
```bash
hello-world ()
{
  if echo "Hello world!"; then
    echo "This should print"
  fi
}
# calling
hello-world
```

```bash
function touch-dir for x in $(ls); do touch $x; done
# calling
touch-dir
```
---

```bash
echo-args ()
{
  for x in $@; do
    echo $x
  done
}
# calling
echo-args a b c d e f g
```

```bash
divide ()
{
  if (( $2 == 0 )); then
    echo "Error: divide by zero" 1>&2
    # the redirection copies stderr to stdout
    # so when echo outputs to its stdout, it's
    # really going to stderr
  else
    echo $(($1 / $2))
  fi
}
# calling
divide 10 2
divide 10 0
```
---

## Scripts
* As was mentioned a few weeks ago, it's annoying to have to type things/go to the history to repeatedly run some commands
* Scripts are just plain-text files with commands in them
* __There's no special syntax for scripts: if you can enter the commands in them line by line at the terminal it would work__
* You can treat it as a simple programming language
--

* First line specifies the interpreter ("shebang")
  * `#!/bin/bash`
  * `#!/usr/bin/env bash`
--

* Arguments work like that of functions:
  * `$n` __Note__: $0 will refer to the script's name, as per \*nix program argument convention
  * `$@`
  * `$#`
---

### Reiterating _running_ vs _sourcing_
* _Running_ (executing) a script puts it into its own shell instance; variables set _won't_ be visible to the parent shell
  * `./script.sh`
  * `bash script.sh`
* Sourcing a script makes your current shell instance run each command in it; variables set _will_ be visible
  * `source script.sh`
  * `. script.sh`

---

# Regular expressions (regexes)
* A pattern that matches a set of strings
* Provide a (relatively) standardized way to perform matches on text
--

* Important to know as ___many___ tools and utilities make use of them
  * `grep`, `sed`, `find` to name a scant few
--

* Lots of different flavors, but they all encapsulate similar ideas
--

* You provide a __pattern__ that is matched on the text
* The __pattern__ can be a simple unassuming string or contain special characters that perform more powerful matching
--

* For this lecture, we'll be looking at POSIX BRE (basic regex) and ERE (extended regex)
  * `grep` is a utility that searches for patterns in a file via regexes
  * Defaults to BRE; `-E` flag (or `egrep`) for ERE
  * `ls /dev | grep tty`: list `/dev` directory, filtering by things that contain "tty"
--

### Resources
* Online regex tester: https://regex101.com/ (one among many)
* [GNU `grep`'s manual on regular expressions](https://www.gnu.org/software/grep/manual/grep.html#Regular-Expressions)
* Highly detailed website: https://www.regular-expressions.info/
---

## Regex basics
* Patterns are composed of smaller regexes that are concatenated
* The atomic regexes are those that match single characters
* The alphanumeric characters (A-Z, a-z, 0-9) act like normal characters
  * `hello` is a simple pattern that matches strings that contain "hello"
--

* There are also special functions denoted by special characters
  * `.` for any single character
  * `|` for an OR
  * `\` for special expressions/escapes
  * Quantifiers: how many to match
  * Brackets: a set of characters to match
  * Anchors: for _positional_ matching
  * Backreferences: for matching a previous match
  * `^tty[0-9]+$` is a less simple pattern that matches lines that exactly compose of only "tty" and some numeric digits after it
---

### Misc special characters
* `.` matches _any_ single character
  * `...` matches strings containing three characters
--

* `|` for an OR between regexes
  * `hello|world` matches a string containing "hello" or "world"
--

* `\` for special expressions/escapes
  * `\b` matches empty string at the edge of a word
  * There's more: check the GNU `grep` manual for the rest
--

* `(`, `)` enclose a whole expression as a _subexpression_
  * `(Hello|Goodbye) (Brandon|Jiwon)` matches:
  * "Hello Brandon"
  * "Hello Jiwon"
  * "Goodbye Brandon"
  * "Goodbye Jiwon"
---

### Quantifiers
* Specify how many of a preceding regex to match
* `?`: ≤1 time
* `*`: ≥0 times
* `+`: ≥1 times
* `{n}`: _n_ times
* `{n,}`: ≥_n_ times
* `{,m}`: ≤_m_ times
* `{n,m}`: ≥_n_ and ≤_m_ times
--

#### Examples
* `a{4}`: matches "aaaa"
* `ba+`: matches "ba", "baa", "baaa"...
* `(hello){3}`: matches "hellohellohello"
---

### Brackets
* `[`, `]` enclose a set to match for one character
  * `[abc]` matches 'a', 'b', or 'c'

#### Special things you can put inside them:
* `-`: range
  * `[A-Za-z0-9]`: capital and lowercase numbers and digits
* `^`: not in set
  * `[^ab]`: everything not 'a' or 'b'
* Named classes
  * `[:alnum:]`: alphanumeric characters
  * `[:alpha:]`: alphabetic characters
  * `[:digit:]`: digit characters
  * `[:blank:]`: space and tab characters
  * ...and others (see the GNU `grep` manual)
  * Brackets are part of the class name: e.g. `[[:alnum:]]` to match alphanumerics
---

### Anchors
* Perform _positional_ matching
* `^`: match empty string at the beginning of a line
  * i.e. following regex must be at the beginning
  * `^hello`: "hello" must be at the beginning
--

* `$`: match empty string at the end of a line
  * i.e. preceding regex must be at the end
  * `world$`: "world" must be at the end
--

* `^hello world$`: entire string must be "hello world"
---

### Backreferences
* Match previous parenthesized `()` subexpression
* `\n`: match _n_ th parenthesized subexpression
  * `(123)testing\1` matches "123testing123"
--

#### Q: `<([[:alpha:]][[:alnum:]]*[^>])>.*</\1>`
--

* Match (simple) HTML/XML tags
---

## Caveats
* GNU `grep` defaults to BRE flavor
  * Use `-E` flag or use `egrep` for ERE flavor
  * In ERE mode, use `[{]` to capture literal '{' for portability
* Other flavors may require escaping certain characters

### BRE vs ERE
* In BRE `?`, `+`, `{`, `|`, `(`, and `)` must be escaped with `\`
---

class: center, middle
# Questions?