Text Processing and File Manipulation

Viewing and Editing Text Files: nano, vim Basics
Linux stores most everything as plain text
use CLI tools to open and edit files directly
two main tools for terminal text work are nano and vim

The Beginners Friend: nano
simple, intuitive, safe
behaves like a standard notepad
to open or create a file
nano <filename>
opens a terminal-based UI
uses keyboard exclusively for navigation and entry
menu at the bottom of screen
carat ^ represents the Ctrl key
  • ^O (Ctrl+O) - save the file
  • ^X (Ctrl+X) - exit the app
  • ^W (Ctrl+W) - search, where is

The Professional's Tool: vim
vim stands for Vi Improved
difficult to learn but is incredibly powerful
has two Modes
  • Normal Mode - every keystroke is a command
  • Insert Mode - keystrokes type text

The vim Survival Guide
  1. open a file
    vim file.txt
  2. enter Insert mode - press i key to enter Insert mode
  3. return to Normal mode - press Esc key
  4. save and quit - in Normal mode type :wq and press Enter
    • : - starts a command
    • w - write (save)
    • q - quit
  5. quit without saving - in Normal mode type :q! and press Enter
Searching Within Files: grep and Regular Expressions
grep - Global Regular Expression Print
searches for a pattern and prints every line containing a match
syntax
grep "<search term>" <file>
look for user 'root' in password config file
grep "root" /etc/passwd
grep is case-sensitive
to search for both 'error' and 'Error' use insensitive (-i) flag
grep -i "error" /var/log/syslog
for a recursive search use the -r flag
grep -r "<search term>" /<directory path>
Introduction to Regular Expressions (Regex)
special characters defining a basic search pattern
  • ^ (The Anchor) - matches the start of line
    to find lines which start with "root"
    grep "^root" /etc/passwd
  • $ (The Tail) - matches the end of the line
    to find lines which end with "root"
    grep "false$" /etc/passwd
  • . (The Wildcard) - matches any single character
    grep "b.t" file.txt
    matches match "bat", "bet", "bit", "bot" etc.

Text Stream Processing: cut, sort, uniq, wc
Word Count: wc
wc counts lines, words and characters
wc /etc/passwd
output
45 130 2500 /etc/passwd
45 lines, 130 words, and 2500 characters
to just get a line count use -lflag
wc -l /etc/passwd
Sorting: sort
command rearranges lines of a file alphabetically or numerically
sort names.txt
to sort a list of numbers use the numeric sort -n flag
to reverse the sort order use the -r flag

Removing Duplicates with uniq
uniq can filter duplicates only if the they are adjacent
almost always run sort before uniq
sort names.txt | uniq
to know how many times each line appeared use -c flag
sort names.txt | uniq -c
Slicing with cut
cut can extract specific columns from a file
example line from file where data is separated by colons (:)
root:x:0:0:root:/root:/bin/bash
syntax
cut -d <delimiter> -f <field number(s)> <filename>
to get a list of users from the user field
cut -d : -f 1 /etc/passwd
Note : column numbers are not zero-based

Comparing Files: diff and cmp
diff command
diff compares two files line by line and outputs the changes needed to make file A look like file B
diff file1.txt file2.txt
output uses < and >
  • < - lines present in file 1 but not file 2
  • > - lines present in file 2 but not file 1
cmp command
compares files byte by byte
mostly used to check binary files
if files are the same the command returns silently
if files are different, result displays the byte number where the difference occurs

Transforming Text: tr, sed Basics, awk Introduction
Translating Characters with tr
tr is used to swap or delete individual characters
is strictly a stream tool (takes input from another command)
converting lowercase to uppercase
echo "hello world" | tr "a-z" "A-Z"
output
HELLO WORLD
deleting specific characters
echo "Hello 123" | tr -d "0-9"
output
Hello
The Stream Editor: sed
sed is a programmable text editor
can modify a file while the data is flowing through it
most common use is substitution (Find and Replace)
syntax
s/old_word/new_word/g 
command
echo "I love Windows" | sed 's/Windows/Linux/g'
sed arguments
  • s - substitute command
  • Windows - search term
  • Linux - replacement
  • g - replace all occurences in the line

sed can be used to edit a file in place using -i flag
sed -i 's/false/true/g' config.txt
The Powerhouse: awk
awk is actually a full programming language disguised as a command
is incredibly powerful for processing data organized in columns
cut requires a single character delimiter
awk handles whitespace automatically
example file
John Manager 50000
Sarah Engineer 60000
to print the name and salary columns
one-based column numbers
awk '{print $1, $3}' employees.txt
Redirecting Output: >, >>, <
redirect command output from screen (stdout) as a stream to a file

Overwrite (>)
sends output to a file
if file exists it will be erased and replaced
ls > filelist.txt
Append (>>)
appends output to existing file
date >> log.txt
Input Redirection (<)
feeds a file to a command
sort < names.txt
command output goes to stdout

Standard Error
third stream Standard Error stderr
if an error occurs the error message goes to stderr and not stdout
to redirect errors
ls 2> errors.txt
Piping Commands: The Power of |
a pipe takes the output of one command and uses itas input to another command
like an assembly line
Raw Data -> [Machine 1] -> Semi-finished Data -> [Machine 2] -> Finished Product
no need to save intermediate files
to page through a long list
ls -l | less
logic example
grep "error" /var/log/syslog | wc -l
  1. grep extracts all lines containing 'error'
  2. pipe passes the lines to wc -l
  3. wc -l counts the number of lines

Combining Commands for Powerful Text Processing Workflows
example web log format
192.168.1.50 - - [Date] "GET /page.html" 200 ... 
10.0.0.1 - - [Date] "GET /index.html" 200 ... 
192.168.1.50 - - [Date] "GET /image.jpg" 200 ...
want to get the top three users
  1. get a list of the IP addresses
    awk '{print$1}' access.log
  2. sort the list so it can be counted
    awk '{print$1}' access.log | sort
  3. count duplicates
    awk '{print$1}' access.log | sort
    | uniq -c
  4. list is sorted by IP address
    need to sort by count number largest to smallest
    awk '{print$1}' access.log | sort | uniq -c | sort -nr
  5. need only the top three
    use head
    awk '{print$1}' access.log | sort | uniq -c | sort -nr | head -n 3
Summary
covered
  • Editors - nano for quick edits, vim for power users
  • Search - grep
  • Streams - head, tail, ct, sort,wc
  • Transformation - tr and sed to replace and modify text
  • Piping - chaining commands together to create complex workflows

key points
  • nano: use Ctrl+O to save and Ctrl+X to exit
  • grep: searches for text. Use grep -r for recursive folder search
  • | (Pipe): sends the output of one command to the input of another
  • > vs >>: > overwrites a file, >> appends to it
  • sort | uniq: standard way to remove duplicates or count occurrences
  • sed: Use sed 's/old/new/g' to replace text
  • awk: Use awk '{print $1}' to extract the first column of data

index