Text Processing and File Manipulation

Viewing and Editing Text Files: nano, vim Basics

Linux stores most everything as plain text
use CLI tools to open and edit files directly
two main tools for terminal text work are nano and vim

The Beginners Friend: nano

simple, intuitive, safe
behaves like a standard notepad
to open or create a file

nano <filename>

opens a terminal-based UI
uses keyboard exclusively for navigation and entry
menu at the bottom of screen
carat ^ represents the Ctrl key

^O (Ctrl+O) - save the file
^X (Ctrl+X) - exit the app
^W (Ctrl+W) - search, where is

The Professional's Tool: vim

vim stands for Vi Improved
difficult to learn but is incredibly powerful
has two Modes

Normal Mode - every keystroke is a command
Insert Mode - keystrokes type text

The vim Survival Guide

open a file
```
vim file.txt
```
enter Insert mode - press i key to enter Insert mode
return to Normal mode - press Esc key
save and quit - in Normal mode type :wq and press Enter
- : - starts a command
- w - write (save)
- q - quit
quit without saving - in Normal mode type :q! and press Enter

Searching Within Files: grep and Regular Expressions

grep - Global Regular Expression Print
searches for a pattern and prints every line containing a match
syntax

grep "<search term>" <file>

look for user 'root' in password config file

grep "root" /etc/passwd

grep is case-sensitive
to search for both 'error' and 'Error' use insensitive (-i) flag

grep -i "error" /var/log/syslog

for a recursive search use the -r flag

grep -r "<search term>" /<directory path>

Introduction to Regular Expressions (Regex)

special characters defining a basic search pattern

^ (The Anchor) - matches the start of line
to find lines which start with "root"
```
grep "^root" /etc/passwd
```
$ (The Tail) - matches the end of the line
to find lines which end with "root"
```
grep "false$" /etc/passwd
```
. (The Wildcard) - matches any single character
```
grep "b.t" file.txt
```
matches match "bat", "bet", "bit", "bot" etc.

Text Stream Processing: cut, sort, uniq, wc

Word Count: wc

wc counts lines, words and characters

wc /etc/passwd

output

45 130 2500 /etc/passwd

45 lines, 130 words, and 2500 characters
to just get a line count use -lflag

wc -l /etc/passwd

Sorting: sort

command rearranges lines of a file alphabetically or numerically

sort names.txt

to sort a list of numbers use the numeric sort -n flag
to reverse the sort order use the -r flag

Removing Duplicates with uniq

uniq can filter duplicates only if the they are adjacent
almost always run sort before uniq

sort names.txt | uniq

to know how many times each line appeared use -c flag

sort names.txt | uniq -c

Slicing with cut

cut can extract specific columns from a file
example line from file where data is separated by colons (:)

root:x:0:0:root:/root:/bin/bash

syntax

cut -d <delimiter> -f <field number(s)> <filename>

to get a list of users from the user field

cut -d : -f 1 /etc/passwd

Note : column numbers are not zero-based

Comparing Files: diff and cmp

diff command

diff compares two files line by line and outputs the changes needed to make file A look like file B

diff file1.txt file2.txt

output uses < and >

< - lines present in file 1 but not file 2
> - lines present in file 2 but not file 1

cmp command

compares files byte by byte
mostly used to check binary files
if files are the same the command returns silently
if files are different, result displays the byte number where the difference occurs

Transforming Text: tr, sed Basics, awk Introduction

Translating Characters with tr

tr is used to swap or delete individual characters
is strictly a stream tool (takes input from another command)
converting lowercase to uppercase

echo "hello world" | tr "a-z" "A-Z"

output

HELLO WORLD

deleting specific characters

echo "Hello 123" | tr -d "0-9"

output

Hello

The Stream Editor: sed

sed is a programmable text editor
can modify a file while the data is flowing through it
most common use is substitution (Find and Replace)
syntax

s/old_word/new_word/g

command

echo "I love Windows" | sed 's/Windows/Linux/g'

sed arguments

s - substitute command
Windows - search term
Linux - replacement
g - replace all occurences in the line

sed can be used to edit a file in place using -i flag

sed -i 's/false/true/g' config.txt

The Powerhouse: awk

awk is actually a full programming language disguised as a command
is incredibly powerful for processing data organized in columns
cut requires a single character delimiter
awk handles whitespace automatically
example file

John Manager 50000
Sarah Engineer 60000

to print the name and salary columns
one-based column numbers

awk '{print $1, $3}' employees.txt

Redirecting Output: >, >>, <

redirect command output from screen (stdout) as a stream to a file

Overwrite (>)

sends output to a file
if file exists it will be erased and replaced

ls > filelist.txt

Append (>>)

appends output to existing file

date >> log.txt

Input Redirection (<)

feeds a file to a command

sort < names.txt

command output goes to stdout

Standard Error

third stream Standard Error stderr
if an error occurs the error message goes to stderr and not stdout
to redirect errors

ls 2> errors.txt

Piping Commands: The Power of |

a pipe takes the output of one command and uses itas input to another command
like an assembly line

Raw Data -> [Machine 1] -> Semi-finished Data -> [Machine 2] -> Finished Product

no need to save intermediate files
to page through a long list

ls -l | less

logic example

grep "error" /var/log/syslog | wc -l

grep extracts all lines containing 'error'
pipe passes the lines to wc -l
wc -l counts the number of lines

Combining Commands for Powerful Text Processing Workflows

example web log format

192.168.1.50 - - [Date] "GET /page.html" 200 ... 
10.0.0.1 - - [Date] "GET /index.html" 200 ... 
192.168.1.50 - - [Date] "GET /image.jpg" 200 ...

want to get the top three users

get a list of the IP addresses
```
awk '{print$1}' access.log
```
sort the list so it can be counted
```
awk '{print$1}' access.log | sort
```
count duplicates
```
awk '{print$1}' access.log | sort
```
| uniq -c
list is sorted by IP address
need to sort by count number largest to smallest
```
awk '{print$1}' access.log | sort | uniq -c | sort -nr
```

need only the top three
use head

awk '{print$1}' access.log | sort | uniq -c | sort -nr | head -n 3

Summary

covered

Editors - nano for quick edits, vim for power users
Search - grep
Streams - head, tail, ct, sort,wc
Transformation - tr and sed to replace and modify text
Piping - chaining commands together to create complex workflows

key points

nano: use Ctrl+O to save and Ctrl+X to exit
grep: searches for text. Use grep -r for recursive folder search
| (Pipe): sends the output of one command to the input of another
> vs >>: > overwrites a file, >> appends to it
sort | uniq: standard way to remove duplicates or count occurrences
sed: Use sed 's/old/new/g' to replace text
awk: Use awk '{print $1}' to extract the first column of data