Troubleshooting Common Issues

Developing a Troubleshooting Mindset
four steps
  1. Identify the Symptom - narrow down the suspect
  2. Isolate the Variable - change one thing at a time
  3. Check the Evidence - read the logs
  4. Test the Fix - once resolution is believed reboot, does it stay fixed?

Boot Problems: GRUB and Recovery Mode
Scenario 1: The GRUB Rescue Prompt
grub rescue>
bootloader is corrupted or cannot find the partition with Linux kernel
  • Cause - updates can overwrite GRUB when dual booting Windows
  • Fix - boot from Linux installer USB stick
    open a terminal
    install a tool called boot-repair
    run boot-repair
    app will scan drive, find the Linux partion, and reinstall GRUB automatically
Scenario 2: Emergency Mode
see "Welcome to emergency mode!" and are asked for the root password
  • Cause - usually a filesystem error or bad entry in /etc/fstab
    if a new drive is added to fstab then unplugged, Linux will refuse to boot because a required disk is missing
  • Fix
    1. enter the root password
    2. remount root filesystem as read/write
      mount -o remount, rw /
    3. edit fstab
      nano etc/fstab
    4. comment bad line
    5. save and reboot

Scenario 3: Resetting a Lost Root Password
forgot password and can't login
  1. reboot
  2. at the GRUB menu press e to edit the boot options
  3. find line starting with linux
  4. at the end of line add
     init=/bin/bash
  5. press Ctrl+X or F10 to reboot
  6. will drop into a root shell without a password
  7. remount the drive
    mount -o remount,rw /
  8. change password
    passwd <username>
  9. reboot
    exec /sbin/init

System Performance Issues: Identifying Bottlenecks
system is running slow
  1. Check Load - run uptime
    compare load average to number of cores
  2. Check CPU - run top
    is a process hogging the CPU?
    yes - kill the runaway process
    no -CPU is idle but sysstem is slow
    check I/O
  3. Check I/O - run top or iostat
    check wa (wait) percentage
    if ws is high CPU is waiting on disk
    run iotop to find the culprit
  4. Check RAM - run free -h
    if swap usage is high and free RAM is near zero, the system is thrashing
    close some apps or buy more RAM
Network Connectivity Problems
can't connect to the Internet
  1. Layer 1 Check - check by running
    ip link
    status will be UP or DOWN
  2. IP Check - check IP address
    ip addr
    no IP address means DHCP client failure
    request a new one
    sudo dhclient -v
  3. Gateway Check - check router
    ip route
    ping the IP address returned
    failure means LAN failure
  4. Internet Check - ping Google's domain
    ping 8.8.8.8
    failure means router not connected to ISP
  5. DNS Check - ping Google
    ping google.com
    failure may mean DNS settings are wrong
    experience says DNS services sometimes can briefly go down (problem not local)
Package and Dependency Conflicts
"E: Unable to locate package" or "Held broken packages."
  1. update the cache
    sudo apt update
  2. fix broken installation
    first run
    sudo dpkg --configure -a
    then
    sudo apt install -f
  3. fixed locked files
    "Could not get lock /var/lib/dpkg/lock"
    means another package manager is running
    wait 5 minutes
    if problem persists run
    ps aux | grep apt
    kill the stuck process
    delete the lockfile
    sudo rm /var/lib/dpkg/lock-frontend
Permission Denied Errors
common error
  1. check ownership
    ls -l
    does user running app own the file?
    use chown to change ownership
  2. check permissions
    are permissions rw-?
    use chmod to change x bit
  3. check directories
    to enter a folder execute permission is needed
  4. check AppArmor/SELinux
    if permissions are OK (rwxrwxrwx) but still denied MAC (Madatory Access Control) issues
    check
    dmesg or /var/log/audit/audit.log
Disk Space Issues: Finding and Removing Large Files
"No space left on disk"
  1. verify
    run
    df -h
    confirm which partition is full
  2. locate
    go to root of partition and run
    sudo du -sh|sort -h
    lists directories by size
    cd to largest-sized folder
    cd /<folder>
    again run
    sudo du -sh|sort -h
    cd to largest-sized subfolder
    cd /<folder>
    examine folder contents
    ls -lh
  3. fix
    generally large file will be a log file
    delete the file
    sudo rm /var/<folder>/<filename>
    if file is in use space will not be freed until app restarts
    restart the log's service
Service Failures and Log Analysis
"Service failed to start"
  1. check status
    systemctl status apache2
    read error lines
  2. check journal
    journalctl -u apache2 -e
    specific error usually at end
    Syntax error on line 54 of /etc/apache2/apache2.conf
  3. fix config - correct the error
  4. restart
    systemctl restart apache2
Summary
covered
  • boot - Rescue Mode and chroot to fix broken systems
  • performance - identifying bottlenecks
  • networks - tracking packets
  • permissions - ownership, bits and Mandatory Access Control
  • logs - use journalctl to find problems

key points
  • don't panic - follow logical path
  • journalctl -xe - first command to run when a service fails
  • df -h / du -sh - tools for disk-space issues
  • ping / ip route / dig - network tools
  • top / iotop - identifying bottlenecks
  • Single User Mode - using GRUB to reset root passwords
  • logs - the answer is almost always in the logs
index