Ubuntu_Machine_for_Deep_Learning

Try operating system Nix, and try framework workstation.

Hardware

Mining Gear

Mining:

Type Item Price
CPU Intel Core i3-10105 3.7 GHz Quad-Core Processor $117.99 @ Amazon
CPU Cooler Deepcool GAMMAXX 400 Blue 74.34 CFM CPU Cooler $17.98 @ Amazon
Motherboard MSI MAG B560M MORTAR WIFI Micro ATX LGA1200 Motherboard $159.99 @ Amazon
Memory Crucial 8 GB (1 x 8 GB) DDR4-2400 CL17 Memory $34.97 @ Amazon
Storage Western Digital Blue Mobile 2 TB 2.5" 5400RPM Internal Hard Drive $63.75 @ Amazon
Video Card Gigabyte GeForce RTX 3080 Ti 12 GB GAMING OC Video Card $2799.99 @ Amazon
Power Supply Corsair RMx (2018) 850 W 80+ Gold Certified Fully Modular ATX Power Supply $114.99 @ Amazon
Prices include shipping, taxes, rebates, and discounts
Total $3309.66 -> $2,023.54 + $480.07 = $2503.61 = 15647.5625 yuan
Generated by PCPartPicker 2022-01-11 16:13 EST-0500

Single Rank Memory vs Dual Rank Memory

Installing Ubuntu

First, download image Here. Note that some version (such as ubuntu-22.04.3-desktop-amd64.iso) might result in black screen while ubuntu-22.04.2-desktop-amd64.iso works as intended.

Then, burn the image using Rufus on Windows, Etcher on Mac OS and Startup Disk Creator on Ubuntu.

Set Up Remote Server

Set Up Remote Server for Windows Server

Steps:

  1. Follow How to Login Windows 11 Remotely from Ubuntu 20.04 / 22.04 / Debian 11
  2. Download and install RDPWrapper

If you have trouble connecting:

  1. Make sure you have the right color display mode
  2. Make sure your username is not "Administrator" but the email of your Microsoft Account
  3. Make sure you are not logged in. Every rdp can only has one session logged in.

SSH Tutorial: Here

Set Up Remote Server for Ubuntu Server

Ctrl + Shift + NumLock: use keyboard mouse https://askubuntu.com/questions/1033436/how-to-use-ubuntu-18-04-on-vnc-without-display-attached

Install Ubuntu Server:

  1. in GRUB, hit e to edit command
  2. follow Here and add nomodeset keyword to prevent black screen

Partition: you can only have maximum 4 partitions

SSH

VNC Servers

X11VNC: See This Video or This Post

Make sure to replace -passwd password with your actual password in VNC layer (not user password).

When you run x11vnc, it will detect your ~/.Xauthority which will only created when you first login though a display.

[Unit]
Description=x11vnc service
After=display-manager.service network.target syslog.target

[Service]
Type=simple
ExecStart=/usr/bin/x11vnc -forever -display :0 -auth guess -passwd password
ExecStop=/usr/bin/killall x11vnc
Restart=on-failure

[Install]
WantedBy=multi-user.target

Then

If you don't have a display, you need a virtual display:

#/etc/X11/xorg.conf
Section "Device"
   Identifier "Configured Video Device"
   Driver     "dummy"
   VideoRam   40000
EndSection
Section "Monitor"
   Identifier "Configured Monitor"
   HorizSync 22-83
   VertRefresh 50-70
   Modeline   "1920x1080_60.05" 173.00 1920 2048 2248 2576 1080 1083 1088 1120 -hsync +vsync
EndSection
Section "Screen"
   Identifier "DefaultScreen"
   Monitor    "Configured Monitor"
   Device     "Configured Video Device"
   DefaultDepth 24
   SubSection "Display"
       Depth 24
       Modes "1920x1200"
   EndSubSection
EndSection

Vino

Section "Device"
    Identifier "DummyDevice"
    Driver "dummy"
    VideoRam 256000
EndSection

Section "Screen"
    Identifier "DummyScreen"
    Device "DummyDevice"
    Monitor "DummyMonitor"
    DefaultDepth 24
    SubSection "Display"
        Depth 24
        Modes "1920x1080_60.0"
    EndSubSection
EndSection

Section "Monitor"
    Identifier "DummyMonitor"
    HorizSync 30-70
    VertRefresh 50-75
    ModeLine "1920x1080" 148.50 1920 2448 2492 2640 1080 1084 1089 1125 +Hsync +Vsync
EndSection

FreeNX

TightVNC (Here or Here or Here)

TigerVNC: Good Tutorial

xRDP: Good Tutorial

unset DBUS_SESSION_BUS_ADDRESS
unset XDG_RUNTIME_DIR

Nvidia: sudo ubuntu-drivers autoinstall

[Allow Wifi Scan]
Identity=unix-user:*
Action=org.freedesktop.NetworkManager.wifi.scan;org.freedesktop.NetworkManager.enable-disable-wifi;org.freedesktop.NetworkManager.settings.modify.own;org.freedesktop.NetworkManager.settings.modify.system;org.freedesktop.NetworkManager.network-control
ResultAny=yes
ResultInactive=yes
ResultActive=yes

Other stuff:

Basic Configurations

Install ssh stuff: just edit ~/.ssh/config

sudo ssh-keygen -t ed25519 -C "[email protected]"
# sudo will give you a file in /root/
# but make sure you change the directory to your user directory
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519

Install shell stuff (systemwide)

sudo apt install zsh tmux neofetch curl git
sh -c "$(curl -fsSL https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"
# and add neofetch to .zshrc

Customize zsh:

You can install above by: setting ZSH_THEME="powerlevel10k/powerlevel10k" in .zshrc and add zsh-autosuggestions in plugins=(...) in .zshrc.

git clone --depth=1 https://github.com/romkatv/powerlevel10k.git ${ZSH_CUSTOM:-$HOME/.oh-my-zsh/custom}/themes/powerlevel10k
git clone https://github.com/zsh-users/zsh-autosuggestions ${ZSH_CUSTOM:-~/.oh-my-zsh/custom}/plugins/zsh-autosuggestions

For settings, choose:

Install python stuff (user)

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Random Stuff

Krita

To install Krita, download .AppImage form their website and put it into /home/koke_cacao/bin/. Then do a cd ~/bin && ln -s krita-5.1.0-x86_64.appimage krita.

Cloud Servers

Oracle Setup

# open up fire wall on oracle cloud (note that to sign in, choose to sign in with username and don't select regions)
# remember go to subnet and set Ingress Rules
# https://stackoverflow.com/questions/54794217/opening-port-80-on-oracle-cloud-infrastructure-compute-node
# https://medium.com/@harjulthakkar/part5-firewall-configuration-on-oracle-public-cloud-3b71b487666c
echo "Configuring IP Table... \n"
sudo iptables -L && \
sudo iptables-save > ~/iptables-rules && \
sudo iptables -P INPUT ACCEPT && \
sudo iptables -P OUTPUT ACCEPT && \
sudo iptables -P FORWARD ACCEPT && \
sudo iptables -F && \
sudo iptables-save | sudo tee /etc/iptables.conf && \
echo "My service will automatically do sudo iptables-restore < /etc/iptables.conf to load saved iptables.conf on server start."

Nvidia Stuff

Understand NVCC, CUDA Driver, CUDA Toolkit, Cudnn

Terminologies:

Compute Unified Device Architecture (CUDA): a programming language, an API, a programming model.

Cudnn: software library for deep learning computing. It has cuFFT, cuDNN and many GPU-accelerated libraries

CUDA Toolkit:

NVCC: a compiler

file extension significance
.cu cuda Source file , Include host and device Code
.cup Pretreated cuda Source file , Compilation options --preprocess/-E
.c c Source file
.cc/.cxx/.cpp c++ Source file
.gpu gpu Intermediate document , Compilation options --gpu
.ptx Similar to assembly code , Compilation options --ptx
.o/.obj Target file , Compilation options --compile/-c
.a/.lib The library files , Compilation options --lib/-lib
.res Resource file
.so Shared target file , Compilation options --shared/-shared
.cubin cuda Binary file , Compilation options -cubin

nvidia-smi: a project based on the NVIDIA Management Library(NVML) for managing GPU performance and state

Sometimes the CUDA version shown in nvcc --version and in nvidia-smi is not the same, this is because: there are runtime API and driver API

In the development process, you can only choose either runtime API or driver API. You can't mix two of them. runtime API is a more advanced package and easier to use while driver API is a lower layer API. The difference is documented here

Installation

TLDR

To install GPU support

For both, we recommend the .run file as it is the most reliable method on Ubuntu 22.04.

Other Complicated Ways

Install cuda stuff

Pre-Installation Checks:

You need to understand what version of cuda you need and what driver version match the driver version of cuda:

CUDA Forward Compatible Upgrade 418.40.04+ (CUDA 10.1) 450.36.06+ (CUDA 11.0) 470.57.02+ (CUDA 11.4) 495.29.05+ (CUDA 11.5) 510.39.01+ (CUDA 11.6) 515.43.04+ (CUDA 11.7)
11-7 X C C X* C Not Required
11-6 C C C Not required X
11-5 C C C X X
11-4 C C Not required X X
11-3 C C X X X
11-2 C C X X X
11-1 C C X X X
11-0 C Not required X X X
10-2 C X X X X
10-1 Not required X X X X
10-0 X X X X X

For updated version of table and more information, read here: https://docs.nvidia.com/deploy/cuda-compatibility/index.html#use-the-right-compat-package

To perform a recommended default installation of drivers and cuda for Nvidia, here is what you should do:

  1. To install driver: sudo ubuntu-drivers autoinstall
  2. To install CUDA: add package or package and do sudo apt install cuda (if package not found, you might need to try sudo apt install nvidia-cuda-toolkit)

You don't always want to install the latest version. For example, Pytorch is compiled at CUDA 11.8 and that means you can't install any driver that is greater than 520. There might be compatibility issue with third-party plugins.

Debugging During Installations

If you could not somehow perform sudo apt install cuda, then check using sudo apt list --installed | grep nvidia if there are any nvidia-related packages.

If there are version conflicts, saying x but it is not going to be installed, try sudo apt install x to see what it says. If x but y is to be installed, it means we have the package with version x but we are currently trying to install y due to we need y to satisfy dependency. This issue typically happens in a computer-vendor-installed version of operating system where it overrides apt source (therefore the thing we are currently try to install got resolved to vendor-specific version y). To see all the apt source:

ls /etc/apt/sources.list.d

You can just remove the ones you don't like such as cuda-ubuntu2204-x86_64.list and cuda-ubuntu2204-x86_64.list.save.

You could also try sudo apt --fix-broken install, but you should be careful what it does.

If sudo apt install cuda can't find the package, you could also install using .run file. Recommend 11.8

❯ sudo sh cuda_11.8.0_520.61.05_linux.run
===========
= Summary =
===========

Driver:   Installed
Toolkit:  Installed in /usr/local/cuda-11.8/

Please make sure that
 -   PATH includes /usr/local/cuda-11.8/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-11.8/lib64, or, add /usr/local/cuda-11.8/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-11.8/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall
Logfile is /var/log/cuda-installer.log

To uninstall, you need to do:

# To uninstall cuda
sudo /usr/local/cuda/bin/cuda-uninstaller
# To uninstall nvidia
sudo /usr/bin/nvidia-uninstall

Post-Installation

Environment Variables

You should define your environment variables in .bashrc like the following:

export PATH="/usr/local/cuda-11.7/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-11.7/lib64:$LD_LIBRARY_PATH"

Systemwide Settings

  1. You need to make sure systemctl status nvidia-persistenced is enabled. If not, enable it.
  2. You need to disable hot-pluggable memory

In default setting of udev rules, you have hot-pluggable memory enabled. The default settings can be viewed /lib/udev/rules.d/40-vm-hotadd.rules.

However, Nvidia doesn't like this default setting. To change the default setting, we copy the default setting to /etc/udev/rules.d.

You should not change files directly in /lib/udev/rules.d/. You should overwrite the default setting by copy the file to /etc/udev/rules.d and change it there.

Wo therefore perform sudo cp /lib/udev/rules.d/40-vm-hotadd.rules /etc/udev/rules.d and then remove the line containing something like:

SUBSYSTEM=="memory", ACTION=="add", PROGRAM="/bin/uname -p", RESULT!="s390*", ATTR{state}=="offline", ATTR{state}="online"

Small Tweaks

sudo add-apt-repository -y ppa:lubomir-brindza/nautilus-typeahead
sudo apt install nautilus

Other Config

  1. add Chinese language follow this
  2. set nautilus search to only 1 level
  3. bind flameshot gui to keyboard shortcut to take screen shot
Gnome Extensions:

- Desktop Icons NG (DING) by rastersoft (system)

- Freon by UshakovVasilii

- OpenWeather by skrewball

- Ubuntu AppIndicators by didrocks (system)

- Ubuntu Dock by didrocks (system)

CUDA Toolkit

sudo apt install nvidia-cuda-toolkit

APT Source

Install CMake follow this guide

Install additional codecs

sudo apt install ubuntu-restricted-extras

Issues

Device Issues

If mouse middle click scroll does not work:

Make Cuda Filed

When doing apt install, you might get:

Error! Bad return status for module build on kernel: 5.15.0-46-generic (x86_64)
Consult /var/lib/dkms/nvidia/515.65.01/build/make.log for more information.
dpkg: error processing package nvidia-dkms-515 (--configure):
 installed nvidia-dkms-515 package post-installation script subprocess returned error exit status 10

When consulting the make.log, you would see.

ProblemType: Package
DKMSBuildLog:
 DKMS make.log for nvidia-515.65.01 for kernel 5.15.0-46-generic (x86_64)
 Thu Aug 11 03:40:04 AM EDT 2022
 make[1]: Entering directory '/usr/src/linux-headers-5.15.0-46-generic'
 test -e include/generated/autoconf.h -a -e include/config/auto.conf || (               \
 echo >&2;                                                      \
 echo >&2 "  ERROR: Kernel configuration is invalid.";          \
 echo >&2 "         include/generated/autoconf.h or include/config/auto.conf are missing.";\
 echo >&2 "         Run 'make oldconfig && make prepare' on kernel src to fix it.";     \
 echo >&2 ;                                                     \
 /bin/false)
 warning: the compiler differs from the one used to build the kernel
   The kernel was built by: gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0
   You are using:           cc (Ubuntu 10.3.0-15ubuntu1) 10.3.0
 make -f ./scripts/Makefile.build obj=/var/lib/dkms/nvidia/515.65.01/build \
 single-build= \
 need-builtin=1 need-modorder=1
   ln -sf /var/lib/dkms/nvidia/515.65.01/build/nvidia/nv-kernel.o_binary /var/lib/dkms/nvidia/515.65.01/build/nvidia/nv-kernel.o
   ln -sf /var/lib/dkms/nvidia/515.65.01/build/nvidia-modeset/nv-modeset-kernel.o_binary /var/lib/dkms/nvidia/515.65.01/build/nvidia-modeset/nv-modeset-kernel.o

In the log, you might also see a warning, saying that the kernel is compiled using gcc-11, but the driver is compiled using gcc-9. cd to gcc-11 and redo apt install will solve the problem.

ZFS Storage Full

Use zfs list -r -t snapshot -o name,used,referenced,creation bpool/BOOT command to see all snapshots. Use zfs list -r -t snapshot -o name,used,referenced,creation bpool/BOOT | tail -n 4 | cut -c 35-40 | xargs -n 1 sudo zsysctl state remove --system to remove the last 4 snapshot.

To check snapshots: zfs list -t snapshot

To remove all bpool: for i in $(sudo zfs list -t snapshot | grep bpool | awk '{print $1}'); do sudo zfs destroy -R $i;done

To remove all rpool: for i in $(sudo zfs list -t snapshot | grep rpool | awk '{print $1}'); do sudo zfs destroy -R $i;done

Error Occurred at Startup

You can do sudo dmesg or checkout /var/crash.

Davinci Resolve

Davinci Resolve gives The GPU failed to perform image processing because of an error. Error code 999.. This link gives the solution.

If nvidia gpu is used in on-demand mode, you have to explicitly demand it. To enable set the following environment variables:

export __NV_PRIME_RENDER_OFFLOAD=1
export __GLX_VENDOR_LIBRARY_NAME=nvidia

Davinci Resolve could then be launched at /opt/resolve/bin/resolve.

Other solution (not working) involve:

Connecting to Public WiFi

After you setup your Ethernet to connect to your laptop, here is a route table

(base) ➜  ~ route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         10.0.0.2        0.0.0.0         UG    20100  0        0 enp2s0
default         _gateway        0.0.0.0         UG    20600  0        0 wlp3s0
10.0.0.0        0.0.0.0         255.255.255.0   U     100    0        0 enp2s0
100.64.0.0      0.0.0.0         255.255.240.0   U     600    0        0 wlp3s0
link-local      0.0.0.0         255.255.0.0     U     1000   0        0 enp2s0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0

You can see that default destination with gateway 10.0.0.2 is set to top priority with no netmask. This is a problem when we connect to public WLAN. Usually, you connect to https://_gateway to register WiFi, but since the WiFi is not established yet, you will try to connect the redirection through 10.0.0.2. To resolve this issue, we choose to set the priority of enp2s0 to a higher value.

This can be done by installing ifmetric and do sudo ifmetric enp2s0 30600. You can download an offline version of the package and put it onto the serving using ssh.

(base) ➜  ~ route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         _gateway        0.0.0.0         UG    20600  0        0 wlp3s0
default         10.0.0.2        0.0.0.0         UG    30600  0        0 enp2s0
10.0.0.0        0.0.0.0         255.255.255.0   U     30600  0        0 enp2s0
100.64.0.0      0.0.0.0         255.255.240.0   U     600    0        0 wlp3s0
link-local      0.0.0.0         255.255.0.0     U     30600  0        0 enp2s0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0

This method will not interrupt your current connection. Route table should automatically reset after reboot.

Apple Airpod Connection

Following this article:

To enable pairing of airpods, you will need to update the ControllerMode to bredr from the default value of dual. This can be done by editing the file /etc/bluetooth/main.conf. Then restart the Bluetooth service using sudo /etc/init.d/bluetooth restart command.

Table of Content