Ubuntu – Ubuntu Server 20.04 randomly crashing

server

At least once a day my home server crashes. This is a recent build, (7/16/2020, with all new hardware etc).

Machine Specs:

  • AMD Ryzen 5 3400G with Radeon Vega Graphics
  • B450 AORUS M
  • 32 GB RAM DIMM DDR4
  • 1TB SSD M2
  • 2 6TB HDD
  • 1 TB HDD
  • Ubuntu 20.04.1 LTS

I'm currently running via APT the following apps

  • Roon Server
  • Docker (the snap version has been uninstalled)
  • Samba
  • Restic for backups

On Docker Im running

  • PiHole (I have turned off dnsresovler so its not a port issue there)
  • Portainer
  • Plex
  • Resilio Sync

I am not able to find much in the logs but did come across the following section of the log that piqued my interest after the s system crashed this afternoon. Doesn't tell me much but maybe someone here can help get me headed in the right direction.

For some reason I can't copy or paste the actual log to represent it as I'm seeing it. I've included a screen shot. In short it looks like its doing something with docker, then i get a bunch of (bad memory locations?) `<0x00>'

enter image description here

After it crashes I have zero ability to interact with the system. The screen shows some information that I dont know what it means, or how to get at that data. Maybe if it crashes again I'll take a photo with my phone.

I am not a linux/ubuntu expert (but pretty proficient with windows) and have been learning as I've been going since last Thursday when I built the machine and started installing ubuntu.

What Ive tried thus far.

  • I have made sure there is disk space available. None of the drives are even remotely full (30%-40% utilized and RAM is showing 32GB available), and when it crashed recently it was under little to no load. I was just streaming roon in another room.
  • Docker seems to be running as expected. I did unintentionally install docker via apt vs snap, which was causing some problems, but i seem to have (I think) remedied that as I uninstalled both snap and apt version and ensured any remaining folders etc were removed.
  • Bios shows all memory is loaded and recognized.
  • fdisk -l shows no oddness and all drive look right size and right partitioning
  • free -h shows 4Gi total for swap file but used 12mi, and RAM is showing 29 Gi total, and 28Gi available.
  • dmesg shows this error shows up quite a few times. Searching isn't yielding too much luck.
    [ 2328.925902] BUG: unable to handle page fault for address: 0000000000c045c7
    [ 2328.925905] #PF: supervisor write access in kernel mode
    [ 2328.929589] RIP: 0010:fsnotify+0x63/0x3d0
    [ 2328.933164] #PF: error_code(0x0002) - not-present page

Any help/ideas anyone might have is greatly appreciated, this is getting somewhat annoying.

Edit: Per suggestions of @heynnema

sudo dmidecode -s bios-version returns F50

sysctl vm.swappiness returns vm.swapiness = 60

sudo lshw -C memory:

*-firmware
       description: BIOS
       vendor: American Megatrends Inc.
       physical id: 0
       version: F50
       date: 11/27/2019
       size: 64KiB
       capacity: 16MiB
       capabilities: pci upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int14serial int17printer acpi usb biosbootspecification uefi
  *-memory
       description: System Memory
       physical id: 9
       slot: System board or motherboard
       size: 32GiB
     *-bank:0
          description: DIMM DDR4 Synchronous Unbuffered (Unregistered) 1866 MHz (0.5 ns)
          product: F4-3200C16-8GVKB
          vendor: Unknown
          physical id: 0
          serial: 00000000
          slot: DIMM 0
          size: 8GiB
          width: 64 bits
          clock: 1866MHz (0.5ns)
     *-bank:1
          description: DIMM DDR4 Synchronous Unbuffered (Unregistered) 1866 MHz (0.5 ns)
          product: F4-3200C16-8GVKB
          vendor: Unknown
          physical id: 1
          serial: 00000000
          slot: DIMM 1
          size: 8GiB
          width: 64 bits
          clock: 1866MHz (0.5ns)
     *-bank:2
          description: DIMM DDR4 Synchronous Unbuffered (Unregistered) 1866 MHz (0.5 ns)
          product: F4-3200C16-8GVKB
          vendor: Unknown
          physical id: 2
          serial: 00000000
          slot: DIMM 0
          size: 8GiB
          width: 64 bits
          clock: 1866MHz (0.5ns)
     *-bank:3
          description: DIMM DDR4 Synchronous Unbuffered (Unregistered) 1866 MHz (0.5 ns)
          product: F4-3200C16-8GVKB
          vendor: Unknown
          physical id: 3
          serial: 00000000
          slot: DIMM 1
          size: 8GiB
          width: 64 bits
          clock: 1866MHz (0.5ns)
  *-cache:0
       description: L1 cache
       physical id: b
       slot: L1 - Cache
       size: 384KiB
       capacity: 384KiB
       clock: 1GHz (1.0ns)
       capabilities: pipeline-burst internal write-back unified
       configuration: level=1
  *-cache:1
       description: L2 cache
       physical id: c
       slot: L2 - Cache
       size: 2MiB
       capacity: 2MiB
       clock: 1GHz (1.0ns)
       capabilities: pipeline-burst internal write-back unified
       configuration: level=2
  *-cache:2
       description: L3 cache
       physical id: d
       slot: L3 - Cache
       size: 4MiB
       capacity: 4MiB
       clock: 1GHz (1.0ns)
       capabilities: pipeline-burst internal write-back unified
       configuration: level=3

Best Answer

You're having page fault errors.

BIOS

Gigabyte B450 AORUS M

You have BIOS version F50.

There's a newer BIOS available, version F51f, and it can be downloaded here.

Update video available here.

Note: Confirm that I have the correct web page for your model #.

Note: Have good backups before updating the BIOS.

memtest

Go to https://www.memtest86.com/ and download/run their free memtest to test your memory. Get at least one complete pass of all the 4/4 tests to confirm good memory. This may take many hours to complete.

Update #1:

memtest has failed. We'll first update the BIOS, then retest with memtest, and troubleshoot memory if errors still occur.

Update #2:

enter image description here enter image description here enter image description here

Update #3:

After updating the BIOS, memtest still failed. We tested various pairs of DIMMs in slots 1 & 2, and they all passed memtest. I believe that there is a compatibility problem with the Ryzen CPU, and the G.SKILL DIMMs, when all four DIMMs are installed, so we swapped them for Corsair DIMMs.

memtest now runs all 4/4 tests with no errors!

Reference: CPU Support list https://www.gigabyte.com/us/Motherboard/B450-AORUS-M-rev-10/support#support-cpu

Reference: RAM Support list https://www.gigabyte.com/us/Motherboard/B450-AORUS-M-rev-10/support#support-doc

Related Question