Ubuntu – Mysterious minutes-long system freeze

16.10dellfreezexps

I'm currently experiencing very long periods of time (~5 minutes at a time) during which my laptop freezes. I've taken an image of the situation to point out a few symptoms (couldn't take a screenshot due to freeze).

Here is the image:
enter image description here

So, what freezes:

  • VM freezes (right side), was in process of shutting down
  • Websites won't load (background)
  • Can't ping websites (terminal window) and, after a while, can't enter text in terminal window either (notice 'open rectangle' text cursor)
  • File browser freezes and doesn't show folder content (Dolphin window)
  • Can't open Dash home

What doesn't freeze:

  • Can still move mouse
  • Can still put focus on window
  • Can still enter terminal with altctrlf1

Additional information:

  • There seem to be 2 stages, one during which I can still open new programs, for example, and one during which even that is no longer possible. I suspect the second stage starts when I try to view the contents of the home folder (~), but I might be completely off with that.

  • After about 5 minutes, the system unfreezes as if nothing ever happened.

  • It happens a few times a day. A reboot doesn't make it go away.

  • In at least one case (I'll try more as it keeps occurring), switching to a different wifi network instantly resolves the problem. Switching back to the original network doesn't cause the problem to re-appear (immediately).

I don't know where to start looking, but reading around suggests the dmesg output might be a good place. Its contents can be found here. The (relevant portion of the) content of /var/log/syslog can be found here. In both, there is mention of a firmware crash at [3125.851869], which is Jan 9 19:24:03.

I'm running 16.10 on a new Dell XPS 13 Kaby Lake. Let me know if there is any more information I can provide.


Edit

The dmesg log now mentions a hardware error:

[   38.276956] Key type id_legacy registered
[  300.462458] mce: [Hardware Error]: Machine check events logged
[  311.013944] SUPR0GipMap: fGetGipCpu=0x3
[  311.521449] vboxdrv: ffffffffc0000020 VMMR0.r0
[  311.706008] vboxdrv: ffffffffc0102020 VBoxDDR0.r0
[  311.799288] vboxdrv: ffffffffc0122020 VBoxEhciR0.r0
[  327.508305] wlp58s0: AP 88:03:55:f4:9c:e8 changed bandwidth, new config is 2462 MHz, width 1 (2462/0 MHz)
[  404.851340] vboxdrv: ffffffffc0000020 VMMR0.r0
[  404.984658] vboxdrv: ffffffffc0102020 VBoxDDR0.r0
[  746.410756] hrtimer: interrupt took 9058 ns

The contents of /var/log/mcelog is found in this pastebin.


Edit

There are some suggestions that the issue might be hard-drive related, so let me provide some information on that.

The system is running on an encrypted ssd (not just the home folder), which is probably why it is not showing up under /dev/sda, but rather /dev/mapper/ubuntu--vg-root. If it is of any help, the whole output of df -l is:

Filesystem                  1K-blocks      Used Available Use% Mounted on
udev                          4003752         0   4003752   0% /dev
tmpfs                          805328     10204    795124   2% /run
/dev/mapper/ubuntu--vg-root 235927440 214041380   9831944  96% /
tmpfs                         4026636       292   4026344   1% /dev/shm
tmpfs                            5120         4      5116   1% /run/lock
tmpfs                         4026636         0   4026636   0% /sys/fs/cgroup
/dev/loop2                      77952     77952         0 100% /snap/ubuntu-core/1357
/dev/loop0                      76800     76800         0 100% /snap/ubuntu-core/423
/dev/loop1                     131968    131968         0 100% /snap/arduino-mhall119/3
/dev/nvme0n1p2                 483946    136447    322514  30% /boot
/dev/nvme0n1p1                 523248      3676    519572   1% /boot/efi
tmpfs                          805324       140    805184   1% /run/user/1000

Trying to find some health information, running gsmartcontrol, the 's Basic Health Check is "unknown", and viewing the output, the last lines read Read NVMe SMART/Health Information failed: NVMe Status 0x4002

I get the same output when running sudo smartctl -a /dev/nvme0n1:

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.8.0-34-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       THNSN5256GPUK NVMe TOSHIBA 256GB
Serial Number:                      X64S14LCT18T
Firmware Version:                   5KDA4101
PCI Vendor/Subsystem ID:            0x1179
IEEE OUI Identifier:                0x00080d
Controller ID:                      0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          256,060,514,304 [256 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Fri Jan 13 19:05:21 2017 CET
Firmware Updates (0x02):            1 Slot
Optional Admin Commands (0x0017):   Security Format Frmw_DL *Other*
Optional NVM Commands (0x001e):     Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Warning  Comp. Temp. Threshold:     78 Celsius
Critical Comp. Temp. Threshold:     82 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.00W       -        -    0  0  0  0        0       0
 1 +     2.40W       -        -    1  1  1  1        0       0
 2 +     1.90W       -        -    2  2  2  2        0       0
 3 -   0.0120W       -        -    3  3  3  3     5000   25000
 4 -   0.0060W       -        -    4  4  4  4   100000   70000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
Read NVMe SMART/Health Information failed: NVMe Status 0x4002

I don't find any info on this status.

Related Question