Linux

From Stupid IT
Jump to: navigation, search

How-To's

Useful Commands

Misc

lsblk

# blkid
List connected partitions/devices and their's UUID/PARTUUID


# lsblk
List connect block devices, and how they are used


# cat -n /etc/fstab
Cat file w/ line numbers


# sed -n '22p,24p;30p' /etc/fstab >> /boot/loader/entries/arch.ini
Extract selected lines of text from file and pip to another file


# setenforce 0
Disabled SELinux FOR THIS SESSION ONLY, will need to edit /etc/selinux/config to SELINUX=Disabled to keep it off


smartctl

smartctl --all /dev/sdx
Check full SMART status of a specific drive


smartctl -c /dev/sdx
Analyze test time


smartctl -t <short|long|conveyance|select> /dev/sdx
Start test (background)


smartctl -t <short|long|conveyance|select> -C /dev/sdx
Start test (foreground)


smartctl -a /dev/sdx
View results (all)


smartctl -l selftest /dev/sdx
Report only test results

smartctl -o off|on /dev/sdx
Turn OfflineAutoTests off/on per drive (requires device target)


mdadm

For setting up an array, see Using mdadm to create and manage an array.
mdadm --detail /dev/mdX
Check full details of a specific md/array
NOTE: May show up as /dev/md/raidX


targetcli

//TODO


Troubleshooting

kernel BUG at drivers/ata/sata_mv.c:2118! (Journal)

Last Updated :: 9/25/18 @ 2206


System :: My system is a Supermicro w/ 24 3.5" HD bays, all housing (10x) refurb WD RE4 2TB drives in a RAID10 via mdadm, with an internal SSD OS drive (Sandisk). (Guess where the VM for this site is stored)


Background :: /dev/sdg has been giving me issues. My first lockup, it was the only drive with an activity light. I initially overlooked this, and thought I was having a mobo or proc issue since the logs were complaining about a CPU core getting stuck. I rebooted the unit, and everything went back to normal for a couple weeks. After some time, it locked up again, and this time smartctl was complaining about Currently unreadable (pending) sectors and Offline uncorrectable sectors. The SMART tests were having an issue running, but I figured that was more of an OS issue. Later, attempting to do an Offline Short test would hang and take about an hour before it would fail on a Read Failure. Usually, I'm a big fan of Occam's Razor, but for some reason this time it alluded me - I assumed the drive issues were caused by the CPU hangs. As time went on, after a reboot, my targetcli config would also get trashed and I would need to restore the config before my Failover Cluster could come back up.


Hypothesis :: smartctl's offline auto tests were causing the server to lock up.


Process :: I removed /dev/sdg from the mdadm array, but the server would still lock up. This continued for awhile, but the system logs would continue on about /dev/sdg's bad sectors - about every half hour. I even cleared the metadata off the drive itself with fdisk, to hopefully have the system forget about it, but to no avail - if the drive isn't in use, then it won't hang the system, was my thought process. However, as this continued, I wondered why smartctl was still complaining about the drive(s at this point, as I fear /dev/sdg may have affected it's paired drive). Apparently, by default, smartctl runs offline auto tests constantly on drives. From what I've read (need citation), this process is considered obsolete, but is still an actively used concept.


Current Test :: smartctl -o off /dev/sdX
I turned off the offline auto tests on every single drive, and currently have been running a few hours without any hiccups (and faster, as well, even with the array in a resync).
Yes, I also removed the failed drive.


Notes :: I haven't read the source for sata_mv driver, yet, but if this continues, that is the next step.