The wrong sysadmin
28 April, 2015
NOTE: This was replicated from the Unix Diary thread at http://nixers.net
Dear Unix diary,
today I've been a bad sysadmin. It just happened. I host my own git repository, and earlier this evening I was working on my crux port tree, when I decided to commit and push my work. But this time, something went wrong and git didn't let me push any reference. Amongst all the messages returned by git, I saw this one:
remote: fatal: write error: No space left on device
Fucking shit. I instantly imagine what's happening: my /var partition wasn't correctly sized upon creation. This is where I host my website, gopherhole, git repo, pictures, videos, ... Every 'production' service. And after serving me well for several years, it's now full.
Hopefully, I had setup all my partitions on top of LVM, and let like 200GiB available, just in case things go wrong. And they did.
So here am I, staring at my red prompt, typing a few commands:
root ~# df -h
Filesystem Size Used Available Use% Mounted on
mdev 1.0M 0 1.0M 0% /dev
shm 499.4M 0 499.4M 0% /dev/shm
/dev/dm-1 4.0G 797.9M 3.2G 20% /
tmpfs 99.9M 208.0K 99.7M 0% /run
cgroup_root 10.0M 0 10.0M 0% /sys/fs/cgroup
/dev/sda1 96.8M 14.5M 77.3M 16% /boot
/dev/mapper/vg0-var 50.0G 50.0G 20.0K 100% /var
/dev/mapper/vg0-home 100.0G 12.9G 85.2G 13% /home
/dev/mapper/vg0-data 600.0G 346.7G 252.1G 58% /data
tmpfs 499.4M 0 499.4M 0% /tmp
tmpfs 499.4M 32.4M 467.0M 6% /home/z3bra/tmp
/dev/mapper/vg0-data 600.0G 346.7G 252.1G 58% /var/lib/mpd/music
root ~# mount | grep /var
/dev/mapper/vg0-var on /var type xfs (rw,relatime,attr2,inode64,noquota)
root ~# lvs
LV VG Attr LSize
data vg0 -wi-ao---- 600.00g
home vg0 -wi-ao---- 100.00g
root vg0 -wi-ao---- 4.00g
swap vg0 -wi-ao---- 1.00g
var vg0 -wi-ao---- 50.00g
root ~# vgs
VG #PV #LV #SN Attr VSize VFree
vg0 1 5 0 wz--n- 931.41g 176.41g
Ok, so it's not the first time this happens, remember? You already grew your /home partition, and it went good! Just do the same with /var! It works without a reboot!
What was those commands again?
root ~# lvextend -L +20G vg0/var
Extending logical volume var to 70.00 GiB
63e74d07f000-63e74d2c1000 r-xp 00000000 fd:01 8430401 /lib/libdevmapper.so.1.02: mlock failed: Out of memory
63e74d2c6000-63e74d4cb000 r-xp 00000000 fd:01 8430404 /lib/libdevmapper-event.so.1.02: mlock failed: Out of memory
Logical volume var successfully resized
Internal error: Reserved memory (9064448) not enough: used 9084928. Increase activation/reserved_memory?
root ~# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
data vg0 -wi-ao---- 600.00g
home vg0 -wi-ao---- 100.00g
root vg0 -wi-ao---- 4.00g
swap vg0 -wi-ao---- 1.00g
var vg0 -wi-ao---- 70.00g
root ~# xfs_growfs -d /var
meta-data=/dev/mapper/vg0-var isize=256 agcount=4, agsize=3276800 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=0
data = bsize=4096 blocks=13107200, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=0
log =internal bsize=4096 blocks=6400, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
data blocks changed from 13107200 to 18350080
root ~# df -h
Filesystem Size Used Available Use% Mounted on
mdev 1.0M 0 1.0M 0% /dev
shm 499.4M 0 499.4M 0% /dev/shm
/dev/dm-1 4.0G 797.9M 3.2G 20% /
tmpfs 99.9M 208.0K 99.7M 0% /run
cgroup_root 10.0M 0 10.0M 0% /sys/fs/cgroup
/dev/sda1 96.8M 14.5M 77.3M 16% /boot
/dev/mapper/vg0-var 70.0G 50.0G 20.0G 71% /var
/dev/mapper/vg0-home 100.0G 12.9G 85.2G 13% /home
/dev/mapper/vg0-data 600.0G 346.7G 252.1G 58% /data
tmpfs 499.4M 0 499.4M 0% /tmp
tmpfs 499.4M 32.4M 467.0M 6% /home/z3bra/tmp
/dev/mapper/vg0-data 600.0G 346.7G 252.1G 58% /var/lib/mpd/music
Phew... I'm safe now! So what the hell was going on? I decided to investigate a bit further, to see what I should watch next time. That's how I realised that I did a HUGE mistake...
root ~# cd /var/
root var# du -sh *
48.5G backup
156.7M cache
0 db
0 empty
228.8M git
5.7M gopher
4.5G lib
0 local
0 lock
7.9M log
0 mail
0 run
40.0K spool
0 tmp
1.1G www
root var# cd backup/
root backup# du -sh *
12.0K bin
20.0K etc
48.5G out
20.0K usr
84.0K var
root backup# mountpoint out
out is not a mountpoint
root backup# cd out/
root out# ll
total 50841516
drwxr-sr-x 2 backup users 4.0K Apr 28 02:11 ./
drwxr-sr-x 8 backup users 4.0K Feb 2 20:24 ../
-rw-r--r-- 1 backup users 5.3G Apr 25 07:43 data
-rw-r--r-- 1 backup users 0 Apr 25 07:43 data.0.BAK
-rw-r--r-- 1 backup users 12.0G Apr 26 04:37 homedir
-rw-r--r-- 1 backup users 12.0G Apr 22 04:43 homedir.0.BAK
-rw-r--r-- 1 backup users 12.0G Apr 25 05:00 homedir.1.BAK
-rw-r--r-- 1 backup users 44.0K Apr 26 04:42 homedir.2.BAK
-rw-r--r-- 1 backup users 1.2G Apr 28 02:11 production
-rw-r--r-- 1 backup users 1.2G Apr 21 02:10 production.0.BAK
-rw-r--r-- 1 backup users 1.2G Apr 22 02:11 production.1.BAK
-rw-r--r-- 1 backup users 1.2G Apr 23 02:11 production.2.BAK
-rw-r--r-- 1 backup users 1.2G Apr 24 02:11 production.3.BAK
-rw-r--r-- 1 backup users 1.2G Apr 25 02:12 production.4.BAK
-rw-r--r-- 1 backup users 0 Apr 26 02:11 production.5.BAK
-rw-r--r-- 1 backup users 5.3M Apr 27 02:12 production.6.BAK
-rw-r--r-- 1 backup users 0 Apr 28 02:11 production.7.BAK
My backup system doesn't check wether it saves to a mountpoint or not. Shit. For a whole week, all my backups where created in my /var partition instead of a backup USB drive meant for this purpose. And it filled it up pretty quickly.
My backup system send me a mail after each backup, explaining me how it went. The fact it's saving to a mountpoint or not is written in it. I just stopped checking. Silly me.
I realise that this issue could have been easily solved by mounting my backup disk elsewhere, then moving the files, and remounting where it should be. But I didn't. Instead, I grew a partition that didn't need to be (the backups filled 48GiB out of 50Gib allocated to /var), and this partition can't be shrinked anymore, as it's an XFS filesystem.
So today I learnt two things, the hard way:
- Don't do anything until you know what's going on
- Configure systems checks and READ THEM
I hope you'll learn from my mistakes. For now I think I'll just print this over my desktop, as a reminder:
root ~# df -h /var/
Filesystem Size Used Available Use% Mounted on
/dev/mapper/vg0-var 70.0G 1.5G 68.5G 2% /var