The wrong sysadmin

28 April, 2015

NOTE: This was replicated from the Unix Diary thread at http://nixers.net

Dear Unix diary,

today I've been a bad sysadmin. It just happened. I host my own git repository, and earlier this evening I was working on my crux port tree, when I decided to commit and push my work. But this time, something went wrong and git didn't let me push any reference. Amongst all the messages returned by git, I saw this one:

remote: fatal: write error: No space left on device

Fucking shit. I instantly imagine what's happening: my /var partition wasn't correctly sized upon creation. This is where I host my website, gopherhole, git repo, pictures, videos, ... Every 'production' service. And after serving me well for several years, it's now full.

Hopefully, I had setup all my partitions on top of LVM, and let like 200GiB available, just in case things go wrong. And they did.

So here am I, staring at my red prompt, typing a few commands:

root ~# df -h
Filesystem                Size      Used Available Use% Mounted on
mdev                      1.0M         0      1.0M   0% /dev
shm                     499.4M         0    499.4M   0% /dev/shm
/dev/dm-1                 4.0G    797.9M      3.2G  20% /
tmpfs                    99.9M    208.0K     99.7M   0% /run
cgroup_root              10.0M         0     10.0M   0% /sys/fs/cgroup
/dev/sda1                96.8M     14.5M     77.3M  16% /boot
/dev/mapper/vg0-var      50.0G     50.0G     20.0K 100% /var
/dev/mapper/vg0-home    100.0G     12.9G     85.2G  13% /home
/dev/mapper/vg0-data    600.0G    346.7G    252.1G  58% /data
tmpfs                   499.4M         0    499.4M   0% /tmp
tmpfs                   499.4M     32.4M    467.0M   6% /home/z3bra/tmp
/dev/mapper/vg0-data    600.0G    346.7G    252.1G  58% /var/lib/mpd/music

root ~# mount | grep /var
/dev/mapper/vg0-var on /var type xfs (rw,relatime,attr2,inode64,noquota)

root ~# lvs
  LV   VG   Attr       LSize
  data vg0  -wi-ao---- 600.00g
  home vg0  -wi-ao---- 100.00g
  root vg0  -wi-ao----   4.00g
  swap vg0  -wi-ao----   1.00g
  var  vg0  -wi-ao----  50.00g

root ~# vgs
  VG   #PV #LV #SN Attr   VSize   VFree
  vg0    1   5   0 wz--n- 931.41g 176.41g

Ok, so it's not the first time this happens, remember? You already grew your /home partition, and it went good! Just do the same with /var! It works without a reboot!

What was those commands again?

root ~# lvextend -L +20G vg0/var
  Extending logical volume var to 70.00 GiB
  63e74d07f000-63e74d2c1000 r-xp 00000000 fd:01 8430401                    /lib/libdevmapper.so.1.02: mlock failed: Out of memory
  63e74d2c6000-63e74d4cb000 r-xp 00000000 fd:01 8430404                    /lib/libdevmapper-event.so.1.02: mlock failed: Out of memory
  Logical volume var successfully resized
  Internal error: Reserved memory (9064448) not enough: used 9084928. Increase activation/reserved_memory?

root ~# lvs
  LV   VG   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  data vg0  -wi-ao---- 600.00g
  home vg0  -wi-ao---- 100.00g
  root vg0  -wi-ao----   4.00g
  swap vg0  -wi-ao----   1.00g
  var  vg0  -wi-ao----  70.00g

root ~# xfs_growfs -d /var
meta-data=/dev/mapper/vg0-var    isize=256    agcount=4, agsize=3276800 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=0
data     =                       bsize=4096   blocks=13107200, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
log      =internal               bsize=4096   blocks=6400, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
data blocks changed from 13107200 to 18350080

root ~# df -h
Filesystem                Size      Used Available Use% Mounted on
mdev                      1.0M         0      1.0M   0% /dev
shm                     499.4M         0    499.4M   0% /dev/shm
/dev/dm-1                 4.0G    797.9M      3.2G  20% /
tmpfs                    99.9M    208.0K     99.7M   0% /run
cgroup_root              10.0M         0     10.0M   0% /sys/fs/cgroup
/dev/sda1                96.8M     14.5M     77.3M  16% /boot
/dev/mapper/vg0-var      70.0G     50.0G     20.0G  71% /var
/dev/mapper/vg0-home    100.0G     12.9G     85.2G  13% /home
/dev/mapper/vg0-data    600.0G    346.7G    252.1G  58% /data
tmpfs                   499.4M         0    499.4M   0% /tmp
tmpfs                   499.4M     32.4M    467.0M   6% /home/z3bra/tmp
/dev/mapper/vg0-data    600.0G    346.7G    252.1G  58% /var/lib/mpd/music

Phew... I'm safe now! So what the hell was going on? I decided to investigate a bit further, to see what I should watch next time. That's how I realised that I did a HUGE mistake...

root ~# cd /var/
root var# du -sh *
48.5G   backup
156.7M  cache
0       db
0       empty
228.8M  git
5.7M    gopher
4.5G    lib
0       local
0       lock
7.9M    log
0       mail
0       run
40.0K   spool
0       tmp
1.1G    www

root var# cd backup/

root backup# du -sh *
12.0K   bin
20.0K   etc
48.5G   out
20.0K   usr
84.0K   var

root backup# mountpoint out
out is not a mountpoint

root backup# cd out/

root out# ll
total 50841516
drwxr-sr-x    2 backup   users       4.0K Apr 28 02:11 ./
drwxr-sr-x    8 backup   users       4.0K Feb  2 20:24 ../
-rw-r--r--    1 backup   users       5.3G Apr 25 07:43 data
-rw-r--r--    1 backup   users          0 Apr 25 07:43 data.0.BAK
-rw-r--r--    1 backup   users      12.0G Apr 26 04:37 homedir
-rw-r--r--    1 backup   users      12.0G Apr 22 04:43 homedir.0.BAK
-rw-r--r--    1 backup   users      12.0G Apr 25 05:00 homedir.1.BAK
-rw-r--r--    1 backup   users      44.0K Apr 26 04:42 homedir.2.BAK
-rw-r--r--    1 backup   users       1.2G Apr 28 02:11 production
-rw-r--r--    1 backup   users       1.2G Apr 21 02:10 production.0.BAK
-rw-r--r--    1 backup   users       1.2G Apr 22 02:11 production.1.BAK
-rw-r--r--    1 backup   users       1.2G Apr 23 02:11 production.2.BAK
-rw-r--r--    1 backup   users       1.2G Apr 24 02:11 production.3.BAK
-rw-r--r--    1 backup   users       1.2G Apr 25 02:12 production.4.BAK
-rw-r--r--    1 backup   users          0 Apr 26 02:11 production.5.BAK
-rw-r--r--    1 backup   users       5.3M Apr 27 02:12 production.6.BAK
-rw-r--r--    1 backup   users          0 Apr 28 02:11 production.7.BAK

My backup system doesn't check wether it saves to a mountpoint or not. Shit. For a whole week, all my backups where created in my /var partition instead of a backup USB drive meant for this purpose. And it filled it up pretty quickly.

My backup system send me a mail after each backup, explaining me how it went. The fact it's saving to a mountpoint or not is written in it. I just stopped checking. Silly me.

I realise that this issue could have been easily solved by mounting my backup disk elsewhere, then moving the files, and remounting where it should be. But I didn't. Instead, I grew a partition that didn't need to be (the backups filled 48GiB out of 50Gib allocated to /var), and this partition can't be shrinked anymore, as it's an XFS filesystem.

So today I learnt two things, the hard way:

Don't do anything until you know what's going on
Configure systems checks and READ THEM

I hope you'll learn from my mistakes. For now I think I'll just print this over my desktop, as a reminder:

root ~# df -h /var/
Filesystem                Size      Used Available Use% Mounted on
/dev/mapper/vg0-var      70.0G      1.5G     68.5G   2% /var