A full /var partition destroyed 3 hours of my life!

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

A full /var partition destroyed 3 hours of my life!

Borden Rhodes-2
I tried booting up into Debian and got all sorts of systemd breakages
apparently because my /var partition was full. That's fair, but the
pain started when Debian frustrated any attempt to free up space. I'm
wondering if this is a 'feature' that needs removing or if there might
be a bug in the underlying filesystem. I really don't want anymore
finite hours of my life (or anyone else's) lost in this problem if I
can find the cause.

One of the culprits in my full /var partition was a 3 gig syslog file
which has only been getting bigger since January despite running
logrotate -f. I try to run it this time but I'm told that it can't
rotate anything because there's no space left of the device. OK, Plan
B. Another thing that the interwebs say is to run apt-get clean to
sweep out downloaded packages, of which I collected hundreds of
megabytes. Again, this command failed because there was no space left
on the device.

Since there's almost no documentation as to what can be safely rm'd in
/var without breaking your system, I decide the least risky choice is
to sudo rm -rf the offending 3-gig syslog file from single-user mode
and the systemd debug shell. But *THIS* command failed because there
was 'no space left on the device'. Is this right? Does rm need space
on a drive to free other space? If so, how on earth can you fix a full
partition if you can't remove anything from it?!

Since Debian can't delete files from its own partitions, I have to
boot from a Ubuntu DVD. I'm able to rm -rf the syslog file from that,
but when I reboot into Debian, I get the same 'no space left on
device' errors. That's weird, so I df -h to figure out what's going on
and df correctly reports a 5G var partition, of which under 3G are now
used and avail space is 0G. Whoa, wait, what?!?! How can 5G - 3G =
0G?!

I start blindly casting whatever btrfs spells I can find on the
Internet to fix 'no space left on device' errors. One of them
eventually works and df -h correctly reports the free space in my /var
partition and Debian boots normally again.

My question, therefore, is whether this is a btrfs bug that got
triggered by the full /var partition or whether Debian is designed to
break irrecoverably when /var fills up. Any ideas of what happened?

Reply | Threaded
Open this post in threaded view
|

Re: A full /var partition destroyed 3 hours of my life!

Michael Biebl-3
[rsyslog maintainer speaking here]

Am 15.11.2016 um 06:00 schrieb Borden Rhodes:
> One of the culprits in my full /var partition was a 3 gig syslog file
> which has only been getting bigger since January despite running
> logrotate -f. I try to run it this time but I'm told that it can't

I'd be interested to find out, why logrotation was not done
automatically. Do you have cron installed and running?
Do you have  /etc/cron.daily/logrotate which works when executed and a
corresponding /etc/logrotate.d/rsyslog?

Any idea why logrotate was not run or failed to do its job?

> My question, therefore, is whether this is a btrfs bug that got
> triggered by the full /var partition or whether Debian is designed to
> break irrecoverably when /var fills up. Any ideas of what happened?
>

That sounds like a btrfs issue. Which kernel is that?
I do remember btrfs having problems when the disk runs full.

--
Why is it that all of the instruments seeking intelligent life in the
universe are pointed away from Earth?


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: A full /var partition destroyed 3 hours of my life!

Peter Ludikovsky
In reply to this post by Borden Rhodes-2
Am 15.11.2016 um 06:00 schrieb Borden Rhodes:
> I start blindly casting whatever btrfs spells I can find on the
> Internet to fix 'no space left on device' errors. One of them
> eventually works and df -h correctly reports the free space in my /var
> partition and Debian boots normally again.
>
> My question, therefore, is whether this is a btrfs bug that got
> triggered by the full /var partition or whether Debian is designed to
> break irrecoverably when /var fills up. Any ideas of what happened?
>

Does anything on the Debian Wiki on Btrfs [1] seem familiar? Other than
that I can only guess, but maybe check the SMART information of your
disk(s) for excessive errors, as it _could_ be that defective sectors
prevent Btrfs from doing it's COW magic.


[1] https://wiki.debian.org/Btrfs#WARNINGS


signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

btrfs filesystem full problems (was Re: A full /var partition destroyed 3 hours of my life!)

Jonathan Dowland
In reply to this post by Borden Rhodes-2
On Tue, Nov 15, 2016 at 12:00:37AM -0500, Borden Rhodes wrote:
> I tried booting up into Debian and got all sorts of systemd breakages
> apparently because my /var partition was full.
...
> I start blindly casting whatever btrfs spells...

Aha! brtfs!

> My question, therefore, is whether this is a btrfs bug that got
> triggered by the full /var partition or whether Debian is designed to
> break irrecoverably when /var fills up. Any ideas of what happened?

It sounds like btrfs specific behaviour. It would be interesting to know
what kernel version and btrfs version you were using, if only to confirm
my suspicion that even the versions in Debian are not suitable for use in
production.

I'm going to guess that it was a series of 'btrfs balance' commands that
fixed things for you.

--
Jonathan Dowland
Please do not CC me, I am subscribed to the list.

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: A full /var partition destroyed 3 hours of my life!

Mart van de Wege
In reply to this post by Borden Rhodes-2
Borden Rhodes <[hidden email]> writes:

> Since there's almost no documentation as to what can be safely rm'd in
> /var without breaking your system, I decide the least risky choice is
> to sudo rm -rf the offending 3-gig syslog file from single-user mode
> and the systemd debug shell. But *THIS* command failed because there
> was 'no space left on the device'. Is this right? Does rm need space
> on a drive to free other space?

Yes, this is right. The problem is not 'rm', the problem is that you use
sudo without understanding why it is set up like that: sudo logs the
command it executes to /var/log/auth.log


--
"We will need a longer wall when the revolution comes."
    --- AJS, quoting an uncertain source.

Reply | Threaded
Open this post in threaded view
|

Re: A full /var partition destroyed 3 hours of my life!

Eduardo M KALINOWSKI-4
In reply to this post by Borden Rhodes-2
On 15-11-2016 03:00, Borden Rhodes wrote:
> My question, therefore, is whether this is a btrfs bug that got
> triggered by the full /var partition or whether Debian is designed to
> break irrecoverably when /var fills up. Any ideas of what happened?

First, you mentioned the crucial bit of information (that it's a btrfs
filesystem) only at the end. Also, you've left out important things such
as your running kernel and (to a lesser extent) version of btrfs-tools
(or btrfs-progs in newer systems).

As others have pointed, btrfs is the culprit. Take a look at these links
to try to understand what might have happened:
https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools
https://btrfs.wiki.kernel.org/index.php/Problem_FAQ#I_get_.22No_space_left_on_device.22_errors.2C_but_df_says_I.27ve_got_lots_of_space
http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html


--
Love is being stupid together.
                -- Paul Valery

Eduardo M KALINOWSKI
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: A full /var partition destroyed 3 hours of my life!

Borden Rhodes-2
In reply to this post by Borden Rhodes-2
> [rsyslog maintainer speaking here]
>
> Am 15.11.2016 um 06:00 schrieb Borden Rhodes:
>> One of the culprits in my full /var partition was a 3 gig syslog file
>> which has only been getting bigger since January despite running
>> logrotate -f. I try to run it this time but I'm told that it can't
>
> I'd be interested to find out, why logrotation was not done
> automatically. Do you have cron installed and running?
> Do you have  /etc/cron.daily/logrotate which works when executed and a
> corresponding /etc/logrotate.d/rsyslog?
>
> Any idea why logrotate was not run or failed to do its job?

Here's the contents of /etc/cron.daily/logrotate:
#!/bin/sh

test -x /usr/sbin/logrotate || exit 0
/usr/sbin/logrotate /etc/logrotate.conf

and /etc/logrotate.d/rsyslog:
/var/log/syslog
{
        rotate 7
        daily
        missingok
        notifempty
        delaycompress
        compress
        postrotate
                invoke-rc.d rsyslog rotate > /dev/null
        endscript
}

/var/log/mail.info
/var/log/mail.warn
/var/log/mail.err
/var/log/mail.log
/var/log/daemon.log
/var/log/kern.log
/var/log/auth.log
/var/log/user.log
/var/log/lpr.log
/var/log/cron.log
/var/log/debug
/var/log/messages
{
        rotate 4
        weekly
        missingok
        notifempty
        compress
        delaycompress
        sharedscripts
        postrotate
                invoke-rc.d rsyslog rotate > /dev/null
        endscript
}

Both looked normal to me and, without knowing more about the structure
of logrotate config files, didn't pick further. When I logrotate -f ,
it runs and finishes without complaining, but syslog doesn't seem to
get smaller. I think it just kept getting bigger.

>> My question, therefore, is whether this is a btrfs bug that got
>> triggered by the full /var partition or whether Debian is designed to
>> break irrecoverably when /var fills up. Any ideas of what happened?
>>=20
>
> That sounds like a btrfs issue. Which kernel is that?
> I do remember btrfs having problems when the disk runs full.

I'm running a 4.8.0-1-amd64 kernel. I'm on the testing branch. It
makes me feel better knowing that it may be a btrfs bug (or at least
not part of the Linux design) since that's a rough edge I can (try to)
work around by checking /var every so often. Still, "A Cowboy's Guide
to Cleaning /var and /tmp" would help in cases where some process gets
greedy with space.

>> My question, therefore, is whether this is a btrfs bug that got
>> triggered by the full /var partition or whether Debian is designed to
>> break irrecoverably when /var fills up. Any ideas of what happened?
>>
>
> Does anything on the Debian Wiki on Btrfs [1] seem familiar? Other than
> that I can only guess, but maybe check the SMART information of your
> disk(s) for excessive errors, as it _could_ be that defective sectors
> prevent Btrfs from doing it's COW magic.

I don't think it's that, unless smartctl is lying to me. It passes all
of the test and the only historical failure (which I think has almost
always been there) is an airflow warning. Error logs are empty. If I
start getting strange behaviour, I can do a more comprehensive SMART
scan.

> [1] https://wiki.debian.org/Btrfs#WARNINGS

Nothing seems on point here. My configuration is btrfs partitions
within an LVM within an MBR hard drive. I'm not doing any fancy RAID
or anything.

Thank you for the hints!

Reply | Threaded
Open this post in threaded view
|

RE: btrfs filesystem full problems (was Re: A full /var partition destroyed 3 hours of my life!)

Borden Rhodes-2
In reply to this post by Jonathan Dowland
> It sounds like btrfs specific behaviour. It would be interesting to know
> what kernel version and btrfs version you were using, if only to confirm
> my suspicion that even the versions in Debian are not suitable for use in
> production.
>
> I'm going to guess that it was a series of 'btrfs balance' commands that
> fixed things for you.

Correct you are! The various incantations used different filters and
one of them worked. I have no idea what filters are and I would die a
happy man without needing to know.

I use Debian testing, so it's whatever kernel and btrfs packages that
were in that as of yesterday.

> Yes, this is right. The problem is not 'rm', the problem is that you use
> sudo without understanding why it is set up like that: sudo logs the
> command it executes to /var/log/auth.log

Makes sense. So why did I get the exact same problem when I enabled
the debug-shell? Unless it's also lying to me, doesn't it boot into a
proper root shell?

Reply | Threaded
Open this post in threaded view
|

Re: btrfs filesystem full problems (was Re: A full /var partition destroyed 3 hours of my life!)

Jonathan Dowland
On Tue, Nov 15, 2016 at 02:25:55PM -0500, Borden Rhodes wrote:
> Correct you are! The various incantations used different filters and
> one of them worked. I have no idea what filters are and I would die a
> happy man without needing to know.

Last time I used 'btrfs balance', I had to run it with increasing (or
decreasing) values for the -fi argument (iirc), as the initial values freed up
just enough working space for the subsequent values to work within, and the
subsequent values would not work when I first tried them. If that makes sense.
I was left with a very sour taste in my mouth.

Nowadays I only use btrfs for a development space (as a docker storage backend)
and when it goes pear-shaped, I just blow it away and start again, there's
nothing stored there which can't be recreated.

> I use Debian testing, so it's whatever kernel and btrfs packages that
> were in that as of yesterday.

Ah ok, thanks.

--
Jonathan Dowland
Please do not CC me, I am subscribed to the list.

signature.asc (836 bytes) Download Attachment