Bug#404143: Fans unreliable under load, permanent memory leak

classic Classic list List threaded Threaded
47 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Bug#404143: Fans unreliable under load, permanent memory leak

Sven Luther
On Fri, Dec 22, 2006 at 12:53:09PM +0100, Marc 'HE' Brockschmidt wrote:

> severity 404143 critical
> thanks
>
> maximilian attems <[hidden email]> writes:
> > On Fri, Dec 22, 2006 at 11:28:29AM +0100, Marc 'HE' Brockschmidt wrote:
> >> Fix it or document it, I don't care. But the current state is not
> >> releasable.
> > we are not talking about "a" patch.
> > what you need is an backport of the 2.6.19 acpi release to 2.6.18.
>
> Read again what I wrote. I will not allow Debian to release with a
> Kernel that may damage hardware without even a notice in the release
> notes. If you are not able to fix it, note that you have provided a
> broken kernel.

Cool, let's delay etch a couple of weeks and move to a (now released) 2.6.19
kernel, to solve this issue.

Friendly,

Sven Luther


--
To UNSUBSCRIBE, email to [hidden email]
with a subject of "unsubscribe". Trouble? Contact [hidden email]

Reply | Threaded
Open this post in threaded view
|

Bug#404143: Fans unreliable under load, permanent memory leak

Marc Brockschmidt-4
Sven Luther <[hidden email]> writes:

> On Fri, Dec 22, 2006 at 12:53:09PM +0100, Marc 'HE' Brockschmidt wrote:
>> maximilian attems <[hidden email]> writes:
>>> On Fri, Dec 22, 2006 at 11:28:29AM +0100, Marc 'HE' Brockschmidt wrote:
>>>> Fix it or document it, I don't care. But the current state is not
>>>> releasable.
>>> we are not talking about "a" patch.
>>> what you need is an backport of the 2.6.19 acpi release to 2.6.18.
>> Read again what I wrote. I will not allow Debian to release with a
>> Kernel that may damage hardware without even a notice in the release
>> notes. If you are not able to fix it, note that you have provided a
>> broken kernel.
> Cool, let's delay etch a couple of weeks and move to a (now released) 2.6.19
> kernel, to solve this issue.
Let's try again: Fix it *OR* explain in the release notes that the
kernel in etch is broken for some hardware.

Marc
--
Fachbegriffe der Informatik - Einfach erklärt
79: Usenet
       Ich habe zuviel Freizeit. (Florian Kuehnert)

attachment0 (194 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Bug#404143: Fans unreliable under load, permanent memory leak

Ludovic Brenta-2
In reply to this post by Ludovic Brenta-2
Some more information.

1) On my machine, reading the temperature using, say, yacpi, causes
   one processor to process all the pending ACPI events.  On a
   uniprocessor machine, the machine would appear to hang for several
   seconds; not so on my dual-core machine :)

2) The lare slab usage (1.1 Gb) was in part due to the XFS cache data;
   all three of my machine's filesystems are XFS.  So the Acpi-State
   line in /proc/slabinfo is the really meaningful one.

Here is my complete log so far, with annotations.

2006-06-21T20:06:10: Slab:            30296 kB
2006-17-21T20:17:01: Slab:            37756 kB
2006-17-21T21:17:01: Slab:            48116 kB
2006-17-21T22:17:01: Slab:            55764 kB
2006-17-21T23:17:01: Slab:            69904 kB
-- Reboot with acpi=noirq: only one CPU found --
2006-24-21T23:24:10: Slab:            10444 kB
-- Reboot with pci=noacpi: only one CPU found --
2006-30-21T23:30:26: Slab:             9676 kB
2006-30-21T23:30:26: Acpi-State             0      0     80   48    1 : tunables  120   60    8 : slabdata      0      0      0
-- Reboot with no options: OK, both CPUs found --
2006-34-21T23:34:23: Slab:            10584 kB
2006-34-21T23:34:23: Acpi-State             0      0     80   48    1 : tunables  120   60    8 : slabdata      0      0      0
2006-17-22T00:17:01: Slab:            15424 kB
2006-17-22T00:17:01: Acpi-State         23088  23088     80   48    1 : tunables  120   60    8 : slabdata    481    481      0
2006-17-22T01:17:01: Slab:            29956 kB
2006-17-22T01:17:01: Acpi-State         59136  59136     80   48    1 : tunables  120   60    8 : slabdata   1232   1232      0
2006-17-22T02:17:01: Slab:            37764 kB
2006-17-22T02:17:01: Acpi-State         95088  95088     80   48    1 : tunables  120   60    8 : slabdata   1981   1981      0
2006-17-22T03:17:01: Slab:            45544 kB
2006-17-22T03:17:01: Acpi-State        130992 130992     80   48    1 : tunables  120   60    8 : slabdata   2729   2729      0
2006-17-22T04:17:01: Slab:            53328 kB
2006-17-22T04:17:01: Acpi-State        166944 166944     80   48    1 : tunables  120   60    8 : slabdata   3478   3478      0
2006-17-22T05:17:01: Slab:            61120 kB
2006-17-22T05:17:01: Acpi-State        202896 202896     80   48    1 : tunables  120   60    8 : slabdata   4227   4227      0
2006-17-22T06:17:01: Slab:            68904 kB
2006-17-22T06:17:01: Acpi-State        238800 238800     80   48    1 : tunables  120   60    8 : slabdata   4975   4975      0
2006-17-22T07:17:01: Slab:          1152624 kB
2006-17-22T07:17:01: Acpi-State        274656 274656     80   48    1 : tunables  120   60    8 : slabdata   5722   5722      0
2006-17-22T08:17:01: Slab:          1160376 kB
2006-17-22T08:17:01: Acpi-State        310608 310608     80   48    1 : tunables  120   60    8 : slabdata   6471   6471      0
2006-17-22T09:17:01: Slab:          1168168 kB
2006-17-22T09:17:01: Acpi-State        346464 346464     80   48    1 : tunables  120   60    8 : slabdata   7218   7218      0
2006-17-22T10:17:01: Slab:          1175892 kB
2006-17-22T10:17:01: Acpi-State        382176 382176     80   48    1 : tunables  120   60    8 : slabdata   7962   7962      0
2006-17-22T11:17:01: Slab:          1183660 kB
2006-17-22T11:17:01: Acpi-State        417984 417984     80   48    1 : tunables  120   60    8 : slabdata   8708   8708      0
2006-17-22T12:17:01: Slab:          1191400 kB
2006-17-22T12:17:01: Acpi-State        453744 453744     80   48    1 : tunables  120   60    8 : slabdata   9453   9453      0
2006-17-22T13:17:01: Slab:          1202924 kB
2006-17-22T13:17:01: Acpi-State        489696 489696     80   48    1 : tunables  120   60    8 : slabdata  10202  10202      0
-- Start yacpi, monitoring the temperature every second.
-- Note how the slab allocation drops by ~100M and then stays constant.
2006-17-22T14:17:01: Slab:          1097584 kB
2006-17-22T14:17:01: Acpi-State           109    144     80   48    1 : tunables  120   60    8 : slabdata      3      3      0
2006-17-22T15:17:01: Slab:          1097532 kB
2006-17-22T15:17:01: Acpi-State            45     96     80   48    1 : tunables  120   60    8 : slabdata      2      2      0
2006-17-22T16:17:01: Slab:          1097536 kB
2006-17-22T16:17:01: Acpi-State            75    144     80   48    1 : tunables  120   60    8 : slabdata      3      3      0
2006-17-22T17:17:01: Slab:          1097668 kB
2006-17-22T17:17:01: Acpi-State           141    144     80   48    1 : tunables  120   60    8 : slabdata      3      3      0
-- Stop the yacpi monitoring.
2006-17-22T18:17:01: Slab:          1098904 kB
2006-17-22T18:17:01: Acpi-State          5808   5808     80   48    1 : tunables  120   60    8 : slabdata    121    121      0
-- At this point the Acpi-State has started increasing again, but is still
-- small.  Most of the slab allocations are in the XFS caches (all three
-- filesystems on this computer are XFS).
-- To make sure the memory can be released, start a fairly large compilation
-- using both CPUs and 2x370 M of RAM.  Just before compilation:
2006-48-22T18:48:56: Slab:          1103244 kB
2006-48-22T18:48:56: Acpi-State         24528  24528     80   48    1 : tunables  120   60    8 : slabdata    511    511      0
-- A couple of minutes into the compilation, the fans have still not turned on
-- and the CPU is getting so hot it burns my hand.  Restart yacpi, monitoring
-- temperature every second.  The temp is 85°C (dangerous!!) One CPU starts
-- processing the backlog of ACPI events, the other continues the compilation.
-- Fans start.  Temperature drops to 71°C and stays there.
2006-00-22T19:00:44: Slab:           861828 kB
2006-00-22T19:00:44: Acpi-State            74     96     80   48    1 : tunables  120   60    8 : slabdata      2      2      0
-- End of compilation.  During the final packaging stages, the temperature has
-- dropped to 57°C as the CPUs were less used.  Stop the yacpi monitoring.
2006-07-22T19:07:13: Slab:           865660 kB
2006-07-22T19:07:13: Acpi-State            73     96     80   48    1 : tunables  120   60    8 : slabdata      2      2      0
2006-17-22T19:17:01: Slab:           865028 kB
2006-17-22T19:17:01: Acpi-State            71    144     80   48    1 : tunables  120   60    8 : slabdata      3      3      0
2006-17-22T20:17:01: Slab:           871224 kB
2006-17-22T20:17:01: Acpi-State         34704  34704     80   48    1 : tunables  120   60    8 : slabdata    723    723      0
2006-17-22T21:17:01: Slab:           879112 kB
2006-17-22T21:17:01: Acpi-State         69552  69552     80   48    1 : tunables  120   60    8 : slabdata   1449   1449      0
2006-17-22T22:17:01: Slab:           887908 kB
2006-17-22T22:17:01: Acpi-State        104784 104784     80   48    1 : tunables  120   60    8 : slabdata   2183   2183      0
2006-17-22T23:17:01: Slab:           896024 kB
2006-17-22T23:17:01: Acpi-State        139920 139968     80   48    1 : tunables  120   60    8 : slabdata   2915   2916      0



Reply | Threaded
Open this post in threaded view
|

Bug#404143: Fans unreliable under load, permanent memory leak

Andreas Barth
In reply to this post by Sven Luther
* Sven Luther ([hidden email]) [061222 05:42]:

> On Fri, Dec 22, 2006 at 12:53:09PM +0100, Marc 'HE' Brockschmidt wrote:
> > maximilian attems <[hidden email]> writes:
> > > On Fri, Dec 22, 2006 at 11:28:29AM +0100, Marc 'HE' Brockschmidt wrote:
> > >> Fix it or document it, I don't care. But the current state is not
> > >> releasable.
> > > we are not talking about "a" patch.
> > > what you need is an backport of the 2.6.19 acpi release to 2.6.18.
> >
> > Read again what I wrote. I will not allow Debian to release with a
> > Kernel that may damage hardware without even a notice in the release
> > notes. If you are not able to fix it, note that you have provided a
> > broken kernel.
>
> Cool, let's delay etch a couple of weeks and move to a (now released) 2.6.19
> kernel, to solve this issue.

Sven, stop this! I can remember well how you promised that moving to
2.6.18 will magically solve almost all of our issues - 6 (or more)
release critical bugs against 2.6.18 don't show that this has worked so
well. Please try helping us on solutions rather then breaking things
again.


Please try to look at it from another perspective:

Consider you have bought such a laptop, and you install Debian. You have
even read the release notes first.  Everything works well.  Until one
day you notice your laptop gets too warm, and eventually even breaks
because of this.  On deeper research, you notice that this issue was
well-known to Debian, but they refused to deal with it at all. How would
you feel as a user? I think this is an unacceptable perspective.


Ok, what can we do?
1. ignore the problem,
2. document it in the release notes and README.Debian of the kernel,
3. prevent the kernel running on such buggy laptops [is this possible?],
4. backport ACPI from 2.6.19, or use 2.6.19,
5. isolate a smaller fix and apply it.

I personally consider options 1 and 4 to be unacceptable. Option 5 would
be the best, but I have yet to see that this is possible (or rather,
someone knowledgeable enough has time to do it).

So, we should at least document it inside of the release notes, and
README.Debian, and, if possible without being to invasive, get some
check inside the kernel to print a big warning on bootup, or even refuse
to work until some special parameter is used.


How does this proposal sound to the kernel team?



Cheers,
Andi
--
  http://home.arcor.de/andreas-barth/


--
To UNSUBSCRIBE, email to [hidden email]
with a subject of "unsubscribe". Trouble? Contact [hidden email]

Reply | Threaded
Open this post in threaded view
|

Bug#404143: Fans unreliable under load, permanent memory leak

Sven Luther
On Sat, Dec 23, 2006 at 11:50:40AM +0100, Andreas Barth wrote:

> * Sven Luther ([hidden email]) [061222 05:42]:
> > On Fri, Dec 22, 2006 at 12:53:09PM +0100, Marc 'HE' Brockschmidt wrote:
> > > maximilian attems <[hidden email]> writes:
> > > > On Fri, Dec 22, 2006 at 11:28:29AM +0100, Marc 'HE' Brockschmidt wrote:
> > > >> Fix it or document it, I don't care. But the current state is not
> > > >> releasable.
> > > > we are not talking about "a" patch.
> > > > what you need is an backport of the 2.6.19 acpi release to 2.6.18.
> > >
> > > Read again what I wrote. I will not allow Debian to release with a
> > > Kernel that may damage hardware without even a notice in the release
> > > notes. If you are not able to fix it, note that you have provided a
> > > broken kernel.
> >
> > Cool, let's delay etch a couple of weeks and move to a (now released) 2.6.19
> > kernel, to solve this issue.
>
> Sven, stop this!

Why ? /me guesses that even though debian is about free software, there are
many who feel that freedom of speach is to be banned. Do you also follow that
line of thought ? Was it not enough that some people felt that i should be
burned on the stack for having send mails while i was not at my best ?

Really, this kind of behavior is disgusting.

> I can remember well how you promised that moving to
> 2.6.18 will magically solve almost all of our issues - 6 (or more)
> release critical bugs against 2.6.18 don't show that this has worked so
> well. Please try helping us on solutions rather then breaking things
> again.

I did not promise anything such. I simply stated at that time, that there
where many RC issues which where already fixed in the 2.6.18 tree, and which
would be a pain to backport to the 2.6.17 tree. Quite a different thing, don't
you think ?

I personally will need to maintain 2.6.19+ backports to etch, because there is
no sane way to get Efika support in 2.6.18 without lot of work.

> Please try to look at it from another perspective:
>
> Consider you have bought such a laptop, and you install Debian. You have
> even read the release notes first.  Everything works well.  Until one
> day you notice your laptop gets too warm, and eventually even breaks
> because of this.  On deeper research, you notice that this issue was
> well-known to Debian, but they refused to deal with it at all. How would
> you feel as a user? I think this is an unacceptable perspective.

Bah. hardware which can be broken by software is broken. That said, if in fact
this is not a bug of the bios as was first mentioned here, but that the linux
support is not able to cope with some not usual but legal features of acpi,
then it is another matter.

But you should *NEVER* try to stop discussion about the subject, or bash on
someone for writing a single sentence as i did.

Friendly,

Sven Luther


--
To UNSUBSCRIBE, email to [hidden email]
with a subject of "unsubscribe". Trouble? Contact [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Bug#404143: Fans unreliable under load, permanent memory leak

Frederik Schueler-2
In reply to this post by Maximilian Attems-3

Hi *,

this is indeed a severe issue which requires all our attention and care
to solve or circumvent in order for nobodies boxes to get any harm, you
know how expensive these laptops are.

I basically see 3 solutions/workarounds:

1. the brutal one: deactivate ACPI in 2.6.18, have the bios keep control
of the fans - better a noisy laptop until I upgrade the kernel than a
fried box.

2. port 2.6.19 ACPI - noop because way too much work, unless someone
"crazy enough" to accomplish this task.

3. go for 2.6.19

Documenting arbitrary breakage in the release notes is not a solution,
just consider how well manuals are usually read (if at all). Users will
end with damaged hardware and blame us for it.

We released woody with disabled ide dma due to somewhat similar issues
(boxes hanging), so disabling ACPI in 2.6.18 and going for a 2.6.19
based 4.0r1 ASAP seems the best thing to me personally, but this is of
course up for discussion.

Best regards
Frederik Schueler

--
ENOSIG

signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Bug#404143: Fans unreliable under load, permanent memory leak

Sven Luther
On Sun, Dec 24, 2006 at 03:07:55AM +0100, Frederik Schueler wrote:

>
> Hi *,
>
> this is indeed a severe issue which requires all our attention and care
> to solve or circumvent in order for nobodies boxes to get any harm, you
> know how expensive these laptops are.
>
> I basically see 3 solutions/workarounds:
>
> 1. the brutal one: deactivate ACPI in 2.6.18, have the bios keep control
> of the fans - better a noisy laptop until I upgrade the kernel than a
> fried box.
>
> 2. port 2.6.19 ACPI - noop because way too much work, unless someone
> "crazy enough" to accomplish this task.
>
> 3. go for 2.6.19

As said, i can imagine another solution.

  4. Provide both a stable 2.6.18, and a easily usable backported 2.6.19
  (or newer) kernel, which would be built for etch, but built out of our
  trunk/unstable/testing archive.

Then we can add a bit of logic into d-i's base-installer, so that the kernel
installation step detects the laptops which have this problem (do we know how
to detect them ?), and inform the user and install the newer kernel.

Alternatively, we can go 1, create a -noacpi flavour usable on those laptops,
and install that flavour in d-i. This would probably be the easiest solution.

> Documenting arbitrary breakage in the release notes is not a solution,
> just consider how well manuals are usually read (if at all). Users will
> end with damaged hardware and blame us for it.

/me agrees.

> We released woody with disabled ide dma due to somewhat similar issues
> (boxes hanging), so disabling ACPI in 2.6.18 and going for a 2.6.19
> based 4.0r1 ASAP seems the best thing to me personally, but this is of
> course up for discussion.

I have been thinking of another solution, but since i am kind of ignored or
this is a subject a certain amount of the powers-who-be don't want me to
mention, i doubt it will be gaining much momentum. I am going to propose a
talk at fosdem about these ideas, where issues and everything else can be
discussed.

The idea goes as follows :

  1) We take the kernel out of the main debian archive, into a separate kernel
  pool. This pool would hold the kernel and all assorted modules or
  abi-depending packages. This pool would hold per-abi subpools
  (dists/kernel/2.6.18-3, dists/kernel/2.6.19-1, etc).

  2) Eventually, we have some symlink or mirroring logic which would allow the
  chosen kernel to be accesible from the main archives. This means we can
  prepare kernels in this kernel pool, test it, and once it is ready, do a
  one-pule moving of those packages (without rebuild) into the main pools.

  3) This pool will include both kernel .debs and .udebs. A further
  improvement would allow to split the d-i initramfs into two, having a single
  copy of the non-kernel specific stuff, and a per-flavour copy of the kernel
  initramfs stuff. This way, we move together the kernel and the module
  .udebs, and can easily switch d-i to change kernel version, or even build
  various d-i for various kernel versions. Furthermore this would avoid d-i
  trying to import 2.6.18-3 modules when you build a local 2.6.19-1 kernel,
  and simplify the whole .udeb version checking and downloading logic.

Well, there is more to it, and i will present that at fosdem, but i hope this
already gave you all a taste of what could be, and that these ideas will not
be rejected out of hand, just because they come from me.

Friendly,

Sven Luther


--
To UNSUBSCRIBE, email to [hidden email]
with a subject of "unsubscribe". Trouble? Contact [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Bug#404143: Fans unreliable under load, permanent memory leak

Moritz Mühlenhoff-2
In reply to this post by Frederik Schueler-2
On Sun, Dec 24, 2006 at 03:07:55AM +0100, Frederik Schueler wrote:

>
> Hi *,
>
> this is indeed a severe issue which requires all our attention and care
> to solve or circumvent in order for nobodies boxes to get any harm, you
> know how expensive these laptops are.
>
> I basically see 3 solutions/workarounds:
>
> 1. the brutal one: deactivate ACPI in 2.6.18, have the bios keep control
> of the fans - better a noisy laptop until I upgrade the kernel than a
> fried box.

Do you intent to disable ACPI entirely for all systems?

It appears to me that the affected HP models could be disabled on a per-case
basis using drivers/acpi/blacklist.c

Cheers,
        Moritz


--
To UNSUBSCRIBE, email to [hidden email]
with a subject of "unsubscribe". Trouble? Contact [hidden email]

Reply | Threaded
Open this post in threaded view
|

Bug#404143: Fans unreliable under load, permanent memory leak

Frans Pop-3
In reply to this post by Frederik Schueler-2
On Sunday 24 December 2006 03:07, Frederik Schueler wrote:
> 2. port 2.6.19 ACPI - noop because way too much work, unless someone
> "crazy enough" to accomplish this task.

Did you see that Bas Zoetekouw managed [1, #400488] to solve the problem
for his box by applying some selected patches from upstream?
Wouldn't that be an option?

I'd suggest asking other people that see the same issues to also test a
kernel with these patches and decide based on the results.

[1] http://lists.debian.org/debian-kernel/2006/12/msg00768.html

attachment0 (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Bug#404143: Fans unreliable under load, permanent memory leak

Sven Luther
On Sun, Dec 24, 2006 at 02:48:27PM +0100, Frans Pop wrote:
> On Sunday 24 December 2006 03:07, Frederik Schueler wrote:
> > 2. port 2.6.19 ACPI - noop because way too much work, unless someone
> > "crazy enough" to accomplish this task.
>
> Did you see that Bas Zoetekouw managed [1, #400488] to solve the problem
> for his box by applying some selected patches from upstream?
> Wouldn't that be an option?

I thought i saw Maximilian say that there are indeed some patches, but that
the risk to destabilize the whole ACPI subsystem was too great this near to
the etch release. This is exactly the same kind of argument you are using in
d-i, don't you think ?

> I'd suggest asking other people that see the same issues to also test a
> kernel with these patches and decide based on the results.

No, what we would need is huge testing of these patches by people *WHO DIDN'T
SEE THE SAME ISSUES* to make sure there is no regression.

Friendly,

Sven Luther


--
To UNSUBSCRIBE, email to [hidden email]
with a subject of "unsubscribe". Trouble? Contact [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Bug#404143: Fans unreliable under load, permanent memory leak

Frederik Schueler-2
In reply to this post by Moritz Mühlenhoff-2
Hello,

On Sun, Dec 24, 2006 at 02:02:58PM +0100, Moritz Muehlenhoff wrote:
> Do you intent to disable ACPI entirely for all systems?
>
> It appears to me that the affected HP models could be disabled on a per-case
> basis using drivers/acpi/blacklist.c

This looks like a good idea to me, do we know which models are affected?

OTOH, I doubt we have a complete list of affected models, and who knows
what problems may arise for yet to be released laptops...

Best regards
Frederik Schueler

--
ENOSIG

signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Bug#404143: Fans unreliable under load, permanent memory leak

Maximilian Attems-3
On Sun, Dec 24, 2006 at 03:31:15PM +0100, Frederik Schueler wrote:

> Hello,
>
> On Sun, Dec 24, 2006 at 02:02:58PM +0100, Moritz Muehlenhoff wrote:
> > Do you intent to disable ACPI entirely for all systems?
> >
> > It appears to me that the affected HP models could be disabled on a per-case
> > basis using drivers/acpi/blacklist.c
>
> This looks like a good idea to me, do we know which models are affected?
>
> OTOH, I doubt we have a complete list of affected models, and who knows
> what problems may arise for yet to be released laptops...

indeed this is a good way.
acpi patches have known side-effects so i would nack any hand-picking
of those.

do we have a report from an affected laptop that booting with noacpi
solves the thermal issues?

i don't agreee with the fuzz about this bug report nor with the severity.
for the sarge release kernel-image 2.6.8 did not boot on a wide range
of market available intel boards and there were overheating bug reports.
completly disabling acpi seems like an overreaction, based on the fact
that the affected laptops are quite specific. on the other hand i'm
delighted to see discussions about the linux-image upgrade in a stable
revision.

happy christmas

--
maks



--
To UNSUBSCRIBE, email to [hidden email]
with a subject of "unsubscribe". Trouble? Contact [hidden email]

Reply | Threaded
Open this post in threaded view
|

Bug#404143: Fans unreliable under load, permanent memory leak

Moritz Mühlenhoff-2
In reply to this post by Frederik Schueler-2
Frederik Schueler wrote:

> Hello,
>
> On Sun, Dec 24, 2006 at 02:02:58PM +0100, Moritz Muehlenhoff wrote:
> > Do you intent to disable ACPI entirely for all systems?
> >
> > It appears to me that the affected HP models could be disabled on a per-case
> > basis using drivers/acpi/blacklist.c
>
> This looks like a good idea to me, do we know which models are affected?
> OTOH, I doubt we have a complete list of affected models,

Since HP supports Debian officially now, I'm sure Dann or someone else from
HP can provide us a list of affected models.

If not, we can contact Len Brown to get the ACPI-OEM-ID for HP and
blacklist all HP models.

> and who knows what problems may arise for yet to be released laptops...

Well, even Debian can't predict the future :-)
Plus, we can still address these in point updates.

Cheers,
        Moritz


--
To UNSUBSCRIBE, email to [hidden email]
with a subject of "unsubscribe". Trouble? Contact [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Bug#404143: Fans unreliable under load, permanent memory leak

Martin Michlmayr
* Moritz Muehlenhoff <[hidden email]> [2006-12-24 15:57]:
> Since HP supports Debian officially now

not on laptops.

> I'm sure Dann or someone else from HP can provide us a list of
> affected models.

--
Martin Michlmayr
http://www.cyrius.com/


--
To UNSUBSCRIBE, email to [hidden email]
with a subject of "unsubscribe". Trouble? Contact [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Bug#404143: Fans unreliable under load, permanent memory leak

Sven Luther
In reply to this post by Maximilian Attems-3
On Sun, Dec 24, 2006 at 03:42:46PM +0100, maximilian attems wrote:

> On Sun, Dec 24, 2006 at 03:31:15PM +0100, Frederik Schueler wrote:
> > Hello,
> >
> > On Sun, Dec 24, 2006 at 02:02:58PM +0100, Moritz Muehlenhoff wrote:
> > > Do you intent to disable ACPI entirely for all systems?
> > >
> > > It appears to me that the affected HP models could be disabled on a per-case
> > > basis using drivers/acpi/blacklist.c
> >
> > This looks like a good idea to me, do we know which models are affected?
> >
> > OTOH, I doubt we have a complete list of affected models, and who knows
> > what problems may arise for yet to be released laptops...
>
> indeed this is a good way.
> acpi patches have known side-effects so i would nack any hand-picking
> of those.
>
> do we have a report from an affected laptop that booting with noacpi
> solves the thermal issues?

Ah, neat, there is the noacpi option.

We could simply add this flag to affected laptops by d-i. No need to touch the
kernel or otherwise.

Friendly,

Sven Luther


--
To UNSUBSCRIBE, email to [hidden email]
with a subject of "unsubscribe". Trouble? Contact [hidden email]

Reply | Threaded
Open this post in threaded view
|

Bug#404143: Fans unreliable under load, permanent memory leak

Frans Pop-3
In reply to this post by Sven Luther
On Sunday 24 December 2006 15:22, you wrote:
> This is exactly the same kind of
> argument you are using in d-i, don't you think ?

There is a difference between being conservative with fixes for minor
issues and fixes for issues that can fry peoples hardware, don't you
think?

Of course care is needed for such changes and I would certainly encourage
a careful review and possibly some contact with upstream maintainers to
get a better feeling for feasibility and possible risks.

The sooner some action is taken on this, the earlier a kernel could be
uploaded (or made available for testing) and a call for testing be done
on the appropriate lists. If patches do cause regressions there would
still be time to revert them. After all, this is an RC issue and the
release will wait for it.

attachment0 (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Bug#404143: Fans unreliable under load, permanent memory leak

Jurij Smakov
In reply to this post by Frederik Schueler-2
On Sun, Dec 24, 2006 at 03:07:55AM +0100, Frederik Schueler wrote:

>
> Hi *,
>
> this is indeed a severe issue which requires all our attention and care
> to solve or circumvent in order for nobodies boxes to get any harm, you
> know how expensive these laptops are.
>
> I basically see 3 solutions/workarounds:
>
> 1. the brutal one: deactivate ACPI in 2.6.18, have the bios keep control
> of the fans - better a noisy laptop until I upgrade the kernel than a
> fried box.
>
> 2. port 2.6.19 ACPI - noop because way too much work, unless someone
> "crazy enough" to accomplish this task.
I have reviewed the information available on the thermal problems with
HP laptops, and it appears that there is a fairly conservative set of
patches which takes care of the problems (thanks to Bas for pointing
most of the out). I might have missed some upstream bugs, so please
let me know if there is anything else available on the issue. Below is
the summary, describing the relevant patches:

Bug #5534: No thermal events until acpi -t - HP nx6125
------------------------------------------------------
Summary: thermal events generated by the ACPI subsystem do not get
processed by the kernel because both the interrupt due to a thermal
event and event handler are managed by the same thread (kacpid). The
solution is to create a separate thread for the handler, so that the
processing of thermal events may happen asynchronously.

I have identified the following patches which appear to finally resolve
the problem:

#8951 from comment #159 Don't defer release of the global lock.
                                (applies to drivers/acpi/events/evmisc.c)
#8952 from comment #160 Create another workqueue for notify()
                                execution.
                                (applies to drivers/acpi/osl.c)

These patches presumably solve the problem, but the problem persists after
suspend/resume cycle. Followup patches which are supposed to improve the
situation include:

#9631 from comment #171 Improved version of #8952, which prevents
                                flooding of certain machines with thermal
                                events (Linus owns one of those, so he was
                                very unhappy :-)
#9746 from comment #180 Some further improvements. AFAICT, supersedes
                                #9631 and #8952.

So, it looks like we need #8951 and #9746 from this bug. Both apply cleanly
to our 2.6.18-8 source.

Bug #7122: Thermal management problems - HPC nx6325
---------------------------------------------------
Summary: the fans do not come on properly after resume/suspend cycle. Looks
like the reason for the problem is that the ACPI logic which turns on the
fans cannot cope with the fact that it might be needed to execute the
"power on" method for fans a few times before they actually turn on.

The following patches appear to be relevant:

#9254 from comment #37 Reset number of resource references on resume
                                and make power on/off routines more strict and
                                robust.
#9255 from comment #38 Make ACPI suspend handlers to occur before
                                _PTS/_GTS methods and ACPI resume handlers to
                                occur after _WAK method.
#9263 from comment #41 A modification of #9254 to apply to 2.6.19-rc1-mm1

#9355 from comment #48 Implement power resource references as a list,
                                so if two devices using the same power resource,
                                it cannot be disabled by two subsequent calls from
                                a single device. Supersedes #9254 and #9263.
#9337 from comment #52 Improved final version of #9355.

We need #9255 and #9337 from this bug. They apply cleanly to 2.6.18-8.

Bug 7570: S3: fan doesn't work properly after resume
----------------------------------------------------
Summary: one of the four fans is not turned on after suspend/resume cycle.

Relevant patch:

#9802 from comment #8 'force_power_state' flag being set, disables the
                                check if the required power state is the same as
                                the current one. In that case the list of power
                                resources being enabled is the same as the list of
                                power resources being disabled, and follows to
                                consequent enabling and disabling of these resources.

This patch may be included, even though the issue it fixes is not as critical
as the other ones. Applies fine to 2.6.18-8 too.

So far I have not tried building the kernel with this patches, but I think this is
a reasonable way to resolve the problem, as the resulting cumulative patch (attached)
is only 19K.

Best regards,
--
Jurij Smakov                                           [hidden email]
Key: http://www.wooyd.org/pgpkey/                      KeyID: C99E03CC

cumulative.patch (19K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Bug#404143: Fans unreliable under load, permanent memory leak

Maximilian Attems-3
On Tue, Dec 26, 2006 at 06:09:02PM -0800, Jurij Smakov wrote:

> On Sun, Dec 24, 2006 at 03:07:55AM +0100, Frederik Schueler wrote:
> >
> > Hi *,
> >
> > this is indeed a severe issue which requires all our attention and care
> > to solve or circumvent in order for nobodies boxes to get any harm, you
> > know how expensive these laptops are.
> >
> > I basically see 3 solutions/workarounds:
> >
> > 1. the brutal one: deactivate ACPI in 2.6.18, have the bios keep control
> > of the fans - better a noisy laptop until I upgrade the kernel than a
> > fried box.
> >
> > 2. port 2.6.19 ACPI - noop because way too much work, unless someone
> > "crazy enough" to accomplish this task.
>
> I have reviewed the information available on the thermal problems with
> HP laptops, and it appears that there is a fairly conservative set of
> patches which takes care of the problems (thanks to Bas for pointing
> most of the out). I might have missed some upstream bugs, so please
> let me know if there is anything else available on the issue. Below is
> the summary, describing the relevant patches:

i nack the mentioned patches!

backports are risky, again as you see for the net-r8169-1.patch,
that is a "localized" driver enhancement with big slow down consequences
#400524 and #403782. yes upstream has a fix for that and it should
land soon, but still no one else bothered yet.

the acpi patches may solve the troubles with those stupid HP laptops,
but they have _certainly_ side effects.
if you look at the acpi commits of this day you see that they broke
a toshiba laptop.


back to the facts
* the sarge kernel was released with *huge* thermal problems
  and without any userspace help for early loading
* the etch 2.6.18 linux acpi supports *many* thermal boxes
  thermal hooks load modules at earliest possible stage
* acpi releases have regression tests that are only run
  for the complete release itself

the sanest way is to disable acpi for the affected laptops
and push a newer linux in a point release.
playing with acpi fire is not appropriate for a stable release.

 
--
maks


--
To UNSUBSCRIBE, email to [hidden email]
with a subject of "unsubscribe". Trouble? Contact [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Bug#404143: Fans unreliable under load, permanent memory leak

Jurij Smakov
On Wed, Dec 27, 2006 at 03:40:58AM +0100, maximilian attems wrote:

> > I have reviewed the information available on the thermal problems with
> > HP laptops, and it appears that there is a fairly conservative set of
> > patches which takes care of the problems (thanks to Bas for pointing
> > most of the out). I might have missed some upstream bugs, so please
> > let me know if there is anything else available on the issue. Below is
> > the summary, describing the relevant patches:
>
> i nack the mentioned patches!

Well, that's one in favor and one vote against then.
 
> backports are risky, again as you see for the net-r8169-1.patch,
> that is a "localized" driver enhancement with big slow down consequences
> #400524 and #403782. yes upstream has a fix for that and it should
> land soon, but still no one else bothered yet.

That's because slower networking will not break your hardware.

> the acpi patches may solve the troubles with those stupid HP laptops,
> but they have _certainly_ side effects.
> if you look at the acpi commits of this day you see that they broke
> a toshiba laptop.

Do you have a reference to that? And we do have a possibility to test
the changes pretty extensively by uploading to unstable plus
specifically asking people to test.
 

> back to the facts
> * the sarge kernel was released with *huge* thermal problems
>   and without any userspace help for early loading
> * the etch 2.6.18 linux acpi supports *many* thermal boxes
>   thermal hooks load modules at earliest possible stage
> * acpi releases have regression tests that are only run
>   for the complete release itself
>
> the sanest way is to disable acpi for the affected laptops
> and push a newer linux in a point release.

Do you have a patch which does that? If that would exist, I might
reconsider my position.

> playing with acpi fire is not appropriate for a stable release.

It's all about cost/benefit analysis. In my eyes the benefits of
introducing these patches significantly outweighs the possible
problems, given the proper testing.

Best regards,
--
Jurij Smakov                                           [hidden email]
Key: http://www.wooyd.org/pgpkey/                      KeyID: C99E03CC


--
To UNSUBSCRIBE, email to [hidden email]
with a subject of "unsubscribe". Trouble? Contact [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Bug#404143: Fans unreliable under load, permanent memory leak

Maximilian Attems-3
On Tue, Dec 26, 2006 at 06:52:06PM -0800, Jurij Smakov wrote:
>  
> > backports are risky, again as you see for the net-r8169-1.patch,
> > that is a "localized" driver enhancement with big slow down consequences
> > #400524 and #403782. yes upstream has a fix for that and it should
> > land soon, but still no one else bothered yet.
>
> That's because slower networking will not break your hardware.

why was that fact never rc for sarge?
#259481, #262383
 
> > the acpi patches may solve the troubles with those stupid HP laptops,
> > but they have _certainly_ side effects.
> > if you look at the acpi commits of this day you see that they broke
> > a toshiba laptop.
>
> Do you have a reference to that? And we do have a possibility to test
> the changes pretty extensively by uploading to unstable plus
> specifically asking people to test.

the dsdt of those hp notebooks is quite strange,
if you follow mjg59 posts you have read a funny story:
http://mjg59.livejournal.com/67443.html

the reference is easily readable in the git-commits-mail,
if you interested in a 2006 tarball, i can send it.

check b976fe19acc565e5137e6f12af7b6633a23e6b7c
it reverts your proposed patch.
 
> > and push a newer linux in a point release.
>
> Do you have a patch which does that? If that would exist, I might
> reconsider my position.
 
no that is a release manager position. ;)
but i assume you mean a patch for drivers/acpi/blacklist.c
that should be fairly easy to create once we get dmidecode
output of the bug reporter.

fully untested:

diff --git a/drivers/acpi/blacklist.c b/drivers/acpi/blacklist.c
index f9c972b..669d81d 100644
--- a/drivers/acpi/blacklist.c
+++ b/drivers/acpi/blacklist.c
@@ -69,6 +69,9 @@ static struct acpi_blacklist_item acpi_blacklist[] __initdata = {
  "Incorrect _ADR", 1},
  {"ASUS\0\0", "P2B-S   ", 0, ACPI_DSDT, all_versions,
  "Bogus PCI routing", 1},
+ /* HP nx6125 */
+ {"Hewlett-Packard ", "68DTT Ver. F.0", 0xE0000, ACPI_DSDT, all_versions,
+ "Bogus fan support", 1},
 
  {""}
 };

> > playing with acpi fire is not appropriate for a stable release.
>
> It's all about cost/benefit analysis. In my eyes the benefits of
> introducing these patches significantly outweighs the possible
> problems, given the proper testing.

fully agreed.
the cost analysis of acpi patches seems quite high,
that's why we currently have the policy not to add any.
i hate to do name dropping, but that goes back to hch.

best regards

--
maks


--
To UNSUBSCRIBE, email to [hidden email]
with a subject of "unsubscribe". Trouble? Contact [hidden email]

123