slapd-mtread crash on ppc64{,el} in stretch/sid

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

slapd-mtread crash on ppc64{,el} in stretch/sid

Ryan Tandy-4
Hi debian-powerpc,

Would a ppc64(el) porter be able to help me look at #866122? I have
requested a porterbox account but it's not gone through yet, and I am
unable to reproduce the issue at all in a qemu VM.

The openldap test suite is failing on ppc64 and ppc64el in stretch and
unstable: the slapd-mtread helper program segfaults (exit 139) in a
certain test case.

It looks like the builds tend to succeed on jessie's kernel and fail on
stretch's kernel:

* 2.4.44+dfsg-4 failed on ppc64el-osuosl-01
  - kernel 4.9.0-2-powerpc64le
  - gcc-6_6.3.0-13
  - binutils_2.28-3
* 2.4.44+dfsg-5 succeeded on ppc64el-unicamp-01
  - kernel 3.16.0-4-powerpc64le
  - gcc-6_6.3.0-18
  - binutils_2.28-5
* 2.4.44+dfsg-6 failed on ppc64el-osuosl-01
  - kernel 4.9.0-3-powerpc64le
  - gcc-6_6.3.0-19
  - binutils_2.28-6
* 2.4.44+dfsg-7 failed on ppc64el-unicamp-01
  - kernel 4.9.0-3-powerpc64le
  - gcc-6_6.3.0-19
  - binutils binutils_2.28-6

with one exception:

* 2.4.44+dfsg-5~bpo8+1 succeeded on ppc64el-osuosl-01
  - kernel 4.9.0-3-powerpc64le
  - gcc-4.9_4.9.2-10
  - binutils_2.25-5+deb8u1

The steps to build and run just the failing test are:

  apt-get build-dep openldap
  apt-get source openldap
  cd openldap-*/
  DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage -T build
  cd debian/build/tests
  ./run -b bdb test060-mt-hot

The slapd-mtread output is in debian/build/tests/testrun/mtread.out.
The testrun folder also contains other logs and files from the test run.
The slapd-mtread source is in tests/progs/slapd-mtread.c.

Thanks in advance for any help you can offer.

Ryan

Reply | Threaded
Open this post in threaded view
|

Re: slapd-mtread crash on ppc64{,el} in stretch/sid

Breno Leitao-6
Hi Ryan,

On Mon, Jul 03, 2017 at 03:39:35PM -0700, Ryan Tandy wrote:
> Hi debian-powerpc,
>
> Would a ppc64(el) porter be able to help me look at #866122? I have
> requested a porterbox account but it's not gone through yet, and I am unable
> to reproduce the issue at all in a qemu VM.

You can also request a VM on the minicloud at
http://openpower.ic.unicamp.br/minicloud/ if you wish. They are usually
quick on creating accounts.

> The openldap test suite is failing on ppc64 and ppc64el in stretch and
> unstable: the slapd-mtread helper program segfaults (exit 139) in a certain
> test case.
>
> It looks like the builds tend to succeed on jessie's kernel and fail on
> stretch's kernel:

In fact, this problem seems to reproduce once in a while, thus, I would
not trust that it might be related to the kernel/gcc combination at this
very beginning. I am suspecting that it might be related to the amount
of threads created and the order, but I do not have a conclusion yet.
Still investigating.

>  apt-get build-dep openldap
>  apt-get source openldap
>  cd openldap-*/
>  DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage -T build
>  cd debian/build/tests
>  ./run -b bdb test060-mt-hot

Nice. I was able to reproduce it and debug it further. The problem seems
to be related to a invalid branch/jump, the the next address is not
memory mapped, thus the segfault. The new address is completely random,
and definitely is wrong. The link register (LR), which is register that
shows the return of the branch (similar to call() on amd64) is always
constant when ALSR is disabled. Other than that, I also saw a stack
corruption, which caused the backtrace to be completely bogus.

Anyway, myself and a colleague are still investigating this problem. I will keep
you informed of the progress of this issue.

Thanks,
Breno

Reply | Threaded
Open this post in threaded view
|

Re: slapd-mtread crash on ppc64{,el} in stretch/sid

Breno Leitao-6
Ryan,

> Nice. I was able to reproduce it and debug it further. The problem seems
> to be related to a invalid branch/jump, the the next address is not
> memory mapped, thus the segfault. The new address is completely random,
> and definitely is wrong.

I think I found the real case of the problem here. There is an array
being allocated, and initialized until 'i'.

Later, we use a random number 'r' to access this array, and it will fail
if 'r' is bigger than 'i', since any value bigger than 'i' will not be
initialized. It might fail harder if 'r' is being allocated than 'nvalues', but
I didn't get this scenario during my debugs.

That is how I see the array and how we are trying to access it:

0                         i                        nvalues          RAND_MAX
|-------------------------|---------------------------|--------------|
|     initialized         |       mapped              | unmapped     |
|-------------------------|---------------------------|--------------|

I understand that forcing the values to be smaller than 'i' is what we want.

This is a patch I've created to fix this problem. The problem seems to happen
upstream also. If this patch is OK for you , I will create a pull request
to send this fix upstream.

---

diff --git a/tests/progs/slapd-mtread.c b/tests/progs/slapd-mtread.c
index c9c872918..15316b360 100644
--- a/tests/progs/slapd-mtread.c
+++ b/tests/progs/slapd-mtread.c
@@ -635,7 +635,7 @@ do_random( LDAP *ld,
        int     i = 0, do_retry = maxretries;
        char    *attrs[ 2 ];
        int     rc = LDAP_SUCCESS;
-       int     nvalues = 0;
+       int     maxnvalue, nvalues = 0;
        char    **values = NULL;
        LDAPMessage *res = NULL, *e = NULL;
        char    thrstr[BUFSIZ];
@@ -668,6 +668,7 @@ do_random( LDAP *ld,
                        values[ i ] = ldap_get_dn( ld, e );
                }
                values[ i ] = NULL;
+               maxnvalue = i;

                ldap_msgfree( res );

@@ -680,6 +681,7 @@ do_random( LDAP *ld,

                for ( i = 0; i < innerloop; i++ ) {
                        int     r = ((double)nvalues)*rand()/(RAND_MAX + 1.0);
+                       r %= maxnvalue;

                        do_read( ld, values[ r ],
                                srchattrs, noattrs, nobind, 1, maxretries,

Reply | Threaded
Open this post in threaded view
|

Re: slapd-mtread crash on ppc64{,el} in stretch/sid

John Paul Adrian Glaubitz
Hi Breno!

On 07/06/2017 09:31 PM, Breno Leitao wrote:
> I think I found the real case of the problem here. There is an array
> being allocated, and initialized until 'i'.
>
> Later, we use a random number 'r' to access this array, and it will fail
> if 'r' is bigger than 'i', since any value bigger than 'i' will not be
> initialized. It might fail harder if 'r' is being allocated than 'nvalues', but
> I didn't get this scenario during my debugs.

Interesting bug. I wonder how this bug is only triggering on ppc*. Is that
code that is run on these targets only?

Adrian

--
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - [hidden email]
`. `'   Freie Universitaet Berlin - [hidden email]
  `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

Reply | Threaded
Open this post in threaded view
|

Re: slapd-mtread crash on ppc64{,el} in stretch/sid

Breno Leitao-6
Hey Adrian,

On Thu, Jul 06, 2017 at 10:44:21PM +0200, John Paul Adrian Glaubitz wrote:

> Hi Breno!
>
> On 07/06/2017 09:31 PM, Breno Leitao wrote:
> > I think I found the real case of the problem here. There is an array
> > being allocated, and initialized until 'i'.
> >
> > Later, we use a random number 'r' to access this array, and it will fail
> > if 'r' is bigger than 'i', since any value bigger than 'i' will not be
> > initialized. It might fail harder if 'r' is being allocated than 'nvalues', but
> > I didn't get this scenario during my debugs.
>
> Interesting bug. I wonder how this bug is only triggering on ppc*. Is that
> code that is run on these targets only?

In fact, doing further debugs I understand that the problem is somewhere
else, and what I am seeing is just a consequence of a prior error that was not
prior handled. This is the test case:

         for ( i = 0, e = ldap_first_entry( ld, res ); e != NULL; i++, e = ldap_next_entry( ld, e ) )
         {
                 values[ i ] = ldap_get_dn( ld, e );
         }
         values[ i ] = NULL;

         ...

         for ( i = 0; i < innerloop; i++ ) {
                     int     r = ((double)nvalues)*rand()/(RAND_MAX + 1.0);

                     do_read( ld, values[ r ],
                             srchattrs, noattrs, nobind, 1, maxretries,
                             delay, force, chaserefs, idx );
         }


In ppc64el case, ldap_first_entry() is returning NULL, thus values[], other
than values[0], but the code will contain garbage and we will interate over it
later.

That said, I understand that there are two bugs:

1) we shouldn't iterate over values[] if it is bogus.
2) ldap_first_entry() shouldn't return NULL (real problem)

So, answering your question. My patch *didn't* fix the real issue (2), but the
consequence of it. That explains why we do not see this on other systems.

Anyway, I am still investigating the real problem here, and Howard, from
OpenLdap, is kindly helping.

Reply | Threaded
Open this post in threaded view
|

Re: slapd-mtread crash on ppc64{,el} in stretch/sid

John Paul Adrian Glaubitz
On 07/07/2017 03:12 AM, Breno Leitao wrote:
> (...)
> That said, I understand that there are two bugs:
>
> 1) we shouldn't iterate over values[] if it is bogus.
> 2) ldap_first_entry() shouldn't return NULL (real problem)
>
> So, answering your question. My patch *didn't* fix the real issue (2), but the
> consequence of it. That explains why we do not see this on other systems.

Thanks for the explanation. Sounds quite tricky to track down.

> Anyway, I am still investigating the real problem here, and Howard, from
> OpenLdap, is kindly helping.

Great. Thank you so much for fixing this!

Adrian

--
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - [hidden email]
`. `'   Freie Universitaet Berlin - [hidden email]
  `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

Reply | Threaded
Open this post in threaded view
|

Re: slapd-mtread crash on ppc64{,el} in stretch/sid

Ryan Tandy-4
In reply to this post by Breno Leitao-6
I spent some more time on this issue and I believe I isolated the
problem to the following line:

int     r = ((double)nvalues)*rand()/(RAND_MAX + 1.0);

http://sources.debian.net/src/openldap/2.4.44%2Bdfsg-7/tests/progs/slapd-mtread.c/#L682

Specifically, during the multi-threaded tests, casting 'nvalues' from
int to double occasionally comes up with the wrong result.

e.g.:
(double)19 -> 61.000000

Note that 'nvalues' is a local variable in the same function. There
should be zero chance of it being shared with another thread or anything
like that.

The bug reproduces consistently when building with stretch's toolchain
and running stretch's kernel, but I have not seen it at all with
jessie's kernel *nor* in a jessie chroot on stretch's kernel. I have
also reproduced it with clang as well as gcc.

Reply | Threaded
Open this post in threaded view
|

Re: slapd-mtread crash on ppc64{,el} in stretch/sid

Ryan Tandy-4
On Sat, Jul 08, 2017 at 11:15:48AM -0700, Ryan Tandy wrote:
>The bug reproduces consistently when building with stretch's toolchain
>and running stretch's kernel, but I have not seen it at all with
>jessie's kernel *nor* in a jessie chroot on stretch's kernel. I have
>also reproduced it with clang as well as gcc.

The newer compilers generate POWER8 vector instructions e.g.
mtvsrd/mfvsrd while the older ones do not. A kernel bug related to (or
triggered by) these instructions would meet the conditions I described.
Howard suggested we could be looking at a kernel bug around FP/VSR
save/restore.

Reply | Threaded
Open this post in threaded view
|

Re: slapd-mtread crash on ppc64{,el} in stretch/sid

Ryan Tandy-4
Control: notforwarded -1
Control: reassign -1 src:linux 4.8.7-1
Control: affects -1 src:openldap

Hi debian-kernel,

There seems to be a regression on powerpc64 (both endians) that can
corrupt the vector-scalar registers (VSRs) in a threaded program.

I believe the bad commit is this one:

4.9.0: https://github.com/torvalds/linux/commit/dc16b553c949e81f37555777dc7bab66d78285a7
4.8.6: https://github.com/linux-stable/linux-stable/commit/613036d9e91990f2043130ff8f78fd770432b3de

My program is not using transactional memory itself, but I don't
know what libc/libpthread might do internally.

4.8.7-1 is the first Debian kernel where I observe the crash. I rebuilt
that package with the above commit reverted and it went away.

The crash still occurs with 4.11.6-1 from unstable. I have not tested
anything more recent.

I still need to work up a reduced test case. For now the steps to
reproduce the crash with the openldap package are:

  apt-get build-dep openldap
  apt-get source openldap
  cd openldap-*/
  DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage -T build
  cd debian/build/tests
  ./run -b bdb test060-mt-hot

Thanks
Ryan

Reply | Threaded
Open this post in threaded view
|

Re: slapd-mtread crash on ppc64{,el} in stretch/sid

Breno Leitao-6
hi Ryan,

On Sun, Jul 09, 2017 at 08:56:59AM -0700, Ryan Tandy wrote:
> There seems to be a regression on powerpc64 (both endians) that can corrupt
> the vector-scalar registers (VSRs) in a threaded program.
>
> I believe the bad commit is this one:
>
> 4.9.0: https://github.com/torvalds/linux/commit/dc16b553c949e81f37555777dc7bab66d78285a7
> 4.8.6: https://github.com/linux-stable/linux-stable/commit/613036d9e91990f2043130ff8f78fd770432b3de

Nice work!

> My program is not using transactional memory itself, but I don't know what
> libc/libpthread might do internally.

That is interesting. I also found a bug last week related to hardware
transactional memory, and it corrupts vector-scalar register 0 (vs0)
when using hardware transactional memory. In our case, every exception
would cause a wrong reclaim (grab the register values from the
checkpointed area), which will affect later, the recheckpoint (put the
stacked registers into the checkpointed area). This is specific for
hardware transactional memory.

The full description of the bug and a possible patch should be found at:
https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-July/160117.html

> 4.8.7-1 is the first Debian kernel where I observe the crash. I rebuilt that
> package with the above commit reverted and it went away.

Anyway, this is what I am going to investigate now:

1) If glibc's pthread method is using hardware transactional memory by
default.  I remember that upstream enabled it once and then disabled by
default.

2) Investigate this commit ID and check for a possible corruption
depending on the result above.

Reply | Threaded
Open this post in threaded view
|

Re: Bug#866122: slapd-mtread crash on ppc64{,el} in stretch/sid

Ryan Tandy-4
Control: tag -1 upstream

Hi Breno,

Today I built Linux 4.12 from upstream source and the test program still
crashes. I was looking at your fixes to initialize load_{fp,tm,vec} as
well as someone else fixing the CONFIG_ALIVEC typo but none of those
have helped.

I did confirm on this kernel that reverting 613036d9 still stops it from
crashing. Tomorrow I will try to narrow it down to a specific change.
There are only 4 hunks after all (the addition of msr_tm_active cannot
be reverted as there are more calls to it now).

It turns out it is _not_ compiler dependent. The test program compiled
in a jessie chroot succeeds in that chroot and then crashes if I run the
same binary in a stretch chroot. This also means I was wrong about the
m{t,f}vsrd instructions being related, as gcc-4.9 doesn't emit them (for
this particular program, at least).

On Mon, Jul 10, 2017 at 10:23:40AM -0300, Breno Leitao wrote:
>Anyway, this is what I am going to investigate now:
>
>1) If glibc's pthread method is using hardware transactional memory by
>default.  I remember that upstream enabled it once and then disabled by
>default.

objdump -d libpthread.so.0 output apparently lists some tbegin/tend
instructions, but I suppose those could be selected at runtime?

Thanks
Ryan

Reply | Threaded
Open this post in threaded view
|

Re: Bug#866122: slapd-mtread crash on ppc64{,el} in stretch/sid

Breno Leitao
hi Ryan,

On 07/11/2017 02:58 AM, Ryan Tandy wrote:
> Today I built Linux 4.12 from upstream source and the test program still
> crashes. I was looking at your fixes to initialize load_{fp,tm,vec} as well
> as someone else fixing the CONFIG_ALIVEC typo but none of those have helped.

Right, I tested it with the pending patches for HTM and the bug is still
there, so, I doubt is has been fixed already.

> I did confirm on this kernel that reverting 613036d9 still stops it from
> crashing. Tomorrow I will try to narrow it down to a specific change. There
> are only 4 hunks after all (the addition of msr_tm_active cannot be reverted
> as there are more calls to it now).

In fact I just did it and I found that the following patch fixes the
problem.  I am not able to understand why yet. Working on it right now.

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 9f3e2c932dcc..21bcb3b19758 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -231,7 +231,7 @@ void enable_kernel_fp(void)
 EXPORT_SYMBOL(enable_kernel_fp);
 
 static int restore_fp(struct task_struct *tsk) {
-       if (tsk->thread.load_fp || msr_tm_active(tsk->thread.regs->msr)) {
+       if (tsk->thread.load_fp) {
                load_fp_state(&current->thread.fp_state);
                current->thread.load_fp++;
                return 1;

> It turns out it is _not_ compiler dependent. The test program compiled in a
> jessie chroot succeeds in that chroot and then crashes if I run the same
> binary in a stretch chroot. This also means I was wrong about the m{t,f}vsrd
> instructions being related, as gcc-4.9 doesn't emit them (for this particular
> program, at least).

I  understand that glibc might have VSX instructions, so, even if your
application is not using VSX instructions, it might be required
depending on the glibc version you are using, although the problem seems
to be on the float point (FP) side.

> objdump -d libpthread.so.0 output apparently lists some tbegin/tend
> instructions, but I suppose those could be selected at runtime?

Correct. I checked and Debian is enabling HTM[1] to do lock
ellision. It is not a option that you can change on runtime, we would
need to reconfigure/recompile glibc if we want to disable it.  There is
currently an effort to use glibc tunnables to enable/disable lock
elision at runtime, but this is still under development.

Out of curiosity, how did you bisect the kernel to find that commit-id?
Did you do it automatically?

[1] https://buildd.debian.org/status/fetch.php?pkg=glibc&arch=ppc64el&ver=2.24-12&stamp=1497900384&raw=0 (Check for --enable-lock-elision)