64-bit subtract from vector unsigned int

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

64-bit subtract from vector unsigned int

Jeffrey Walton-3
Hi Everyone,

I'm porting a 64-bit algorithm to 32-bit PowerPC (an old PowerMac).
The algorithm is simple when 64-bit is available, but it gets a little
ugly under 32-bit.

PowerPC has a "Vector Subtract Carryout Unsigned Word" (vsubcuw),
https://www.nxp.com/docs/en/reference-manual/ALTIVECPEM.pdf. The
altivec intrinsics are vec_vsubcuw and vec_subc.

The problem is, I don't know how to use it. I've been experimenting
with it but I don't see the use (yet).

How does one use vsubcuw to implement a subtract with borrow?

Thanks in advance.

==========================================

Here's what an "add with carry" looks like. The addc simply adds the
carry into the result after transposing the carry bits from columns 1
and 3 to columns 0 and 2.

typedef __vector unsigned char uint8x16_p;
typedef __vector unsigned int uint32x4_p;
...

inline uint32x4_p VecAdd64(const uint32x4_p& vec1, const uint32x4_p& vec2)
{
    // 64-bit elements available at POWER7 with VSX, but addudm requires POWER8
#if defined(_ARCH_PWR8)
    return (uint32x4_p)vec_add((uint64x2_p)vec1, (uint64x2_p)vec2);
#else
    const uint8x16_p cmask = {4,5,6,7, 16,16,16,16, 12,13,14,15, 16,16,16,16};
    const uint32x4_p zero = {0, 0, 0, 0};

    uint32x4_p cy = vec_addc(vec1, vec2);
    cy = vec_perm(cy, zero, cmask);
    return vec_add(vec_add(vec1, vec2), cy);
#endif
}

==========================================

Here's what I have for subtract with borrow in terms of addition.
There are 4 loads and then 9 instructions. I know it is too
inefficient.

    const uint32x4_p mask = {0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff};
    const uint8x16_p cmask = {4,5,6,7, 16,16,16,16, 12,13,14,15, 16,16,16,16};
    const uint32x4_p zero = {0, 0, 0, 0};
    const uint32x4_p  one = {0, 1, 0, 1};

    // one's compliment, still need to add 1
    uint32x4_p comp = vec_andc(mask, vec2);

    uint32x4_p cy = vec_addc(one, comp);
    cy = vec_perm(cy, zero, cmask);
    comp = vec_add(vec_add(one, comp), cy);

    cy = vec_addc(vec1, comp);
    cy = vec_perm(cy, zero, cmask);
    return vec_add(vec_add(vec1, comp), cy);

Reply | Threaded
Open this post in threaded view
|

Re: 64-bit subtract from vector unsigned int

Jeffrey Walton-3
On Tue, Apr 7, 2020 at 5:51 AM Jeffrey Walton <[hidden email]> wrote:

>
> Hi Everyone,
>
> I'm porting a 64-bit algorithm to 32-bit PowerPC (an old PowerMac).
> The algorithm is simple when 64-bit is available, but it gets a little
> ugly under 32-bit.
> ...
>
> Here's what an "add with carry" looks like. The addc simply adds the
> carry into the result after transposing the carry bits from columns 1
> and 3 to columns 0 and 2.
>
> typedef __vector unsigned char uint8x16_p;
> typedef __vector unsigned int uint32x4_p;
> ...
>
> inline uint32x4_p VecAdd64(const uint32x4_p& vec1, const uint32x4_p& vec2)
> {
>     // 64-bit elements available at POWER7 with VSX, but addudm requires POWER8
> #if defined(_ARCH_PWR8)
>     return (uint32x4_p)vec_add((uint64x2_p)vec1, (uint64x2_p)vec2);
> #else
>     const uint8x16_p cmask = {4,5,6,7, 16,16,16,16, 12,13,14,15, 16,16,16,16};
>     const uint32x4_p zero = {0, 0, 0, 0};
>
>     uint32x4_p cy = vec_addc(vec1, vec2);
>     cy = vec_perm(cy, zero, cmask);
>     return vec_add(vec_add(vec1, vec2), cy);
> #endif
> }

I think I found it... The compliment of the carry was throwing me off.
Subtract with borrow needs an extra vec_andc to un-compliment the
borrow:

    const uint8x16_p bmask = {4,5,6,7, 16,16,16,16, 12,13,14,15, 16,16,16,16};
    const uint32x4_p amask = {1, 1, 1, 1};
    const uint32x4_p zero = {0, 0, 0, 0};

    uint32x4_p bw = vec_subc(vec1, vec2);
    bw = vec_andc(amask, bw);
    bw = vec_perm(bw, zero, bmask);
   return vec_sub(vec_sub(vec1, vec2), bw);

Jeff

Reply | Threaded
Open this post in threaded view
|

Re: 64-bit subtract from vector unsigned int

Lennart Sorensen
In reply to this post by Jeffrey Walton-3
On Tue, Apr 07, 2020 at 05:51:54AM -0400, Jeffrey Walton wrote:

> Hi Everyone,
>
> I'm porting a 64-bit algorithm to 32-bit PowerPC (an old PowerMac).
> The algorithm is simple when 64-bit is available, but it gets a little
> ugly under 32-bit.
>
> PowerPC has a "Vector Subtract Carryout Unsigned Word" (vsubcuw),
> https://www.nxp.com/docs/en/reference-manual/ALTIVECPEM.pdf. The
> altivec intrinsics are vec_vsubcuw and vec_subc.
>
> The problem is, I don't know how to use it. I've been experimenting
> with it but I don't see the use (yet).
>
> How does one use vsubcuw to implement a subtract with borrow?

Does your 32 bit powerpc have altivec?  A lot do not.  It is certainly
not a universal feature.  As far as I remember, G4 and G5 powermacs have
it, but nothing older.

--
Len Sorensen

Reply | Threaded
Open this post in threaded view
|

Re: 64-bit subtract from vector unsigned int

Romain Dolbeau
Le mar. 7 avr. 2020 à 14:35, Lennart Sorensen
<[hidden email]> a écrit :
> Does your 32 bit powerpc have altivec?  A lot do not.  It is certainly
> not a universal feature.  As far as I remember, G4 and G5 powermacs have
> it, but nothing older.

You remember correctly, and also G5 (PowerPC 970) are 64-bits natively
and will happily run a 64 bits kernel (and even userland) in Linux
(including Debian), so this is for G4 (PowerPC 74xx) only (and G5
running OSX, or NetBSD which doesn't have a 64-bit kernel / userland
yet).

[this is for PowerMacs - some newer, embedded, 32 bits powerpc also
have AltiVec, although the one in the AmigaOne X5000 doesn't].

Cordially,

--
Romain Dolbeau

Reply | Threaded
Open this post in threaded view
|

Re: 64-bit subtract from vector unsigned int

Lennart Sorensen
On Tue, Apr 07, 2020 at 03:12:38PM +0200, Romain Dolbeau wrote:
> You remember correctly, and also G5 (PowerPC 970) are 64-bits natively
> and will happily run a 64 bits kernel (and even userland) in Linux
> (including Debian), so this is for G4 (PowerPC 74xx) only (and G5
> running OSX, or NetBSD which doesn't have a 64-bit kernel / userland
> yet).
>
> [this is for PowerMacs - some newer, embedded, 32 bits powerpc also
> have AltiVec, although the one in the AmigaOne X5000 doesn't].

Right the X1000 had it, and the X5000 does not (but it is 64 bit),
and the A1222 doesn't even have powerpc floatingpoint.  They are such
a mess.

--
Len Sorensen

Reply | Threaded
Open this post in threaded view
|

Re: 64-bit subtract from vector unsigned int

Jeffrey Walton-3
In reply to this post by Lennart Sorensen
On Tue, Apr 7, 2020 at 8:27 AM Lennart Sorensen
<[hidden email]> wrote:

>
> On Tue, Apr 07, 2020 at 05:51:54AM -0400, Jeffrey Walton wrote:
> > Hi Everyone,
> >
> > I'm porting a 64-bit algorithm to 32-bit PowerPC (an old PowerMac).
> > The algorithm is simple when 64-bit is available, but it gets a little
> > ugly under 32-bit.
> >
> > PowerPC has a "Vector Subtract Carryout Unsigned Word" (vsubcuw),
> > https://www.nxp.com/docs/en/reference-manual/ALTIVECPEM.pdf. The
> > altivec intrinsics are vec_vsubcuw and vec_subc.
> >
> > The problem is, I don't know how to use it. I've been experimenting
> > with it but I don't see the use (yet).
> >
> > How does one use vsubcuw to implement a subtract with borrow?
>
> Does your 32 bit powerpc have altivec?  A lot do not.  It is certainly
> not a universal feature.  As far as I remember, G4 and G5 powermacs have
> it, but nothing older.

Yes, this is an old PowerMac G4 with Power4. It has a Altivec unit,
but it is only 32-bit. Add, subtract, shift and rotate (and friends)
on 64-bit values are missing.

As old as the hardware is (circa 2000), that old PowerPC chip
outperforms some modern hardware, like Atoms, Celerons and low-end ARM
cpu's in modern gadgets.

Testing some algorithms, like Simon-128 and Speck-128, show a need for
Altivec. For example, Integer-based Speck-128 was running at about 70
cpb. Altivec-based Speck-128 dropped to 10 cpb even with me doing all
the 64-bit fixups. (Speck-128 runs around 2.5 cpb when the native
hardware supports 64-bit operations, like on Power8).

Jeff

Reply | Threaded
Open this post in threaded view
|

Re: 64-bit subtract from vector unsigned int

Mathieu Malaterre-4
Jeffrey,

On Wed, Apr 8, 2020 at 11:56 AM Jeffrey Walton <[hidden email]> wrote:

>
> On Tue, Apr 7, 2020 at 8:27 AM Lennart Sorensen
> <[hidden email]> wrote:
> >
> > On Tue, Apr 07, 2020 at 05:51:54AM -0400, Jeffrey Walton wrote:
> > > Hi Everyone,
> > >
> > > I'm porting a 64-bit algorithm to 32-bit PowerPC (an old PowerMac).
> > > The algorithm is simple when 64-bit is available, but it gets a little
> > > ugly under 32-bit.
> > >
> > > PowerPC has a "Vector Subtract Carryout Unsigned Word" (vsubcuw),
> > > https://www.nxp.com/docs/en/reference-manual/ALTIVECPEM.pdf. The
> > > altivec intrinsics are vec_vsubcuw and vec_subc.
> > >
> > > The problem is, I don't know how to use it. I've been experimenting
> > > with it but I don't see the use (yet).
> > >
> > > How does one use vsubcuw to implement a subtract with borrow?
> >
> > Does your 32 bit powerpc have altivec?  A lot do not.  It is certainly
> > not a universal feature.  As far as I remember, G4 and G5 powermacs have
> > it, but nothing older.
>
> Yes, this is an old PowerMac G4 with Power4. It has a Altivec unit,
> but it is only 32-bit. Add, subtract, shift and rotate (and friends)
> on 64-bit values are missing.
>
> As old as the hardware is (circa 2000), that old PowerPC chip
> outperforms some modern hardware, like Atoms, Celerons and low-end ARM
> cpu's in modern gadgets.
>
> Testing some algorithms, like Simon-128 and Speck-128, show a need for
> Altivec. For example, Integer-based Speck-128 was running at about 70
> cpb. Altivec-based Speck-128 dropped to 10 cpb even with me doing all
> the 64-bit fixups. (Speck-128 runs around 2.5 cpb when the native
> hardware supports 64-bit operations, like on Power8).

[Somewhat off-topic here.]

Did you ever tried crc32 with altivec ? crc32 with altivec in the
kernel is only for ppc64.

Reply | Threaded
Open this post in threaded view
|

Re: 64-bit subtract from vector unsigned int

Jeffrey Walton-3
On Wed, Apr 8, 2020 at 7:31 AM Mathieu Malaterre <[hidden email]> wrote:

>
> Jeffrey,
>
> On Wed, Apr 8, 2020 at 11:56 AM Jeffrey Walton <[hidden email]> wrote:
> >
> > On Tue, Apr 7, 2020 at 8:27 AM Lennart Sorensen
> > <[hidden email]> wrote:
> > >
> > > On Tue, Apr 07, 2020 at 05:51:54AM -0400, Jeffrey Walton wrote:
> > > > Hi Everyone,
> > > >
> > > > I'm porting a 64-bit algorithm to 32-bit PowerPC (an old PowerMac).
> > > > The algorithm is simple when 64-bit is available, but it gets a little
> > > > ugly under 32-bit.
> > > >
> > > > PowerPC has a "Vector Subtract Carryout Unsigned Word" (vsubcuw),
> > > > https://www.nxp.com/docs/en/reference-manual/ALTIVECPEM.pdf. The
> > > > altivec intrinsics are vec_vsubcuw and vec_subc.
> > > >
> > > > The problem is, I don't know how to use it. I've been experimenting
> > > > with it but I don't see the use (yet).
> > > >
> > > > How does one use vsubcuw to implement a subtract with borrow?
> > >
> > > Does your 32 bit powerpc have altivec?  A lot do not.  It is certainly
> > > not a universal feature.  As far as I remember, G4 and G5 powermacs have
> > > it, but nothing older.
> >
> > Yes, this is an old PowerMac G4 with Power4. It has a Altivec unit,
> > but it is only 32-bit. Add, subtract, shift and rotate (and friends)
> > on 64-bit values are missing.
> >
> > As old as the hardware is (circa 2000), that old PowerPC chip
> > outperforms some modern hardware, like Atoms, Celerons and low-end ARM
> > cpu's in modern gadgets.
> >
> > Testing some algorithms, like Simon-128 and Speck-128, show a need for
> > Altivec. For example, Integer-based Speck-128 was running at about 70
> > cpb. Altivec-based Speck-128 dropped to 10 cpb even with me doing all
> > the 64-bit fixups. (Speck-128 runs around 2.5 cpb when the native
> > hardware supports 64-bit operations, like on Power8).
>
> [Somewhat off-topic here.]
>
> Did you ever tried crc32 with altivec ? crc32 with altivec in the
> kernel is only for ppc64.

No. I think the CRC32 support comes from Power8 and in-core crypto
using polynomial multiplies. Here's the fellow who has the reference
implementation and tutorial:
https://github.com/antonblanchard/crc32-vpmsum.

I don't use CRC32 much. I do have GCM mode using polynomial multiples
(along with Power8 AES). It runs around 1.3 cpb.

Jeff

Reply | Threaded
Open this post in threaded view
|

Re: 64-bit subtract from vector unsigned int

Romain Dolbeau
In reply to this post by Jeffrey Walton-3
Le mer. 8 avr. 2020 à 11:56, Jeffrey Walton <[hidden email]> a écrit :
> As old as the hardware is (circa 2000), that old PowerPC chip
> outperforms some modern hardware, like Atoms, Celerons and low-end ARM
> cpu's in modern gadgets.

They don't feel that fast anymore... Even a Raspberry Pi 3 will run
circle around my dual-1.25 GHz G4 (not that much faster per-core, but
there's 4 of them), and even more so for the single G4. And a Rpi4 is
even faster per-core. And while the PATA interface is faster than the
SD card of the Pi, it's not that great for I/O either. And they don't
have that much memory, either. The G5 is a whole different beast (I
have a quad, complete with the full complement f 16 GiB of ECC RAM,
just because I can).

> Testing some algorithms, like Simon-128 and Speck-128, show a need for
> Altivec. For example, Integer-based Speck-128 was running at about 70
> cpb. Altivec-based Speck-128 dropped to 10 cpb even with me doing all
> the 64-bit fixups. (Speck-128 runs around 2.5 cpb when the native
> hardware supports 64-bit operations, like on Power8).

Interesting. Maybe you could share your implementations in the
Supercop benchmark? (<http://bench.cr.yp.to/supercop.html>, there's
some help in "How to submit new software:").
Are you interested in just those algorithms or crypto in general? I
have an AltiVec implementation of Chacha20 in there that can probably
be beaten if you feel up to the challenge ;-)
(unfortunately, no published results on G4 or recent one on G5 as
Supercop takes forever to run, and I've already blown a power supply
on my G5 so I'm reluctant to let it run for extended period of time).

Cordially,

--
Romain Dolbeau

Reply | Threaded
Open this post in threaded view
|

Re: 64-bit subtract from vector unsigned int

Jeffrey Walton-3
In reply to this post by Jeffrey Walton-3
On Tue, Apr 7, 2020 at 7:51 AM Jeffrey Walton <[hidden email]> wrote:

>
> On Tue, Apr 7, 2020 at 5:51 AM Jeffrey Walton <[hidden email]> wrote:
> >
> > Hi Everyone,
> >
> > I'm porting a 64-bit algorithm to 32-bit PowerPC (an old PowerMac).
> > The algorithm is simple when 64-bit is available, but it gets a little
> > ugly under 32-bit.
> > ...
> >
> > Here's what an "add with carry" looks like. The addc simply adds the
> > carry into the result after transposing the carry bits from columns 1
> > and 3 to columns 0 and 2.
> >
> > typedef __vector unsigned char uint8x16_p;
> > typedef __vector unsigned int uint32x4_p;
> > ...
> >
> > inline uint32x4_p VecAdd64(const uint32x4_p& vec1, const uint32x4_p& vec2)
> > {
> >     // 64-bit elements available at POWER7 with VSX, but addudm requires POWER8
> > #if defined(_ARCH_PWR8)
> >     return (uint32x4_p)vec_add((uint64x2_p)vec1, (uint64x2_p)vec2);
> > #else
> >     const uint8x16_p cmask = {4,5,6,7, 16,16,16,16, 12,13,14,15, 16,16,16,16};
> >     const uint32x4_p zero = {0, 0, 0, 0};
> >
> >     uint32x4_p cy = vec_addc(vec1, vec2);
> >     cy = vec_perm(cy, zero, cmask);
> >     return vec_add(vec_add(vec1, vec2), cy);
> > #endif
> > }
>
> I think I found it... The compliment of the carry was throwing me off.
> Subtract with borrow needs an extra vec_andc to un-compliment the
> borrow:
>
>     const uint8x16_p bmask = {4,5,6,7, 16,16,16,16, 12,13,14,15, 16,16,16,16};
>     const uint32x4_p amask = {1, 1, 1, 1};
>     const uint32x4_p zero = {0, 0, 0, 0};
>
>     uint32x4_p bw = vec_subc(vec1, vec2);
>     bw = vec_andc(amask, bw);
>     bw = vec_perm(bw, zero, bmask);
>    return vec_sub(vec_sub(vec1, vec2), bw);

Sorry to dig up an old thread... I've been working with Steven Munroe,
who is a retired IBM engineer and maintainer of pveclib
(https://github.com/munroesj52/pveclib). Munroe recommended avoid the
load and permute, and use a shift instead.

Here is an updated VecSub64 routine.

typedef __vector unsigned int uint32x4_p ;
...

#if defined(__BIG_ENDIAN__)
    const uint32x4_p zero = {0, 0, 0, 0};
    const uint32x4_p mask = {0, 1, 0, 1};
#else
    const uint32x4_p zero = {0, 0, 0, 0};
    const uint32x4_p mask = {1, 0, 1, 0};
#endif

    uint32x4_p bw = vec_subc(vec1, vec2);
    uint32x4_p res = vec_sub(vec1, vec2);
    bw = vec_andc(mask, bw);
    bw = vec_sld (bw, zero, 4);
    return vec_sub(res, bw);

Jeff