Bits from /me: A humble draft policy on "deep learning v.s. freedom"

classic Classic list List threaded Threaded
47 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
Hi people,

A year ago I raised a topic on -devel, pointing out the
"deep learning v.s. software freedom" issue. We drew no
conclusion at that time, and linux distros who care about
software freedom may still have doubt on some fundamental
problems, e.g. "is this piece of deep learning software
really free"?

People do lazy execution on this problem. Now that a
related package entered my packaging radar, and I think
I'd better write a draft and shed some light on a safety
area. Then here is the first humble attempt:

  https://salsa.debian.org/lumin/deeplearning-policy
  (issue tracker is enabled)

This draft is conservative and overkilling, and currently
only focus on software freedom. That's exactly where we
start, right?

Specifically, I defined 3 types of pre-trained machine
learning models / deep learning models:

  Free Model, ToxicCandy Model. Non-free Model

Developers who'd like to touch DL software should be
cautious to the "ToxicCandy" models. Details can be
found in my draft.

Apart from that, I pointed out in the draft that software
associated with any critical task should be considered
carefully as deep neural networks introduced a new kind
of vulnerability, that a network's response can be
disrupted or even controlled by some carefully designed
perturbations added to the network put.

Hence, I suggest that packaging an intelligent software
must be discussed on -devel if the piece of software is
associated with any kind of critical task, including but
not limited to

  * authentication (e.g. login via face verification or
    identification)
  * program execution (e.g. intelligent voice assistants:
    "Hey, Siri! sudo rm -rf / --no-preserve-root")
  * physical object manipulation (e.g. mechanical
    arms in non-educational occasion,
    cars i.e. auto pilot), etc.

See my draft for details.

The package that entered my packaging radar is nltk_data.
https://github.com/nltk/nltk_data
The 2 most widely used python-based computational
linguistics toolkit, NLTK and Spacy, require these
data (datasets + models) to enable most of their
functionalities.

Best,
Mo.

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Paul Wise via nm
On Tue, May 21, 2019 at 3:11 PM Mo Zhou wrote:

> I'd better write a draft and shed some light on a safety
> area. Then here is the first humble attempt:
>
>   https://salsa.debian.org/lumin/deeplearning-policy

The policy looks good to me.

A couple of situations this related to this policy:

https://bugs.debian.org/699609
https://ffmpeg.org/pipermail/ffmpeg-devel/2018-July/231828.html

--
bye,
pabs

https://wiki.debian.org/PaulWise

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Andreas Tille-5
In reply to this post by Mo Zhou
Hi Mo,

thanks again for all your effort for Deep Learning in Debian.
Please note, that I'm not competent in this field.

On Tue, May 21, 2019 at 12:11:14AM -0700, Mo Zhou wrote:
>
>   https://salsa.debian.org/lumin/deeplearning-policy
>   (issue tracker is enabled)

Not sure whether this is sensible to be added to the issue
tracker.
 
> See my draft for details.

Quoting from your section "Questions Not Easy to Answer"


  1. Must the dataset for training a Free Model present in our archive?
     Wikipedia dump is a frequently used free dataset in the computational
     linguistics field, is uploading wikipedia dump to our Archive sane?

I have no idea about the size of this kind of dump.  Recently I've read
that data sets for other programs tend into the direction of 1GB.  In
Debian Med I'm maintaining metaphlan2-data with 204MB which would be
even larger if there would not be some method for "data reduction" would
be used that is considered a bug (#839925) by other DDs.

  2. Should we re-train the Free Models on buildd? This is crazy. Let's
     don't do that right now.

If you ask me bothering buildd with this task is insane.  However I'm
positively convinced that we should ship the training data and be able
to train the models from these.

Kind regards

      Andreas.

--
http://fam-tille.de


Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
In reply to this post by Paul Wise via nm
Hi Paul,

They are added to the case study section. And I like
that question from ffmpeg-devel:

  Where is the source for all those numbers?

On 2019-05-21 08:02, Paul Wise wrote:

> On Tue, May 21, 2019 at 3:11 PM Mo Zhou wrote:
>
>> I'd better write a draft and shed some light on a safety
>> area. Then here is the first humble attempt:
>>
>>   https://salsa.debian.org/lumin/deeplearning-policy
>
> The policy looks good to me.
>
> A couple of situations this related to this policy:
>
> https://bugs.debian.org/699609
> https://ffmpeg.org/pipermail/ffmpeg-devel/2018-July/231828.html

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Tzafrir Cohen
In reply to this post by Andreas Tille-5
Hi,

On 21/05/2019 12:07, Andreas Tille wrote:

> If you ask me bothering buildd with this task is insane.  However I'm
> positively convinced that we should ship the training data and be able
> to train the models from these.
>

Is there a way to prove in some way (reproducible build or something
similar) that the results were obtained from that set using the specific
algorithm?

I suppose that the answer is negative, but it would have been nice to
have that.

-- Tzafrir

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Ben Hutchings-3
In reply to this post by Mo Zhou
On Tue, 2019-05-21 at 00:11 -0700, Mo Zhou wrote:
[...]
> People do lazy execution on this problem. Now that a
> related package entered my packaging radar, and I think
> I'd better write a draft and shed some light on a safety
> area. Then here is the first humble attempt:
>
>   https://salsa.debian.org/lumin/deeplearning-policy
>   (issue tracker is enabled)
[...]

Thanks for this.  Something I don't quite understand is the division
into 3 categories.  You write:

> 2. A ToxicCandy Model refers to a free software licensed model,
> trained from unknown or non-free dataset [...]

> 3. A model is Non-free Model as long as any of the following
> conditions is satisfied: (1) trained from unknown/non-free data [...]

Is category 2 intended to be a subset of category 3, or am I missing
some distinction?

Ben.

--
Ben Hutchings
Any sufficiently advanced bug is indistinguishable from a feature.


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Paul Wise via nm
In reply to this post by Mo Zhou
On Tue, 2019-05-21 at 03:14 -0700, Mo Zhou wrote:

> They are added to the case study section.

Are there any other case studies we could add?

Has anyone repeated the training of Mozilla DeepSpeech for example?

Are deep learning models deterministically and reproducibly trainable?
If I re-train a model using the exact same input data on different
(GPU?) hardware will I get the same bits out at the end?

--
bye,
pabs

https://wiki.debian.org/PaulWise


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
In reply to this post by Ben Hutchings-3
Hi Ben,

Good catch! I'm quite sure that the 3 categories are not overlapping
with each other. And I've fixed the language to make it logically
correct:

  A **ToxicCandy Model** refers to an explicitly free software licensed
  model, trained from unknown or non-free dataset.

  A model is **Non-free Model** as long as any of the following
conditions is
  satisfied: (1) trained from unknown/non-free data and released WITHOUT
  explicit free software license declaration; ...

Category 2 is a special but common case: the warm hearted upstream
wants to share the training results freely, but actually the results
are trained from non-free data and free software community could
never reproduce that with purely free stuff.

Cat.3 is easier and more obvious to identify compared to cat.2.

Fixed in the git repo.

On 2019-05-21 21:43, Ben Hutchings wrote:

> Thanks for this.  Something I don't quite understand is the division
> into 3 categories.  You write:
>
>> 2. A ToxicCandy Model refers to a free software licensed model,
>> trained from unknown or non-free dataset [...]
>
>> 3. A model is Non-free Model as long as any of the following
>> conditions is satisfied: (1) trained from unknown/non-free data [...]
>
> Is category 2 intended to be a subset of category 3, or am I missing
> some distinction?
>
> Ben.

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
In reply to this post by Paul Wise via nm
Hi Paul,

On 2019-05-21 23:52, Paul Wise wrote:
> Are there any other case studies we could add?

Anybody is welcome to open an issue and add more
cases to the document. I can dig into them in the
future.

> Has anyone repeated the training of Mozilla DeepSpeech for example?

Generally speaking, training is non-trivial and
requires expensive hardware. This fact will clearly
reduce the probability that "someone has tried to
reproduce it".

A real example to illustrate how hard reproducing a
**giant** model is, is BERT, one of the state-of-the-art
natural language representation model that takes
2 weeks to train on TPU at a cost about $500.

Cite:
https://github.com/google-research/bert#pre-training-tips-and-caveats

> Are deep learning models deterministically and reproducibly trainable?
> If I re-train a model using the exact same input data on different
> (GPU?) hardware will I get the same bits out at the end?

Making the training program reproducible is a good practice to everyone
who train / debug neural networks. I've ever wrote a simple deep
learning
framework with only C++ STL and hence trapped into many pitfalls.
Reproducibility is very important for debugging as mathematical
bug is much harder to diagnose compared to code bugs.

I wrote a dedicated section about reproducibility:
https://salsa.debian.org/lumin/deeplearning-policy#neural-network-reproducibility

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
In reply to this post by Tzafrir Cohen
Hi Tzafrir,

On 2019-05-21 19:58, Tzafrir Cohen wrote:
> Is there a way to prove in some way (reproducible build or something
> similar) that the results were obtained from that set using the specific
> algorithm?

I wrote a dedicated section about reproducibility:
https://salsa.debian.org/lumin/deeplearning-policy#neural-network-reproducibility

> I suppose that the answer is negative, but it would have been nice to
> have that.

In simple cases, fixing the seed for random number generator is enough.

If any upstream has ever claimed that their project aims to be of high
quality. Then unable to reproduce is very likely a fatal bug.

Reproducibility is also a headache among the machine learning and
deep learning communities. They are trying to improve the situation.
Everyone likes reproducible bits.

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Holger Levsen-2
On Tue, May 21, 2019 at 07:53:34PM -0700, Mo Zhou wrote:
> I wrote a dedicated section about reproducibility:
> https://salsa.debian.org/lumin/deeplearning-policy#neural-network-reproducibility

nice, very!

Though you dont specify what 'reproducible' means. Given your last line
in this email (see below) it sounds like you mean 'bit by bit identical
results' but that's not clear in the above URL.

Still, very nice and easy to add!

> Everyone likes reproducible bits.

:))


--
tschau,
        Holger

-------------------------------------------------------------------------------
               holger@(debian|reproducible-builds|layer-acht).org
       PGP fingerprint: B8BF 5413 7B09 D35C F026 FE9D 091A B856 069A AA1C

Some people say that the climate crisis  is something that we all have created,
but  that is not true,  because if everyone is guilty  then no one is to blame.
And someone is to blame.  Some people, some companies,  some decision-makers in
particular, have known exactly what priceless values they have been sacrificing
to continue making unimaginable amounts of money.

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
Hi Holger,

Yes, that section is about bit-by-bit reproducibility,
and identical hashsum is expected. Let's call it
"Bit-by-Bit reproducible".

I updated that section to make the definition
of "reproducible" explicit. And the strongest one
is discussed by default.

However, I'm not sure whether "bit-by-bit" is
easy to break by some obscure reasons in a complex
system (e.g. float point precision problems, time
stamps hidden in the stored model). And I've never
tried to compared my neural nets with hashsums...
I compare curves and digits instead ...
I need some time to think about it, verify, and
refine the definition.

On 2019-05-22 08:49, Holger Levsen wrote:

> On Tue, May 21, 2019 at 07:53:34PM -0700, Mo Zhou wrote:
>> I wrote a dedicated section about reproducibility:
>> https://salsa.debian.org/lumin/deeplearning-policy#neural-network-reproducibility
>
> nice, very!
>
> Though you dont specify what 'reproducible' means. Given your last line
> in this email (see below) it sounds like you mean 'bit by bit identical
> results' but that's not clear in the above URL.
>
> Still, very nice and easy to add!
>
>> Everyone likes reproducible bits.
>
> :))

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Holger Levsen-2
On Wed, May 22, 2019 at 03:35:20AM -0700, Mo Zhou wrote:
> Yes, that section is about bit-by-bit reproducibility,
> and identical hashsum is expected. Let's call it
> "Bit-by-Bit reproducible".

cool!

> I updated that section to make the definition
> of "reproducible" explicit.

thank you!

> However, I'm not sure whether "bit-by-bit" is
> easy to break by some obscure reasons in a complex
> system (e.g. float point precision problems, time
> stamps hidden in the stored model). And I've never
> tried to compared my neural nets with hashsums...
> I compare curves and digits instead ...
> I need some time to think about it, verify, and
> refine the definition.

sure, take your time!


--
tschau,
        Holger

-------------------------------------------------------------------------------
               holger@(debian|reproducible-builds|layer-acht).org
       PGP fingerprint: B8BF 5413 7B09 D35C F026 FE9D 091A B856 069A AA1C

Dance like no one's watching. Encrypt like everyone is.

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Sam Hartman-3
In reply to this post by Mo Zhou
>>>>> "Mo" == Mo Zhou <[hidden email]> writes:

    Mo> Hi Holger, Yes, that section is about bit-by-bit
    Mo> reproducibility, and identical hashsum is expected. Let's call
    Mo> it "Bit-by-Bit reproducible".

    Mo> I updated that section to make the definition of "reproducible"
    Mo> explicit. And the strongest one is discussed by default.

    Mo> However, I'm not sure whether "bit-by-bit" is easy to break by
    Mo> some obscure reasons in a complex system (e.g. float point
    Mo> precision problems, time stamps hidden in the stored model). And
    Mo> I've never tried to compared my neural nets with hashsums...  I
    Mo> compare curves and digits instead ...  I need some time to think
    Mo> about it, verify, and refine the definition.

So, I think it's problematic to apply old assumptions to new areas.  The
reproducible builds world has gotten a lot further with bit-for-bit
identical builds than I ever imagined they would.

However, what's actually needed in the deep learning context is weaker
than bit-for-bit identical.  What we need is a way to validate that two
models are identical for some equality predicate that meets our security
and safety (and freedom) concerns.  Parallel computation in the
training, the sort of floating point issues you point to, and a lot of
other things may make bit-for-bit identical models hard to come by.

Obviously we need to validate the correctness of whatever comparison
function we use.  The checksums match is relatively easy to validate.
Something that for example understood floating point numbers would have
a greater potential for bugs than an implementation of say sha256.

So, yeah, bit-for-bit identical is great if we can get it.  But
validating these models is important enough that if we need to use a
different equality predicate it's still worth doing.

--Sam

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Andy Simpkins-5
In reply to this post by Mo Zhou

On 22/05/2019 03:53, Mo Zhou wrote:

> Hi Tzafrir,
>
> On 2019-05-21 19:58, Tzafrir Cohen wrote:
>> Is there a way to prove in some way (reproducible build or something
>> similar) that the results were obtained from that set using the specific
>> algorithm?
> I wrote a dedicated section about reproducibility:
> https://salsa.debian.org/lumin/deeplearning-policy#neural-network-reproducibility
>
>> I suppose that the answer is negative, but it would have been nice to
>> have that.
> In simple cases, fixing the seed for random number generator is enough.
>
> If any upstream has ever claimed that their project aims to be of high
> quality. Then unable to reproduce is very likely a fatal bug.
>
> Reproducibility is also a headache among the machine learning and
> deep learning communities. They are trying to improve the situation.
> Everyone likes reproducible bits.

I agree completely.

Your wording "The model /should/be reproducible with a fixed random
seed." feels
correct but wonder if guidance notes along the following lines should be
added?

     *unless* we can reproduce the same results, from the same training
data,
     you cannot classify as group 1, "Free Model", because verification that
     training has been carried out on the dataset explicitly licensed
under a
     free software license can not be achieved.  This should be treated
as a
     severe bug and the entire suite should be classified as group 2,
     "ToxicCandy Model", until such time that verification is possible.

Finally,
Thank you for your work on this.

/Andy

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Sam Hartman-3
>>>>> "Andy" == Andy Simpkins <[hidden email]> writes:

    Andy>     *unless* we can reproduce the same results, from the same
    Andy> training data,     you cannot classify as group 1, "Free
    Andy> Model", because verification that     training has been
    Andy> carried out on the dataset explicitly licensed under a    
    Andy> free software license can not be achieved.  This should be
    Andy> treated as a     severe bug and the entire suite should be
    Andy> classified as group 2,     "ToxicCandy Model", until such time
    Andy> that verification is possible.

I don't think that's entirely true.
If we've done the training we can have confidence that it's free.
Reproducibility is still an issue, but is no more or less an issue than
with any other software.


Consider how we treat assets for games or web applications.  And yes
there are some confusing areas there and areas where we'd like to
improve.  But let's be consistent in what we demand from various
communities to be part of Debian.  Let's not penalize people for being
new and innovative.


--Sam

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Andy Simpkins-3
Sam.
Whilst i agree that "assets" in some packages may not have sources with them and the application may still be in main if it pulls in those assets from contrib or non free.
I am trying to suggest the same thing here. If the data set is unknown this is the *same* as a dependancy on a random binary blob (music / fonts / game levels / textures etc) and we wouldn't put that in main.

It is my belief that we consider training data sets as 'source' in much the same way....

/Andy

On 23 May 2019 16:33:24 BST, Sam Hartman <[hidden email]> wrote:
"Andy" == Andy Simpkins <[hidden email]> writes:

Andy>     *unless* we can reproduce the same results, from the same
Andy> training data,     you cannot classify as group 1, "Free
Andy> Model", because verification that     training has been
Andy> carried out on the dataset explicitly licensed under a    
Andy> free software license can not be achieved.  This should be
Andy> treated as a     severe bug and the entire suite should be
Andy> classified as group 2,     "ToxicCandy Model", until such time
Andy> that verification is possible.

I don't think that's entirely true.
If we've done the training we can have confidence that it's free.
Reproducibility is still an issue, but is no more or less an issue than
with any other software.


Consider how we treat assets for games or web applications. And yes
there are some confusing areas there and areas where we'd like to
improve. But let's be consistent in what we demand from various
communities to be part of Debian. Let's not penalize people for being
new and innovative.


--Sam


--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Sam Hartman-3
>>>>> "Andy" == Andy Simpkins <[hidden email]> writes:

    Andy> wouldn't put that in main.  It is my belief that we consider
    Andy> training data sets as 'source' in much the same way....  /Andy

I agree that we consider training data sets as source.

We require the binaries we ship to be buildable from source.
We typically but not always (I'm sure I can go find some pdfs that are
not rebuilt) require that someone we trust rebuild those binaries.
So for deep learning models we would require that they be retrainable
and typically require that we have retrained them.
Reproducibility is nice, but it is not a requirement at this time.
Reproducibility makes it easy for us to convince ourselves that the
model can be retrained from the training data set.
It's the best and simplest way to do so.
It's not the only way to do so.

--Sam

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Don Armstrong
In reply to this post by Andy Simpkins-5
On Thu, 23 May 2019, Andy Simpkins wrote:
> Your wording "The model /should/be reproducible with a fixed random
> seed." feels correct but wonder if guidance notes along the following
> lines should be added?

Reproducing exact results from a deep learning model which requires
extensive computation is fairly hard. On top of knowing the exact state
of the system at the initiation of the computation, it requires that
entire computation to be deterministic. Deterministic execution may
require unacceptable performance tradeoffs in larger analyses.

Reproducible builds are hard enough, and those generally don't involve
coprocessors.

--
Don Armstrong                      https://www.donarmstrong.com

I made a bunch of stickers
to put on rooftops, and in secret tunnels.
"If you are reading this,
 then you are awesome"
 -- a softer world #569
    http://www.asofterworld.com/index.php?id=569

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
In reply to this post by Sam Hartman-3
Hi,

On 2019-05-22 12:43, Sam Hartman wrote:
> So, I think it's problematic to apply old assumptions to new areas.  The
> reproducible builds world has gotten a lot further with bit-for-bit
> identical builds than I ever imagined they would.

I overhauled the reproducibility section. And lowered the
reproducibility
standard from "Bit-by-Bit" to "Numerically", which is the most practical
choice for now. Anyway we can raise the bar in the future if things got
better in terms of reproducibility.

> However, what's actually needed in the deep learning context is weaker
> than bit-for-bit identical.  What we need is a way to validate that two
> models are identical for some equality predicate that meets our security
> and safety (and freedom) concerns.  Parallel computation in the
> training, the sort of floating point issues you point to, and a lot of
> other things may make bit-for-bit identical models hard to come by.

Indeed: I name this as "Numerically Reproducible":
https://salsa.debian.org/lumin/deeplearning-policy#neural-network-reproducibility

> Obviously we need to validate the correctness of whatever comparison
> function we use.  The checksums match is relatively easy to validate.
> Something that for example understood floating point numbers would have
> a greater potential for bugs than an implementation of say sha256.
>
> So, yeah, bit-for-bit identical is great if we can get it.  But
> validating these models is important enough that if we need to use a
> different equality predicate it's still worth doing.

For now, we just need to compare the digits and the curves: train twice
without any modification, and see if the curves and digits are the same.
Further measures, I think, depends on how this field evolves.

123