Bits from /me: A humble draft policy on "deep learning v.s. freedom"

classic Classic list List threaded Threaded
26 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
Hi people,

A year ago I raised a topic on -devel, pointing out the
"deep learning v.s. software freedom" issue. We drew no
conclusion at that time, and linux distros who care about
software freedom may still have doubt on some fundamental
problems, e.g. "is this piece of deep learning software
really free"?

People do lazy execution on this problem. Now that a
related package entered my packaging radar, and I think
I'd better write a draft and shed some light on a safety
area. Then here is the first humble attempt:

  https://salsa.debian.org/lumin/deeplearning-policy
  (issue tracker is enabled)

This draft is conservative and overkilling, and currently
only focus on software freedom. That's exactly where we
start, right?

Specifically, I defined 3 types of pre-trained machine
learning models / deep learning models:

  Free Model, ToxicCandy Model. Non-free Model

Developers who'd like to touch DL software should be
cautious to the "ToxicCandy" models. Details can be
found in my draft.

Apart from that, I pointed out in the draft that software
associated with any critical task should be considered
carefully as deep neural networks introduced a new kind
of vulnerability, that a network's response can be
disrupted or even controlled by some carefully designed
perturbations added to the network put.

Hence, I suggest that packaging an intelligent software
must be discussed on -devel if the piece of software is
associated with any kind of critical task, including but
not limited to

  * authentication (e.g. login via face verification or
    identification)
  * program execution (e.g. intelligent voice assistants:
    "Hey, Siri! sudo rm -rf / --no-preserve-root")
  * physical object manipulation (e.g. mechanical
    arms in non-educational occasion,
    cars i.e. auto pilot), etc.

See my draft for details.

The package that entered my packaging radar is nltk_data.
https://github.com/nltk/nltk_data
The 2 most widely used python-based computational
linguistics toolkit, NLTK and Spacy, require these
data (datasets + models) to enable most of their
functionalities.

Best,
Mo.

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Paul Wise via nm
On Tue, May 21, 2019 at 3:11 PM Mo Zhou wrote:

> I'd better write a draft and shed some light on a safety
> area. Then here is the first humble attempt:
>
>   https://salsa.debian.org/lumin/deeplearning-policy

The policy looks good to me.

A couple of situations this related to this policy:

https://bugs.debian.org/699609
https://ffmpeg.org/pipermail/ffmpeg-devel/2018-July/231828.html

--
bye,
pabs

https://wiki.debian.org/PaulWise

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Andreas Tille-5
In reply to this post by Mo Zhou
Hi Mo,

thanks again for all your effort for Deep Learning in Debian.
Please note, that I'm not competent in this field.

On Tue, May 21, 2019 at 12:11:14AM -0700, Mo Zhou wrote:
>
>   https://salsa.debian.org/lumin/deeplearning-policy
>   (issue tracker is enabled)

Not sure whether this is sensible to be added to the issue
tracker.
 
> See my draft for details.

Quoting from your section "Questions Not Easy to Answer"


  1. Must the dataset for training a Free Model present in our archive?
     Wikipedia dump is a frequently used free dataset in the computational
     linguistics field, is uploading wikipedia dump to our Archive sane?

I have no idea about the size of this kind of dump.  Recently I've read
that data sets for other programs tend into the direction of 1GB.  In
Debian Med I'm maintaining metaphlan2-data with 204MB which would be
even larger if there would not be some method for "data reduction" would
be used that is considered a bug (#839925) by other DDs.

  2. Should we re-train the Free Models on buildd? This is crazy. Let's
     don't do that right now.

If you ask me bothering buildd with this task is insane.  However I'm
positively convinced that we should ship the training data and be able
to train the models from these.

Kind regards

      Andreas.

--
http://fam-tille.de

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Mattias Wadenstein
On Tue, 21 May 2019, Andreas Tille wrote:

> Quoting from your section "Questions Not Easy to Answer"
>
>
>  1. Must the dataset for training a Free Model present in our archive?
>     Wikipedia dump is a frequently used free dataset in the computational
>     linguistics field, is uploading wikipedia dump to our Archive sane?
>
> I have no idea about the size of this kind of dump.

The current size of wikimedia dumps is 18T, but that includes several
versions of data (five dated versions are shipped for most dumps), etc. As
a sample, I think this[1] is the english pages main text (not history or
metadata), which is 15G compressed.

1) https://ftp.acc.umu.se/mirror/wikimedia.org/dumps/enwiki/20190501/enwiki-20190501-pages-articles.xml.bz2

/Mattias Wadenstein, mirror admin who also mirrors the wikimedia dumps

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
In reply to this post by Paul Wise via nm
Hi Paul,

They are added to the case study section. And I like
that question from ffmpeg-devel:

  Where is the source for all those numbers?

On 2019-05-21 08:02, Paul Wise wrote:

> On Tue, May 21, 2019 at 3:11 PM Mo Zhou wrote:
>
>> I'd better write a draft and shed some light on a safety
>> area. Then here is the first humble attempt:
>>
>>   https://salsa.debian.org/lumin/deeplearning-policy
>
> The policy looks good to me.
>
> A couple of situations this related to this policy:
>
> https://bugs.debian.org/699609
> https://ffmpeg.org/pipermail/ffmpeg-devel/2018-July/231828.html

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
In reply to this post by Andreas Tille-5
Hi Andreas,

On 2019-05-21 09:07, Andreas Tille wrote:
> Not sure whether this is sensible to be added to the issue
> tracker.

I always abuse issue track in my personal repository.

> Quoting from your section "Questions Not Easy to Answer"
>
>
>   1. Must the dataset for training a Free Model present in our archive?
>      Wikipedia dump is a frequently used free dataset in the computational
>      linguistics field, is uploading wikipedia dump to our Archive sane?
>
> I have no idea about the size of this kind of dump.  Recently I've read
> that data sets for other programs tend into the direction of 1GB.  In
> Debian Med I'm maintaining metaphlan2-data with 204MB which would be
> even larger if there would not be some method for "data reduction" would
> be used that is considered a bug (#839925) by other DDs.

As pointed out by Mattias Wadenstein (thanks for the data point), the
wikipedia dump is large enough to challenge the .deb format (recent
threads).

>   2. Should we re-train the Free Models on buildd? This is crazy. Let's
>      don't do that right now.
>
> If you ask me bothering buildd with this task is insane.  However I'm
> positively convinced that we should ship the training data and be able
> to train the models from these.

It's always good if we can do these things purely with our archive.
However sometimes it's just not easy to enforce: datasets used by DL
are generally large, (several hundred MB ~ several TB or even larger).

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Julien PUYDT-2
Hi

Le 21 mai 2019 13:45, Mo Zhou <[hidden email]> a écrit :

It's always good if we can do these things purely with our archive.
However sometimes it's just not easy to enforce: datasets used by DL
are generally large, (several hundred MB ~ several TB or even larger). 


And even with the data, the training might need an awfully powerful box *and* weeks of computation *and* some of the algorithms aren't deterministic, so reproducibility is a problem, not only for Debian but for the scientific community at large.

jpuydt

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
Hi,

Some additional data points:

* In order to train the most widely used convolutional neural network,
  I use 4 * GTX 1080Ti cards on an 8-card machine. The network occupies
  around 40 GiB of video memory during training.

* GTX 1080 is the lowest standard for research or production. More
  common choices for rich groups are the Nvidia Titan X cards or
  Tesla cards.

* The state-of-the-art natural language representation, BERT, takes
  2 weeks to train on TPU at a cost about $500.
  https://github.com/google-research/bert
  CPU cannot do that in finite time.

For the reproducibility problem: In the definition of "Free Model",
I mentioned that the model *should be reproducible* with a fixed
random seed. This is also a good practice for ML/DL engineers
and researchers.

On 2019-05-21 12:10, [hidden email] wrote:

> Hi
>
> Le 21 mai 2019 13:45, Mo Zhou <[hidden email]> a écrit :
>
>> It's always good if we can do these things purely with our archive.
>>
>> However sometimes it's just not easy to enforce: datasets used by DL
>>
>> are generally large, (several hundred MB ~ several TB or even
>> larger).
>
> And even with the data, the training might need an awfully powerful
> box *and* weeks of computation *and* some of the algorithms aren't
> deterministic, so reproducibility is a problem, not only for Debian
> but for the scientific community at large.
>
> jpuydt

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Ben Hutchings-3
In reply to this post by Mo Zhou
On Tue, 2019-05-21 at 00:11 -0700, Mo Zhou wrote:
[...]
> People do lazy execution on this problem. Now that a
> related package entered my packaging radar, and I think
> I'd better write a draft and shed some light on a safety
> area. Then here is the first humble attempt:
>
>   https://salsa.debian.org/lumin/deeplearning-policy
>   (issue tracker is enabled)
[...]

Thanks for this.  Something I don't quite understand is the division
into 3 categories.  You write:

> 2. A ToxicCandy Model refers to a free software licensed model,
> trained from unknown or non-free dataset [...]

> 3. A model is Non-free Model as long as any of the following
> conditions is satisfied: (1) trained from unknown/non-free data [...]

Is category 2 intended to be a subset of category 3, or am I missing
some distinction?

Ben.

--
Ben Hutchings
Any sufficiently advanced bug is indistinguishable from a feature.


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Paul Wise via nm
In reply to this post by Mo Zhou
On Tue, 2019-05-21 at 03:14 -0700, Mo Zhou wrote:

> They are added to the case study section.

Are there any other case studies we could add?

Has anyone repeated the training of Mozilla DeepSpeech for example?

Are deep learning models deterministically and reproducibly trainable?
If I re-train a model using the exact same input data on different
(GPU?) hardware will I get the same bits out at the end?

--
bye,
pabs

https://wiki.debian.org/PaulWise


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
In reply to this post by Ben Hutchings-3
Hi Ben,

Good catch! I'm quite sure that the 3 categories are not overlapping
with each other. And I've fixed the language to make it logically
correct:

  A **ToxicCandy Model** refers to an explicitly free software licensed
  model, trained from unknown or non-free dataset.

  A model is **Non-free Model** as long as any of the following
conditions is
  satisfied: (1) trained from unknown/non-free data and released WITHOUT
  explicit free software license declaration; ...

Category 2 is a special but common case: the warm hearted upstream
wants to share the training results freely, but actually the results
are trained from non-free data and free software community could
never reproduce that with purely free stuff.

Cat.3 is easier and more obvious to identify compared to cat.2.

Fixed in the git repo.

On 2019-05-21 21:43, Ben Hutchings wrote:

> Thanks for this.  Something I don't quite understand is the division
> into 3 categories.  You write:
>
>> 2. A ToxicCandy Model refers to a free software licensed model,
>> trained from unknown or non-free dataset [...]
>
>> 3. A model is Non-free Model as long as any of the following
>> conditions is satisfied: (1) trained from unknown/non-free data [...]
>
> Is category 2 intended to be a subset of category 3, or am I missing
> some distinction?
>
> Ben.

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
In reply to this post by Paul Wise via nm
Hi Paul,

On 2019-05-21 23:52, Paul Wise wrote:
> Are there any other case studies we could add?

Anybody is welcome to open an issue and add more
cases to the document. I can dig into them in the
future.

> Has anyone repeated the training of Mozilla DeepSpeech for example?

Generally speaking, training is non-trivial and
requires expensive hardware. This fact will clearly
reduce the probability that "someone has tried to
reproduce it".

A real example to illustrate how hard reproducing a
**giant** model is, is BERT, one of the state-of-the-art
natural language representation model that takes
2 weeks to train on TPU at a cost about $500.

Cite:
https://github.com/google-research/bert#pre-training-tips-and-caveats

> Are deep learning models deterministically and reproducibly trainable?
> If I re-train a model using the exact same input data on different
> (GPU?) hardware will I get the same bits out at the end?

Making the training program reproducible is a good practice to everyone
who train / debug neural networks. I've ever wrote a simple deep
learning
framework with only C++ STL and hence trapped into many pitfalls.
Reproducibility is very important for debugging as mathematical
bug is much harder to diagnose compared to code bugs.

I wrote a dedicated section about reproducibility:
https://salsa.debian.org/lumin/deeplearning-policy#neural-network-reproducibility

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
In reply to this post by Paul Wise via nm
Hi,

On 2019-05-21 23:52, Paul Wise wrote:
> Has anyone repeated the training of Mozilla DeepSpeech for example?

By chance I found a paper from a pile of papers (that attacks AI models)
that Berkeley researchers have successfully attacked DeepSpeech:

   https://arxiv.org/pdf/1801.01944.pdf

IHMO Try not to ask AI to deal with any critical task unless one
understands the security risk. Maybe attacking AI models will
be what future hackers do?

```quote from https://arxiv.org/abs/1801.01944
Abstract

We construct targeted audio adversarial examples on automatic speech
recognition. Given any audio waveform, we can produce another that
is over 99.9% similar, but transcribes as any phrase we choose
(recognizing
up to 50 characters per second of audio). We apply our white-box
iterative
optimization-based attack to Mozilla’s implementation DeepSpeech
end-to-end,
and show it has a 100% success rate. The feasibility of this attack
introduce a new domain to study adversarial examples.
```quote

Reply | Threaded
Open this post in threaded view
|

Concern for: A humble draft policy on "deep learning v.s. freedom"

Osamu Aoki
In reply to this post by Mo Zhou
Hi,

On Tue, May 21, 2019 at 12:11:14AM -0700, Mo Zhou wrote:
> Hi people,

I see your good intention but this is basically changing status-quo for
the main requirement.

>   https://salsa.debian.org/lumin/deeplearning-policy
>   (issue tracker is enabled)

I read it ;-)

> This draft is conservative and overkilling, and currently
> only focus on software freedom. That's exactly where we
> start, right?

OK but it can't be where we end-up-with.

Before scientific "deep learning" data, we already have practical "deep
learning" data in our archive.

Please note one of the most popular Japanese input method mozc will be
kicked out from main as a starter if we start enforcing this new
guideline.

> Specifically, I defined 3 types of pre-trained machine
> learning models / deep learning models:
>
>   Free Model, ToxicCandy Model. Non-free Model
>
> Developers who'd like to touch DL software should be
> cautious to the "ToxicCandy" models. Details can be
> found in my draft.

With a labeling like "ToxicCandy Model" for the situation, it makes bad
impression on people and I am afraid people may not be make rational
decision.  Is this characterization correct and sane one?  At least,
it looks to me that this is changing status-quo of our policy and
practice severely.  So it is worth evaluating idea without labeling.

As long as the "data" comes in the form which allows us to modify it and
re-train it to make it better with a set of free software tools to do it,
we shouldn't make it non-free, for sure.  That is my position and I
think this was what we operated as the project.  We never asked how they
are originally made.  The touchy question is how easy it should be to
modify and re-train, etc.

Let's list analogy cases.  We allow a photo of something on our archive
as wallpaper etc.  We don't ask object of photo or tool used to make it
to be FREE.  Debian logo is one example which was created by Photoshop
as I understand.  Another analogy to consider is how we allow
independent copyright and license for the dictionary like data which
must have processed previous copyrighted (possibly non-free) texts by
human brain and maybe with some script processing.  Packages such as
opendict, *spell-*, dict-freedict-all, ... are in main.

I agree it is nice to have base data in the package.  If you can, please
include the training data if it is a FREE set.  But it may become
unrealistic for Debian to getting into business of distributing many GB
of training data for this purpose.  You may be talking data size being over
10s of GB.  This is another thing you should realize -- So mandating its
inclusion is unpractical since it is not the focus point on which Debian
needs to spend its resource.

Let's talk about actual cases in main.

"mecab" is free a tool for Japanese text morphological analysis which
can create CRF optimized parameters from the marked-up training data.

(This is also the base of mozc which uses such data to create desirable
typing output in normal Japanese text input from the keyboard.)

One of the dictionary for mecab is 800MB compressed deb in main:
unidic-mecab which is 2.2GB data in text format containing CRF optimized
parameters and other text data obtained by training. These text and
parameters are triple licensed BSD/LGPL/GPL. Re-training this is very
straight forward application of mecab tool with additional data only.
So this is FREE as it can be in current practice and we have it in main.
  https://unidic.ninjal.ac.jp/

When these CRF parameters were initially made, it used non-free data
(Japanese Government funded) available in multiple DVDs with hefty price
and restriction on its use and its redistribution.  This base data for
training is as NON-FREE as it can be so we don't distribute.
  https://pj.ninjal.ac.jp/corpus_center/bccwj/dvd-index.html

In case of MOZC, the original training data is only available in Google
and not published by them.  Actually, tweaking data is possible but
consistently retraining this data in MOZC may not be a trivial
application of mecab tool.  We are placing this in main now, anyway
since its data (CRF optimized parameters and other text data ) are
licensed under BSD-3-clause and we have MOZC in main.

Regards,

Osamu

Reply | Threaded
Open this post in threaded view
|

Re: Concern for: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
Hi Osamu,

On 2019-06-08 18:43, Osamu Aoki wrote:
>> This draft is conservative and overkilling, and currently
>> only focus on software freedom. That's exactly where we
>> start, right?
>
> OK but it can't be where we end-up-with.

That's why I said the two words "conservative" and "overkilling".
In my blueprint we can actually loosen these restrictions bit
by bit with further case study.

> Before scientific "deep learning" data, we already have practical "deep
> learning" data in our archive.

Thanks for pointing them out. They are good case study
for me to revise the DL-Policy.

> Please note one of the most popular Japanese input method mozc will be
> kicked out from main as a starter if we start enforcing this new
> guideline.

I'm in no position of irresponsibly enforcing an experimental
policy without having finished enough case study.

>> Specifically, I defined 3 types of pre-trained machine
>> learning models / deep learning models:
>>
>>   Free Model, ToxicCandy Model. Non-free Model
>>
>> Developers who'd like to touch DL software should be
>> cautious to the "ToxicCandy" models. Details can be
>> found in my draft.
>
> With a labeling like "ToxicCandy Model" for the situation, it makes bad
> impression on people and I am afraid people may not be make rational
> decision.  Is this characterization correct and sane one?  At least,
> it looks to me that this is changing status-quo of our policy and
> practice severely.  So it is worth evaluating idea without labeling.

My motivation for the naming "ToxicCandy" is pure: to warn developers
about this special case as it may lead to very difficult copyright
or software freedom questions. I admit that this name looks not
quite friendly. Maybe "SemiFree" look better?

> As long as the "data" comes in the form which allows us to modify it and
> re-train it to make it better with a set of free software tools to do it,
> we shouldn't make it non-free, for sure.  That is my position and I
> think this was what we operated as the project.  We never asked how they
> are originally made.  The touchy question is how easy it should be to
> modify and re-train, etc.
>
> Let's list analogy cases.  We allow a photo of something on our archive
> as wallpaper etc.  We don't ask object of photo or tool used to make it
> to be FREE.  Debian logo is one example which was created by Photoshop
> as I understand.  Another analogy to consider is how we allow
> independent copyright and license for the dictionary like data which
> must have processed previous copyrighted (possibly non-free) texts by
> human brain and maybe with some script processing.  Packages such as
> opendict, *spell-*, dict-freedict-all, ... are in main.
>
> I agree it is nice to have base data in the package.  If you can, please
> include the training data if it is a FREE set.  But it may become
> unrealistic for Debian to getting into business of distributing many GB
> of training data for this purpose.  You may be talking data size being over
> 10s of GB.  This is another thing you should realize -- So mandating its
> inclusion is unpractical since it is not the focus point on which Debian
> needs to spend its resource.
>
> Let's talk about actual cases in main.
>
> "mecab" is free a tool for Japanese text morphological analysis which
> can create CRF optimized parameters from the marked-up training data.
>
> (This is also the base of mozc which uses such data to create desirable
> typing output in normal Japanese text input from the keyboard.)
>
> One of the dictionary for mecab is 800MB compressed deb in main:
> unidic-mecab which is 2.2GB data in text format containing CRF optimized
> parameters and other text data obtained by training. These text and
> parameters are triple licensed BSD/LGPL/GPL. Re-training this is very
> straight forward application of mecab tool with additional data only.
> So this is FREE as it can be in current practice and we have it in main.
>   https://unidic.ninjal.ac.jp/
>
> When these CRF parameters were initially made, it used non-free data
> (Japanese Government funded) available in multiple DVDs with hefty price
> and restriction on its use and its redistribution.  This base data for
> training is as NON-FREE as it can be so we don't distribute.
>   https://pj.ninjal.ac.jp/corpus_center/bccwj/dvd-index.html
>
> In case of MOZC, the original training data is only available in Google
> and not published by them.  Actually, tweaking data is possible but
> consistently retraining this data in MOZC may not be a trivial
> application of mecab tool.  We are placing this in main now, anyway
> since its data (CRF optimized parameters and other text data ) are
> licensed under BSD-3-clause and we have MOZC in main.

Thank you Osamu. These cases inspired me on finding a better
balance point for DL-Policy. I'll add these cases to the case
study section, and I'm going to add the following points to DL-Policy:

1. Free datasets used to train FreeModel are not required to upload
   to our main section, for example those Osamu mentioned and wikipedia
   dump. We are not scientific data archiving organization and these
   data will blow up our infra if we upload too much.

2. It's not required to re-train a FreeModel with our infra, because
   the outcome/cost ratio is impractical. The outcome is nearly zero
   compared to directly using a pre-trained FreeModel, while the cost
   is increased carbon dioxide in our atmosphere and wasted developer
   time. (Deep learning is producing much more carbon dioxide than we
   thought).

   For classical probablistic graph models such as MRF or the mentioned
   CRF, the training process might be trivial, but re-training is still
   not required.

For SemiFreeModel I still hesitate to make any decision. Once we let
them enter the main section there will be many unreproducible
or hard-to-reproduce but surprisingly "legal" (in terms of DL-Policy)
files. Maybe this case is to some extent similar to artworks and fonts.
Further study needed. And it's still not easy to find a balance point
for SemiFreeModel between usefulness and freedom.

Thanks,
Mo.

Reply | Threaded
Open this post in threaded view
|

Re: Concern for: A humble draft policy on "deep learning v.s. freedom"

Yao Wei (魏銘廷)-2
Hi,

>> With a labeling like "ToxicCandy Model" for the situation, it makes bad
>> impression on people and I am afraid people may not be make rational
>> decision.  Is this characterization correct and sane one?  At least,
>> it looks to me that this is changing status-quo of our policy and
>> practice severely.  So it is worth evaluating idea without labeling.
>
> My motivation for the naming "ToxicCandy" is pure: to warn developers
> about this special case as it may lead to very difficult copyright
> or software freedom questions. I admit that this name looks not
> quite friendly. Maybe "SemiFree" look better?

About the term ToxicCandy it makes me reminded of an existing
term "Tainted" which also used in Linux kernel to describe kernel
running with non-free module.

So... how about "Tainted Model"?

Just 2 cents,
Yao Wei

(This email is sent from a phone; sorry for HTML email if it happens.)

Reply | Threaded
Open this post in threaded view
|

Re: Concern for: A humble draft policy on "deep learning v.s. freedom"

Osamu Aoki
In reply to this post by Mo Zhou
Hi Mo,

On Sat, Jun 08, 2019 at 10:07:13PM -0700, Mo Zhou wrote:

> Hi Osamu,
>
> On 2019-06-08 18:43, Osamu Aoki wrote:
> >> This draft is conservative and overkilling, and currently
> >> only focus on software freedom. That's exactly where we
> >> start, right?
> >
> > OK but it can't be where we end-up-with.
>
> That's why I said the two words "conservative" and "overkilling".
> In my blueprint we can actually loosen these restrictions bit
> by bit with further case study.

Yes, we agree here!

> > Before scientific "deep learning" data, we already have practical "deep
> > learning" data in our archive.
>
> Thanks for pointing them out. They are good case study
> for me to revise the DL-Policy.
>
> > Please note one of the most popular Japanese input method mozc will be
> > kicked out from main as a starter if we start enforcing this new
> > guideline.
>
> I'm in no position of irresponsibly enforcing an experimental
> policy without having finished enough case study.

I noticed it since you were thinking deep enough but I saw some danger
for other people to make decision too quickly based on the "Labeling".

Please check our history on the following GRs:
 https://www.debian.org/vote/2004/vote_003
 https://www.debian.org/vote/2006/vote_004

We are stack with "Further discussion" at this moment.

> >> Specifically, I defined 3 types of pre-trained machine
> >> learning models / deep learning models:
> >>
> >>   Free Model, ToxicCandy Model. Non-free Model
> >>
> >> Developers who'd like to touch DL software should be
> >> cautious to the "ToxicCandy" models. Details can be
> >> found in my draft.
> >
> > With a labeling like "ToxicCandy Model" for the situation, it makes bad
> > impression on people and I am afraid people may not be make rational
> > decision.  Is this characterization correct and sane one?  At least,
> > it looks to me that this is changing status-quo of our policy and
> > practice severely.  So it is worth evaluating idea without labeling.
>
> My motivation for the naming "ToxicCandy" is pure: to warn developers
> about this special case as it may lead to very difficult copyright
> or software freedom questions. I admit that this name looks not
> quite friendly. Maybe "SemiFree" look better?

Although I understand the intent of "SemiFree" or "Tainted" (by Yao), I
don't think these are a good choice.  We need to draw a line between
FREE(=main) and NON-FREE(non-free) as a organization.  I think there are
2 FREE models we are allowing for "main" as the current practice.

 * Pure      Free Model from pure free pre-train data only
 * Sanitized Free Model from free and non-free mixed pre-train data

And, we don't allow Non-Free Model in "main"

Question is when do you call it "sanitized" (or "distilled") to be clean
enough to qualify for "main" ;-)

> > As long as the "data" comes in the form which allows us to modify it and
> > re-train it to make it better with a set of free software tools to do it,
> > we shouldn't make it non-free, for sure.  That is my position and I
> > think this was what we operated as the project.  We never asked how they
> > are originally made.  The touchy question is how easy it should be to
> > modify and re-train, etc.
> >
> > Let's list analogy cases.  We allow a photo of something on our archive
> > as wallpaper etc.  We don't ask object of photo or tool used to make it
> > to be FREE.  Debian logo is one example which was created by Photoshop
> > as I understand.  Another analogy to consider is how we allow
> > independent copyright and license for the dictionary like data which
> > must have processed previous copyrighted (possibly non-free) texts by
> > human brain and maybe with some script processing.  Packages such as
> > opendict, *spell-*, dict-freedict-all, ... are in main.

...

> Thank you Osamu. These cases inspired me on finding a better
> balance point for DL-Policy. I'll add these cases to the case
> study section, and I'm going to add the following points to DL-Policy:
>
> 1. Free datasets used to train FreeModel are not required to upload
>    to our main section, for example those Osamu mentioned and wikipedia
>    dump. We are not scientific data archiving organization and these
>    data will blow up our infra if we upload too much.
>
> 2. It's not required to re-train a FreeModel with our infra, because
>    the outcome/cost ratio is impractical. The outcome is nearly zero
>    compared to directly using a pre-trained FreeModel, while the cost
>    is increased carbon dioxide in our atmosphere and wasted developer
>    time. (Deep learning is producing much more carbon dioxide than we
>    thought).
>
>    For classical probablistic graph models such as MRF or the mentioned
>    CRF, the training process might be trivial, but re-training is still
>    not required.

... but re-training is highly desirable in line with the spirit of the
free software.

> For SemiFreeModel  I still hesitate to make any decision. Once we let
      SanitizedModel
> them enter the main section there will be many unreproducible
> or hard-to-reproduce but surprisingly "legal" (in terms of DL-Policy)
> files. Maybe this case is to some extent similar to artworks and fonts.
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                           YES.
> Further study needed. And it's still not easy to find a balance point
> for SemiFreeModel  between usefulness and freedom.
      SanitizedModel

Let's use SanitizedModel to be neutral.

We need to have some guideline principle for this sanitization process.
(I don't have an answer now)

This sanitization mechanism shouldn't be used to include obfuscated
binary blob equivalents.  It's worse than FIRMWARE case since it runs on
the same CPU as the program code.

Although "Further Discussion" was the outcome, B in
https://www.debian.org/vote/2006/vote_004 is worth looking at:
  Strongly recommends that all non-programmatic works distribute the form
  that the copyright holder or upstream developer would actually use for
  modification. Such forms need not be distributed in the orig.tar.gz
  (unless required by license) but should be made available on upstream
  websites and/or using Debian project resources.

Please note this is "Strongly recommends ... should be made
available..." and not "must be made available ...".

Aside from Policy/Guideline for FREE/NON-FREE discussion, we also need
to address for the spirit of the reproducible build.  It is nice to have
checking mechanism for the validity and health of these MODELs.  I know
one of the Japanese keyboard input method "Anthy" is suffering some
regression in the upcoming release.  The fix was found too late so I
uploaded to experimental since it contained too many changes while
impact was subtle.  If we had a test suite with numerical score outputs,
we could have detected such regressions by the upstream.  It may be
unrealistic to aim for exact match for such probabilistic model but
objectively traceable measure is very desirable to have.

Osamu

Reply | Threaded
Open this post in threaded view
|

Re: Concern for: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
Hi Osamu,

On 2019-06-09 08:28, Osamu Aoki wrote:
> Although I understand the intent of "SemiFree" or "Tainted" (by Yao), I
> don't think these are a good choice.  We need to draw a line between
> FREE(=main) and NON-FREE(non-free) as a organization.  I think there are

There is no such a line as a big grey area exists. Pure-free models plus
Pure-non-free models doesn't cover all the possible cases. But
Free + SemiFree + NonFree covers all possible cases.

SemiFree lies in a grey area because the ways people interpret it vary:

1. If one regards a model as sort of human artifact such as artwork
   or font, a free software licensed SemiFreeModel is free even if
   it's trained from non-free data. (Ah, yes, there is an MIT license!
   It's a free blob made by human.)

2. If one regards a model as a production from a mathematical process
   such as training or compilation, a free software licensed
   SemiFreeModel is actually non-free. (Oops, where did these
MIT-licensed
   digits came from and how can I reproduce it? Can I trust the source?
   What if the MIT-licensed model is trained from evil data but we don't
   know?)

I'm not going to draw a line across this grey area, or say minefield.
Personally I prefer the second interpretation.

> 2 FREE models we are allowing for "main" as the current practice.
>
>  * Pure      Free Model from pure free pre-train data only
>  * Sanitized Free Model from free and non-free mixed pre-train data

Please don't make the definition of FreeModel complicated.
FreeModel should be literally and purely free.
We can divide SemiFreeModel into several categories according to
future case study and make DL-Policy properly match with the practice.

> And, we don't allow Non-Free Model in "main"

I think no one would argue about NonFreeModel.

> Question is when do you call it "sanitized" (or "distilled") to be clean
> enough to qualify for "main" ;-)

I expect a model, once sanitized, to he purely free. For example by
removing
all non-free data from the training dataset and only use free training
data. Any non-free single data pulls the model into the minefield.

>> 2. It's not required to re-train a FreeModel with our infra, because
>>    the outcome/cost ratio is impractical. The outcome is nearly zero
>>    compared to directly using a pre-trained FreeModel, while the cost
>>    is increased carbon dioxide in our atmosphere and wasted developer
>>    time. (Deep learning is producing much more carbon dioxide than we
>>    thought).
>>
>>    For classical probablistic graph models such as MRF or the mentioned
>>    CRF, the training process might be trivial, but re-training is still
>>    not required.
>
> ... but re-training is highly desirable in line with the spirit of the
> free software.

I guess you didn't catch my point. In my definition of FreeModel and the
SemiFree/ToxicCandy model, providing training script is mandatory. Any
model without training script must be non-free. This requirement also
implies that the upstream must provide all information about the
datasets
and the training process. Software freedom can be guaranteed even if
we don't always re-train the free models, as it will only waste
electricity. On the other hand, developers should check whether a model
provide such freedom, and local re-training as an verification step
is encouraged.

Enforcing re-training will be a painful decision and would drive
energetic
contributors away especially when the contributor refuse to use Nvidia
suckware.

> Let's use SanitizedModel to be neutral.

Once sanitized a model should turn into a free model. If it doesn't,
then
why does one sanitize the model?

> We need to have some guideline principle for this sanitization process.
> (I don't have an answer now)

I need case study at this point.

> This sanitization mechanism shouldn't be used to include obfuscated
> binary blob equivalents.  It's worse than FIRMWARE case since it runs on
> the same CPU as the program code.
>
> Although "Further Discussion" was the outcome, B in
> https://www.debian.org/vote/2006/vote_004 is worth looking at:
>   Strongly recommends that all non-programmatic works distribute the form
>   that the copyright holder or upstream developer would actually use for
>   modification. Such forms need not be distributed in the orig.tar.gz
>   (unless required by license) but should be made available on upstream
>   websites and/or using Debian project resources.
>
> Please note this is "Strongly recommends ... should be made
> available..." and not "must be made available ...".

Umm....

> Aside from Policy/Guideline for FREE/NON-FREE discussion, we also need
> to address for the spirit of the reproducible build.  It is nice to have
> checking mechanism for the validity and health of these MODELs.  I know
> one of the Japanese keyboard input method "Anthy" is suffering some
> regression in the upcoming release.  The fix was found too late so I
> uploaded to experimental since it contained too many changes while
> impact was subtle.  If we had a test suite with numerical score outputs,
> we could have detected such regressions by the upstream.  It may be
> unrealistic to aim for exact match for such probabilistic model but
> objectively traceable measure is very desirable to have.

Isn't this checking mechanism a part of upstream work? When developing
machine learning software, the model reproduciblity (two different runs
should produce very similar results) is important.

This reproducibility issue is much different than that of code.
Software upstream doesn't compile a cxx program twice to see whether
the same hashsum is produced because it's a compiler bug once mismatch.
For a machine learning program, if the first time training produced
a model with 95% accuracy but merely 30% accuracy on the second run,
it's a fatal bug to the program itself. (94% for the second run may
be acceptable)

Reply | Threaded
Open this post in threaded view
|

Re: Concern for: A humble draft policy on "deep learning v.s. freedom"

Osamu Aoki
Hi,

Let's think in a bit different perspective.

What is the outcome of "Deep Lerning".  That's "knowledge".

If the dictionary of "knowledge" is expressed in a freely usable
software format with free license, isn't it enough?

If you want more for your package, that's fine.  Please promote such
program for your project.  (FYI: the reason I spent my time for fixing
"anthy" for Japanese text input is I didn't like the way "mozc" looked
as a sort of dump-ware by Google containing the free license dictionary
of "knowledge" without free base training data.)  But placing some kind
of fancy purist "Policy" wording to police other software doesn't help
FREE SOFTWARE.  We got rid of Netscape from Debian because we now have
good functional free alternative.

If you can make model without any reliance to non-free base training
data for your project, that's great.

I think it's a dangerous and counter productive thing to do to deprive
access to useful functionality of software by requesting to use only
free data to obtain "knowledge".

Please note that the re-training will not erase "knowledge".  It usually
just mix-in new "knowledge" to the existing dictionary of "knowledge".
So the resulting dictionary of "knowledge" is not completely free of
the original training data.  We really need to treat this kind of
dictionary of "knowledge" in line with artwork --- not as a software
code.

Training process itself may be mathematical, but the preparation of
training data and its iterative process of providing the re-calibrating
data set involves huge human inputs.

> Enforcing re-training will be a painful decision...

Hmmm... this may depends on what kind of re-training.

At least for unidic-mecab, re-training to add many new words to be
recognized by the morphological analyzer is an easier task.  People has
used unidic-mecab and web crawler to create even bigger dictionary with
minimal work of re-training (mostly automated, I guess.)
  https://github.com/neologd/mecab-unidic-neologd/

I can't imagine to re-create the original core dictionary of "knowledge"
for Japanese text processing purely by training with newly provided free
data since it takes too much human works and I agree it is unrealistic
without serious government or corporate sponsorship project.

Also, the "knowledge" for Japanese text processing should be able to
cover non-free texts.  Without using non-free texts as input data, how
do you know it works on them.

> Isn't this checking mechanism a part of upstream work? When developing
> machine learning software, the model reproduciblity (two different runs
> should produce very similar results) is important.

Do you always have a luxury of relying on such friendly/active upstream?
If so, I see no problem.  But what should we do if not?

Anthy's upstream is practically Debian repo now.

Osamu

Reply | Threaded
Open this post in threaded view
|

Re: Concern for: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
Hi Osamu,

On 2019-06-09 13:48, Osamu Aoki wrote:
> Let's think in a bit different perspective.
>
> What is the outcome of "Deep Lerning".  That's "knowledge".

Don't mix everything into a single obscure word "knowledge".
That things is not representable through programming language
or mathematical language because we cannot define what
"knowledge" is in an unambiguous way. Squashing everything
into "knowledge" does exactly the inverse as what I'm doing.

> If the dictionary of "knowledge" is expressed in a freely usable
> software format with free license, isn't it enough?

A free license doesn't solve all my concerns. If we just treat
models as sort of artwork, what if

1. upstream happened to license a model trained from non-free
   data under GPL. Is upstream violating GPL by not releasing
   "source" (or material that is necessary to reproduce a work)?

2. upstream trained a model on a private dataset that contains
   deliberate evil data, and released it under MIT license.
   (Then malware just sneaked into main?)

I have to consider all possible models and applications in
the whole machine learning and deep learning area. The experience
learned from input methods cannot cover all possible cases.

A pile of digits from classifical machine learning model is
generally interpretable. That means human can understand what
each digit means (e.g. conditional probability, frequency, etc).

A pile of digits from deep neural network is basically not
interpretable -- human cannot fully understand them. Something
malicious could hide in this pile of digits due to the complexity
of the non-linearity mapping that neural networks have learned.

Proposed updates:

1. If a SemiFreeModel won't raise any security concern, we
   can accept them into main section. For an imagined example,
   upstream foobar wrote an input method, and trained a probablistic
   model based on developer's personal diary. The upstream released
   the model under a free license but didn't release his/her diary.
   Such model is fine as it doesn't incur any security problem.

2. Security sensitive SemiFreeModel is prohibited from entering
   the main section. Why should we trust it if we cannot inspect
   every thing about it?

Let me emphasize this again: Don't forget security when talking
about machine learning models and deep learning models. Data
used to train input method don't harm in any way, but data
used to train a model that controls authentication is ...
Security concern is inevitable along with industrial application
of deep learning.

Maybe I'm just too sensitive after reading ~100 papers about
attacking/fooling machine learning models. Here is a ridiculous
example: [Adversarial Reprogramming of Neural Networks]
(https://arxiv.org/abs/1806.11146)

> If you want more for your package, that's fine.  Please promote such
> program for your project.  (FYI: the reason I spent my time for fixing
> "anthy" for Japanese text input is I didn't like the way "mozc" looked
> as a sort of dump-ware by Google containing the free license dictionary
> of "knowledge" without free base training data.)  But placing some kind
> of fancy purist "Policy" wording to police other software doesn't help
> FREE SOFTWARE.  We got rid of Netscape from Debian because we now have
> good functional free alternative.
>
> If you can make model without any reliance to non-free base training
> data for your project, that's great.

I'll create a subcategory under SemiFreeModel as an umbrella for input
methods and alike to reduce the overkilling level of DL-Policy. After
reviewing the code by myself. It may take some time because I have
to understand how things work.

> I think it's a dangerous and counter productive thing to do to deprive
> access to useful functionality of software by requesting to use only
> free data to obtain "knowledge".

The policy needs to balance not only usefulness/productivity but also
software freedom (as per definition), reproducibility, security,
doability, possibility and difficulties.

The first priority is software freedom instead of productivity
when we can only choose one, even if users will complain.
That's why our official ISO cannot ship ZFS kernel module
and very useful non-free firmware or alike.

> Please note that the re-training will not erase "knowledge".  It usually
> just mix-in new "knowledge" to the existing dictionary of "knowledge".
> So the resulting dictionary of "knowledge" is not completely free of
> the original training data.  We really need to treat this kind of
> dictionary of "knowledge" in line with artwork --- not as a software
> code.

My interpretation of "re-train" is "train from scratch again" instead
of "train increamentaly". For neural networks the "incremental training"
process is called "fine-tune".

I understand that you don't wish DL-Policy to kick off input methods
or alike and make developers down, and this will be sorted out soon...

> Training process itself may be mathematical, but the preparation of
> training data and its iterative process of providing the re-calibrating
> data set involves huge human inputs.

I don't buy it because I cannot neglect my concerns.

>> Enforcing re-training will be a painful decision...
>
> Hmmm... this may depends on what kind of re-training.

Based on DL-Policy's scope of discussion, that "re-training" word
have global effects.

> At least for unidic-mecab, re-training to add many new words to be
> recognized by the morphological analyzer is an easier task.  People has
> used unidic-mecab and web crawler to create even bigger dictionary with
> minimal work of re-training (mostly automated, I guess.)
>   https://github.com/neologd/mecab-unidic-neologd/
>
> I can't imagine to re-create the original core dictionary of "knowledge"
> for Japanese text processing purely by training with newly provided free
> data since it takes too much human works and I agree it is unrealistic
> without serious government or corporate sponsorship project.
>
> Also, the "knowledge" for Japanese text processing should be able to
> cover non-free texts.  Without using non-free texts as input data, how
> do you know it works on them.

Understood. The information you provided is enough to help DL-Policy
set up an area for input methods and prevent them from begin kicked
out from archive (given the fundamental requirements hold).

>> Isn't this checking mechanism a part of upstream work? When developing
>> machine learning software, the model reproduciblity (two different runs
>> should produce very similar results) is important.
>
> Do you always have a luxury of relying on such friendly/active upstream?
> If so, I see no problem.  But what should we do if not?

Generally speaking a deep learning software that fails to reproduce
in any way is rubbish and should not be packaged. Special cases such
as input methods or board game models trained collectively by a
community may exist but they cannot be used to conclude the general
law.

> Anthy's upstream is practically Debian repo now.
>
> Osamu

12