Bits from /me: A humble draft policy on "deep learning v.s. freedom"

classic Classic list List threaded Threaded
47 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: Concern for: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
Hi Osamu,

On 2019-06-09 08:28, Osamu Aoki wrote:
> Although I understand the intent of "SemiFree" or "Tainted" (by Yao), I
> don't think these are a good choice.  We need to draw a line between
> FREE(=main) and NON-FREE(non-free) as a organization.  I think there are

There is no such a line as a big grey area exists. Pure-free models plus
Pure-non-free models doesn't cover all the possible cases. But
Free + SemiFree + NonFree covers all possible cases.

SemiFree lies in a grey area because the ways people interpret it vary:

1. If one regards a model as sort of human artifact such as artwork
   or font, a free software licensed SemiFreeModel is free even if
   it's trained from non-free data. (Ah, yes, there is an MIT license!
   It's a free blob made by human.)

2. If one regards a model as a production from a mathematical process
   such as training or compilation, a free software licensed
   SemiFreeModel is actually non-free. (Oops, where did these
MIT-licensed
   digits came from and how can I reproduce it? Can I trust the source?
   What if the MIT-licensed model is trained from evil data but we don't
   know?)

I'm not going to draw a line across this grey area, or say minefield.
Personally I prefer the second interpretation.

> 2 FREE models we are allowing for "main" as the current practice.
>
>  * Pure      Free Model from pure free pre-train data only
>  * Sanitized Free Model from free and non-free mixed pre-train data

Please don't make the definition of FreeModel complicated.
FreeModel should be literally and purely free.
We can divide SemiFreeModel into several categories according to
future case study and make DL-Policy properly match with the practice.

> And, we don't allow Non-Free Model in "main"

I think no one would argue about NonFreeModel.

> Question is when do you call it "sanitized" (or "distilled") to be clean
> enough to qualify for "main" ;-)

I expect a model, once sanitized, to he purely free. For example by
removing
all non-free data from the training dataset and only use free training
data. Any non-free single data pulls the model into the minefield.

>> 2. It's not required to re-train a FreeModel with our infra, because
>>    the outcome/cost ratio is impractical. The outcome is nearly zero
>>    compared to directly using a pre-trained FreeModel, while the cost
>>    is increased carbon dioxide in our atmosphere and wasted developer
>>    time. (Deep learning is producing much more carbon dioxide than we
>>    thought).
>>
>>    For classical probablistic graph models such as MRF or the mentioned
>>    CRF, the training process might be trivial, but re-training is still
>>    not required.
>
> ... but re-training is highly desirable in line with the spirit of the
> free software.

I guess you didn't catch my point. In my definition of FreeModel and the
SemiFree/ToxicCandy model, providing training script is mandatory. Any
model without training script must be non-free. This requirement also
implies that the upstream must provide all information about the
datasets
and the training process. Software freedom can be guaranteed even if
we don't always re-train the free models, as it will only waste
electricity. On the other hand, developers should check whether a model
provide such freedom, and local re-training as an verification step
is encouraged.

Enforcing re-training will be a painful decision and would drive
energetic
contributors away especially when the contributor refuse to use Nvidia
suckware.

> Let's use SanitizedModel to be neutral.

Once sanitized a model should turn into a free model. If it doesn't,
then
why does one sanitize the model?

> We need to have some guideline principle for this sanitization process.
> (I don't have an answer now)

I need case study at this point.

> This sanitization mechanism shouldn't be used to include obfuscated
> binary blob equivalents.  It's worse than FIRMWARE case since it runs on
> the same CPU as the program code.
>
> Although "Further Discussion" was the outcome, B in
> https://www.debian.org/vote/2006/vote_004 is worth looking at:
>   Strongly recommends that all non-programmatic works distribute the form
>   that the copyright holder or upstream developer would actually use for
>   modification. Such forms need not be distributed in the orig.tar.gz
>   (unless required by license) but should be made available on upstream
>   websites and/or using Debian project resources.
>
> Please note this is "Strongly recommends ... should be made
> available..." and not "must be made available ...".

Umm....

> Aside from Policy/Guideline for FREE/NON-FREE discussion, we also need
> to address for the spirit of the reproducible build.  It is nice to have
> checking mechanism for the validity and health of these MODELs.  I know
> one of the Japanese keyboard input method "Anthy" is suffering some
> regression in the upcoming release.  The fix was found too late so I
> uploaded to experimental since it contained too many changes while
> impact was subtle.  If we had a test suite with numerical score outputs,
> we could have detected such regressions by the upstream.  It may be
> unrealistic to aim for exact match for such probabilistic model but
> objectively traceable measure is very desirable to have.

Isn't this checking mechanism a part of upstream work? When developing
machine learning software, the model reproduciblity (two different runs
should produce very similar results) is important.

This reproducibility issue is much different than that of code.
Software upstream doesn't compile a cxx program twice to see whether
the same hashsum is produced because it's a compiler bug once mismatch.
For a machine learning program, if the first time training produced
a model with 95% accuracy but merely 30% accuracy on the second run,
it's a fatal bug to the program itself. (94% for the second run may
be acceptable)

Reply | Threaded
Open this post in threaded view
|

Re: Concern for: A humble draft policy on "deep learning v.s. freedom"

Osamu Aoki
Hi,

Let's think in a bit different perspective.

What is the outcome of "Deep Lerning".  That's "knowledge".

If the dictionary of "knowledge" is expressed in a freely usable
software format with free license, isn't it enough?

If you want more for your package, that's fine.  Please promote such
program for your project.  (FYI: the reason I spent my time for fixing
"anthy" for Japanese text input is I didn't like the way "mozc" looked
as a sort of dump-ware by Google containing the free license dictionary
of "knowledge" without free base training data.)  But placing some kind
of fancy purist "Policy" wording to police other software doesn't help
FREE SOFTWARE.  We got rid of Netscape from Debian because we now have
good functional free alternative.

If you can make model without any reliance to non-free base training
data for your project, that's great.

I think it's a dangerous and counter productive thing to do to deprive
access to useful functionality of software by requesting to use only
free data to obtain "knowledge".

Please note that the re-training will not erase "knowledge".  It usually
just mix-in new "knowledge" to the existing dictionary of "knowledge".
So the resulting dictionary of "knowledge" is not completely free of
the original training data.  We really need to treat this kind of
dictionary of "knowledge" in line with artwork --- not as a software
code.

Training process itself may be mathematical, but the preparation of
training data and its iterative process of providing the re-calibrating
data set involves huge human inputs.

> Enforcing re-training will be a painful decision...

Hmmm... this may depends on what kind of re-training.

At least for unidic-mecab, re-training to add many new words to be
recognized by the morphological analyzer is an easier task.  People has
used unidic-mecab and web crawler to create even bigger dictionary with
minimal work of re-training (mostly automated, I guess.)
  https://github.com/neologd/mecab-unidic-neologd/

I can't imagine to re-create the original core dictionary of "knowledge"
for Japanese text processing purely by training with newly provided free
data since it takes too much human works and I agree it is unrealistic
without serious government or corporate sponsorship project.

Also, the "knowledge" for Japanese text processing should be able to
cover non-free texts.  Without using non-free texts as input data, how
do you know it works on them.

> Isn't this checking mechanism a part of upstream work? When developing
> machine learning software, the model reproduciblity (two different runs
> should produce very similar results) is important.

Do you always have a luxury of relying on such friendly/active upstream?
If so, I see no problem.  But what should we do if not?

Anthy's upstream is practically Debian repo now.

Osamu

Reply | Threaded
Open this post in threaded view
|

Re: Concern for: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
Hi Osamu,

On 2019-06-09 13:48, Osamu Aoki wrote:
> Let's think in a bit different perspective.
>
> What is the outcome of "Deep Lerning".  That's "knowledge".

Don't mix everything into a single obscure word "knowledge".
That things is not representable through programming language
or mathematical language because we cannot define what
"knowledge" is in an unambiguous way. Squashing everything
into "knowledge" does exactly the inverse as what I'm doing.

> If the dictionary of "knowledge" is expressed in a freely usable
> software format with free license, isn't it enough?

A free license doesn't solve all my concerns. If we just treat
models as sort of artwork, what if

1. upstream happened to license a model trained from non-free
   data under GPL. Is upstream violating GPL by not releasing
   "source" (or material that is necessary to reproduce a work)?

2. upstream trained a model on a private dataset that contains
   deliberate evil data, and released it under MIT license.
   (Then malware just sneaked into main?)

I have to consider all possible models and applications in
the whole machine learning and deep learning area. The experience
learned from input methods cannot cover all possible cases.

A pile of digits from classifical machine learning model is
generally interpretable. That means human can understand what
each digit means (e.g. conditional probability, frequency, etc).

A pile of digits from deep neural network is basically not
interpretable -- human cannot fully understand them. Something
malicious could hide in this pile of digits due to the complexity
of the non-linearity mapping that neural networks have learned.

Proposed updates:

1. If a SemiFreeModel won't raise any security concern, we
   can accept them into main section. For an imagined example,
   upstream foobar wrote an input method, and trained a probablistic
   model based on developer's personal diary. The upstream released
   the model under a free license but didn't release his/her diary.
   Such model is fine as it doesn't incur any security problem.

2. Security sensitive SemiFreeModel is prohibited from entering
   the main section. Why should we trust it if we cannot inspect
   every thing about it?

Let me emphasize this again: Don't forget security when talking
about machine learning models and deep learning models. Data
used to train input method don't harm in any way, but data
used to train a model that controls authentication is ...
Security concern is inevitable along with industrial application
of deep learning.

Maybe I'm just too sensitive after reading ~100 papers about
attacking/fooling machine learning models. Here is a ridiculous
example: [Adversarial Reprogramming of Neural Networks]
(https://arxiv.org/abs/1806.11146)

> If you want more for your package, that's fine.  Please promote such
> program for your project.  (FYI: the reason I spent my time for fixing
> "anthy" for Japanese text input is I didn't like the way "mozc" looked
> as a sort of dump-ware by Google containing the free license dictionary
> of "knowledge" without free base training data.)  But placing some kind
> of fancy purist "Policy" wording to police other software doesn't help
> FREE SOFTWARE.  We got rid of Netscape from Debian because we now have
> good functional free alternative.
>
> If you can make model without any reliance to non-free base training
> data for your project, that's great.

I'll create a subcategory under SemiFreeModel as an umbrella for input
methods and alike to reduce the overkilling level of DL-Policy. After
reviewing the code by myself. It may take some time because I have
to understand how things work.

> I think it's a dangerous and counter productive thing to do to deprive
> access to useful functionality of software by requesting to use only
> free data to obtain "knowledge".

The policy needs to balance not only usefulness/productivity but also
software freedom (as per definition), reproducibility, security,
doability, possibility and difficulties.

The first priority is software freedom instead of productivity
when we can only choose one, even if users will complain.
That's why our official ISO cannot ship ZFS kernel module
and very useful non-free firmware or alike.

> Please note that the re-training will not erase "knowledge".  It usually
> just mix-in new "knowledge" to the existing dictionary of "knowledge".
> So the resulting dictionary of "knowledge" is not completely free of
> the original training data.  We really need to treat this kind of
> dictionary of "knowledge" in line with artwork --- not as a software
> code.

My interpretation of "re-train" is "train from scratch again" instead
of "train increamentaly". For neural networks the "incremental training"
process is called "fine-tune".

I understand that you don't wish DL-Policy to kick off input methods
or alike and make developers down, and this will be sorted out soon...

> Training process itself may be mathematical, but the preparation of
> training data and its iterative process of providing the re-calibrating
> data set involves huge human inputs.

I don't buy it because I cannot neglect my concerns.

>> Enforcing re-training will be a painful decision...
>
> Hmmm... this may depends on what kind of re-training.

Based on DL-Policy's scope of discussion, that "re-training" word
have global effects.

> At least for unidic-mecab, re-training to add many new words to be
> recognized by the morphological analyzer is an easier task.  People has
> used unidic-mecab and web crawler to create even bigger dictionary with
> minimal work of re-training (mostly automated, I guess.)
>   https://github.com/neologd/mecab-unidic-neologd/
>
> I can't imagine to re-create the original core dictionary of "knowledge"
> for Japanese text processing purely by training with newly provided free
> data since it takes too much human works and I agree it is unrealistic
> without serious government or corporate sponsorship project.
>
> Also, the "knowledge" for Japanese text processing should be able to
> cover non-free texts.  Without using non-free texts as input data, how
> do you know it works on them.

Understood. The information you provided is enough to help DL-Policy
set up an area for input methods and prevent them from begin kicked
out from archive (given the fundamental requirements hold).

>> Isn't this checking mechanism a part of upstream work? When developing
>> machine learning software, the model reproduciblity (two different runs
>> should produce very similar results) is important.
>
> Do you always have a luxury of relying on such friendly/active upstream?
> If so, I see no problem.  But what should we do if not?

Generally speaking a deep learning software that fails to reproduce
in any way is rubbish and should not be packaged. Special cases such
as input methods or board game models trained collectively by a
community may exist but they cannot be used to conclude the general
law.

> Anthy's upstream is practically Debian repo now.
>
> Osamu

Reply | Threaded
Open this post in threaded view
|

Re: Concern for: A humble draft policy on "deep learning v.s. freedom"

Sam Hartman-3
In reply to this post by Mo Zhou
>>>>> "Mo" == Mo Zhou <[hidden email]> writes:


    >>> Specifically, I defined 3 types of pre-trained machine learning
    >>> models / deep learning models:
    >>>
    >>> Free Model, ToxicCandy Model. Non-free Model
    >>>
    >>> Developers who'd like to touch DL software should be cautious to
    >>> the "ToxicCandy" models. Details can be found in my draft.
    >>
    >> With a labeling like "ToxicCandy Model" for the situation, it
    >> makes bad impression on people and I am afraid people may not be
    >> make rational decision.  Is this characterization correct and
    >> sane one?  At least, it looks to me that this is changing
    >> status-quo of our policy and practice severely.  So it is worth
    >> evaluating idea without labeling.

    Mo> My motivation for the naming "ToxicCandy" is pure: to warn
    Mo> developers about this special case as it may lead to very
    Mo> difficult copyright or software freedom questions. I admit that
    Mo> this name looks not quite friendly. Maybe "SemiFree" look
    Mo> better?

I really like the term toxic candy.
In two words it explains both that the model is appealing and
problematic.

If there are subdivisions of toxic candy that we decide are free, we
should come back and revisit and perhaps narrow toxic candy to the
problematic cases.

--Sam

Reply | Threaded
Open this post in threaded view
|

Re: Concern for: A humble draft policy on "deep learning v.s. freedom"

Sam Hartman-3
In reply to this post by Osamu Aoki
>>>>> "Osamu" == Osamu Aoki <[hidden email]> writes:

    Osamu> Hi, Let's think in a bit different perspective.

    Osamu> What is the outcome of "Deep Lerning".  That's "knowledge".

    Osamu> If the dictionary of "knowledge" is expressed in a freely
    Osamu> usable software format with free license, isn't it enough?

Unfortunately, I don't think this model applies well to deep learning.

That free knowledge in the dictionary isn't going to have hidden
potentially designed in security flaws.

For a Japanese input method, it might not matter.  If the wrong
character is chosen, I think the user can notice and correct it.

However, for other deep learning models, if the output decision is
wrong, the effect can be security sensitive.
And  without the training data it is very difficult to change the future
behavior of the program.
Or to even understand whether there are back doors.

In your dictionary example, if I have the resulting dictionary I can
modify it even if I don't understand how we got there.
That's not really true for deep learning models I cannot retrain.

--Sam

Reply | Threaded
Open this post in threaded view
|

Re: Concern for: A humble draft policy on "deep learning v.s. freedom"

Osamu Aoki
In reply to this post by Mo Zhou
Hi,

On Sun, Jun 09, 2019 at 09:27:42AM -0700, Mo Zhou wrote:
> Hi Osamu,
>
> On 2019-06-09 13:48, Osamu Aoki wrote:
> > Let's think in a bit different perspective.
...
... (I have some explanation for GPL-contamination concern later)
...
> Let me emphasize this again: Don't forget security when talking
> about machine learning models and deep learning models. Data
> used to train input method don't harm in any way, but data
> used to train a model that controls authentication is ...
> Security concern is inevitable along with industrial application
> of deep learning.

Very true. Are you talking things like facial recognition?

My immediate concern cases were relatively shallow learning models where
meaning of each resulting parameter is mostly self-explanatory one and
data were in ascii.  In other words, transparency is there.

> Maybe I'm just too sensitive after reading ~100 papers about
> attacking/fooling machine learning models. Here is a ridiculous
> example: [Adversarial Reprogramming of Neural Networks]
> (https://arxiv.org/abs/1806.11146)

I see.

I am not even sure having FREE and open base data is enough, after
reading the first few lines of linked text.  If the input data
distributed in the upstream data contains steganography and deep
learning may pick it while human data reviewer for the input data may
overlook it.  Then bad thing can happen with the decision which uses
this set of seemingly nice data with deep learning.

I think one of the important question is the transparency
(=ability to inspect and test with independent data) of the resulting
data.  I am no expert on this subject, though.  This is just my gut
feeling.

(Of course, another thing is the trust to the people which offer base
data.)

This reminds me of the "trust of C-code" situation.  We need the source,
YES.  We need the source of compiler, YES. But these aren't enough.  If
boot strapping of compiler was tainted, the trust of C-code can't be
secured.  We must have GDB to inspect compiled result to be sure.

Even shallow data like Japanese input string can contain intentional
twist which could be biased.  String "64" may be linked to a string
"Tiananmen".   I don't see that yet ;-)  This kind of rogue data which
may upset some is minor concern since this is like Easter egg in C code.
We have enough transparency of data in this case.

I am wondering about another shallow learning data case such as Bayesian
spam filter.  I can't agree to package a pre-trained binary data made
from unknown source data set text, even if upstream tells us this data
is under FREE license.  This is like binary firmware.  If pre-trained
binary data is dumped in a readable and meaningful ascii text data
format, it may be OK if it is licensed under FREE license.  Here again,
transparency of data is important.

> > If you want more for your package, that's fine.  Please promote such
> > program for your project.  (FYI: the reason I spent my time for fixing
> > "anthy" for Japanese text input is I didn't like the way "mozc" looked
> > as a sort of dump-ware by Google containing the free license dictionary
> > of "knowledge" without free base training data.)  But placing some kind
> > of fancy purist "Policy" wording to police other software doesn't help
> > FREE SOFTWARE.  We got rid of Netscape from Debian because we now have
> > good functional free alternative.
> >
> > If you can make model without any reliance to non-free base training
> > data for your project, that's great.
>
> I'll create a subcategory under SemiFreeModel as an umbrella for input
> methods and alike to reduce the overkilling level of DL-Policy. After
> reviewing the code by myself. It may take some time because I have
> to understand how things work.
>
> > I think it's a dangerous and counter productive thing to do to deprive
> > access to useful functionality of software by requesting to use only
> > free data to obtain "knowledge".
>
> The policy needs to balance not only usefulness/productivity but also
> software freedom (as per definition), reproducibility, security,
> doability, possibility and difficulties.
>
> The first priority is software freedom instead of productivity
> when we can only choose one, even if users will complain.
> That's why our official ISO cannot ship ZFS kernel module
> and very useful non-free firmware or alike.
>
> > Please note that the re-training will not erase "knowledge".  It usually
> > just mix-in new "knowledge" to the existing dictionary of "knowledge".
> > So the resulting dictionary of "knowledge" is not completely free of
> > the original training data.  We really need to treat this kind of
> > dictionary of "knowledge" in line with artwork --- not as a software
> > code.
>
> My interpretation of "re-train" is "train from scratch again" instead
> of "train increamentaly". For neural networks the "incremental training"
> process is called "fine-tune".

I see.

> I understand that you don't wish DL-Policy to kick off input methods
> or alike and make developers down, and this will be sorted out soon...
>
> > Training process itself may be mathematical, but the preparation of
> > training data and its iterative process of providing the re-calibrating
> > data set involves huge human inputs.
>
> I don't buy it because I cannot neglect my concerns.
>
> >> Enforcing re-training will be a painful decision...
> >
> > Hmmm... this may depends on what kind of re-training.
>
> Based on DL-Policy's scope of discussion, that "re-training" word
> have global effects.

I see.

> > At least for unidic-mecab, re-training to add many new words to be
> > recognized by the morphological analyzer is an easier task.  People has
> > used unidic-mecab and web crawler to create even bigger dictionary with
> > minimal work of re-training (mostly automated, I guess.)
> >   https://github.com/neologd/mecab-unidic-neologd/
> >
> > I can't imagine to re-create the original core dictionary of "knowledge"
> > for Japanese text processing purely by training with newly provided free
> > data since it takes too much human works and I agree it is unrealistic
> > without serious government or corporate sponsorship project.
> >
> > Also, the "knowledge" for Japanese text processing should be able to
> > cover non-free texts.  Without using non-free texts as input data, how
> > do you know it works on them.
>
> Understood. The information you provided is enough to help DL-Policy
> set up an area for input methods and prevent them from begin kicked
> out from archive (given the fundamental requirements hold).
>
> >> Isn't this checking mechanism a part of upstream work? When developing
> >> machine learning software, the model reproduciblity (two different runs
> >> should produce very similar results) is important.
> >
> > Do you always have a luxury of relying on such friendly/active upstream?
> > If so, I see no problem.  But what should we do if not?
>
> Generally speaking a deep learning software that fails to reproduce
> in any way is rubbish and should not be packaged. Special cases such
> as input methods or board game models trained collectively by a
> community may exist but they cannot be used to conclude the general
> law.

I am not sure if it is right to segregate things just by the use case.
(It's OK to use this as the gauideline to spend time on them.)

I also think it is important to have transparency test for any of these
data.  It's like binary vs. source code.

Just to be sure:

mecab is not Japanese input method tool. mecab is generic morphological
analysis tool which checks only the nearest neighbor word.

mozc is the input method which seems to use mecab code variant within it
with its own dictionary data.

mecab with any one of mecab dictionaries can be used to tokenize
Japanese text into words (some even generate data for proper
pronunciation information).  This is sometimes the first step to scan
Japanese text for any analytical process.

They are also used to crawl net to find unknown words and add them to a
new updated fine-tuned dictionary with default property settings.  

The upstream corpus of mecab-unidic contain not only Government paperers
but also published ordinary books, journals, and  newspaper articles.
Thus they can't be distributed as Free.  Also initial base words and
some pronunciation intonation instruction data came from proprietary
dictionaries.  So it was not FREE data initially.  But after good
efforts by the government agency and people around it to negotiate with
the original dictionary data suppliers, they let the dictionary
publisher agree to distribute this scope of data in 3 licenses:
(BSD/LGPL/GPL) to make it available and compatible in most use cases.
So there is no GPL-contamination issue here.  Also this dictionary has
no definition for each word meaning which dictionary publisher didn't
wish to be released under FREE license.  That is the sanitization
process of these mecab data.  GPL-compatibility is not the issue with
mecab-unidic.  

It's extended mecab-unidic-nlogd is in apache license which I guess is
OK if we take mecab-unidic as BSD license.

Regards,

Osamu

Reply | Threaded
Open this post in threaded view
|

Re: Concern for: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
In reply to this post by Sam Hartman-3
Hello guys,

On 2019-06-10 13:14, Sam Hartman wrote:
> I really like the term toxic candy.
> In two words it explains both that the model is appealing and
> problematic.

So let's keep this name :-)

> If there are subdivisions of toxic candy that we decide are free, we
> should come back and revisit and perhaps narrow toxic candy to the
> problematic cases.

Various recent comments remind me of many details, mentioned or
not mentioned, that I have overlooked. So I'm going to refactor
the whole document and make it much deeper and more complete.
This will be a big update and will take some time.

123