Bits from /me: A humble draft policy on "deep learning v.s. freedom"

classic Classic list List threaded Threaded
47 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
Hi Andy,

Thanks for you comments.

On 2019-05-23 09:28, Andy Simpkins wrote:
> Your wording "The model /should/be reproducible with a fixed random seed." feels
> correct but wonder if guidance notes along the following lines should be added?
>
>     *unless* we can reproduce the same results, from the same training data,
>     you cannot classify as group 1, "Free Model", because verification that
>     training has been carried out on the dataset explicitly licensed under a
>     free software license can not be achieved.  This should be treated as a
>     severe bug and the entire suite should be classified as group 2,
>     "ToxicCandy Model", until such time that verification is possible.

Ummm... This is actually a bit cruel to upstream ... And I think there
is still some misunderstanding. I've updated the document and made the
following points clear:

- "Numerically Reproducible" is the default reproduciblity definition
  in the context

 
https://salsa.debian.org/lumin/deeplearning-policy#neural-network-reproducibility

- A Free Model should be Numerically Reproducible,
  or at least a locally-trained model can reach similar performance
  (e.g. accuracy) compared to the original one.

  Similar results are acceptable. The bar "Identical" is not always
reachable.

- The datasets used for training a "ToxicCandy" may be
  private/non-free and not everybody can access them. (This case is more
  likely a result of problematic upstream licensing, but it sometimes
happens).

  One got a free model from internet. That little candy tastes sweet.
  One wanted to make this candy at home with the provided recipe, but
  surprisingly found out that non-free ingredients are inevitable.
    -- ToxicCandy

Is the updated document clearer?

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
In reply to this post by Sam Hartman-3
Hi Sam,

On 2019-05-23 15:33, Sam Hartman wrote:
> I don't think that's entirely true.

Yes, that's a bit cruel to upstream.

> Reproducibility is still an issue, but is no more or less an issue than
> with any other software.

Bit-by-bit reproducibility is not quite practical for now. The
refined definition of "Numerically Reproducible" is much better
and I use it as the default definition.

https://salsa.debian.org/lumin/deeplearning-policy#neural-network-reproducibility

> Consider how we treat assets for games or web applications.  And yes
> there are some confusing areas there and areas where we'd like to
> improve.  But let's be consistent in what we demand from various
> communities to be part of Debian.  Let's not penalize people for being
> new and innovative.

Agreed.

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
In reply to this post by Andy Simpkins-3
Hi Andy,

On 2019-05-23 17:52, Andy Simpkins wrote:
> Sam.
> Whilst i agree that "assets" in some packages may not have sources
> with them and the application may still be in main if it pulls in
> those assets from contrib or non free.
> I am trying to suggest the same thing here. If the data set is unknown
> this is the *same* as a dependancy on a random binary blob (music /
> fonts / game levels / textures etc) and we wouldn't put that in main.

The "ToxicCandy Model" is used to cover a special case. Both
"ToxicCandy"
and "Non-free" model cannot enter our main section, as stated by
DL-Policy #1 from the beginning.

> It is my belief that we consider training data sets as 'source' in
> much the same way....

We can interpret training data as sort of "source" indeed. But some
times we even have trouble with free "source". Wikipedia dump is
a frequently used free corpus in the computational linguistics
field. Do we really want to upload the wikipedia dump to the
archive when some Free Model to be packaged is trained from it?

Wikipedia dump is so giant that challenges our .deb format
(see recent threads).

See (Difficulties -- Dataset Size):
https://salsa.debian.org/lumin/deeplearning-policy#difficulties-questions-not-easy-to-answer

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
In reply to this post by Sam Hartman-3
On 2019-05-23 17:58, Sam Hartman wrote:
> So for deep learning models we would require that they be retrainable
> and typically require that we have retrained them.

The two difficulties make the above point not easy to achieve:
https://salsa.debian.org/lumin/deeplearning-policy#difficulties-questions-not-easy-to-answer

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Paul Wise via nm
In reply to this post by Sam Hartman-3
On Fri, May 24, 2019 at 1:58 AM Sam Hartman wrote:

> So for deep learning models we would require that they be retrainable
> and typically require that we have retrained them.

I don't think it is currently feasible for Debian to retrain the
models. I don't think we have any buildds with GPUs yet. I don't know
about the driver situation but for example I doubt any deep learning
folks using the nvidia hardware mentioned in deeplearning-policy are
using the libre nouveau drivers. The driver situation for TPUs might
be better though? Either way I think a cross-community effort for
retraining and reproducibility of models would be better than Debian
having to do any retraining.

--
bye,
pabs

https://wiki.debian.org/PaulWise

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
On 2019-05-24 15:59, Paul Wise wrote:
> On Fri, May 24, 2019 at 1:58 AM Sam Hartman wrote:
>
>> So for deep learning models we would require that they be retrainable
>> and typically require that we have retrained them.
>
> I don't think it is currently feasible for Debian to retrain the
> models.

Infeasible, for sure.

> I don't think we have any buildds with GPUs yet.

Non-free nvidia driver is inevitable.
AMD GPUs and OpenCL are not sane choices.

> I don't know
> about the driver situation but for example I doubt any deep learning
> folks using the nvidia hardware mentioned in deeplearning-policy are
> using the libre nouveau drivers.

Don't doubt. Nouveau can never support CUDA well.
Unless someday nvidia rethought about everything.

Some good Xeon CPUs can train models as well,
and a well optimized linear algebra library
helps a lot (e.g. MKL, OpenBLAS). But generally
CPU training takes at least 10x longer time to
finish. (except some toy networks)

> The driver situation for TPUs might
> be better though?

IDK any software detail about TPU..

> Either way I think a cross-community effort for
> retraining and reproducibility of models would be better than Debian
> having to do any retraining.

Sounds like a good way to go. But not today.
Let's do lazy execution at this point, and
see how this subject evolves and how other
FOSS communities think.

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Adam Borowski-3
In reply to this post by Mo Zhou
On Thu, May 23, 2019 at 11:37:41PM -0700, Mo Zhou wrote:
> - The datasets used for training a "ToxicCandy" may be
>   private/non-free and not everybody can access them. (This case is more
>   likely a result of problematic upstream licensing, but it sometimes
> happens).
>
>   One got a free model from internet. That little candy tastes sweet.
>   One wanted to make this candy at home with the provided recipe, but
>   surprisingly found out that non-free ingredients are inevitable.
>     -- ToxicCandy

I'm not so sure this model would be unacceptable.  It's no different than
a game's image being a photo of a tree in your garden -- not reproducible by
anyone but you (or someone you invite).  Or, a wordlist frequency produced
by analyzing results of a google search.

At some point, the work becomes an entity on its own rather than the result
of processing some dataset.

A more ridiculous argument: the input is a project requirement sheet, the
neural network being four pieces of wetware, working for 3 months.  Do you
insist on _this_ being reproducible, or would you accept the product as free
software?  Sufficiently advanced artificial intelligence might be not that
different.


喵!
--
⢀⣴⠾⠻⢶⣦⠀ Latin:   meow 4 characters, 4 columns,  4 bytes
⣾⠁⢠⠒⠀⣿⡁ Greek:   μεου 4 characters, 4 columns,  8 bytes
⢿⡄⠘⠷⠚⠋  Runes:   ᛗᛖᛟᚹ 4 characters, 4 columns, 12 bytes
⠈⠳⣄⠀⠀⠀⠀ Chinese: 喵   1 character,  2 columns,  3 bytes <-- best!

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Paul Wise via nm
In reply to this post by Mo Zhou
On Fri, 2019-05-24 at 03:14 -0700, Mo Zhou wrote:

> Non-free nvidia driver is inevitable.
> AMD GPUs and OpenCL are not sane choices.

So no model which cannot be CPU-trained is suitable for Debian main.

> Don't doubt. Nouveau can never support CUDA well.

There is coriander but nouveau doesn't support OpenCL 1.2 yet.

https://github.com/hughperkins/coriander

> Some good Xeon CPUs can train models as well,
> and a well optimized linear algebra library
> helps a lot (e.g. MKL, OpenBLAS). But generally
> CPU training takes at least 10x longer time to
> finish. (except some toy networks)

So only toy networks can enter Debian main?

> Sounds like a good way to go. But not today.
> Let's do lazy execution at this point, and
> see how this subject evolves and how other
> FOSS communities think.

Agreed, that sounds reasonable, similar to how repro builds went.

--
bye,
pabs

https://wiki.debian.org/PaulWise


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE:Bits from /me: A humble draft policy on "deep learning v.s. freedom"

PICCA Frederic-Emmanuel
What about ibm power9 with pocl ?

it seems that this is better than the latest NVIDIA GPU.

Cheers
Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Sam Hartman-3
In reply to this post by Paul Wise via nm
>>>>> "Paul" == Paul Wise <[hidden email]> writes:

    Paul> On Fri, May 24, 2019 at 1:58 AM Sam Hartman wrote:
    >> So for deep learning models we would require that they be
    >> retrainable and typically require that we have retrained them.

    Paul> I don't think it is currently feasible for Debian to retrain
    Paul> the models. I don't think we have any buildds with GPUs yet. I
    Paul> don't know about the driver situation but for example I doubt
    Paul> any deep learning folks using the nvidia hardware mentioned in
    Paul> deeplearning-policy are using the libre nouveau drivers. The
    Paul> driver situation for TPUs might be better though? Either way I
    Paul> think a cross-community effort for retraining and
    Paul> reproducibility of models would be better than Debian having
    Paul> to do any retraining.

I wonder whether we'd accept a developer's assertion that some large pdf
in a source package could be rebuilt without actually rebuilding it  on
every upload.
I think we probably would.

I think something similar might be acceptable here.

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Holger Levsen-2
On Fri, May 24, 2019 at 10:43:34AM -0400, Sam Hartman wrote:
> I wonder whether we'd accept a developer's assertion that some large pdf
> in a source package could be rebuilt without actually rebuilding it  on
> every upload.
> I think we probably would.

I dont think so, actually and AFAIK, we don't accept this and we treat
such bugs as serious. (though quite very probably those bugs might be
tagged buster-ignore right now.)


--
tschau,
        Holger

-------------------------------------------------------------------------------
               holger@(debian|reproducible-builds|layer-acht).org
       PGP fingerprint: B8BF 5413 7B09 D35C F026 FE9D 091A B856 069A AA1C

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Paul Wise via nm
In reply to this post by Sam Hartman-3
On Fri, 2019-05-24 at 10:43 -0400, Sam Hartman wrote:

> I wonder whether we'd accept a developer's assertion that some large pdf
> in a source package could be rebuilt without actually rebuilding it  on
> every upload.

As I understand it, ftp-master policy is that things in main be
buildable from source using only tools in main, not that everything in
main is actually built from source at `debian/rules build` time.

There are plenty of things in the archive that we do not build from
source on the buildds, firmware-linux-free for example.

Obviously the best way to prove things are buildable from source is to
actually build from source and do it as often as possible.

Personally I'd like:

 * A standard build profile used when building everything from source.
 * A way to tell debian/rules to build everything from source.
 * A build toolchain option to make use of these.
 * A requirement that things not built from source come in a separate
   component tarball of the source package, using the multi-tarball
   feature of the v3 Debian source package format.
 * More upstream separation of build products from source.

> I think we probably would.

Personally I do not think it would be acceptable to not build large
PDFs from source. I doubt the PDF build process could be problematic
enough that we couldn't do it on current buildds.

--
bye,
pabs

https://wiki.debian.org/PaulWise


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
In reply to this post by Adam Borowski-3
Hi Adam,

On 2019-05-24 10:19, Adam Borowski wrote:

>
> I'm not so sure this model would be unacceptable.  It's no different than
> a game's image being a photo of a tree in your garden -- not reproducible by
> anyone but you (or someone you invite).  Or, a wordlist frequency produced
> by analyzing results of a google search.
>
> At some point, the work becomes an entity on its own rather than the result
> of processing some dataset.
>
> A more ridiculous argument: the input is a project requirement sheet, the
> neural network being four pieces of wetware, working for 3 months.  Do you
> insist on _this_ being reproducible, or would you accept the product as free
> software?  Sufficiently advanced artificial intelligence might be not that
> different.

This is exactly the difficult question #3. The definition of ToxicCandy
is prepared for the future, and we currently lack concrete example
in this case. (A ToxicCandy Model that nobody plans to upload to the
archive is an invalid case.)

Let me first make the definition of the safest area clear. Only after
then should we try to explore more complicated cases ...

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
In reply to this post by Paul Wise via nm
Hi Paul,

On 2019-05-24 11:50, Paul Wise wrote:
> On Fri, 2019-05-24 at 03:14 -0700, Mo Zhou wrote:
>
>> Non-free nvidia driver is inevitable.
>> AMD GPUs and OpenCL are not sane choices.
>
> So no model which cannot be CPU-trained is suitable for Debian main.

I've already pointed out that 1 year ago. Modern DL frameworks
supports different computation devices, typically CPU and GPU (CUDA).
And CUDA training is typically tens or hundreds times faster than
CPU training. I've already raised the question that whether a model
is really free if training it on purely free data and free software
takes 1 year, but merely 1 hour with non-free software.
In that historical thread people thought this is
not a solvable problem so I didn't wrote much about it.

My word "can't" means "cannot finish within a reasonable time frame".
If I can live for 1e9 years, I'd definitely say non-free
software is not necessary even if training on a weak i3 CPU
takes a short period, say, 100 years.

I updated Difficulty#2 and mentioned this.
Packages within my radar are not likely suffering
from the hard problems.

> https://github.com/hughperkins/coriander

Added to watch list.
 
>> Some good Xeon CPUs can train models as well,
>> and a well optimized linear algebra library
>> helps a lot (e.g. MKL, OpenBLAS). But generally
>> CPU training takes at least 10x longer time to
>> finish. (except some toy networks)
>
> So only toy networks can enter Debian main?

Not exactly. Some useful stuff can be trained by
CPUs within a reasonable timeframe. We can analyze them
if we got some concrete cases.

NLTK-data contains some good examples of useful
"Free Models". I haven't finished inspecting it's
contents but some of it's components can meet
the high standard of "Free Model".

As I said at the beginning. The initial draft policy is conservative
and overkilling. We can revise it to let more models pass
in the future if people request so.

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
In reply to this post by PICCA Frederic-Emmanuel
Hi PICCA,

On 2019-05-24 12:01, PICCA Frederic-Emmanuel wrote:
> What about ibm power9 with pocl ?
>
> it seems that this is better than the latest NVIDIA GPU.

The typical workload for training neural networks is linear
operations such as general matrix-matrix multiplication and
convolution.

I know nothing about pocl, but it's hard for CPU to beat
GPU in terms of these highly-parallelizable linear operations.
Try a 4096x4096 multiplication and you will easily find out
the difference.

E.g. my CPU = I5 7440HQ (middle-end mobile CPU), GPU = Nvidia 940MX
(junk)
The junk GPU (CUDA) is 100x faster than my CPU (MKL).

~ ❯❯❯ optirun ipython3
Python 3.7.3 (default, Apr  3 2019, 05:39:12)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.2.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import torch as th                                              
       

In [2]: x = th.rand(4096, 4096)                                        
       

In [3]: %time x@x                                                      
       
CPU times: user 1.65 s, sys: 38.7 ms, total: 1.69 s
Wall time: 449 ms
Out[3]:
tensor([[1015.7596, 1004.2767, 1001.6245,  ..., 1026.8447,  996.3105,
         1002.7847],
        [1047.8833, 1014.3856, 1020.8246,  ..., 1055.3224, 1021.6126,
         1031.0334],
        [1049.3168, 1027.7637, 1030.9961,  ..., 1054.3218, 1015.3804,
         1031.6709],
        ...,
        [1039.6516, 1024.6678, 1021.1326,  ..., 1047.0674, 1015.1402,
         1029.5969],
        [1020.1988,  994.0073, 1005.5823,  ..., 1015.6786,  990.2491,
         1008.1358],
        [1022.9388,  991.9886,  990.4608,  ..., 1013.9000,  998.8676,
         1007.8554]])

In [4]: x = x.cuda()                                                    
                                             

In [5]: %time x@x                                                      
                                             
CPU times: user 1.1 ms, sys: 174 µs, total: 1.27 ms
Wall time: 2.67 ms
Out[5]:
tensor([[1015.7591, 1004.2764, 1001.6254,  ..., 1026.8447,  996.3105,
         1002.7841],
        [1047.8838, 1014.3846, 1020.8243,  ..., 1055.3209, 1021.6123,
         1031.0328],
        [1049.3174, 1027.7644, 1030.9971,  ..., 1054.3210, 1015.3800,
         1031.6727],
        ...,
        [1039.6511, 1024.6686, 1021.1323,  ..., 1047.0674, 1015.1404,
         1029.5974],
        [1020.1982,  994.0067, 1005.5826,  ..., 1015.6784,  990.2482,
         1008.1347],
        [1022.9395,  991.9879,  990.4588,  ..., 1013.9014,  998.8687,
         1007.8544]], device='cuda:0')

Reply | Threaded
Open this post in threaded view
|

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
In reply to this post by Paul Wise via nm
Hi,

On 2019-05-21 23:52, Paul Wise wrote:
> Has anyone repeated the training of Mozilla DeepSpeech for example?

By chance I found a paper from a pile of papers (that attacks AI models)
that Berkeley researchers have successfully attacked DeepSpeech:

   https://arxiv.org/pdf/1801.01944.pdf

IHMO Try not to ask AI to deal with any critical task unless one
understands the security risk. Maybe attacking AI models will
be what future hackers do?

```quote from https://arxiv.org/abs/1801.01944
Abstract

We construct targeted audio adversarial examples on automatic speech
recognition. Given any audio waveform, we can produce another that
is over 99.9% similar, but transcribes as any phrase we choose
(recognizing
up to 50 characters per second of audio). We apply our white-box
iterative
optimization-based attack to Mozilla’s implementation DeepSpeech
end-to-end,
and show it has a 100% success rate. The feasibility of this attack
introduce a new domain to study adversarial examples.
```quote

Reply | Threaded
Open this post in threaded view
|

Concern for: A humble draft policy on "deep learning v.s. freedom"

Osamu Aoki
In reply to this post by Mo Zhou
Hi,

On Tue, May 21, 2019 at 12:11:14AM -0700, Mo Zhou wrote:
> Hi people,

I see your good intention but this is basically changing status-quo for
the main requirement.

>   https://salsa.debian.org/lumin/deeplearning-policy
>   (issue tracker is enabled)

I read it ;-)

> This draft is conservative and overkilling, and currently
> only focus on software freedom. That's exactly where we
> start, right?

OK but it can't be where we end-up-with.

Before scientific "deep learning" data, we already have practical "deep
learning" data in our archive.

Please note one of the most popular Japanese input method mozc will be
kicked out from main as a starter if we start enforcing this new
guideline.

> Specifically, I defined 3 types of pre-trained machine
> learning models / deep learning models:
>
>   Free Model, ToxicCandy Model. Non-free Model
>
> Developers who'd like to touch DL software should be
> cautious to the "ToxicCandy" models. Details can be
> found in my draft.

With a labeling like "ToxicCandy Model" for the situation, it makes bad
impression on people and I am afraid people may not be make rational
decision.  Is this characterization correct and sane one?  At least,
it looks to me that this is changing status-quo of our policy and
practice severely.  So it is worth evaluating idea without labeling.

As long as the "data" comes in the form which allows us to modify it and
re-train it to make it better with a set of free software tools to do it,
we shouldn't make it non-free, for sure.  That is my position and I
think this was what we operated as the project.  We never asked how they
are originally made.  The touchy question is how easy it should be to
modify and re-train, etc.

Let's list analogy cases.  We allow a photo of something on our archive
as wallpaper etc.  We don't ask object of photo or tool used to make it
to be FREE.  Debian logo is one example which was created by Photoshop
as I understand.  Another analogy to consider is how we allow
independent copyright and license for the dictionary like data which
must have processed previous copyrighted (possibly non-free) texts by
human brain and maybe with some script processing.  Packages such as
opendict, *spell-*, dict-freedict-all, ... are in main.

I agree it is nice to have base data in the package.  If you can, please
include the training data if it is a FREE set.  But it may become
unrealistic for Debian to getting into business of distributing many GB
of training data for this purpose.  You may be talking data size being over
10s of GB.  This is another thing you should realize -- So mandating its
inclusion is unpractical since it is not the focus point on which Debian
needs to spend its resource.

Let's talk about actual cases in main.

"mecab" is free a tool for Japanese text morphological analysis which
can create CRF optimized parameters from the marked-up training data.

(This is also the base of mozc which uses such data to create desirable
typing output in normal Japanese text input from the keyboard.)

One of the dictionary for mecab is 800MB compressed deb in main:
unidic-mecab which is 2.2GB data in text format containing CRF optimized
parameters and other text data obtained by training. These text and
parameters are triple licensed BSD/LGPL/GPL. Re-training this is very
straight forward application of mecab tool with additional data only.
So this is FREE as it can be in current practice and we have it in main.
  https://unidic.ninjal.ac.jp/

When these CRF parameters were initially made, it used non-free data
(Japanese Government funded) available in multiple DVDs with hefty price
and restriction on its use and its redistribution.  This base data for
training is as NON-FREE as it can be so we don't distribute.
  https://pj.ninjal.ac.jp/corpus_center/bccwj/dvd-index.html

In case of MOZC, the original training data is only available in Google
and not published by them.  Actually, tweaking data is possible but
consistently retraining this data in MOZC may not be a trivial
application of mecab tool.  We are placing this in main now, anyway
since its data (CRF optimized parameters and other text data ) are
licensed under BSD-3-clause and we have MOZC in main.

Regards,

Osamu

Reply | Threaded
Open this post in threaded view
|

Re: Concern for: A humble draft policy on "deep learning v.s. freedom"

Mo Zhou
Hi Osamu,

On 2019-06-08 18:43, Osamu Aoki wrote:
>> This draft is conservative and overkilling, and currently
>> only focus on software freedom. That's exactly where we
>> start, right?
>
> OK but it can't be where we end-up-with.

That's why I said the two words "conservative" and "overkilling".
In my blueprint we can actually loosen these restrictions bit
by bit with further case study.

> Before scientific "deep learning" data, we already have practical "deep
> learning" data in our archive.

Thanks for pointing them out. They are good case study
for me to revise the DL-Policy.

> Please note one of the most popular Japanese input method mozc will be
> kicked out from main as a starter if we start enforcing this new
> guideline.

I'm in no position of irresponsibly enforcing an experimental
policy without having finished enough case study.

>> Specifically, I defined 3 types of pre-trained machine
>> learning models / deep learning models:
>>
>>   Free Model, ToxicCandy Model. Non-free Model
>>
>> Developers who'd like to touch DL software should be
>> cautious to the "ToxicCandy" models. Details can be
>> found in my draft.
>
> With a labeling like "ToxicCandy Model" for the situation, it makes bad
> impression on people and I am afraid people may not be make rational
> decision.  Is this characterization correct and sane one?  At least,
> it looks to me that this is changing status-quo of our policy and
> practice severely.  So it is worth evaluating idea without labeling.

My motivation for the naming "ToxicCandy" is pure: to warn developers
about this special case as it may lead to very difficult copyright
or software freedom questions. I admit that this name looks not
quite friendly. Maybe "SemiFree" look better?

> As long as the "data" comes in the form which allows us to modify it and
> re-train it to make it better with a set of free software tools to do it,
> we shouldn't make it non-free, for sure.  That is my position and I
> think this was what we operated as the project.  We never asked how they
> are originally made.  The touchy question is how easy it should be to
> modify and re-train, etc.
>
> Let's list analogy cases.  We allow a photo of something on our archive
> as wallpaper etc.  We don't ask object of photo or tool used to make it
> to be FREE.  Debian logo is one example which was created by Photoshop
> as I understand.  Another analogy to consider is how we allow
> independent copyright and license for the dictionary like data which
> must have processed previous copyrighted (possibly non-free) texts by
> human brain and maybe with some script processing.  Packages such as
> opendict, *spell-*, dict-freedict-all, ... are in main.
>
> I agree it is nice to have base data in the package.  If you can, please
> include the training data if it is a FREE set.  But it may become
> unrealistic for Debian to getting into business of distributing many GB
> of training data for this purpose.  You may be talking data size being over
> 10s of GB.  This is another thing you should realize -- So mandating its
> inclusion is unpractical since it is not the focus point on which Debian
> needs to spend its resource.
>
> Let's talk about actual cases in main.
>
> "mecab" is free a tool for Japanese text morphological analysis which
> can create CRF optimized parameters from the marked-up training data.
>
> (This is also the base of mozc which uses such data to create desirable
> typing output in normal Japanese text input from the keyboard.)
>
> One of the dictionary for mecab is 800MB compressed deb in main:
> unidic-mecab which is 2.2GB data in text format containing CRF optimized
> parameters and other text data obtained by training. These text and
> parameters are triple licensed BSD/LGPL/GPL. Re-training this is very
> straight forward application of mecab tool with additional data only.
> So this is FREE as it can be in current practice and we have it in main.
>   https://unidic.ninjal.ac.jp/
>
> When these CRF parameters were initially made, it used non-free data
> (Japanese Government funded) available in multiple DVDs with hefty price
> and restriction on its use and its redistribution.  This base data for
> training is as NON-FREE as it can be so we don't distribute.
>   https://pj.ninjal.ac.jp/corpus_center/bccwj/dvd-index.html
>
> In case of MOZC, the original training data is only available in Google
> and not published by them.  Actually, tweaking data is possible but
> consistently retraining this data in MOZC may not be a trivial
> application of mecab tool.  We are placing this in main now, anyway
> since its data (CRF optimized parameters and other text data ) are
> licensed under BSD-3-clause and we have MOZC in main.

Thank you Osamu. These cases inspired me on finding a better
balance point for DL-Policy. I'll add these cases to the case
study section, and I'm going to add the following points to DL-Policy:

1. Free datasets used to train FreeModel are not required to upload
   to our main section, for example those Osamu mentioned and wikipedia
   dump. We are not scientific data archiving organization and these
   data will blow up our infra if we upload too much.

2. It's not required to re-train a FreeModel with our infra, because
   the outcome/cost ratio is impractical. The outcome is nearly zero
   compared to directly using a pre-trained FreeModel, while the cost
   is increased carbon dioxide in our atmosphere and wasted developer
   time. (Deep learning is producing much more carbon dioxide than we
   thought).

   For classical probablistic graph models such as MRF or the mentioned
   CRF, the training process might be trivial, but re-training is still
   not required.

For SemiFreeModel I still hesitate to make any decision. Once we let
them enter the main section there will be many unreproducible
or hard-to-reproduce but surprisingly "legal" (in terms of DL-Policy)
files. Maybe this case is to some extent similar to artworks and fonts.
Further study needed. And it's still not easy to find a balance point
for SemiFreeModel between usefulness and freedom.

Thanks,
Mo.

Reply | Threaded
Open this post in threaded view
|

Re: Concern for: A humble draft policy on "deep learning v.s. freedom"

"Yao Wei (魏銘廷)"-2
Hi,

>> With a labeling like "ToxicCandy Model" for the situation, it makes bad
>> impression on people and I am afraid people may not be make rational
>> decision.  Is this characterization correct and sane one?  At least,
>> it looks to me that this is changing status-quo of our policy and
>> practice severely.  So it is worth evaluating idea without labeling.
>
> My motivation for the naming "ToxicCandy" is pure: to warn developers
> about this special case as it may lead to very difficult copyright
> or software freedom questions. I admit that this name looks not
> quite friendly. Maybe "SemiFree" look better?

About the term ToxicCandy it makes me reminded of an existing
term "Tainted" which also used in Linux kernel to describe kernel
running with non-free module.

So... how about "Tainted Model"?

Just 2 cents,
Yao Wei

(This email is sent from a phone; sorry for HTML email if it happens.)

Reply | Threaded
Open this post in threaded view
|

Re: Concern for: A humble draft policy on "deep learning v.s. freedom"

Osamu Aoki
In reply to this post by Mo Zhou
Hi Mo,

On Sat, Jun 08, 2019 at 10:07:13PM -0700, Mo Zhou wrote:

> Hi Osamu,
>
> On 2019-06-08 18:43, Osamu Aoki wrote:
> >> This draft is conservative and overkilling, and currently
> >> only focus on software freedom. That's exactly where we
> >> start, right?
> >
> > OK but it can't be where we end-up-with.
>
> That's why I said the two words "conservative" and "overkilling".
> In my blueprint we can actually loosen these restrictions bit
> by bit with further case study.

Yes, we agree here!

> > Before scientific "deep learning" data, we already have practical "deep
> > learning" data in our archive.
>
> Thanks for pointing them out. They are good case study
> for me to revise the DL-Policy.
>
> > Please note one of the most popular Japanese input method mozc will be
> > kicked out from main as a starter if we start enforcing this new
> > guideline.
>
> I'm in no position of irresponsibly enforcing an experimental
> policy without having finished enough case study.

I noticed it since you were thinking deep enough but I saw some danger
for other people to make decision too quickly based on the "Labeling".

Please check our history on the following GRs:
 https://www.debian.org/vote/2004/vote_003
 https://www.debian.org/vote/2006/vote_004

We are stack with "Further discussion" at this moment.

> >> Specifically, I defined 3 types of pre-trained machine
> >> learning models / deep learning models:
> >>
> >>   Free Model, ToxicCandy Model. Non-free Model
> >>
> >> Developers who'd like to touch DL software should be
> >> cautious to the "ToxicCandy" models. Details can be
> >> found in my draft.
> >
> > With a labeling like "ToxicCandy Model" for the situation, it makes bad
> > impression on people and I am afraid people may not be make rational
> > decision.  Is this characterization correct and sane one?  At least,
> > it looks to me that this is changing status-quo of our policy and
> > practice severely.  So it is worth evaluating idea without labeling.
>
> My motivation for the naming "ToxicCandy" is pure: to warn developers
> about this special case as it may lead to very difficult copyright
> or software freedom questions. I admit that this name looks not
> quite friendly. Maybe "SemiFree" look better?

Although I understand the intent of "SemiFree" or "Tainted" (by Yao), I
don't think these are a good choice.  We need to draw a line between
FREE(=main) and NON-FREE(non-free) as a organization.  I think there are
2 FREE models we are allowing for "main" as the current practice.

 * Pure      Free Model from pure free pre-train data only
 * Sanitized Free Model from free and non-free mixed pre-train data

And, we don't allow Non-Free Model in "main"

Question is when do you call it "sanitized" (or "distilled") to be clean
enough to qualify for "main" ;-)

> > As long as the "data" comes in the form which allows us to modify it and
> > re-train it to make it better with a set of free software tools to do it,
> > we shouldn't make it non-free, for sure.  That is my position and I
> > think this was what we operated as the project.  We never asked how they
> > are originally made.  The touchy question is how easy it should be to
> > modify and re-train, etc.
> >
> > Let's list analogy cases.  We allow a photo of something on our archive
> > as wallpaper etc.  We don't ask object of photo or tool used to make it
> > to be FREE.  Debian logo is one example which was created by Photoshop
> > as I understand.  Another analogy to consider is how we allow
> > independent copyright and license for the dictionary like data which
> > must have processed previous copyrighted (possibly non-free) texts by
> > human brain and maybe with some script processing.  Packages such as
> > opendict, *spell-*, dict-freedict-all, ... are in main.

...

> Thank you Osamu. These cases inspired me on finding a better
> balance point for DL-Policy. I'll add these cases to the case
> study section, and I'm going to add the following points to DL-Policy:
>
> 1. Free datasets used to train FreeModel are not required to upload
>    to our main section, for example those Osamu mentioned and wikipedia
>    dump. We are not scientific data archiving organization and these
>    data will blow up our infra if we upload too much.
>
> 2. It's not required to re-train a FreeModel with our infra, because
>    the outcome/cost ratio is impractical. The outcome is nearly zero
>    compared to directly using a pre-trained FreeModel, while the cost
>    is increased carbon dioxide in our atmosphere and wasted developer
>    time. (Deep learning is producing much more carbon dioxide than we
>    thought).
>
>    For classical probablistic graph models such as MRF or the mentioned
>    CRF, the training process might be trivial, but re-training is still
>    not required.

... but re-training is highly desirable in line with the spirit of the
free software.

> For SemiFreeModel  I still hesitate to make any decision. Once we let
      SanitizedModel
> them enter the main section there will be many unreproducible
> or hard-to-reproduce but surprisingly "legal" (in terms of DL-Policy)
> files. Maybe this case is to some extent similar to artworks and fonts.
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                           YES.
> Further study needed. And it's still not easy to find a balance point
> for SemiFreeModel  between usefulness and freedom.
      SanitizedModel

Let's use SanitizedModel to be neutral.

We need to have some guideline principle for this sanitization process.
(I don't have an answer now)

This sanitization mechanism shouldn't be used to include obfuscated
binary blob equivalents.  It's worse than FIRMWARE case since it runs on
the same CPU as the program code.

Although "Further Discussion" was the outcome, B in
https://www.debian.org/vote/2006/vote_004 is worth looking at:
  Strongly recommends that all non-programmatic works distribute the form
  that the copyright holder or upstream developer would actually use for
  modification. Such forms need not be distributed in the orig.tar.gz
  (unless required by license) but should be made available on upstream
  websites and/or using Debian project resources.

Please note this is "Strongly recommends ... should be made
available..." and not "must be made available ...".

Aside from Policy/Guideline for FREE/NON-FREE discussion, we also need
to address for the spirit of the reproducible build.  It is nice to have
checking mechanism for the validity and health of these MODELs.  I know
one of the Japanese keyboard input method "Anthy" is suffering some
regression in the upcoming release.  The fix was found too late so I
uploaded to experimental since it contained too many changes while
impact was subtle.  If we had a test suite with numerical score outputs,
we could have detected such regressions by the upstream.  It may be
unrealistic to aim for exact match for such probabilistic model but
objectively traceable measure is very desirable to have.

Osamu

123