Bug#820119: tidy reports valid NCR as invalid

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Bug#820119: tidy reports valid NCR as invalid

victory.deb

Package: www.debian.org
Severity: wishlist

https://www.w3.org/International/questions/qa-controls#support
HTML, XHTML and XML 1.0 do not support the C0 range,
except for HT (Horizontal Tabulation) U+0009, LF (Line Feed) U+000A,
and CR (Carriage Return) U+000D.
The C1 range is supported, i.e. you can encode the controls directly
or represent them as NCRs (Numeric Character References).

*
https://www.w3.org/International/questions/qa-controls#background
The control codes in the range U+0080-U+009F are known as the "C1" range.

unfortunately no option seems to eliminate this :(
latest source use the same code (line 1165-)
https://github.com/htacg/tidy-html5/blob/master/src/lexer.c


--
victory
no need to CC me :-)

Reply | Threaded
Open this post in threaded view
|

Bug#820119: tidy reports valid NCR as invalid

Frank Lichtenheld
2016-04-05 18:12 GMT+02:00 victory <[hidden email]>:
>
> Package: www.debian.org

I assume you wanted to report this against tidy, not www.debian.org?

> Severity: wishlist
>
> https://www.w3.org/International/questions/qa-controls#support
> HTML, XHTML and XML 1.0 do not support the C0 range,
> except for HT (Horizontal Tabulation) U+0009, LF (Line Feed) U+000A,
> and CR (Carriage Return) U+000D.
> The C1 range is supported, i.e. you can encode the controls directly
> or represent them as NCRs (Numeric Character References).
>
> *
> https://www.w3.org/International/questions/qa-controls#background
> The control codes in the range U+0080-U+009F are known as the "C1" range.
>
> unfortunately no option seems to eliminate this :(
> latest source use the same code (line 1165-)
> https://github.com/htacg/tidy-html5/blob/master/src/lexer.c
>
>
> --
> victory
> no need to CC me :-)
>


--
Frank Lichtenheld <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Bug#820119: tidy reports valid NCR as invalid

victory.deb
On Tue, 5 Apr 2016 20:16:53 +0200
Frank Lichtenheld wrote:

> I assume you wanted to report this against tidy, not www.debian.org?

if so, I always report to the upstream, not the debian's one

see https://www-master.debian.org/build-logs/tidy/
files w/ 142bytes are caused by the issue
(other langs do not have the page [international/l10n/po/pl])

as this is an issue about managing the site,
you have some choices:
1) fix tidy (upstream or package) and use appropriate option
2) eliminates tidy's output (pipe to sed or use local modified tidy)
3) tamper with the po file
4) ignore this and accept the current situation forever
   (until the Last-Translator changed)

--
victory
no need to CC me :-)

Reply | Threaded
Open this post in threaded view
|

Bug#820119: tidy reports valid NCR as invalid

Frank Lichtenheld
2016-04-06 18:52 GMT+02:00 victory <[hidden email]>:

> On Tue, 5 Apr 2016 20:16:53 +0200
> Frank Lichtenheld wrote:
>
>> I assume you wanted to report this against tidy, not www.debian.org?
>
> if so, I always report to the upstream, not the debian's one
>
> see https://www-master.debian.org/build-logs/tidy/
> files w/ 142bytes are caused by the issue
> (other langs do not have the page [international/l10n/po/pl])

Okay, that paragraph would have been helpful in the original mail to
understand the contexts of your statement.

Regards,
  Frank

--
Frank Lichtenheld <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Bug#820119: tidy reports valid NCR as invalid

AYANOKOUZI, Ryuunosuke
Control: tag -1 + patch

Dear all,

At Wed, 6 Apr 2016 20:48:15 +0200,
Frank Lichtenheld wrote:

>
> 2016-04-06 18:52 GMT+02:00 victory <[hidden email]>:
> > On Tue, 5 Apr 2016 20:16:53 +0200
> > Frank Lichtenheld wrote:
> >
> >> I assume you wanted to report this against tidy, not www.debian.org?
> >
> > if so, I always report to the upstream, not the debian's one
> >
> > see https://www-master.debian.org/build-logs/tidy/
> > files w/ 142bytes are caused by the issue
> > (other langs do not have the page [international/l10n/po/pl])
>
> Okay, that paragraph would have been helpful in the original mail to
> understand the contexts of your statement.
In HTML, "BREAK PERMITTED HERE" + "SPACE" can be rewritten by " ".
Because line break is automatically added by browser
between words separated by space.

How about apply this patch?
This patch replaces "&#130; " in translator's name to " ".
The patch is not fully tested, but I hope it will work.

Sincerely yours,
Ryuunosuke Ayanokouzi
--
AYANOKOUZI, Ryuunosuke <[hidden email]>

fix_NCR_130.patch (1K) Download Attachment
attachment1 (484 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Bug#821159: dupe of #821155

Mark Hindley
In reply to this post by Frank Lichtenheld

merge 821159
package apt-cacher
thanks

Reply | Threaded
Open this post in threaded view
|

Bug#820119: closed by Debian WWW CVS <webmaster@debian.org> (reply to debian-www@lists.debian.org) (Debian WWW CVS commit by djpig fixes #820119)

victory.deb
In reply to this post by victory.deb
On Fri, 20 May 2016 21:18:09 +0000
Debian Bug Tracking System wrote:

> @@ -117,6 +117,8 @@ sub transform_translator {
>          $name =~ s/\s*<.*//;
>          $name =~ s/&(?!#)/&amp;/g;
>          $name =~ s/=\?.*?\?=//g;
> +        # BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01.
> +        $name =~ s/(?:&#0*130;|&#x0*82;|\N{U+0082})//ig;
>          $name = 'DDTP' if $name eq 'Debian Description Translation Project';
>          $name = '' if $name =~ m/\@/;
>          return $name;

there is a wrong comment;
as I said, BPH, i.e. C1 range is supported
just tidy is stupid so you needed to eliminate yourself

--
victory
no need to CC me :-)

Reply | Threaded
Open this post in threaded view
|

Bug#820119: restores original characters instead of taking care of every time numeric references coming up

victory.deb

first, it is stupid to blame about names which are valid.
it is also stupid that taking care of each occurrences coming up.
as pages are all utf-8 now, no need to keep such references,
this patch restores original characters instead of numeric references

patch below:
Index: english/international/l10n/scripts/gen-files.pl
===================================================================
--- english/international/l10n/scripts/gen-files.pl (revision 232)
+++ english/international/l10n/scripts/gen-files.pl (working copy)
@@ -3,6 +3,7 @@
 use strict;
 use File::Path;
 use Getopt::Long;
+use Encode qw(encode);
 
 use lib ($0 =~ m|(.*)/|, $1 or ".") ."/../../../../Perl";
 
@@ -117,8 +118,7 @@
         $name =~ s/\s*<.*//;
         $name =~ s/&(?!#)/&amp;/g;
         $name =~ s/=\?.*?\?=//g;
-        # BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01.
-        $name =~ s/(?:&#0*130;|&#x0*82;|\N{U+0082})//ig;
+        $name =~ s/&#(\d+);/encode("UTF-8",chr($1))/ge;
         $name = 'DDTP' if $name eq 'Debian Description Translation Project';
         $name = '' if $name =~ m/\@/;
         return $name;


--
victory
no need to CC me :-)

Reply | Threaded
Open this post in threaded view
|

Bug#820119: restores original characters instead of taking care of every time numeric references coming up

Laura Arjona Reina-4
Hi

El 13/01/17 a las 11:34, victory escribió:

>
> first, it is stupid to blame about names which are valid.
> it is also stupid that taking care of each occurrences coming up.
> as pages are all utf-8 now, no need to keep such references,
> this patch restores original characters instead of numeric references
>
> patch below:
> Index: english/international/l10n/scripts/gen-files.pl
> ===================================================================
> --- english/international/l10n/scripts/gen-files.pl (revision 232)
> +++ english/international/l10n/scripts/gen-files.pl (working copy)
> @@ -3,6 +3,7 @@
>  use strict;
>  use File::Path;
>  use Getopt::Long;
> +use Encode qw(encode);
>  
>  use lib ($0 =~ m|(.*)/|, $1 or ".") ."/../../../../Perl";
>  
> @@ -117,8 +118,7 @@
>          $name =~ s/\s*<.*//;
>          $name =~ s/&(?!#)/&amp;/g;
>          $name =~ s/=\?.*?\?=//g;
> -        # BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01.
> -        $name =~ s/(?:&#0*130;|&#x0*82;|\N{U+0082})//ig;
> +        $name =~ s/&#(\d+);/encode("UTF-8",chr($1))/ge;
>          $name = 'DDTP' if $name eq 'Debian Description Translation Project';
>          $name = '' if $name =~ m/\@/;
>          return $name;
>
>

Thanks for all the work in these and other validation/tidy issues in
the website.

I've done some tests and I'm afraid I cannot merge the patch yet.

Using perl to encode to UTF8 as you propose makes tidy happy, but
there is another script passed to the files, "validate", that produces
theses errors:

Line 10, character 12:  non SGML character number 130

If we use numeric entities, tidy complains for &#000130 unless we
suppress the character as we do now.

For the emoji in translator name, "validate" complains in any case:

* Using numeric entities: with the current message received:

"128513" is not a character number in the document character set

* Encoding to UTF8 as the proposed patch:

Line 10, character 29:  non SGML character number 65533

I've produced two small files:

https://cosas.larjona.net/validate.utf8.html
https://cosas.larjona.net/validate.ncr.html

and passed the online validator in https://validator.w3.org/

I'll try to see if we can use https://validator.w3.org/source/ and get
better "tidy" and "validate" tools from there.

For now, I've fixed the comment in the gen-files.pl:

--- english/international/l10n/scripts/gen-files.pl     20 May 2016
21:15:45 -0000      1.97
+++ english/international/l10n/scripts/gen-files.pl     14 Jan 2017
12:41:06 -0000
@@ -117,7 +117,10 @@
         $name =~ s/\s*<.*//;
         $name =~ s/&(?!#)/&amp;/g;
         $name =~ s/=\?.*?\?=//g;
-        # BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01.
+        # BREAK PERMITTED HERE (U+0082) is allowed in HTML 4.01.
+        # but the "tidy" tool that we use complains about them,
+        # so we just remove those characters for now, until better
solution
+        # see Bug #820119
         $name =~ s/(?:&#0*130;|&#x0*82;|\N{U+0082})//ig;
         $name = 'DDTP' if $name eq 'Debian Description Translation
Project';
         $name = '' if $name =~ m/\@/;

Best regards
--
Laura Arjona Reina
https://wiki.debian.org/LauraArjona

Reply | Threaded
Open this post in threaded view
|

Bug#820119: restores original characters instead of taking care of every time numeric references coming up

Laura Arjona Reina-3
Hi again.

I think my conclusion is silly, I was considering encoding the whole string only.
But we can encode  the &000130 and leave the emoji in numeric entity.
Victory is right, I'll try to think clearer later and merge the patch today. (Now afk, sorry).


El 14 de enero de 2017 13:43:24 CET, Laura Arjona Reina <[hidden email]> escribió:

>Hi
>
>El 13/01/17 a las 11:34, victory escribió:
>>
>> first, it is stupid to blame about names which are valid.
>> it is also stupid that taking care of each occurrences coming up.
>> as pages are all utf-8 now, no need to keep such references,
>> this patch restores original characters instead of numeric references
>>
>> patch below:
>> Index: english/international/l10n/scripts/gen-files.pl
>> ===================================================================
>> --- english/international/l10n/scripts/gen-files.pl (revision 232)
>> +++ english/international/l10n/scripts/gen-files.pl (working copy)
>> @@ -3,6 +3,7 @@
>>  use strict;
>>  use File::Path;
>>  use Getopt::Long;
>> +use Encode qw(encode);
>>  
>>  use lib ($0 =~ m|(.*)/|, $1 or ".") ."/../../../../Perl";
>>  
>> @@ -117,8 +118,7 @@
>>          $name =~ s/\s*<.*//;
>>          $name =~ s/&(?!#)/&amp;/g;
>>          $name =~ s/=\?.*?\?=//g;
>> -        # BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01.
>> -        $name =~ s/(?:&#0*130;|&#x0*82;|\N{U+0082})//ig;
>> +        $name =~ s/&#(\d+);/encode("UTF-8",chr($1))/ge;
>>          $name = 'DDTP' if $name eq 'Debian Description Translation
>Project';
>>          $name = '' if $name =~ m/\@/;
>>          return $name;
>>
>>
>
>Thanks for all the work in these and other validation/tidy issues in
>the website.
>
>I've done some tests and I'm afraid I cannot merge the patch yet.
>
>Using perl to encode to UTF8 as you propose makes tidy happy, but
>there is another script passed to the files, "validate", that produces
>theses errors:
>
>Line 10, character 12:  non SGML character number 130
>
>If we use numeric entities, tidy complains for &#000130 unless we
>suppress the character as we do now.
>
>For the emoji in translator name, "validate" complains in any case:
>
>* Using numeric entities: with the current message received:
>
>"128513" is not a character number in the document character set
>
>* Encoding to UTF8 as the proposed patch:
>
>Line 10, character 29:  non SGML character number 65533
>
>I've produced two small files:
>
>https://cosas.larjona.net/validate.utf8.html
>https://cosas.larjona.net/validate.ncr.html
>
>and passed the online validator in https://validator.w3.org/
>
>I'll try to see if we can use https://validator.w3.org/source/ and get
>better "tidy" and "validate" tools from there.
>
>For now, I've fixed the comment in the gen-files.pl:
>
>--- english/international/l10n/scripts/gen-files.pl     20 May 2016
>21:15:45 -0000      1.97
>+++ english/international/l10n/scripts/gen-files.pl     14 Jan 2017
>12:41:06 -0000
>@@ -117,7 +117,10 @@
>         $name =~ s/\s*<.*//;
>         $name =~ s/&(?!#)/&amp;/g;
>         $name =~ s/=\?.*?\?=//g;
>-        # BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01.
>+        # BREAK PERMITTED HERE (U+0082) is allowed in HTML 4.01.
>+        # but the "tidy" tool that we use complains about them,
>+        # so we just remove those characters for now, until better
>solution
>+        # see Bug #820119
>         $name =~ s/(?:&#0*130;|&#x0*82;|\N{U+0082})//ig;
>         $name = 'DDTP' if $name eq 'Debian Description Translation
>Project';
>         $name = '' if $name =~ m/\@/;
>
>Best regards

Laura Arjona Reina
https://wiki.debian.org/LauraArjona

Reply | Threaded
Open this post in threaded view
|

Bug#820119: restores original characters instead of taking care of every time numeric references coming up

victory.deb
On Sat, 14 Jan 2017 13:55:16 +0100
Laura Arjona Reina wrote:

> Victory is right

no, i misinterpreted validate as tidy :p

what the patch does:
  if both of
   (charset is "utf-8")
  and
    ([the last 56 chars of a error] is
      "is not a character number in the document character set\n")
  are satisfied,
  then the current loop is terminated
    (push(@errors, $_) will not be processed in this case)
  and continue next ones
patch for git:///debwww/cron: scripts/validate below:

@@ -392,10 +392,13 @@ foreach $file (@files) {
         if ($#error < 5) {
 
             next;
 
         } elsif ($error[4] eq 'E' || $error[4] eq 'X') {
+    next if($charset eq "utf-8" &&
+    substr($error[5],-56) eq
+    "is not a character number in the document character set\n");
 
             push(@errors, $_);
 
             # If the DOCTYPE is bad, bail out
             last if ($error[5] eq " unrecognized {{DOCTYPE}}; unable to check document\n");


--
victory
no need to CC me :-)

Reply | Threaded
Open this post in threaded view
|

Bug#820119: [www.debian.org] validation errors: cannot convert character reference to number X because character not in internal character set

Laura Arjona Reina-4
In reply to this post by victory.deb
Hello all
Now that we are using the more modern tool onsgmls instead of nsgmls in our
"validate" script:

https://anonscm.debian.org/cgit/debwww/cron.git/tree/scripts/validate

I've returned to this bug.

The output of the validate script for the files containing "emojis" didn't
change much:

**** Errors validating
        /srv/www.debian.org/www/international/l10n/po/en_GB.it.html: ***
Line 122, character 357:  cannot convert character reference to number
        128513 because character not in internal character set

I was a bit surprised that we are still getting these errors, because if I pass
the online w3c validator https://validator.w3.org/ or even a manual onsgmls
command in the machine that builds the website:

onsgmls -E0 -s /path/to/dtd /path/to/file

in both cases I don't get any error.
So I've looked at the "validate" script and played a bit with the options set
there, and I'd like to bring to your attention the lines L363-376:

    # Determine whether we're dealing with HTML or XHTML and set the SP
    # environment accordingly.
    if ($xhtml{$htmlLevel}) {
        $ENV{'SGML_CATALOG_FILES'} = $xhtmlCatalog;
        $ENV{'SP_ENCODING'} = 'xml';
    } else {
        $ENV{'SGML_CATALOG_FILES'} = $htmlCatalog;
        if (defined $charset) {
            $ENV{'SP_ENCODING'} = $charset;
        } else {
            $ENV{'SP_ENCODING'} = "ISO-8859-1";
        }
    }
    $ENV{'SP_CHARSET_FIXED'} = 1

If I comment this last line (and thus, letting onsgmls run in not fixed mode), I
get no errors validating the file.

I've read the documentation about these options:

http://openjade.sourceforge.net/doc/charset.htm

but frankly I don't understand it very much.

I've done:

larjona@wolkenstein:~$ sudo -u debwww env | grep SP_

and it returns nothing, so I guess only the environment set in "validate" script
is taken into account, if we don't set the variables there, defaults rule.

I've modified and run a copy of the validate script, making it print some values
when checking a file, and document type is correctly detected (HTML 4.01
Strict), as well as charset (utf-8).

I'm not sure I can safely comment the line 376

    $ENV{'SP_CHARSET_FIXED'} = 1;

to avoid the errors, or even comment the whole paragraph, and trust onsgmls to
do the right thing.

Anybody with more experience in this can help?

Thanks
--
Laura Arjona Reina
https://wiki.debian.org/LauraArjona

Reply | Threaded
Open this post in threaded view
|

Bug#820119: tidy reports valid NCR as invalid

Neil Roeth-2
In reply to this post by victory.deb
Laura asked for my help on this issue.  What I found is that setting the
environment variable SP_CHARSET_FIXED to 1 makes the onsgmls program use
the Unicode 2.0 character set, as the referenced web page says. 
However, it uses only the first 65536 characters (the iso10646-ucs-2
character set), so character number 128513 triggers the error since it
is outside that range.  In order to make that work, you need to ensure
SP_CHARSET_FIXED is unset in the validate script.  However, XML files
need SP_CHARSET_FIXED set.  So, I suggest something like this (patch
attached):

    if ($xhtml{$htmlLevel}) {
        $ENV{'SGML_CATALOG_FILES'} = $xhtmlCatalog;
    $ENV{'SP_CHARSET_FIXED'} = 1;
        $ENV{'SP_ENCODING'} = 'xml';
    } else {
        $ENV{'SGML_CATALOG_FILES'} = $htmlCatalog;
        if (defined $charset) {
            $ENV{'SP_BCTF'} = $charset;
        } else {
            $ENV{'SP_BCTF'} = "utf-8";
        }
    }

That also changes the default character set for HTML from ISO-8859-1 to
UTF-8 because the former is not a valid BCTF option.  It appears the
validate script only uses that default if there is not a character set
defined in the HTML file itself and there is no character set option
passed to the script.

I didn't set up the whole web site build on my machine to test if this
change has any negative effects on pages other than en_GB.it.html , so
it needs broader testing.



validate.patch (799 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Bug#820119: tidy reports valid NCR as invalid

Neil Roeth
In reply to this post by victory.deb
Sorry, I should have used my Debian email address on that last update. 
I'm the maintainer of the opensp package which provides the onsgmls
executable used by validate.

--
Neil Roeth