Bug#820119: tidy reports valid NCR as invalid

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Bug#820119: tidy reports valid NCR as invalid

victory.deb

Package: www.debian.org
Severity: wishlist

https://www.w3.org/International/questions/qa-controls#support
HTML, XHTML and XML 1.0 do not support the C0 range,
except for HT (Horizontal Tabulation) U+0009, LF (Line Feed) U+000A,
and CR (Carriage Return) U+000D.
The C1 range is supported, i.e. you can encode the controls directly
or represent them as NCRs (Numeric Character References).

*
https://www.w3.org/International/questions/qa-controls#background
The control codes in the range U+0080-U+009F are known as the "C1" range.

unfortunately no option seems to eliminate this :(
latest source use the same code (line 1165-)
https://github.com/htacg/tidy-html5/blob/master/src/lexer.c


--
victory
no need to CC me :-)

Reply | Threaded
Open this post in threaded view
|

Bug#820119: tidy reports valid NCR as invalid

Frank Lichtenheld
2016-04-05 18:12 GMT+02:00 victory <[hidden email]>:
>
> Package: www.debian.org

I assume you wanted to report this against tidy, not www.debian.org?

> Severity: wishlist
>
> https://www.w3.org/International/questions/qa-controls#support
> HTML, XHTML and XML 1.0 do not support the C0 range,
> except for HT (Horizontal Tabulation) U+0009, LF (Line Feed) U+000A,
> and CR (Carriage Return) U+000D.
> The C1 range is supported, i.e. you can encode the controls directly
> or represent them as NCRs (Numeric Character References).
>
> *
> https://www.w3.org/International/questions/qa-controls#background
> The control codes in the range U+0080-U+009F are known as the "C1" range.
>
> unfortunately no option seems to eliminate this :(
> latest source use the same code (line 1165-)
> https://github.com/htacg/tidy-html5/blob/master/src/lexer.c
>
>
> --
> victory
> no need to CC me :-)
>


--
Frank Lichtenheld <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Bug#820119: tidy reports valid NCR as invalid

victory.deb
On Tue, 5 Apr 2016 20:16:53 +0200
Frank Lichtenheld wrote:

> I assume you wanted to report this against tidy, not www.debian.org?

if so, I always report to the upstream, not the debian's one

see https://www-master.debian.org/build-logs/tidy/
files w/ 142bytes are caused by the issue
(other langs do not have the page [international/l10n/po/pl])

as this is an issue about managing the site,
you have some choices:
1) fix tidy (upstream or package) and use appropriate option
2) eliminates tidy's output (pipe to sed or use local modified tidy)
3) tamper with the po file
4) ignore this and accept the current situation forever
   (until the Last-Translator changed)

--
victory
no need to CC me :-)

Reply | Threaded
Open this post in threaded view
|

Bug#820119: tidy reports valid NCR as invalid

Frank Lichtenheld
2016-04-06 18:52 GMT+02:00 victory <[hidden email]>:

> On Tue, 5 Apr 2016 20:16:53 +0200
> Frank Lichtenheld wrote:
>
>> I assume you wanted to report this against tidy, not www.debian.org?
>
> if so, I always report to the upstream, not the debian's one
>
> see https://www-master.debian.org/build-logs/tidy/
> files w/ 142bytes are caused by the issue
> (other langs do not have the page [international/l10n/po/pl])

Okay, that paragraph would have been helpful in the original mail to
understand the contexts of your statement.

Regards,
  Frank

--
Frank Lichtenheld <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Bug#820119: tidy reports valid NCR as invalid

AYANOKOUZI, Ryuunosuke
Control: tag -1 + patch

Dear all,

At Wed, 6 Apr 2016 20:48:15 +0200,
Frank Lichtenheld wrote:

>
> 2016-04-06 18:52 GMT+02:00 victory <[hidden email]>:
> > On Tue, 5 Apr 2016 20:16:53 +0200
> > Frank Lichtenheld wrote:
> >
> >> I assume you wanted to report this against tidy, not www.debian.org?
> >
> > if so, I always report to the upstream, not the debian's one
> >
> > see https://www-master.debian.org/build-logs/tidy/
> > files w/ 142bytes are caused by the issue
> > (other langs do not have the page [international/l10n/po/pl])
>
> Okay, that paragraph would have been helpful in the original mail to
> understand the contexts of your statement.
In HTML, "BREAK PERMITTED HERE" + "SPACE" can be rewritten by " ".
Because line break is automatically added by browser
between words separated by space.

How about apply this patch?
This patch replaces "&#130; " in translator's name to " ".
The patch is not fully tested, but I hope it will work.

Sincerely yours,
Ryuunosuke Ayanokouzi
--
AYANOKOUZI, Ryuunosuke <[hidden email]>

fix_NCR_130.patch (1K) Download Attachment
attachment1 (484 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Processed: Re: Bug#820119: tidy reports valid NCR as invalid

Debian Bug Tracking System
In reply to this post by victory.deb
Processing control commands:

> tag -1 + patch
Bug #820119 [www.debian.org] tidy reports valid NCR as invalid
Added tag(s) patch.

--
820119: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=820119
Debian Bug Tracking System
Contact [hidden email] with problems

Reply | Threaded
Open this post in threaded view
|

Bug#820119: marked as done (tidy reports valid NCR as invalid)

Debian Bug Tracking System
In reply to this post by victory.deb
Your message dated Fri, 20 May 2016 21:15:45 +0000
with message-id <[hidden email]>
and subject line Debian WWW CVS commit by djpig fixes #820119
has caused the Debian Bug report #820119,
regarding tidy reports valid NCR as invalid
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact [hidden email]
immediately.)


--
820119: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=820119
Debian Bug Tracking System
Contact [hidden email] with problems


Package: www.debian.org
Severity: wishlist

https://www.w3.org/International/questions/qa-controls#support
HTML, XHTML and XML 1.0 do not support the C0 range,
except for HT (Horizontal Tabulation) U+0009, LF (Line Feed) U+000A,
and CR (Carriage Return) U+000D.
The C1 range is supported, i.e. you can encode the controls directly
or represent them as NCRs (Numeric Character References).

*
https://www.w3.org/International/questions/qa-controls#background
The control codes in the range U+0080-U+009F are known as the "C1" range.

unfortunately no option seems to eliminate this :(
latest source use the same code (line 1165-)
https://github.com/htacg/tidy-html5/blob/master/src/lexer.c


--
victory
no need to CC me :-)

This bug was closed by djpig in the webwml CVS repository:

https://www.debian.org/devel/website/using_cvs

Note that it might take some time until www.debian.org has been updated.

CVSROOT: /cvs/webwml
Module name: webwml
Changes by: djpig 16/05/20 21:15:45

Modified files:
        english/international/l10n/scripts: gen-files.pl

Log message:
        Remove BREAK PERMITTED HERE (Closes: #820119)
Reply | Threaded
Open this post in threaded view
|

Bug#820119: closed by Debian WWW CVS <webmaster@debian.org> (reply to debian-www@lists.debian.org) (Debian WWW CVS commit by djpig fixes #820119)

victory.deb
In reply to this post by victory.deb
On Fri, 20 May 2016 21:18:09 +0000
Debian Bug Tracking System wrote:

> @@ -117,6 +117,8 @@ sub transform_translator {
>          $name =~ s/\s*<.*//;
>          $name =~ s/&(?!#)/&amp;/g;
>          $name =~ s/=\?.*?\?=//g;
> +        # BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01.
> +        $name =~ s/(?:&#0*130;|&#x0*82;|\N{U+0082})//ig;
>          $name = 'DDTP' if $name eq 'Debian Description Translation Project';
>          $name = '' if $name =~ m/\@/;
>          return $name;

there is a wrong comment;
as I said, BPH, i.e. C1 range is supported
just tidy is stupid so you needed to eliminate yourself

--
victory
no need to CC me :-)

Reply | Threaded
Open this post in threaded view
|

Bug#820119: restores original characters instead of taking care of every time numeric references coming up

victory.deb

first, it is stupid to blame about names which are valid.
it is also stupid that taking care of each occurrences coming up.
as pages are all utf-8 now, no need to keep such references,
this patch restores original characters instead of numeric references

patch below:
Index: english/international/l10n/scripts/gen-files.pl
===================================================================
--- english/international/l10n/scripts/gen-files.pl (revision 232)
+++ english/international/l10n/scripts/gen-files.pl (working copy)
@@ -3,6 +3,7 @@
 use strict;
 use File::Path;
 use Getopt::Long;
+use Encode qw(encode);
 
 use lib ($0 =~ m|(.*)/|, $1 or ".") ."/../../../../Perl";
 
@@ -117,8 +118,7 @@
         $name =~ s/\s*<.*//;
         $name =~ s/&(?!#)/&amp;/g;
         $name =~ s/=\?.*?\?=//g;
-        # BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01.
-        $name =~ s/(?:&#0*130;|&#x0*82;|\N{U+0082})//ig;
+        $name =~ s/&#(\d+);/encode("UTF-8",chr($1))/ge;
         $name = 'DDTP' if $name eq 'Debian Description Translation Project';
         $name = '' if $name =~ m/\@/;
         return $name;


--
victory
no need to CC me :-)

Reply | Threaded
Open this post in threaded view
|

Bug#820119: restores original characters instead of taking care of every time numeric references coming up

Laura Arjona Reina-4
Hi

El 13/01/17 a las 11:34, victory escribió:

>
> first, it is stupid to blame about names which are valid.
> it is also stupid that taking care of each occurrences coming up.
> as pages are all utf-8 now, no need to keep such references,
> this patch restores original characters instead of numeric references
>
> patch below:
> Index: english/international/l10n/scripts/gen-files.pl
> ===================================================================
> --- english/international/l10n/scripts/gen-files.pl (revision 232)
> +++ english/international/l10n/scripts/gen-files.pl (working copy)
> @@ -3,6 +3,7 @@
>  use strict;
>  use File::Path;
>  use Getopt::Long;
> +use Encode qw(encode);
>  
>  use lib ($0 =~ m|(.*)/|, $1 or ".") ."/../../../../Perl";
>  
> @@ -117,8 +118,7 @@
>          $name =~ s/\s*<.*//;
>          $name =~ s/&(?!#)/&amp;/g;
>          $name =~ s/=\?.*?\?=//g;
> -        # BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01.
> -        $name =~ s/(?:&#0*130;|&#x0*82;|\N{U+0082})//ig;
> +        $name =~ s/&#(\d+);/encode("UTF-8",chr($1))/ge;
>          $name = 'DDTP' if $name eq 'Debian Description Translation Project';
>          $name = '' if $name =~ m/\@/;
>          return $name;
>
>

Thanks for all the work in these and other validation/tidy issues in
the website.

I've done some tests and I'm afraid I cannot merge the patch yet.

Using perl to encode to UTF8 as you propose makes tidy happy, but
there is another script passed to the files, "validate", that produces
theses errors:

Line 10, character 12:  non SGML character number 130

If we use numeric entities, tidy complains for &#000130 unless we
suppress the character as we do now.

For the emoji in translator name, "validate" complains in any case:

* Using numeric entities: with the current message received:

"128513" is not a character number in the document character set

* Encoding to UTF8 as the proposed patch:

Line 10, character 29:  non SGML character number 65533

I've produced two small files:

https://cosas.larjona.net/validate.utf8.html
https://cosas.larjona.net/validate.ncr.html

and passed the online validator in https://validator.w3.org/

I'll try to see if we can use https://validator.w3.org/source/ and get
better "tidy" and "validate" tools from there.

For now, I've fixed the comment in the gen-files.pl:

--- english/international/l10n/scripts/gen-files.pl     20 May 2016
21:15:45 -0000      1.97
+++ english/international/l10n/scripts/gen-files.pl     14 Jan 2017
12:41:06 -0000
@@ -117,7 +117,10 @@
         $name =~ s/\s*<.*//;
         $name =~ s/&(?!#)/&amp;/g;
         $name =~ s/=\?.*?\?=//g;
-        # BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01.
+        # BREAK PERMITTED HERE (U+0082) is allowed in HTML 4.01.
+        # but the "tidy" tool that we use complains about them,
+        # so we just remove those characters for now, until better
solution
+        # see Bug #820119
         $name =~ s/(?:&#0*130;|&#x0*82;|\N{U+0082})//ig;
         $name = 'DDTP' if $name eq 'Debian Description Translation
Project';
         $name = '' if $name =~ m/\@/;

Best regards
--
Laura Arjona Reina
https://wiki.debian.org/LauraArjona

Reply | Threaded
Open this post in threaded view
|

Bug#820119: restores original characters instead of taking care of every time numeric references coming up

Laura Arjona Reina-3
Hi again.

I think my conclusion is silly, I was considering encoding the whole string only.
But we can encode  the &000130 and leave the emoji in numeric entity.
Victory is right, I'll try to think clearer later and merge the patch today. (Now afk, sorry).


El 14 de enero de 2017 13:43:24 CET, Laura Arjona Reina <[hidden email]> escribió:

>Hi
>
>El 13/01/17 a las 11:34, victory escribió:
>>
>> first, it is stupid to blame about names which are valid.
>> it is also stupid that taking care of each occurrences coming up.
>> as pages are all utf-8 now, no need to keep such references,
>> this patch restores original characters instead of numeric references
>>
>> patch below:
>> Index: english/international/l10n/scripts/gen-files.pl
>> ===================================================================
>> --- english/international/l10n/scripts/gen-files.pl (revision 232)
>> +++ english/international/l10n/scripts/gen-files.pl (working copy)
>> @@ -3,6 +3,7 @@
>>  use strict;
>>  use File::Path;
>>  use Getopt::Long;
>> +use Encode qw(encode);
>>  
>>  use lib ($0 =~ m|(.*)/|, $1 or ".") ."/../../../../Perl";
>>  
>> @@ -117,8 +118,7 @@
>>          $name =~ s/\s*<.*//;
>>          $name =~ s/&(?!#)/&amp;/g;
>>          $name =~ s/=\?.*?\?=//g;
>> -        # BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01.
>> -        $name =~ s/(?:&#0*130;|&#x0*82;|\N{U+0082})//ig;
>> +        $name =~ s/&#(\d+);/encode("UTF-8",chr($1))/ge;
>>          $name = 'DDTP' if $name eq 'Debian Description Translation
>Project';
>>          $name = '' if $name =~ m/\@/;
>>          return $name;
>>
>>
>
>Thanks for all the work in these and other validation/tidy issues in
>the website.
>
>I've done some tests and I'm afraid I cannot merge the patch yet.
>
>Using perl to encode to UTF8 as you propose makes tidy happy, but
>there is another script passed to the files, "validate", that produces
>theses errors:
>
>Line 10, character 12:  non SGML character number 130
>
>If we use numeric entities, tidy complains for &#000130 unless we
>suppress the character as we do now.
>
>For the emoji in translator name, "validate" complains in any case:
>
>* Using numeric entities: with the current message received:
>
>"128513" is not a character number in the document character set
>
>* Encoding to UTF8 as the proposed patch:
>
>Line 10, character 29:  non SGML character number 65533
>
>I've produced two small files:
>
>https://cosas.larjona.net/validate.utf8.html
>https://cosas.larjona.net/validate.ncr.html
>
>and passed the online validator in https://validator.w3.org/
>
>I'll try to see if we can use https://validator.w3.org/source/ and get
>better "tidy" and "validate" tools from there.
>
>For now, I've fixed the comment in the gen-files.pl:
>
>--- english/international/l10n/scripts/gen-files.pl     20 May 2016
>21:15:45 -0000      1.97
>+++ english/international/l10n/scripts/gen-files.pl     14 Jan 2017
>12:41:06 -0000
>@@ -117,7 +117,10 @@
>         $name =~ s/\s*<.*//;
>         $name =~ s/&(?!#)/&amp;/g;
>         $name =~ s/=\?.*?\?=//g;
>-        # BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01.
>+        # BREAK PERMITTED HERE (U+0082) is allowed in HTML 4.01.
>+        # but the "tidy" tool that we use complains about them,
>+        # so we just remove those characters for now, until better
>solution
>+        # see Bug #820119
>         $name =~ s/(?:&#0*130;|&#x0*82;|\N{U+0082})//ig;
>         $name = 'DDTP' if $name eq 'Debian Description Translation
>Project';
>         $name = '' if $name =~ m/\@/;
>
>Best regards

Laura Arjona Reina
https://wiki.debian.org/LauraArjona

Reply | Threaded
Open this post in threaded view
|

Bug#820119: restores original characters instead of taking care of every time numeric references coming up

victory.deb
On Sat, 14 Jan 2017 13:55:16 +0100
Laura Arjona Reina wrote:

> Victory is right

no, i misinterpreted validate as tidy :p

what the patch does:
  if both of
   (charset is "utf-8")
  and
    ([the last 56 chars of a error] is
      "is not a character number in the document character set\n")
  are satisfied,
  then the current loop is terminated
    (push(@errors, $_) will not be processed in this case)
  and continue next ones
patch for git:///debwww/cron: scripts/validate below:

@@ -392,10 +392,13 @@ foreach $file (@files) {
         if ($#error < 5) {
 
             next;
 
         } elsif ($error[4] eq 'E' || $error[4] eq 'X') {
+    next if($charset eq "utf-8" &&
+    substr($error[5],-56) eq
+    "is not a character number in the document character set\n");
 
             push(@errors, $_);
 
             # If the DOCTYPE is bad, bail out
             last if ($error[5] eq " unrecognized {{DOCTYPE}}; unable to check document\n");


--
victory
no need to CC me :-)

Reply | Threaded
Open this post in threaded view
|

Bug#820119: marked as done (tidy reports valid NCR as invalid)

Debian Bug Tracking System
In reply to this post by victory.deb
Your message dated Fri, 10 Nov 2017 17:39:43 +0100
with message-id <[hidden email]>
and subject line Re: Bug#820119: tidy reports valid NCR as invalid
has caused the Debian Bug report #820119,
regarding tidy reports valid NCR as invalid
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact [hidden email]
immediately.)


--
820119: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=820119
Debian Bug Tracking System
Contact [hidden email] with problems


Package: www.debian.org
Severity: wishlist

https://www.w3.org/International/questions/qa-controls#support
HTML, XHTML and XML 1.0 do not support the C0 range,
except for HT (Horizontal Tabulation) U+0009, LF (Line Feed) U+000A,
and CR (Carriage Return) U+000D.
The C1 range is supported, i.e. you can encode the controls directly
or represent them as NCRs (Numeric Character References).

*
https://www.w3.org/International/questions/qa-controls#background
The control codes in the range U+0080-U+009F are known as the "C1" range.

unfortunately no option seems to eliminate this :(
latest source use the same code (line 1165-)
https://github.com/htacg/tidy-html5/blob/master/src/lexer.c


--
victory
no need to CC me :-)

Hello Neil
Thank you very much for the research and for the patch.

Yesterday I created an additional "validate_sp" script including your
patch, that was run right after the usual one, to be able to compare the
output of both programs.

Everything went ok, and validate_sp reported the same other validation
issues (the real ones), but not these kind of "false positives".

Today I've just merged the patch in the usual validate script and
removed the temporary ones.

This bug can be closed, then. From tomorrow on we shouldn't get more
"Tidy validation failed" messages due to emoji characters.

Thanks again!

--
Laura Arjona Reina
https://wiki.debian.org/LauraArjona