Bug#959474: Issues with Chinese language (all variants) when building some pages in buster

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Bug#959474: Issues with Chinese language (all variants) when building some pages in buster

Laura Arjona Reina-4
Package: www.debian.org
Severity: normal
User: [hidden email]
Usertags: scripts
X-Debbugs-CC: [hidden email]
X-Debbugs-CC: [hidden email]

Hi all,

TL;DR

There are some issues with some Chinese pages when they are built in a
buster machine.
We need to fix those issues (at least the "Malformed UTF-8 character
[...] at ../../bin/tocn.pl [...]" ones) so DSA can upgrade the
www-master machine to buster. See the summary of the log at the bottom
to know which files produce this error.
I have no idea of how to fix the issues, so any help from the Chinese
team or web team mates is greatly appreciated..
Additional issues may arise (e.g. I still didn't test the release-notes
or doc-manual), any help testing is welcome too, please create bug
reports for each different issue or update the existing ones. Thanks!

LONG VERSION

I've done a test build of the /english and /chinese subdirs in a buster
machine, and I have noticed some warnings/errors related to the Chinese
pages (some, not all of them).

It would be desirable to upgrade www-master machine to buster as soon as
possible, so any help with this (from website  or Chinese team members)
is very appreciated.

Below you can find an extract of the build log, including only the the
files for which I got some error or warning message.

After the build, I have compared the problematic HTML files of a build
in stretch and a build in buster with a diff tool, to see if there were
significant changes in the html output due to these issues.

Here are my results:

* For the messages of the type ", [zh_TW]Invalid UTF8: " when building,
I couldn't note any difference between the output of a stretch build and
the output of a buster build.

I would say this is not a blocker for the buster upgrade of www-master.

* For the messages of the type "Malformed UTF-8 character [...] at
../../bin/tocn.pl [...]" I have seen important changes in the HTML diff,
I think the output in the stretch build is totally broken (fortunately,
there are not many files in that situation).

I would say this is a blocker for the buster upgrade of www-master, but
I would prefer somebody of the Chinese team to confirm (try to build
those files in a buster machine, and review the output).

Additional notes:

* I have only tested the wml build, not the rest of the cron scripts
that run on www-master. I will try to do it in the following days, but
if you already know any that works well (e.g. release-notes,
doc-manuals...) just tell so I can skip them.

* When I build files in my machines, there is something wrong in my
environment that I don't get the .po files integrated every time, so for
example the Chinese pages I build show the menus and footnote in
English. Therefore, if there is any issue with the encoding of the .po
files themselves, I guess I cannot detect it until I fix my particular
issue :/

* The local build that I make uses the SAMPLE_FILES that are needed in
some folders; so additional issues may arise when we use the actual
files that are generated at runtime in the often and lessoften cron jobs.

That's all for now, I think. Thanks for your patience reading and for
your help!

Kind regards,
--
Laura Arjona Reina
https://wiki.debian.org/LauraArjona


--- extract of the build log file

/chinese

Processing
donations.wml:
[zh_CN]Invalid UTF8:
ïŒŒç‚¹å‡»â€œæ·»åŠ åˆ°èŽ­ç‰©èœŠâ€ïŒŒç„¶åŽå®Œæˆå‰©äœ™è¿‡çš‹ã€‚
, [zh_TW]Invalid UTF8:
ïŒŒç‚¹å‡»â€œæ·»åŠ åˆ°èŽ­ç‰©èœŠâ€ïŒŒç„¶åŽå®Œæˆå‰©äœ™è¿‡çš‹ã€‚
, [zh_HK]Invalid UTF8:
ïŒŒç‚¹å‡»â€œæ·»åŠ åˆ°èŽ­ç‰©èœŠâ€ïŒŒç„¶åŽå®Œæˆå‰©äœ™è¿‡çš‹ã€‚
.

make[1]: Entering directory '/webwml/chinese/Bugs'
Processing Reporting.wml: [zh_CN]Invalid UTF8:
°äž€æ¬¡ç€ºäŸ‹äŒšè¯çš„过皋。</li>
, [zh_TW]Invalid UTF8: °äž€æ¬¡ç€ºäŸ‹äŒšè¯çš„过皋。</li>
, [zh_HK]Invalid UTF8: °äž€æ¬¡ç€ºäŸ‹äŒšè¯çš„过皋。</li>
.

make[2]: Entering directory '/webwml/chinese/News/2000'

Processing 20000815.wml:
[zh_CN]Invalid UTF8: µ·å€–朋友的錎力協助包括
, [zh_TW]Invalid UTF8: µ·å€–朋友的錎力協助包括
, [zh_HK]Invalid UTF8: µ·å€–朋友的錎力協助包括
.

make[2]: Entering directory '/webwml/chinese/News/2009'
Processing 20090214.wml: [zh_CN]Invalid UTF8: šSun SPARC (sparc)、
, [zh_TW]Invalid UTF8: šSun SPARC (sparc)、
, [zh_HK]Invalid UTF8: šSun SPARC (sparc)、
.

make[2]: Entering directory '/webwml/chinese/News/weekly'

copying index.zh-cn.html to ../../../../www/News/weekly/./2002/48
Processing index.wml: [zh_CN]Malformed UTF-8 character (unexpected end
of string) in substitution (s///) at ../../bin/tocn.pl line 13, <> line 146.
Malformed UTF-8 character (unexpected end of string) in substitution
(s///) at ../../bin/tocn.pl line 15, <> line 146.
panic: do_trans_simple_utf8 line 362 at ../../bin/tocn.pl line 20, <>
line 146.
, [zh_TW]Invalid UTF8: å‘
, [zh_HK]Invalid UTF8: å‘
.
copying index.zh-cn.html to ../../../../www/News/weekly/./2002/49

copying index.zh-cn.html to ../../../../www/News/weekly/./2003/09
Processing index.wml: [zh_CN]Invalid UTF8: –‡æª”描述了埞安裝
, [zh_TW]Invalid UTF8: –‡ä»¶æè¿°äº†åŸžå®‰è£
, [zh_HK]Invalid UTF8: –‡ä»¶æè¿°äº†åŸžå®‰è£
.
copying index.zh-cn.html to ../../../../www/News/weekly/./2003/10
Processing index.wml: [zh_CN]Invalid UTF8: ˆ‘们的<a
href="../../../../events/talks">挔讲页面</a>来吗
, [zh_TW]Invalid UTF8: ˆ‘们的<a
href="../../../../events/talks">挔讲页面</a>来吗
, [zh_HK]Invalid UTF8: ˆ‘们的<a
href="../../../../events/talks">挔讲页面</a>来吗
.
copying index.zh-cn.html to ../../../../www/News/weekly/./2012/15

make[1]: Entering directory '/webwml/chinese/devel'

Processing
testing.wml:
[zh_CN],
[zh_TW]Invalid
UTF8: ˆ°äº† 4
個䞍打算曎新的軟件包因爲它們會砎壞䟝賎。<q>(0)</q> 是無
, [zh_HK]Invalid
UTF8: ˆ°äº† 4
個䞍打算曎新的軟件包因爲它們會砎壞䟝賎。<q>(0)</q> 是無
.

make[2]: Entering directory '/webwml/chinese/devel/join'
Processing index.wml: [zh_CN]Malformed UTF-8 character: \xe9\x98\x0a
(unexpected non-continuation byte 0x0a, 2 bytes after start byte 0xe9;
need 3 bytes, got 2) in substitution (s///) at ../../bin/tocn.pl line
108, <> line 52.
, [zh_TW], [zh_HK].
copying index.zh-cn.html to ../../../../www/devel/join
copying index.zh-hk.html to ../../../../www/devel/join
copying index.zh-tw.html to ../../../../www/devel/join

make[1]: Entering directory '/webwml/chinese/international'
Processing index.wml: [zh_CN]Malformed UTF-8 character: \xe9\x98\x0a
(unexpected non-continuation byte 0x0a, 2 bytes after start byte 0xe9;
need 3 bytes, got 2) in substitution (s///) at ../bin/tocn.pl line 108,
<> line 89.
, [zh_TW]Invalid UTF8: …皋序
, [zh_HK]Invalid UTF8: …皋序
.

make[2]: Entering directory '/webwml/chinese/international/Chinese'

Processing thanks.wml: [zh_CN]Invalid UTF8: «™é»žçš„朋友
, [zh_TW]Invalid UTF8: «™é»žçš„朋友
, [zh_HK]Invalid UTF8: «™é»žçš„朋友
.

make[1]: Entering directory '/webwml/chinese/intro'
Processing about.wml: [zh_CN], [zh_TW], [zh_HK]panic: swash_fetch got
swatch of unexpected bit width, slen=512, needents=64 at ../bin/tohk.pl
line 131, <> line 95.
.

make -C legal install
make[1]: Entering directory '/webwml/chinese/legal'
Processing index.wml: [zh_CN]Malformed UTF-8 character: \xe9\x98\x0a
(unexpected non-continuation byte 0x0a, 2 bytes after start byte 0xe9;
need 3 bytes, got 2) in substitution (s///) at ../bin/tocn.pl line 108,
<> line 68.
, [zh_TW], [zh_HK].
copying index.zh-cn.html to ../../../www/legal
copying index.zh-hk.html to ../../../www/legal
copying index.zh-tw.html to ../../../www/legal

make[1]: Entering directory '/webwml/chinese/releases'

Processing proposed-updates.wml: [zh_CN],
[zh_TW]Invalid UTF8: ‰èƒœæœ€çµ‚到達 proposed-updates
, [zh_HK]Invalid UTF8: ‰èƒœæœ€çµ‚到達 proposed-updates
.

make[2]: Entering directory '/webwml/chinese/releases/hamm'
Processing HOWTO.upgrade.wml: [zh_CN], [zh_TW]Malformed UTF-8 character:
\xe5\x8c\x0a (unexpected non-continuation byte 0x0a, 2 bytes after start
byte 0xe5; need 3 bytes, got 2) in substitution (s///) at
../../bin/totw.pl line 111, <> line 71.
, [zh_HK].

Reply | Threaded
Open this post in threaded view
|

Bug#959474: Issues with Chinese language (all variants) when building some pages in buster

Axel Beckert-8
Control: clone -1 -2
Control: reasign -2 wml 2.12.2~ds1-2
Control: retitle -2 wml: Regression in "htmlstrip -O2" (default) with Chinese language

Hi,

Boyuan Yang wrote:
> Thanks for raising this issue.

Thanks from me, too. I wasn't aware of such a regression, sorry.

> These build errors might have multiple causes,
> but I stripped the issue down to a (possible) regression of wml. Let's fix
> this issue first before talking about others.
>
> =======================================
> $ wml --version
> This is WML Version 2.12.2
> Copyright (c) 1996-2001 Ralf S. Engelschall.
> Copyright (c) 1999-2001 Denis Barbier.
>
> This program is distributed in the hope that it will be useful,
> but WITHOUT ANY WARRANTY; without even the implied warranty of
> MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> GNU General Public License for more details.
> $ cat /etc/issue
> Debian GNU/Linux bullseye/sid \n \l
>
> $ cat a.wml
> <p>
> 包
> </p>
> $ hexdump -C a.wml
> 00000000  3c 70 3e 0a e5 8c 85 0a  3c 2f 70 3e 0a           |<p>.....</p>.|
> 0000000d
> $ wml a.wml > test.txt
> $ cat test.txt
> <p>
> �
> </p>
> $ hexdump -C test.txt
> 00000000  3c 70 3e 0a e5 8c 0a 3c  2f 70 3e 0a              |<p>....</p>.|
> 0000000c
> $
[…]
> I am using Debian Unstable but similar things also happen in Buster.

Can confirm that this is a regression between Stretch and Buster. :-(

> The single character in the a.wml above is U+5305 [1], namely "CJK Unified
> Ideograph-5305", a commonly-used Chinese character. Its UTF-8 encoding is
> "0xE5 0x8C 0x85". However after wml transformation, only "0xE5 0x8C" was kept
> and the "0x85" was dropped. That's surely a regression.

Ack. Figured out that it's pass 8 of 9 passes in WML:

→ cat a.wml | wml -p1-8
<p>

</p>
→ cat a.wml | wml -p1-7
<p>

</p>
→ cat a.wml | wml -p1-7,9
<p>

</p>
→ echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip



Pass 8 is htmlstrip, something similar uglifyjs, but for HTML.

Since that pass should be only for delivery performance and disk space
reasons, it likely can be left out easily.

So I see multiple ways to more or less quickly fix this issue in the
Debian web:

* Always call wml with "-p1-7,9".
* Call wml with "-p1-7,9" if any of the affected languages is build.
* Add <nostrip>…</nostrip> containers in the header and footer
  templates for the affected langauges.

To be more precise, it's the optimisation level 2 of htmlstrip:

→ echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip -O 0

→ echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip -O 1

→ echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip -O 2



The man page says:

       Level 2:
           Good stripping: Same as level 1 plus compression of
           multiple whitespaces (more then one in sequence) to single
           whitespaces [txt,tag] and stripping of trailing whitespaces
           at the of of a line [txt,tag,pre].
           
           This level is the default because while providing good
           optimization the HTML markup is not destroyed and remains
           human readable.

So instead of skipping htmlstrip completely, everywhere, where I
suggested passing "-p1-7,9", also "-O1" could be passed to wml as
this is passed to htmlstrip:

→ cat a.wml | wml -O1
<p>

</p>

> I cc-ed the wml maintainer in Debian. Axel, is there any possibility to solve
> this regression in both Sid/Testing and Stable?

I think the above is a good first workaround on buster. With this
mail, I clone the bug report and will try to figure out what change in
htmlstrip caused the regression and/or how it can be fixed.

I though currently have issues building more recent upstream versions
of WML which is the reason why wml in Unstable hasn't seen an update
yet. A more recent version is in git, but IIRC there was another
release or two recently, at which I haven't looked yet.

                Regards, Axel
--
 ,''`.  |  Axel Beckert <[hidden email]>, https://people.debian.org/~abe/
: :' :  |  Debian Developer, ftp.ch.debian.org Admin
`. `'   |  4096R: 2517 B724 C5F6 CA99 5329  6E61 2FF9 CD59 6126 16B5
  `-    |  1024D: F067 EA27 26B9 C3FC 1486  202E C09E 1D89 9593 0EDE

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Bug#959474: Issues with Chinese language (all variants) when building some pages in buster

Yao Wei (魏銘廷)
In reply to this post by Laura Arjona Reina-4
Package: www.debian.org
Followup-For: Bug #959474

Hi,

After a bit of investigation of Perl source code (5.31.11 downloaded
from upstream) I found the they have weird handling of whitespace when
`feature unicode_strings` turned on.  I am not a perl person and I
haven't executed the source code yet, so my interpretation might be
wrong.

When `unicode_strings` is on, `in_uni_8_bit` should true internally, and
in three places of pp.c:6040, pp.c:6076, pp.c:6114 `isSPACE_L1` is
called to check whether the examining character is a whitespace, by
checking whether the character is 0x85 or 0xA0 (handy.h:1611).  In the
case of the character 包, the last byte of 3-byte UTF-8 code is 0x85,
henceforth the problem.

-- System Information:
Debian Release: bullseye/sid
  APT prefers unstable
  APT policy: (500, 'unstable')
Architecture: amd64 (x86_64)

Kernel: Linux 5.6.0-1-amd64 (SMP w/8 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Bug#959474: Follow-up fix for wml in Debian Stable?

Boyuan Yang-2
In reply to this post by Laura Arjona Reina-4
Hi Axel,

I just tested the new wml 2.12.2~ds1-3 on Chinese translations for website
(webwml). It looks like the previous bug has been properly fixed.

Since the webmaster team is trying to upgrade the machine from Debian 9 to
Debian 10, it should be better if we have this fix pushed into stable soon.
Can you make a stable update for package wml with this fix?

--
Thanks,
Boyuan Yang

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Bug#959474: Bug#959761: Follow-up fix for wml in Debian Stable?

Axel Beckert-8
Hi Boyuan,

Boyuan Yang wrote:
> I just tested the new wml 2.12.2~ds1-3 on Chinese translations for website
> (webwml). It looks like the previous bug has been properly fixed.

Thanks a lot for testing and verifying!

> Since the webmaster team is trying to upgrade the machine from Debian 9 to
> Debian 10, it should be better if we have this fix pushed into stable soon.
> Can you make a stable update for package wml with this fix?

As mentioned on IRC (not sure if you're on #debian-www, probably not),
this is my plan.

I'll though will have to wait until wml 2.12.2~ds1-3 migrates to
testing. Should happen within 2 or 3 days once autopkgtest has been
run and passed.

Laura though meant on IRC that the webmasters might not want to wait
until the next stable update.

But maybe I can get it to stable-proposed-updates soon and they can
use it from there, so that wouldn't cause much of a lag.

(While I was writing this mail, on #debian-www it was decided that
they will use one of the workarounds, likely the -O1" one.)

                Regards, Axel
--
 ,''`.  |  Axel Beckert <[hidden email]>, https://people.debian.org/~abe/
: :' :  |  Debian Developer, ftp.ch.debian.org Admin
`. `'   |  4096R: 2517 B724 C5F6 CA99 5329  6E61 2FF9 CD59 6126 16B5
  `-    |  1024D: F067 EA27 26B9 C3FC 1486  202E C09E 1D89 9593 0EDE

Reply | Threaded
Open this post in threaded view
|

Bug#959474: Issues with Chinese language (all variants) when building some pages in buster

Laura Arjona Reina-4
Hi all

As a workaround for the Debian website, until wml 2.12.2~ds1-3 or higher
arrives to stable, I have added the option "-O1" to the options passed
to wml for Chinese, in the /chinese/Make.lang file:


+# Add "-O1"  to wml to be passed to htmlstrip, to avoid malformed UTF-8
+# see bug #959474
+# This option needs to be kept in Chinese until wml 2.12.2~ds1-3 or higher
+# arrives to Debian stable
+
+WMLOPTIONSZH = -O1

 WMLOUTPUT = -o UNDEFuZH@uCNuCNHKuCNTW:$(*F).zh-cn.html.tmp@g+w \
        -o UNDEFuZH@uHKuCNHKuHKTWuTWHK:$(*F).zh-hk.html.tmp@g+w \
@@ -54,7 +60,7 @@ WMLPROLOG = --prolog=$(FORMAT_ZH)
 # Remove initial blank line due "[ZH::]" in $(TEMPLDIR)/common_tags.wml,
 # an unfortunate but necessary workaround of a bug in slice < 1.3.9
 WMLEPILOG = --epilog=$(STRIP_INITIAL_BLANK_LINE)
-WML = wml $(WMLOPTIONS) $(WMLOUTPUT) $(WMLPROLOG) $(WMLEPILOG)
+WML = wml $(WMLOPTIONS) $(WMLOPTIONSZH) $(WMLOUTPUT) $(WMLPROLOG)
$(WMLEPILOG)

I have compared the results of builds in stretch and buster both with
and without the option, and there are no changes in stretch, and the
UTF-8 issues are fixed in buster with the option (by the way, thanks
Boyuan for the additional fixes you did to mitigate the error).

So, I think that Bug#959474 can be closed, but I'll leave it open until
we effectively migrate to Buster and see the results in www.debian.org
"live" :-)

Thanks everybody for your work!

Kind regards,
--
Laura Arjona Reina
https://wiki.debian.org/LauraArjona

Reply | Threaded
Open this post in threaded view
|

Bug#959474: Issues with Chinese language (all variants) when building some pages in buster

Axel Beckert-8
Hi,

Laura Arjona Reina wrote:
> I have compared the results of builds in stretch and buster both with
> and without the option, and there are no changes in stretch, and the
> UTF-8 issues are fixed in buster with the option

Thanks for these tests.

> So, I think that Bug#959474 can be closed, but I'll leave it open until
> we effectively migrate to Buster and see the results in www.debian.org
> "live" :-)

Just ot be sure: I should still provide a stable update for buster,
right?

(Sorry, was a bit busy IRL and nearly forgot about this open "to do"
item. So thanks for the reminder.)

                Regards, Axel
--
 ,''`.  |  Axel Beckert <[hidden email]>, https://people.debian.org/~abe/
: :' :  |  Debian Developer, ftp.ch.debian.org Admin
`. `'   |  4096R: 2517 B724 C5F6 CA99 5329  6E61 2FF9 CD59 6126 16B5
  `-    |  1024D: F067 EA27 26B9 C3FC 1486  202E C09E 1D89 9593 0EDE

Reply | Threaded
Open this post in threaded view
|

Bug#959474: Issues with Chinese language (all variants) when building some pages in buster

Laura Arjona Reina-4
Hi

El 7/6/20 a las 16:02, Axel Beckert escribió:

> Just ot be sure: I should still provide a stable update for buster,
> right?
>

I don't know if the type of bug qualifies for a stable update.

For www.debian.org, we'll be using the -O1 workaround for building the
Chinese pages, and that's about optimization, we don't lose any
functionality, so I think we can wait for bullseye.

Boyuan, please correct me if I am wrong...

Kind regards,
--
Laura Arjona Reina
https://wiki.debian.org/LauraArjona

Reply | Threaded
Open this post in threaded view
|

Bug#959474: Issues with Chinese language (all variants) when building some pages in buster

Boyuan Yang-2
在 2020-06-07星期日的 21:23 +0200,Laura Arjona Reina写道:
> Hi
>
> El 7/6/20 a las 16:02, Axel Beckert escribió:
>
> > Just ot be sure: I should still provide a stable update for buster,
> > right?
> >
>
> I don't know if the type of bug qualifies for a stable update.

If I were the maintainer, I would give it a try to make the stable
update. (Why not?)

> For www.debian.org, we'll be using the -O1 workaround for building
> the
> Chinese pages, and that's about optimization, we don't lose any
> functionality, so I think we can wait for bullseye.
>
> Boyuan, please correct me if I am wrong...

If we have the workaround applied, website building with Chinese
contents should not be an issue anymore.

--
Thanks,
Boyuan Yang