We've had a report that Korean display got a lot worse in the
snapshots between r5002 and r5003 <000001c513c9$e63ae1e0$aa000059@ktd>.
Quoth Simon:
Looking at the code, I think I can see why this is happening. This
is to do with RDB's idea that when the user selects `use font
encoding' and a font with a DBCS encoding, the terminal code should
simply store the individual bytes in individual character cells and
rely on do_text() being passed a string of these so that TextOut()
can reconstitute pairs of DBCS bytes into double-width characters.
As far as I can tell, terminal.c does not mark the first byte of a
DBCS character stored in this way. Therefore, the mechanism is
fundamentally dependent on a do_text() run happening to begin at
the correct point mod 2! Hence the comment in the mail referenced
above, which said that there was already some breakage when the
cursor moved over a double-byte character - the half-character under
the cursor cannot be properly redrawn. Owing to `font-overflow',
though, when you move the cursor over a double-byte character we now
redraw a lot of text to the right of that as well, and if the cursor
is on the first half of the character then this is bound to be
incorrect mod 2; so the problem shows up a lot more readily. I'd bet
that the same breakage could have been seen in previous versions if
the window was covered and re-exposed when the cursor was in a
problem position.
A real fix for this would involve implementing proper DBCS support,
by detecting DBCS lead bytes in the terminal.c input data stream and
storing both bytes in the same character cell using the existing
UCSWIDE mechanism. I have occasionally wondered about doing this: I
envisage that we would co-opt the top half of the unsigned long
space (never used by any flavour of Unicode/UCS ever) to provide
more than enough fake character encodings for the purpose.
Of course, if we were going to support DBCSes in terminal.c it would
also be good to be able to support them properly, by translating
them to Unicode on input.
Summary: I think this has always been broken, and now it's merely
more obviously broken. I regret the effect on CJK users who had
found the previous behaviour worked just about well enough, but I
don't think a hurried fix is in the general interest.
UTF-8 mode should work reasonably well. A workaround is to use UTF-8
if possible (perhaps via something such as
luit or
screen).