Ticket #4501 (closed defect: fixed)
Unicode escaped RTF to XML problem
| Reported by: | sanyi | Owned by: | kovidgoyal |
|---|---|---|---|
| Priority: | minor | Milestone: | |
| Component: | EPUB Output | Version: | trunk |
| Keywords: | rtf, unicode | Cc: |
Description
I saw rtf2xml is an external project but I prefer to open here a ticket, I hope it is O.K. Also I am not sure if I chose the best 'Component' for this ticket.
The defect: Suppose the RTF document contains Unicode escaped sequences like \u538, this case is handled well by rtf2html.
However as you can read here: ?http://en.wikipedia.org/wiki/Rich_Text_Format "For the benefit of programs without Unicode support, this must be followed by the nearest representation of this character in the specified code page. For example, \u1576? would give the Arabic letter beh, specifying that older programs which do not have Unicode support should render it as a question mark instead."
So the problem is \u538? will be decoded in the respective character but followed by the ? character. The converter should skip the ? character since is not meant to be shown.
In the rtf2html is a script named correct_unicode.py this one seems to do/correct something similar. I would gladly modify it but I fear from the overhead of putting together all the development environment (I am not enough experienced with unix and python modules to do this in short time).

