Ticket #4501 (closed defect: fixed)
Unicode escaped RTF to XML problem
| Reported by: | sanyi | Owned by: | kovidgoyal |
|---|---|---|---|
| Priority: | minor | Milestone: | |
| Component: | EPUB Output | Version: | trunk |
| Keywords: | rtf, unicode | Cc: |
Description
I saw rtf2xml is an external project but I prefer to open here a ticket, I hope it is O.K. Also I am not sure if I chose the best 'Component' for this ticket.
The defect: Suppose the RTF document contains Unicode escaped sequences like \u538, this case is handled well by rtf2html.
However as you can read here: http://en.wikipedia.org/wiki/Rich_Text_Format "For the benefit of programs without Unicode support, this must be followed by the nearest representation of this character in the specified code page. For example, \u1576? would give the Arabic letter beh, specifying that older programs which do not have Unicode support should render it as a question mark instead."
So the problem is \u538? will be decoded in the respective character but followed by the ? character. The converter should skip the ? character since is not meant to be shown.
In the rtf2html is a script named correct_unicode.py this one seems to do/correct something similar. I would gladly modify it but I fear from the overhead of putting together all the development environment (I am not enough experienced with unix and python modules to do this in short time).
Attachments
Change History
comment:2 Changed 8 months ago by sanyi
I tried, I my missed something, but things are not really working. So I started to play around just with the module it self. The code is somehow complex, I rather rewrite it than understand it lol.
A very ugly and ignorant solution is:
- in the tokenize.py line 88 change self.utf_exp = re.compile(r"
u(-?\d{3,6})") to self.utf_exp = re.compile(r"
u(-?\d{3,6})\??") notice the \?? outside of the regex group 1.
It is ugly because the question mark in more usual situations is a real and normal character or sequence. But this will help in at least some of the situations (eg wortpad of win 7 enter some of the chars like this).
The correct way would be a dynamic tokenizer algorithm which uses the indicator /ucN where N is the replacement character number after the unicode \uN character (as one can read in Word2007RTFSpec9.doc page 15).
There are some more errors with the code pages, I can write some of my experience here if it is helpful.
comment:3 Changed 8 months ago by kovidgoyal
One possible solution is to preprocess the rtf text before passing it to rtf2xml to remove the trailing characters. If you can write a python module that does that, I will add it to calibre.
comment:4 Changed 8 months ago by sanyi
I am working on it. To be correct 100% I have to implement some basic RTF parsing. I hope I am amble to deliver it in this weekend.
comment:5 Changed 8 months ago by sanyi
I attached here the basic parser which will modify the utf8 syntax to keep happy both (the rtf2xml and compliant RTF reader). At this point the parser is not fault tolerant, it will raise exceptions for non compliant RTF-s. I tested the code but as always more test needed.
comment:6 Changed 8 months ago by kovidgoyal
I'll merge it before the next release. Are you ok with licensing it as GPL3?
comment:7 Changed 8 months ago by kovidgoyal
- Status changed from new to closed
- Resolution set to fixed
Fixesed in branch trunk. The fix will be in the next release.
comment:9 Changed 7 months ago by sanyi
- Status changed from closed to reopened
- Resolution fixed deleted
Hello, The is one more problem in the rtf2xml code regarding the unicode support. It seems the developer of this script does not fully understood the \uc token and it's use in conjunction with the \uxxx unicode tokens. So we do the RTF preprocess to remove the leading (equivalent) characters after the unicode character because we presumed the rtf2xml does not do this. The actually method was to remove all leading characters keeping account of the current \uc parameter and then set globally per the RTF file the \uc0, meaning no leading characters after the unicode ones.
However it seems the developer at some point observed partially this problem. In his chase the unicode character was not leaded by an ascii representation but a 8-bit character encoded as hexadecimal using \'xx token. His fix was to remove any number 8-bit character after the unicode character regardless of the \uc settings. This is completely wrong.
Supposing we have \uc0 (and this is the case now after the preprocessing), meaning no leading character after the unicode (at the preprocessing we already removed the leading characters) and we have something like: \u537\'e3 (an unicode character and an 8byte character) now this will result in the removal of the 8 bit character and this character have nothing in common with the preceding unicode.
In my opinion the correct_unicode.py and the calls from Parse Rtf?.py should be removed from the calibre because there is no need to do such a step which also introduce a bug. As for the use of the rtf2xml in another projects without the preprocessing of this files the correct_unicode.py has to be completely rethink-ed.
Regards, Sanyi
comment:10 Changed 7 months ago by kovidgoyal
It should e easy to fix the removal of the extra character in rtf2xml, can you do it, or should I?
comment:11 Changed 7 months ago by sanyi
I am sorry, I think I was not clear enough. We do not have to remove any other characters since the RTF rtfPreprocess.py is already do this in a manner witch keeps happy rtf2xml and another readers since the RTF remains specification compatible.
At first we assumed rtf2xml does not do something similar. But it does and in a very bad way. The rtf2xml does remove leading characters only if they are 8 byte characters (represented by the \'xx token) and removes one on more regardless of the uc settings (please read the comment from correct_unicode.py).
It should delete any kind of leading chars or tokens not just the 8 bit tokens and keeping account of the uc setting which tell us actually how many leading char we have.
But this is already done by the rtfPreprocess.py so at this point the only neaded thing is to deactivate that erronated step (correct_unicode) from the rtf2xml.
I hope I was clear this time :)
comment:12 Changed 7 months ago by sanyi
Or maybe You understand it from the first but I was who misunderstood your reply. My English is bad, sorry. Please fix it at your place, I still don't have a good setup to test modifications, maybe some day...
comment:13 Changed 7 months ago by kovidgoyal
I had a look at the code in rtf2xml and the regular expression used to match \u escapes is
r"\\u(-?\d{3,6}) {0,1}"
As far as I can see that wont match anything except a \u escape and a trailing space character. I've modified it to remove the match on the trailing space character, which seems to be a bug to me. Can you attach a test rtf showing this behavior?
comment:14 Changed 7 months ago by sanyi
Ok at this point I am again uncertain if you understand what I am telling.
Ok here it is a short rtf file:
{\rtf1\ansi\ansicpg1250\deff0\deflang1048{\fonttbl{\f0\fnil\fcharset238{\*\fname Arial;}Arial CE;}{\f1\fnil\fcharset0 Calibri;}}
{\*\generator Msftedit 5.41.21.2509;}\viewkind4\uc1\pard\sa200\sl276\slmult1\f0\fs22\'e3\u539?\'e2\'e3\u539?\'e2\u539?\'e2\u539?\'e2\u539?\'e2\u539?\u539?\lang9\f1\par
}
as you can observe we have \uc1 tag which will tell us how many replacement character will be after one unicode \uddd token (character), in our chase is one character and it is always the question mark. Also we have a second special character notation named 8 byte char in our case \'e3 or \'e2.
Before I was written the rtf pre-processor, the rtf2xml converted the unicode tokens and the 8 byte characters to HTML corespondent but the question mark was never removed.
So my correction was to delete any leading characters in conform with the \ucx settings, reset the \ucx to \uc0 (no leading characters after unicode characters) per the whole document. Put a space after \uddd token, this is the token terminator and it is handled well by the rtf2xml. So we have no more lading characters and we remain specification compliant. The result is like:
{\rtf1\uc0\ansi\ansicpg1250\deff0\deflang1048{\fonttbl{\f0\fnil\fcharset238{\*\fname Arial;}Arial CE;}{\f1\fnil\fcharset0 Calibri;}}
{\*\generator Msftedit 5.41.21.2509;}\viewkind4\pard\sa200\sl276\slmult1\f0\fs22\'e3\u539 \'e2\'e3\u539 \'e2\u539 \'e2\u539 \'e2\u539 \'e2\u539 \u539 \lang9\f1\par
}
until this point everything was perfect. But read please the comment from the correct_unicode.py (file from rtf2xml project):
"""
corrects sequences such as \u201c\'F0\'BE
Where \'F0\'BE has to be eliminated.
"""
\'F0\'BE are tokens representing 8 bit characters and should not be removed just if \ucx settings told us so, at the preprocessed rtf the \uc0 told us to do not remove anything anymore!
So as example: \u539 \'e2\'e3\u539 will result in \u539 \u539 so we just eat two characters...
What I was suggested, since correct_unicode.py was intended to handle this problem but without complete understanding, probably based on some particular case and not the m$ documentation is complete bogus, also since my rtfPreprocess.2.py is completely useless.
The simplest removal is I think, in file Parse Rtf?.py comment the following part:
correct_uni_obj = rtf2xml.correct_unicode.CorrectUnicode(
in_file = self.__temp_file,
bug_handler = RtfInvalidCodeException,
copy = self.__copy,
run_level = self.__run_level,
exception_handler = InvalidRtfException,
)
correct_uni_obj.correct_unicode()
self.__bracket_match('correct_unicode_info')
comment:15 Changed 7 months ago by kovidgoyal
- Status changed from reopened to closed
- Resolution set to fixed
The output from converting your attached RTF example is the same, with or without correct_unicode, but I agree that it seems superfluous, so I have removed it.
comment:16 Changed 7 months ago by sanyi
OK. thanks! One more step to a good rtf2xml :) there are some problems converting the 8-bit chars, this is not working well with all the code-pages. At this point I manually replace this tokens in utf chars. I will think on better options once i got some spare time and I will tell you. I also have to study more this multiple convention madness...
It was the same output because you don't applied the preprocessing and after the unicode character there was the question mark not the 8 bit chars.


You don't need to setup a developent environment to hack on calibre. calibre acts as its own development environment. Just follow th einstructions in the User Manual on setting up a development environment. FFell free to ask if you have questions.