Ticket #2846 (closed enhancement: fixed)

Opened 13 months ago

Last modified 12 months ago

Convert common unicode punctuation to ascii.

Reported by: bmfrosty Owned by: john
Priority: minor Milestone:
Component: EPUB Output Version: trunk
Keywords: conversion punctuation Cc:

Description

It appears that some ebook reader hardware has trouble with some Unicode punctuation symbols. Particuarly left and right - single and double quotes. There are other marks that may also be present that it would be logical to convert.

Cursory research found the following:

Smart Quotes:

E2 80 9C E2 80 9D

Apostrophes and single quotes as:

E2 80 98 E2 80 99

In ascii, quotes are 22, and apostrophes are 27. A search and replace in the displayed text areas might be enough. It could possibly even be done globally, assuming that neither the HTML or CSS portions would be using any of those unicode sequences.

It looks like a pretty good list can be found here:

 http://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128

Of additional note there are E2 80 9B and E2 80 9F which are high-reversed-9 quotation marks - single and double respectively, E2 80 90 through E2 80 95 which are various types of hyphens. There are also various spaces in that chart, but I'm unsure if they're used much or at all.

A little bit of further googling found this article:

 http://pivotallabs.com/users/cheister/blog/articles/603-unicode-transliteration-to-ascii

which linked to a project called 'unidecode' which can be found here:

 http://rubyforge.org/projects/unidecode/

Change History

  Changed 13 months ago by bmfrosty

Found the original version of unidecode in perl along with an article about it by the author. May be useful:

 http://interglacial.com/~sburke/tpj/as_html/tpj22.html

  Changed 12 months ago by john

  • owner changed from kovidgoyal to john
  • status changed from new to accepted

Thanks for all those links. This is something I've been thinking about for quite some time. I've ported the unidecode module to python and I plan to integrate it into calibre after the 0.6 release.

  Changed 12 months ago by john

  • status changed from accepted to closed
  • resolution set to fixed

0.6 is out and I've committed my changes to accommodate this feature. It should show up in 0.6.1. Use --asciiize on the command line or use the "Transliterate unicode characters to ASCII." look and feel section on the conversion dialog in the GUI to enable.

It does a bit more than just convert common punctuation, it replaces all non ascii characters. So, "Михаил Горбачёв" will become "Mikhail Gorbachiov".

follow-up: ↓ 5   Changed 12 months ago by kovidgoyal

@john: What's the suitability of this for ASCIIizing file names to address for example #2650 ?

in reply to: ↑ 4   Changed 12 months ago by john

Replying to kovidgoyal:

@john: What's the suitability of this for ASCIIizing file names to address for example #2650 ?

Perfect. Here is a basic example of how and what kind of output it produces.

>>> from calibre.ebooks.unidecode.unidecoder import Unidecoder
>>> udc = Unidecoder()
>>> print udc.decode('Призрак Александра Вольфа, Призрак Александра Вольфа, Газданов; Гайто')
Prizrak Alieksandra Vol'fa, Prizrak Alieksandra Vol'fa, Gazdanov; Gaito

If it runs into a character it can't convert it will replace it with a ?. The only thing you need to check is for invalid characters for use as a file name in the output. Eg. strip /*[]+... Running the output though sanitize_file_name would take care of it.

  Changed 12 months ago by kovidgoyal

Implemented in branch trunk. The fix will be in the next release.

Note: See TracTickets for help on using tickets.