Ticket #4721 (closed defect: fixed)
Bulk import stops because of utf-8 errors
| Reported by: | EgnaledKnarf | Owned by: | kovidgoyal |
|---|---|---|---|
| Priority: | major | Milestone: | |
| Component: | Default | Version: | trunk |
| Keywords: | bulk import | Cc: |
Description
I can not get Calibre to import my book directories in bulk because it cancels importing as soon as it hits a filename it can not interpret as utf-8.
Correct behaviour would be to either:
- skip the file and put its name in a list for the user to peruse
- ask the user what to do with this file (use different encoding, new filename, skip file, etc)
- replace the uninterpretable characters with something innocuous
The real error here is that Calibre insists on interpreting all filenames as utf-8, even when they clearly are encoded in Latin or some other encoding. The real solution therefore would be to add some intelligence to the name mangling code to check for possible file name encodings and default to ASCII with placeholders if the encoding can not be determined.
An example of this type error looks like this:
ERROR: Path error (on behemoth)
The specified directory could not be processed
details:
('utf8', '/path/Name_of_Author/S\xf8k_etter_b\xf8ker.txt', 37, 42, 'unsupported Unicode code range')
That looks like Latin-1 or WINDOWS-1252 to me. It is clearly not utf-8 (which would look like 'S\xc3\x6bker_etter_b\xc3\x6bker.txt').
Instead of dropping out immediately it could try to:
- determine the character set http://www.feedparser.org/docs/character-encoding.html
- replace the unknown characters with something else 'S#ker_etter_b#ker.txt'
- skip them alltogether 'sker_etter_bker.txt'
- ask the user
- etc...

While I can probably fix this, you should be aware that having files in different character encodings on the same filesystem is a bug, indicating either a misconfigured system or misbehaving applications.