Friday, January 18, 2008

Converting project Gutenberg books to SONY Reader

UPDATE: I now have a proxy for Project Gutenberg web site that converts the books on the fly, here: http://1-800-magic.blogspot.com/2008/01/gutenberg-for-sony-pre-alpha.html


I am in the process of writing a proxy web site that would be a projection of gutenberg.org, but would add an option to download the file in a LRF format (which I would construct on the fly from the text files).

Today I finished the proxy part, and started experimenting woth the converter. Luckily, someone has already written a program to convert text into LRF, which is available here: http://www.sven.de/librie_files/makelrf3.zip.

So all I need to do is preprocess the text file to remove unnecessary line breaks

(otherwise processed book ends up having a
jagged appearance
on the smaller screen, because they break
the line both at
the screen end, and at the end of line).


The simplest way to experiment is to just write the script. The simplest way to write a script is to use python. So here goes it, you can copy it from here. It assumes that all makelrf files are in c:\bin, and the books get output into c:\books.

It takes the book as a number (140), or a full URL to the text file (http://www.gutenberg.org/files/140/140.txt)


import os
import sys
import urllib
import tempfile

DESCR_TEXT = 'The Project Gutenberg EBook of '
AUTHOR_TEXT = 'Author: '
AUTHOR_LEN = len(AUTHOR_TEXT)
TITLE_TEXT = 'Title: '
TITLE_LEN = len(TITLE_TEXT)

def main(argv):
if len(argv) != 2:
print 'Usage: gutenbergtolrf.py number or URL'
print 'The output goes into c:\books'
return

url = argv[1]
if not url.startswith('http://'):
url = ('http://www.gutenberg.org/files/%s/%s.txt'
% (argv[1], argv[1]))

(fd, temp_file_name) = tempfile.mkstemp(
suffix = '.txt', text = True)

url_file = urllib.urlopen(url)

title = None
author = None
description = None

# Read URL, and convert everything into
# single-line paragraphs. Also parse out
# title, author and description
first_line = True
for l in url_file:
if l.endswith('\r\n'):
l = l[:-2]
if l:
if not description and l.startswith(DESCR_TEXT):
description = l

if not author and l.startswith(AUTHOR_TEXT):
author = l[AUTHOR_LEN:]
if not title and l.startswith(TITLE_TEXT):
title = l[TITLE_LEN:]

if first_line:
first_line = False
else:
os.write(fd, ' ')

os.write(fd, l)

# This could be a poetry stanza,
# treat short lines differently
if len(l) < 50:
os.write(fd, '\r\n')
first_line = True
else:
if first_line:
os.write(fd, '\r\n')
else:
os.write(fd, '\r\n\r\n')
first_line = True

os.close(fd)

if not (author and title and description):
print 'Could not parse the file!'
os.remove(temp_file_name)
return

os.chdir('c:\\bin')
target = 'c:\\books\\%s.lrf' % title

if os.path.exists(target):
os.remove(target)

print temp_file_name
print 'Title: ' + title
print 'Author: ' + author
print 'Description: ' + description
print 'Converting to LRF: ' + target

os.spawnv(os.P_WAIT, 'c:\\bin\\makelrf.exe',
['c:\\bin\\makelrf.exe',
'-d', '"%s"' % description,
'-a', '"%s"' % author,
'-t', '"%s"' % title,
'-o', '"%s"' % target,
temp_file_name])

os.remove(temp_file_name)


if __name__ == '__main__':
main(sys.argv)

2 comments:

Alex said...

Hi Sergey,

I invite you to join our Sony Reader community where you can find plenty of additional information regarding how to convert content to LRF.

Keep up the great work!

Cheers,
Alex

Igor Skochinsky said...

You should consider using libprs500 for LRF generation, it's much much more powerful than makelrf. It's in Python to boot :)
https://libprs500.kovidgoyal.net/