1-800-MAGIC: Converting project Gutenberg books to SONY Reader

Friday, January 18, 2008

Converting project Gutenberg books to SONY Reader

UPDATE: I now have a proxy for Project Gutenberg web site that converts the books on the fly, here: http://1-800-magic.blogspot.com/2008/01/gutenberg-for-sony-pre-alpha.html

I am in the process of writing a proxy web site that would be a projection of gutenberg.org, but would add an option to download the file in a LRF format (which I would construct on the fly from the text files).

Today I finished the proxy part, and started experimenting woth the converter. Luckily, someone has already written a program to convert text into LRF, which is available here: http://www.sven.de/librie_files/makelrf3.zip.

So all I need to do is preprocess the text file to remove unnecessary line breaks


(otherwise processed book ends up having a
jagged appearance
on the smaller screen, because they break
the line both at
the screen end, and at the end of line).

The simplest way to experiment is to just write the script. The simplest way to write a script is to use python. So here goes it, you can copy it from here. It assumes that all makelrf files are in c:\bin, and the books get output into c:\books.

It takes the book as a number (140), or a full URL to the text file (http://www.gutenberg.org/files/140/140.txt)


import os
import sys
import urllib
import tempfile

DESCR_TEXT = 'The Project Gutenberg EBook of '
AUTHOR_TEXT = 'Author: '
AUTHOR_LEN = len(AUTHOR_TEXT)
TITLE_TEXT = 'Title: '
TITLE_LEN = len(TITLE_TEXT)

def main(argv):
  if len(argv) != 2:
    print 'Usage: gutenbergtolrf.py number or URL'
    print 'The output goes into c:\books'
    return

  url = argv[1]
  if not url.startswith('http://'):
    url = ('http://www.gutenberg.org/files/%s/%s.txt'
           % (argv[1], argv[1]))

  (fd, temp_file_name) = tempfile.mkstemp(
      suffix = '.txt', text = True)

  url_file = urllib.urlopen(url)

  title = None
  author = None
  description = None

  # Read URL, and convert everything into
  # single-line paragraphs. Also parse out
  # title, author and description
  first_line = True
  for l in url_file:
    if l.endswith('\r\n'):
      l = l[:-2]
    if l:
      if not description and l.startswith(DESCR_TEXT):
        description = l

      if not author and l.startswith(AUTHOR_TEXT):
        author = l[AUTHOR_LEN:]
      if not title and l.startswith(TITLE_TEXT):
        title = l[TITLE_LEN:]

      if first_line:
        first_line = False
      else:
        os.write(fd, ' ')

      os.write(fd, l)

      # This could be a poetry stanza,
      # treat short lines differently
      if len(l) < 50:
        os.write(fd, '\r\n')
        first_line = True
    else:
      if first_line:
        os.write(fd, '\r\n')
      else:
        os.write(fd, '\r\n\r\n')
      first_line = True

  os.close(fd)

  if not (author and title and description):
    print 'Could not parse the file!'
    os.remove(temp_file_name)
    return

  os.chdir('c:\\bin')
  target = 'c:\\books\\%s.lrf' % title

  if os.path.exists(target):
    os.remove(target)

  print temp_file_name
  print 'Title: ' + title
  print 'Author: ' + author
  print 'Description: ' + description
  print 'Converting to LRF: ' + target

  os.spawnv(os.P_WAIT, 'c:\\bin\\makelrf.exe', 
            ['c:\\bin\\makelrf.exe',
             '-d', '"%s"' % description,
             '-a', '"%s"' % author,
             '-t', '"%s"' % title,
             '-o', '"%s"' % target,
             temp_file_name])

  os.remove(temp_file_name)


if __name__ == '__main__':
  main(sys.argv)

2 comments:

Alex said...: Hi Sergey,

I invite you to join our Sony Reader community where you can find plenty of additional information regarding how to convert content to LRF.

Keep up the great work!

Cheers,
Alex; January 18, 2008 at 1:41 AM
Igor Skochinsky said...: You should consider using libprs500 for LRF generation, it's much much more powerful than makelrf. It's in Python to boot :)
https://libprs500.kovidgoyal.net/; January 18, 2008 at 6:51 AM

1-800-MAGIC

Friday, January 18, 2008

Converting project Gutenberg books to SONY Reader

2 comments:

NIST Clock

About Me

Blog Archive

Links

Visitors