Internet Archive BookReader plack server

Last year, I had good fortune to get acquiented with great work which Open Library does. It's part of Internet Archive which itself is a library. So, libraries are not (yet) dead it seems. Brewster Kahle's Long Now Talk explains it much better than I can do, so take 90 minutes to listen to it.

Most interesting part of Open Library (IMHO) is Internet Archive BookReader which is JavaScript application which allows users to browse scanned books on-line. For quite some time, I wanted to install something similar to provide web access to our collection of scanned documents. I have found instructions for serving IA like books from own cluster, but I didn't have a cluster, and converting all documents to IA book format seemed like an overhead which I would like to avoid.

Instead, I decided to write image server for JavaScript front-end using plack. I mean, it's basically a directory with images, right? Oh, how wrong can I be? :-)

It turs out that we have pictures in multiple formats (so sorting them required removing common prefix and using number only to get correct order), and most of are scanned images in pdf documents. Here are all types of documents which can be automatically collected into book for on-line browsing:

  • images of scanned pages
  • multi-file pdf file with single image per page
  • single pdf file with one image for each page
  • single pdf file with more than one (usually 4) horizontal bitmap strips for each page
  • normal pdf documents which contain text and needs rendering to bitmap

Source code of my plack server for Internet Archive book reader is on github, so if you want to take a look, hop over there...