Send to a Friend

skorned's avatar

How do I create a full text search index of pdf files?

Asked by skorned (97points) June 25th, 2009

I’m creating a site to list and filter a set of pdf files that I will store locally on my server. The file names and details will be stored in a MySQL database for easy filtering (the files are pdfs of past exam papers, so various parameters like the exam year and subject need to be filtered).
I also want to give the users an option to search through the text contents of the files. To that end, what way can I create and store the index?
The files I want to index all consist of text. There are some files which have scanned copies of the text as images, so that each page is one image, but I guess I’ll go into OCR later. However, if you could recommend a solution for that too, that’ll be great.
Otherwise, just a plain text indexer of the pdf format please?
And while your at it, if you know any website or tool that does good conversions from pdf to html, while maintaining layout and pictures, that’ll be of great help too, so I can offer users a link to an html equivalent of files if they can’t view pdfs…

Using Fluther

or

Using Email

Separate multiple emails with commas.
We’ll only use these emails for this message.