General Question

skorned's avatar

How do I create a full text search index of pdf files?

Asked by skorned (97points) June 25th, 2009

I’m creating a site to list and filter a set of pdf files that I will store locally on my server. The file names and details will be stored in a MySQL database for easy filtering (the files are pdfs of past exam papers, so various parameters like the exam year and subject need to be filtered).
I also want to give the users an option to search through the text contents of the files. To that end, what way can I create and store the index?
The files I want to index all consist of text. There are some files which have scanned copies of the text as images, so that each page is one image, but I guess I’ll go into OCR later. However, if you could recommend a solution for that too, that’ll be great.
Otherwise, just a plain text indexer of the pdf format please?
And while your at it, if you know any website or tool that does good conversions from pdf to html, while maintaining layout and pictures, that’ll be of great help too, so I can offer users a link to an html equivalent of files if they can’t view pdfs…

Observing members: 0 Composing members: 0

1 Answer

jumpo7's avatar

You have a major project on your hands there. I saw your other post about filter on the file names using php. As suggested you could use a db and the methods recommended can be used to parse the file names into the db.

I however would use php to read the directory and parse the names into the db. Using the excel as a intermediate step is a manual process. Read up on PHP’s directory and file reading functions. readdir documentation on php.net Then you can use the split or explode functions to parse the file name directly into the db.

As for reading text out of a pdf file there are some scripts on hotscripts that may help. That is something that you would be better starting with the work that is already done… hotscripts pdf manipulation This will require you to read up on how to use the script you choose… a lot of work is in your future.

Answer this question

Login

or

Join

to answer.

This question is in the General Section. Responses must be helpful and on-topic.

Your answer will be saved while you login or join.

Have a question? Ask Fluther!

What do you know more about?
or
Knowledge Networking @ Fluther