Welcome to the Mythicsoft Q&A site for:

- Agent Ransack
- FileLocator Lite
- FileLocator Pro

Please feel free to ask any questions on these products or even answer other community member questions.

Useful Links:

- Contact Us
- Help Manuals
- Mythicsoft Home
+1 vote

I would like to locate all PDF files which do not have OCR'ed text in them so I can run OCR on those files.

I've tried several regular expressions for "blank" or "null" and haven't found one that works.

by (815 points)

1 Answer

+1 vote

You'll need to search for ALL words in a file and then sort by hit count, e.g.

File name:       *.pdf
Containing Text: *
Look In:         C:\Sample\Folder

This should find ALL words in the PDF files. You can now sort the list with the Hits column to identify documents with a very low hit count.

Empty Document Search

You'll see from the image that the document without any text scan.pdf has a very low hit count, which are the two words from it's file name (ie scan and pdf).

If you're searching among a lot of large PDF documents you might want to limit the search to the first 20 or so lines of data so that FileLocator Pro doesn't load the whole document into memory, e.g.

File name:       *.pdf
Containing Text: LINES:0-20 *
Look In:         C:\Sample\Folder

If you want to actually hide the documents with a high hit count you can use a column filter to filter them out:

Hit count filter

by (28.7k points)
Thanks for posting this! This is incredibly useful in locating image only PDFs and/or the ones that not OCR'ed.