0 votes

Is it possible to use the RegExp functionality in Agent Ransack to search for multiline PDF files to just return those which contain a single word?
i.e. one word in the whole document even though the document has multiple lines, and exclude hits which have multiple words.

e.g.: a single word appearing in a pdf document of 5 lines.



but this wouldn't match because it has more than one word.




I want to reject matches where there is more than one word in the entire document.
The hits I'm getting back so far just seem to be searching line by line. I need to search over the entire document including cartridge returns.

I've tried using the (?s) and ^$ and \A \z anchors but only seem to be able to get the search to function in multi-line mode. I've also tried (?-m) to try and turn off multi line mode.

by (40 points)

1 Answer

+1 vote
Best answer

Agent Ransack matches regex on a line-by-line basis, ie each line is matched independently of the others. Multi-line regex is only available in FileLocator Pro.

I think the best option is to search for '*' as a Boolean expression and sort on the 'Hits' column to group files with the lowest number of hits, e.g.

File name:       *.pdf
Containing Text: *
Look In:         C:\Search

However, one issue you will have is that the PDF Reader adds in ' Page 1 ' type markers which will count as matches. You can mitigate against this with the Pro version by switching to using the installed IFilter instead (Configuration->Extensions->PDF Settings):

PDF Settings

You should get something like this:

Hits column

by (31.4k points)