0 votes

I'm searching about 3,000 documents on a regular basis. As these documents are mostly backups (i.e. revisions) of a common evolving text, they have a very, very high degree of similarity between them. Often the difference between two documents boils down to a single changed sentence -- though sometimes entire pages of text have been introduced or removed.

As a result, when I search for a phrase across these 3,000 documents, it will typically appear in about 80% of the documents, if it appears at all. But usually, I only want to see the swet of different ways that the target phrase is employed. If it's employed in the exact same sentence fragment or context (hit) as some other document that has already appeared in the results, I don't need to see all the other documents that use the phrase identically.

For example, if I search for "wilderness", I would ideally like to see:

SearchResultDocument268. doc
hit: "escaped into the wilderness"

SearchResultDocument745. doc
hit: "through the wilderness of Amazonia"

SearchResultDocument3929. doc
hit: "escaped into the wilderness"
hit: "through the wilderness of Amazonia"
hit: "no known wilderness was larger"

And so on, and so on, to reveal the hundred or so unique employments of the target phrase across these 3,000 documents.

As it now stands, only about 2% of the 1,800 documents returned by this search introduce the target phrase in a new textual context.

BTW, because I'm only looking at the immediate context of the target phrase, it really doesn't matter to me which specific document is returned to demonstrate how the target phrase is used. In other words, I don't really care if the search results show:

SearchResultDocument268. doc
hit: "escaped into the wilderness"

Or if they show:

SearchResultDocument433. doc
hit: "escaped into the wilderness"

Either one, or any other doc that employs the target phrase identically, would be fine.

If this type of perspective, or something similar, is not possible with FileLocator Pro, is it possible by using FileLocator Pro in conjunction with some other utility?

Many, many thanks.

asked by (460 points)

1 Answer

0 votes
 
Best answer

You can show unique term+context by using a combination of regular expressions and the Keyword Summary report, it's not exactly what you ask for but should help.

Switch to regular expression and search for:

.{0,20}wilderness.{0,20}

This will show the term wilderness with 20 characters of context either side. If you then go to reports and choose the Keywords Summary report that will show all unique instances of your search term within its context.

What it won't show is the first file in which the term+context appeared. For that simply re-search to show only those files.

answered by (67.2k points)
Extraordinary!  Absolutely extraordinary!  This does EXACTLY what I wanted!!  I really never thought it would possible!

Thank you so, so much Dave!
...