0 votes

Searches in HTML files can miss intended matches when characters are coded as HTML entities, e.g., "<" coded as "& lt ;".

**** Example text:
In the browser, the HTML file appears as:

"This is an example"
X = Y - (Z << 4)

In the HTML source, that text is coded as:

&quot;This is an example&quot;
<br />
X = Y - (Z &lt;&lt; 4)

**** Example FLP search strings, plain text:
Containing text: "This is an example"
Containing text: X = Y - (Z << 4)

**** Search results:
FLP won't find the HTML file, as those search strings won't match the text in the file.

Are there any easy ways to work around this problem?

In HTML, characters that can be coded as "&...;" are called "entities".

It would be cool if FLP could search HTML files with the specified search strings, and with the search strings encoded as entities. I realize this is problematic, as a web page may not encode all possible entities as entities, e..g, not encode a double quote as " & quot ; " but encode "<" as "& lt ;"

Why search html source files?

There are web-sites that have useful info, but within web-pages, Google's search capabilities aren't as powerful as File Locator Pro's, e.g., regex
So, I've downloaded the web-sites using wget, then I search the html files using File Locator Pro (FLP).
However, HTML tags can be interspersed in the plain-text I'm searching for, as in the examples I posted earlier. Those tags can prevent FLP from finding intended matches.

Use Case

My particular use-case is mirroring web-sites, and searching the html files using FLP.
The best solution for me is probably to just convert the web-site mirror to plain text, then use FLP to search the plain text (not the HTML source).
It still might be useful for FLP to have a search option for HTML files, in which FLP converts the HTML file to plain text and the plain text is searched instead of the HTML source.


One solution for searching html files might be to convert them to plain text, then search the plain text. Converting them to plain text would involve replacing "entities" with their intended character, removing formatting, etc. I anticipate this conversion could be complicated, and difficult or impossible to do perfectly well. But, a useful solution may be possible.
For example, Lynx coverts HTML to plain-text (almost plain-text--there's still text coloring).

by (30 points)

1 Answer

0 votes

In the Configuration->Advanced->IFilter settings you can include IFilters for HTML/XML files which, assuming they're installed, will do the conversion you request:

HTML IFilter

by (31.3k points)
This is just what I needed.

BTW:  Dave, In your edits to clean-up my posts, the HTML source example was not showing the HTML entities.  I edited it to fix that.
Perfect, thanks.