Welcome to the Mythicsoft Q&A site for:

- Agent Ransack
- FileLocator Lite
- FileLocator Pro

Please feel free to ask any questions on these products or even answer other community member questions.

Useful Links:

- Contact Us
- Help Manuals
- Mythicsoft Home
0 votes

Thanks for the prompt answer. This explains everything well.

But searching my codebase by index has been working well enough for my needs.

Would it be possible to know what the general rules used for splitting text into tokens are?

by (110 points)

1 Answer

0 votes
 
Best answer

Standard tokens are alphanumeric sequences, ie a sequence of numbers and/or letters. Anything else is considered a break and the token is finished.

There are some exceptions, the tokenizer tries to keep these separated character sequences as a single token:

  • Acronyms (eg U.S.A.)
  • Internal apostrophes (e.g. O'Reilly)
  • Company names (e.g. AT&T)
  • Email addresses (e.g. person@host.com)
  • Host names (e.g. server.domain)
by (76.5k points)
...