0 votes

Thanks for the prompt answer. This explains everything well.

But searching my codebase by index has been working well enough for my needs.

Would it be possible to know what the general rules used for splitting text into tokens are?

by (60 points)

1 Answer

0 votes
Best answer

Standard tokens are alphanumeric sequences, ie a sequence of numbers and/or letters. Anything else is considered a break and the token is finished.

There are some exceptions, the tokenizer tries to keep these separated character sequences as a single token:

  • Acronyms (eg U.S.A.)
  • Internal apostrophes (e.g. O'Reilly)
  • Company names (e.g. AT&T)
  • Email addresses (e.g. person@host.com)
  • Host names (e.g. server.domain)
by (31.4k points)