Standard tokens are alphanumeric sequences, ie a sequence of numbers and/or letters. Anything else is considered a break and the token is finished.
There are some exceptions, the tokenizer tries to keep these separated character sequences as a single token:
- Acronyms (eg U.S.A.)
- Internal apostrophes (e.g. O'Reilly)
- Company names (e.g. AT&T)
- Email addresses (e.g. email@example.com)
- Host names (e.g. server.domain)