There are roughly two classes of tokenization algorithms:
Top-down tokenization:
we define a standard and implement rules to implement that kind of tokenization.Bottom-up tokenization:
we use simple statistics of letter sequences to break up words into subword tokens.