Efficiently computing & storing token n-grams from large corpora
Compute n-gram statistics and model language over pre-tokenized text corpora used to train large language models.