Process Large Corpora Using Python Generator

Suppose you have a large text corpora and you can’t process that large file in your small RAM computer.

Here is a solution for processing large corpora using python generator

class CorpusProcessing:
def __init__(self, data_path):
self.data_path = data_path

def __iter__(self):
for line in open(self.data_path):
# do your process here
# here I am doing white space tokenization
tokens = line.split()
yield tokens

process = CorpusProcessing('large_copora.txt')
for tokens in process:


Thanks To

  • Faruk Ahmad vai for forcefully helping me learning python generator