nltk.corpus.reader.aligned module¶
- class nltk.corpus.reader.aligned.AlignedCorpusReader[source]¶
Bases:
CorpusReaderReader for corpora of word-aligned sentences. Tokens are assumed to be separated by whitespace. Sentences begin on separate lines.
- __init__(root, fileids, sep='/', word_tokenizer=WhitespaceTokenizer(pattern='\\s+', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), alignedsent_block_reader=<function read_alignedsent_block>, encoding='latin1')[source]¶
Construct a new Aligned Corpus reader for a set of documents located at the given root directory. Example usage:
>>> root = '/...path to corpus.../' >>> reader = AlignedCorpusReader(root, '.*', '.txt')
- Parameters:
root – The root directory for this corpus.
fileids – A list or regexp specifying the fileids in this corpus.
- aligned_sents(fileids=None)[source]¶
- Returns:
the given file(s) as a list of AlignedSent objects.
- Return type:
list(AlignedSent)
- class nltk.corpus.reader.aligned.AlignedSentCorpusView[source]¶
Bases:
StreamBackedCorpusViewA specialized corpus view for aligned sentences.
AlignedSentCorpusViewobjects are typically created byAlignedCorpusReader(not directly by nltk users).- __init__(corpus_file, encoding, aligned, group_by_sent, word_tokenizer, sent_tokenizer, alignedsent_block_reader)[source]¶
Create a new corpus view, based on the file
fileid, and read withblock_reader. See the class documentation for more information.- Parameters:
fileid – The path to the file that is read by this corpus view.
fileidcan either be a string or aPathPointer.startpos – The file position at which the view will start reading. This can be used to skip over preface sections.
encoding – The unicode encoding that should be used to read the file’s contents. If no encoding is specified, then the file’s contents will be read as a non-unicode string (i.e., a str).