nltk.corpus.reader.aligned module¶

class nltk.corpus.reader.aligned.AlignedCorpusReader[source]¶

Bases: CorpusReader

Reader for corpora of word-aligned sentences. Tokens are assumed to be separated by whitespace. Sentences begin on separate lines.

__init__(root, fileids, sep='/', word_tokenizer=WhitespaceTokenizer(pattern='\\s+', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), alignedsent_block_reader=<function read_alignedsent_block>, encoding='latin1')[source]¶

Construct a new Aligned Corpus reader for a set of documents located at the given root directory. Example usage:

>>> root = '/...path to corpus.../'
>>> reader = AlignedCorpusReader(root, '.*', '.txt') 

Parameters:

root – The root directory for this corpus.
fileids – A list or regexp specifying the fileids in this corpus.

aligned_sents(fileids=None)[source]¶

Returns:: the given file(s) as a list of AlignedSent objects.
Return type:: list(AlignedSent)

sents(fileids=None)[source]¶

Returns:: the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type:: list(list(str))

words(fileids=None)[source]¶

Returns:: the given file(s) as a list of words and punctuation symbols.
Return type:: list(str)

class nltk.corpus.reader.aligned.AlignedSentCorpusView[source]¶

Bases: StreamBackedCorpusView

A specialized corpus view for aligned sentences. AlignedSentCorpusView objects are typically created by AlignedCorpusReader (not directly by nltk users).

__init__(corpus_file, encoding, aligned, group_by_sent, word_tokenizer, sent_tokenizer, alignedsent_block_reader)[source]¶

Create a new corpus view, based on the file fileid, and read with block_reader. See the class documentation for more information.

Parameters:

fileid – The path to the file that is read by this corpus view. fileid can either be a string or a PathPointer.
startpos – The file position at which the view will start reading. This can be used to skip over preface sections.
encoding – The unicode encoding that should be used to read the file’s contents. If no encoding is specified, then the file’s contents will be read as a non-unicode string (i.e., a str).

read_block(stream)[source]¶

Read a block from the input stream.

Returns:: a block of tokens from the input stream
Return type:: list(any)
Parameters:: stream (stream) – an input stream

NLTK

Documentation

nltk.corpus.reader.aligned module¶