nltk.corpus.reader.dependency module¶

class nltk.corpus.reader.dependency.DependencyCorpusReader[source]¶

Bases: SyntaxCorpusReader

__init__(root, fileids, encoding='utf8', word_tokenizer=<nltk.tokenize.simple.TabTokenizer object>, sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), para_block_reader=<function read_blankline_block>)[source]¶

Parameters:

root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.
fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:
- A string: encoding is the encoding name for all files.
- A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.
- A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.
- None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

parsed_sents(fileids=None)[source]¶

sents(fileids=None)[source]¶

tagged_sents(fileids=None)[source]¶

tagged_words(fileids=None)[source]¶

words(fileids=None)[source]¶

class nltk.corpus.reader.dependency.DependencyCorpusView[source]¶

Bases: StreamBackedCorpusView

__init__(corpus_file, tagged, group_by_sent, dependencies, chunk_types=None, encoding='utf8')[source]¶

Create a new corpus view, based on the file fileid, and read with block_reader. See the class documentation for more information.

Parameters:

fileid – The path to the file that is read by this corpus view. fileid can either be a string or a PathPointer.
startpos – The file position at which the view will start reading. This can be used to skip over preface sections.
encoding – The unicode encoding that should be used to read the file’s contents. If no encoding is specified, then the file’s contents will be read as a non-unicode string (i.e., a str).

read_block(stream)[source]¶

Read a block from the input stream.

Returns:: a block of tokens from the input stream
Return type:: list(any)
Parameters:: stream (stream) – an input stream

NLTK

Documentation

nltk.corpus.reader.dependency module¶