nltk.corpus.reader.conll module¶

Read CoNLL-style chunk fileids.

class nltk.corpus.reader.conll.ConllChunkCorpusReader[source]¶

Bases: ConllCorpusReader

A ConllCorpusReader whose data file contains three columns: words, pos, and chunk.

__init__(root, fileids, chunk_types, encoding='utf8', tagset=None, separator=None)[source]¶

Parameters:

root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.
fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:
- A string: encoding is the encoding name for all files.
- A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.
- A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.
- None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

class nltk.corpus.reader.conll.ConllCorpusReader[source]¶

Bases: CorpusReader

A corpus reader for CoNLL-style files. These files consist of a series of sentences, separated by blank lines. Each sentence is encoded using a table (or “grid”) of values, where each line corresponds to a single word, and each column corresponds to an annotation type. The set of columns used by CoNLL-style files can vary from corpus to corpus; the ConllCorpusReader constructor therefore takes an argument, columntypes, which is used to specify the columns that are used by a given corpus. By default columns are split by consecutive whitespaces, with the separator argument you can set a string to split by (e.g. ' ').

@todo: Add support for reading from corpora where different: parallel files contain different columns.
@todo: Possibly add caching of the grid corpus view? This would: allow the same grid view to be used by different data access methods (eg words() and parsed_sents() could both share the same grid corpus view object).
@todo: Better support for -DOCSTART-. Currently, we just ignore: it, but it could be used to define methods that retrieve a document at a time (eg parsed_documents()).

CHUNK = 'chunk'¶: column type for chunk structures

COLUMN_TYPES = ('words', 'pos', 'tree', 'chunk', 'ne', 'srl', 'ignore')¶: A list of all column types supported by the conll corpus reader.

IGNORE = 'ignore'¶: column type for column that should be ignored

NE = 'ne'¶: column type for named entities

POS = 'pos'¶: column type for part-of-speech tags

SRL = 'srl'¶: column type for semantic role labels

TREE = 'tree'¶: column type for parse trees

WORDS = 'words'¶: column type for words

__init__(root, fileids, columntypes, chunk_types=None, root_label='S', pos_in_tree=False, srl_includes_roleset=True, encoding='utf8', tree_class=<class 'nltk.tree.tree.Tree'>, tagset=None, separator=None)[source]¶

Parameters:

root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.
fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:
- A string: encoding is the encoding name for all files.
- A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.
- A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.
- None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

chunked_sents(fileids=None, chunk_types=None, tagset=None)[source]¶

chunked_words(fileids=None, chunk_types=None, tagset=None)[source]¶

iob_sents(fileids=None, tagset=None)[source]¶

Returns:: a list of lists of word/tag/IOB tuples
Return type:: list(list)
Parameters:: fileids (None or str or list) – the list of fileids that make up this corpus

iob_words(fileids=None, tagset=None)[source]¶

Returns:: a list of word/tag/IOB tuples
Return type:: list(tuple)
Parameters:: fileids (None or str or list) – the list of fileids that make up this corpus

parsed_sents(fileids=None, pos_in_tree=None, tagset=None)[source]¶

sents(fileids=None)[source]¶

srl_instances(fileids=None, pos_in_tree=None, flatten=True)[source]¶

srl_spans(fileids=None)[source]¶

tagged_sents(fileids=None, tagset=None)[source]¶

tagged_words(fileids=None, tagset=None)[source]¶

words(fileids=None)[source]¶

class nltk.corpus.reader.conll.ConllSRLInstance[source]¶

Bases: object

An SRL instance from a CoNLL corpus, which identifies and providing labels for the arguments of a single verb.

__init__(tree, verb_head, verb_stem, roleset, tagged_spans)[source]¶

arguments¶: A list of (argspan, argid) tuples, specifying the location and type for each of the arguments identified by this instance. argspan is a tuple start, end, indicating that the argument consists of the words[start:end].

pprint()[source]¶

tagged_spans¶: A list of (span, id) tuples, specifying the location and type for each of the arguments, as well as the verb pieces, that make up this instance.

tree¶: The parse tree for the sentence containing this instance.

verb¶: A list of the word indices of the words that compose the verb whose arguments are identified by this instance. This will contain multiple word indices when multi-word verbs are used (e.g. ‘turn on’).

verb_head¶: The word index of the head word of the verb whose arguments are identified by this instance. E.g., for a sentence that uses the verb ‘turn on,’ verb_head will be the word index of the word ‘turn’.

words¶: A list of the words in the sentence containing this instance.

class nltk.corpus.reader.conll.ConllSRLInstanceList[source]¶

Bases: list

Set of instances for a single sentence

__init__(tree, instances=())[source]¶

pprint(include_tree=False)[source]¶

NLTK

Documentation

nltk.corpus.reader.conll module¶