nltk.corpus.reader.bnc module

Corpus reader for the XML version of the British National Corpus.

class nltk.corpus.reader.bnc.BNCCorpusReader[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

Corpus reader for the XML version of the British National Corpus.

For access to the complete XML data structure, use the xml() method. For access to simple word lists and tagged word lists, use words(), sents(), tagged_words(), and tagged_sents().

You can obtain the full version of the BNC corpus at https://www.ota.ox.ac.uk/desc/2554

If you extracted the archive to a directory called BNC, then you can instantiate the reader as:

BNCCorpusReader(root='BNC/Texts/', fileids=r'[A-K]/\w*/\w*\.xml')
__init__(root, fileids, lazy=True)[source]
Parameters
  • root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.

  • fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.

  • encoding

    The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:

    • A string: encoding is the encoding name for all files.

    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.

    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.

    • None: the file contents of all files will be processed using non-unicode byte strings.

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

words(fileids=None, strip_space=True, stem=False)[source]
Returns

the given file(s) as a list of words and punctuation symbols.

Return type

list(str)

Parameters
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.

  • stem – If true, then use word stems instead of word strings.

tagged_words(fileids=None, c5=False, strip_space=True, stem=False)[source]
Returns

the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).

Return type

list(tuple(str,str))

Parameters
  • c5 – If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used.

  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.

  • stem – If true, then use word stems instead of word strings.

sents(fileids=None, strip_space=True, stem=False)[source]
Returns

the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.

Return type

list(list(str))

Parameters
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.

  • stem – If true, then use word stems instead of word strings.

tagged_sents(fileids=None, c5=False, strip_space=True, stem=False)[source]
Returns

the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.

Return type

list(list(tuple(str,str)))

Parameters
  • c5 – If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used.

  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.

  • stem – If true, then use word stems instead of word strings.

class nltk.corpus.reader.bnc.BNCSentence[source]

Bases: list

A list of words, augmented by an attribute num used to record the sentence identifier (the n attribute from the XML).

__init__(num, items)[source]
class nltk.corpus.reader.bnc.BNCWordView[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusView

A stream backed corpus view specialized for use with the BNC corpus.

tags_to_ignore = {'align', 'event', 'gap', 'pause', 'pb', 'shift', 'unclear', 'vocal'}

These tags are ignored. For their description refer to the technical documentation, for example, http://www.natcorp.ox.ac.uk/docs/URG/ref-vocal.html

__init__(fileid, sent, tag, strip_space, stem)[source]
Parameters
  • fileid – The name of the underlying file.

  • sent – If true, include sentence bracketing.

  • tag – The name of the tagset to use, or None for no tags.

  • strip_space – If true, strip spaces from word tokens.

  • stem – If true, then substitute stems for words.

title

Title of the document.

author

Author of the document.

editor

Editor

resps

Statement of responsibility

handle_header(elt, context)[source]
handle_elt(elt, context)[source]

Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the elt_handler constructor argument, this method simply returns elt.

Returns

The view value corresponding to elt.

Parameters
  • elt (ElementTree) – The element that should be converted.

  • context (str) – A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string 'foo/bar/baz' indicates that the element is a baz element whose parent is a bar element and whose grandparent is a top-level foo element.

handle_word(elt)[source]
handle_sent(elt)[source]