nltk.corpus.reader.markdown module

class nltk.corpus.reader.markdown.CategorizedMarkdownCorpusReader[source]

Bases: CategorizedCorpusReader, MarkdownCorpusReader

A reader for markdown corpora whose documents are divided into categories based on their file identifiers.

Based on nltk.corpus.reader.plaintext.CategorizedPlaintextCorpusReader: https://www.nltk.org/_modules/nltk/corpus/reader/api.html#CategorizedCorpusReader

__init__(*args, cat_field='tags', **kwargs)[source]

Initialize the corpus reader. Categorization arguments (cat_pattern, cat_map, and cat_file) are passed to the CategorizedCorpusReader constructor. The remaining arguments are passed to the MarkdownCorpusReader constructor.

blockquote_reader(stream)[source]
blockquotes(fileids=None, categories=None)[source]
categories(fileids=None)[source]

Return a list of the categories that are defined for this corpus, or for the file(s) if it is given.

code_block_reader(stream)[source]
code_blocks(fileids=None, categories=None)[source]
concatenated_view(reader, fileids, categories)[source]
fileids(categories=None)[source]

Return a list of file identifiers for the files that make up this corpus, or that make up the given category(s) if specified.

image_reader(stream)[source]
images(fileids=None, categories=None)[source]
list_reader(stream)[source]
lists(fileids=None, categories=None)[source]
metadata(fileids=None, categories=None)[source]
metadata_reader(stream)[source]
paras(fileids=None, categories=None)[source]
Returns:

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.

Return type:

list(list(list(str)))

raw(fileids=None, categories=None)[source]
Parameters:

fileids – A list specifying the fileids that should be used.

Returns:

the given file(s) as a single string.

Return type:

str

section_reader(stream)[source]
sections(fileids=None, categories=None)[source]
sents(fileids=None, categories=None)[source]
Returns:

the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.

Return type:

list(list(str))

words(fileids=None, categories=None)[source]
Returns:

the given file(s) as a list of words and punctuation symbols.

Return type:

list(str)

class nltk.corpus.reader.markdown.CodeBlock[source]

Bases: MarkdownBlock

__init__(language, *args)[source]
property lines
property paras
property sents
class nltk.corpus.reader.markdown.Image

Bases: tuple

Image(label, src, title)

static __new__(_cls, label, src, title)

Create new instance of Image(label, src, title)

label

Alias for field number 0

src

Alias for field number 1

title

Alias for field number 2

Bases: tuple

Link(label, href, title)

static __new__(_cls, label, href, title)

Create new instance of Link(label, href, title)

href

Alias for field number 1

label

Alias for field number 0

title

Alias for field number 2

class nltk.corpus.reader.markdown.List

Bases: tuple

List(is_ordered, items)

static __new__(_cls, is_ordered, items)

Create new instance of List(is_ordered, items)

is_ordered

Alias for field number 0

items

Alias for field number 1

class nltk.corpus.reader.markdown.MarkdownBlock[source]

Bases: object

__init__(content)[source]
property paras
property raw
property sents
property words
class nltk.corpus.reader.markdown.MarkdownCorpusReader[source]

Bases: PlaintextCorpusReader

__init__(*args, parser=None, **kwargs)[source]

Construct a new plaintext corpus reader for a set of documents located at the given root directory. Example usage:

>>> root = '/usr/local/share/nltk_data/corpora/webtext/'
>>> reader = PlaintextCorpusReader(root, '.*\.txt') 
Parameters:
  • root – The root directory for this corpus.

  • fileids – A list or regexp specifying the fileids in this corpus.

  • word_tokenizer – Tokenizer for breaking sentences or paragraphs into words.

  • sent_tokenizer – Tokenizer for breaking paragraphs into words.

  • para_block_reader – The block reader used to divide the corpus into paragraph blocks.

class nltk.corpus.reader.markdown.MarkdownSection[source]

Bases: MarkdownBlock

__init__(heading, level, *args)[source]
nltk.corpus.reader.markdown.comma_separated_string_args(func)[source]

A decorator that allows a function to be called with a single string of comma-separated values which become individual function arguments.

nltk.corpus.reader.markdown.read_parse_blankline_block(stream, parser)[source]