nltk.corpus.reader.markdown module¶

class nltk.corpus.reader.markdown.CategorizedMarkdownCorpusReader[source]¶

Bases: CategorizedCorpusReader, MarkdownCorpusReader

A reader for markdown corpora whose documents are divided into categories based on their file identifiers.

Based on nltk.corpus.reader.plaintext.CategorizedPlaintextCorpusReader: https://www.nltk.org/_modules/nltk/corpus/reader/api.html#CategorizedCorpusReader

__init__(*args, cat_field='tags', **kwargs)[source]¶: Initialize the corpus reader. Categorization arguments (cat_pattern, cat_map, and cat_file) are passed to the CategorizedCorpusReader constructor. The remaining arguments are passed to the MarkdownCorpusReader constructor.

blockquote_reader(stream)[source]¶

blockquotes(fileids=None, categories=None)[source]¶

categories(fileids=None)[source]¶: Return a list of the categories that are defined for this corpus, or for the file(s) if it is given.

code_block_reader(stream)[source]¶

code_blocks(fileids=None, categories=None)[source]¶

concatenated_view(reader, fileids, categories)[source]¶

fileids(categories=None)[source]¶: Return a list of file identifiers for the files that make up this corpus, or that make up the given category(s) if specified.

image_reader(stream)[source]¶

images(fileids=None, categories=None)[source]¶

link_reader(stream)[source]¶

links(fileids=None, categories=None)[source]¶

list_reader(stream)[source]¶

lists(fileids=None, categories=None)[source]¶

metadata(fileids=None, categories=None)[source]¶

metadata_reader(stream)[source]¶

paras(fileids=None, categories=None)[source]¶

Returns:: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
Return type:: list(list(list(str)))

raw(fileids=None, categories=None)[source]¶

Parameters:: fileids – A list specifying the fileids that should be used.
Returns:: the given file(s) as a single string.
Return type:: str

section_reader(stream)[source]¶

sections(fileids=None, categories=None)[source]¶

sents(fileids=None, categories=None)[source]¶

Returns:: the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type:: list(list(str))

words(fileids=None, categories=None)[source]¶

Returns:: the given file(s) as a list of words and punctuation symbols.
Return type:: list(str)

class nltk.corpus.reader.markdown.CodeBlock[source]¶

Bases: MarkdownBlock

__init__(language, *args)[source]¶

property lines¶

property paras¶

property sents¶

class nltk.corpus.reader.markdown.Image¶

Bases: tuple

Image(label, src, title)

static __new__(_cls, label, src, title)¶: Create new instance of Image(label, src, title)

label¶: Alias for field number 0

src¶: Alias for field number 1

title¶: Alias for field number 2

class nltk.corpus.reader.markdown.Link¶

Bases: tuple

Link(label, href, title)

static __new__(_cls, label, href, title)¶: Create new instance of Link(label, href, title)

href¶: Alias for field number 1

label¶: Alias for field number 0

title¶: Alias for field number 2

class nltk.corpus.reader.markdown.List¶

Bases: tuple

List(is_ordered, items)

static __new__(_cls, is_ordered, items)¶: Create new instance of List(is_ordered, items)

is_ordered¶: Alias for field number 0

items¶: Alias for field number 1

class nltk.corpus.reader.markdown.MarkdownBlock[source]¶

Bases: object

__init__(content)[source]¶

property paras¶

property raw¶

property sents¶

property words¶

class nltk.corpus.reader.markdown.MarkdownCorpusReader[source]¶

Bases: PlaintextCorpusReader

__init__(*args, parser=None, **kwargs)[source]¶

Construct a new plaintext corpus reader for a set of documents located at the given root directory. Example usage:

>>> root = '/usr/local/share/nltk_data/corpora/webtext/'
>>> reader = PlaintextCorpusReader(root, '.*\.txt') 

Parameters:

root – The root directory for this corpus.
fileids – A list or regexp specifying the fileids in this corpus.
word_tokenizer – Tokenizer for breaking sentences or paragraphs into words.
sent_tokenizer – Tokenizer for breaking paragraphs into words.
para_block_reader – The block reader used to divide the corpus into paragraph blocks.

class nltk.corpus.reader.markdown.MarkdownSection[source]¶

Bases: MarkdownBlock

__init__(heading, level, *args)[source]¶

nltk.corpus.reader.markdown.comma_separated_string_args(func)[source]¶: A decorator that allows a function to be called with a single string of comma-separated values which become individual function arguments.

nltk.corpus.reader.markdown.read_parse_blankline_block(stream, parser)[source]¶

NLTK

Documentation

nltk.corpus.reader.markdown module¶