nltk.corpus.reader.markdown module¶
- class nltk.corpus.reader.markdown.CategorizedMarkdownCorpusReader[source]¶
Bases:
CategorizedCorpusReader
,MarkdownCorpusReader
A reader for markdown corpora whose documents are divided into categories based on their file identifiers.
Based on nltk.corpus.reader.plaintext.CategorizedPlaintextCorpusReader: https://www.nltk.org/_modules/nltk/corpus/reader/api.html#CategorizedCorpusReader
- __init__(*args, cat_field='tags', **kwargs)[source]¶
Initialize the corpus reader. Categorization arguments (
cat_pattern
,cat_map
, andcat_file
) are passed to theCategorizedCorpusReader
constructor. The remaining arguments are passed to theMarkdownCorpusReader
constructor.
- categories(fileids=None)[source]¶
Return a list of the categories that are defined for this corpus, or for the file(s) if it is given.
- fileids(categories=None)[source]¶
Return a list of file identifiers for the files that make up this corpus, or that make up the given category(s) if specified.
- paras(fileids=None, categories=None)[source]¶
- Returns:
the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
- Return type:
list(list(list(str)))
- raw(fileids=None, categories=None)[source]¶
- Parameters:
fileids – A list specifying the fileids that should be used.
- Returns:
the given file(s) as a single string.
- Return type:
str
- class nltk.corpus.reader.markdown.CodeBlock[source]¶
Bases:
MarkdownBlock
- property lines¶
- property paras¶
- property sents¶
- class nltk.corpus.reader.markdown.Image¶
Bases:
tuple
Image(label, src, title)
- static __new__(_cls, label, src, title)¶
Create new instance of Image(label, src, title)
- label¶
Alias for field number 0
- src¶
Alias for field number 1
- title¶
Alias for field number 2
- class nltk.corpus.reader.markdown.Link¶
Bases:
tuple
Link(label, href, title)
- static __new__(_cls, label, href, title)¶
Create new instance of Link(label, href, title)
- href¶
Alias for field number 1
- label¶
Alias for field number 0
- title¶
Alias for field number 2
- class nltk.corpus.reader.markdown.List¶
Bases:
tuple
List(is_ordered, items)
- static __new__(_cls, is_ordered, items)¶
Create new instance of List(is_ordered, items)
- is_ordered¶
Alias for field number 0
- items¶
Alias for field number 1
- class nltk.corpus.reader.markdown.MarkdownBlock[source]¶
Bases:
object
- property paras¶
- property raw¶
- property sents¶
- property words¶
- class nltk.corpus.reader.markdown.MarkdownCorpusReader[source]¶
Bases:
PlaintextCorpusReader
- __init__(*args, parser=None, **kwargs)[source]¶
Construct a new plaintext corpus reader for a set of documents located at the given root directory. Example usage:
>>> root = '/usr/local/share/nltk_data/corpora/webtext/' >>> reader = PlaintextCorpusReader(root, '.*\.txt')
- Parameters:
root – The root directory for this corpus.
fileids – A list or regexp specifying the fileids in this corpus.
word_tokenizer – Tokenizer for breaking sentences or paragraphs into words.
sent_tokenizer – Tokenizer for breaking paragraphs into words.
para_block_reader – The block reader used to divide the corpus into paragraph blocks.
- class nltk.corpus.reader.markdown.MarkdownSection[source]¶
Bases:
MarkdownBlock