In looking for a way to automatically generate descriptions for pages I stumbled into a markdown ast in python. It allows me to go over the markdown page and get only paragraph text. This will ignore headings, blockquotes, and code fences.
import commonmark import frontmatter post = frontmatter.load("post.md") parser = commonmark.Parser() ast = parser.parse(post.content) paragraphs = '' for node in ast.walker(): if node.t == "paragraph": paragraphs += " " paragraphs += node.first_child.literal
It's also super fast, previously I was rendering to html and using beautifulsoup to get only the paragraphs. Using the commonmark ast was about 5x faster on my site.
When I originally wrote this post, I did not realize at the time that commonmark duplicates nodes. I still do not understand why, but I have had success duplicating them based on the source position of the node with the snippet below.
from itertools import compress import commonmark import frontmatter post = frontmatter.load("post.md") parser = commonmark.Parser() ast = parser.parse(post.content) # find all paragraph nodes paragraph_nodes = [ n for n in ast.walker() if n.t == "paragraph" and n.first_child.literal is not None ] # for reasons unknown to me commonmark duplicates nodes, dedupe based on sourcepos sourcepos = [p.sourcepos for p in paragraph_nodes] # find first occurence of node based on source position unique_mask = [sourcepos.index(s) == i for i, s in enumerate(sourcepos)] # deduplicate paragraph_nodes based on unique source position unique_paragraph_nodes = list(compress(paragraph_nodes, unique_mask)) paragraphs = " ".join([p.first_child.literal for p in unique_paragraph_nodes])