In looking for a way to automatically generate descriptions for pages I stumbled into a markdown ast in python. It allows me to go over the markdown page and get only paragraph text. This will ignore headings, blockquotes, and code fences.
import commonmark import frontmatter post = frontmatter.load("post.md") parser = commonmark.Parser() ast = parser.parse(post.content) paragraphs = '' for node in ast.walker(): if node[0].t == "paragraph": paragraphs += " " paragraphs += node[0].first_child.literal
It’s also super fast, previously I was rendering to html and using beautifulsoup to get only the paragraphs. Using the commonmark ast was about 5x faster on my site.
When I originally wrote this post, I did not realize at the time that commonmark duplicates nodes. I still do not understand why, but I have had success duplicating them based on the source position of the node with the snippet below.
