Tags
BeautifulSoup is a DOM like library for python. It's quite useful to manipulate html. Here is an example to find_all html headings. I stole the regex from stack overflow, but who doesn't.
Make an example
sample.html
Lets make a sample.html file with the following contents. It mainly has
some headings, <h1>
and <h2>
tags that I want to be able to find.
<!DOCTYPE html> <html lang="en"> <body> <h1>hello</h1> <p>this is a paragraph</p> <h2>second heading</h2> <p>this is also a paragraph</p> <h2>third heading</h2> <p>this is the last paragraph</p> </body> </html>
Get the headings with BeautifulSoup
Lets import our packages, read in our sample.html
using pathlib and find all
headings using BeautifulSoup.
from bs4 import BeautifulSoup from pathlib import Path soup = BeautifulSoup(Path('sample.html').read_text(), features="lxml") headings = soup.find_all(re.compile("^h[1-6]$"))
And what we get is a list of bs4.element.Tag
's.
>> print(headings) [<h1>hello</h1>, <h2>second heading</h2>, <h2>third heading</h2>]
I recently added a heading_link plugin to markata, you might notice the π's next to each heading on this page, that is powered by this exact technique.