Regex to get all the h1, h2,h3,h4 on an html

I am working on a simple scraper.
I have managed to do the workflow on scraping the entire article in HTML format from a URL
However, I am not interested in the entire article and I just want to take only the subheadings like h1, h2, h3, h4 etc…

and arrange the subheadings maybe in a repeating group like this:

Thank you guys for the help!

<h[1-6][^>]>(.?)</h[1-6]>

1 Like

<h([1-6])[^>]*>([\s\S]+?)<\/h\1>
will handle some more cases. It will return the tag as well.

Usually what you want is better done with an html parser.

1 Like

do you know an html parser that I can use for this use case?

I know that htmlparser2 it’s a very fast library used as a starting point by many projects.

Are you using any scraping API service to scrape the articles, or have you created your own scraper?

This topic was automatically closed after 70 days. New replies are no longer allowed.