How to make something like PHP in Python. Part 3

Strip tags like as strip_tags in PHP

Fullstack CTO
2 min readFeb 8, 2023

If you have recently retrained from a PHP programmer to a Python developer, you may have difficulties at first from a different mainsheet and feature set. I’ll make your job easier. I will show you how to clear text from html tags.

To strip HTML tags from a string in Python, you can use the re (regular expression) module and the re.sub function. Here is an example:

import re

def strip_tags(html):
return re.sub('<[^<]+?>', '', html)

This function takes an HTML string as input, and returns a string with all HTML tags removed. The regular expression '<[^<]+?>' matches any string that starts with '<' and ends with '>', and includes anything in between. The re.sub function replaces all occurrences of this pattern with an empty string, effectively removing all HTML tags from the input string.

If performance is a concern, there are a few ways to optimize the strip_tags function. Here are a couple of options:

1. Compile the regular expression for better performance:

import re

pattern = re.compile('<[^<]+?>')

def strip_tags(html):
return pattern.sub('', html)

Compiling the regular expression before using it can improve the performance of the function, especially if you are going to call the function multiple times. By compiling the regular expression, you can avoid the overhead of parsing the regular expression pattern every time the function is called. Choose the optimization that works best for your specific use case.

2. Use the lxml library instead of the re module:

from lxml import html

def strip_tags(html):
return html.text_content()

The lxml library provides a more robust and efficient HTML parsing engine than a regular expression, which can significantly improve the performance of this function, especially for large HTML documents.

What is the best way to remove html tags: using lxml or regular expression?

The choice between using lxml and regular expressions for stripping HTML tags depends on several factors, including the size and complexity of the HTML document, the performance requirements, and the ease of use.

In general, if you need to handle complex HTML documents, lxml is a better choice. It provides a more robust and efficient HTML parsing engine, which can handle malformed HTML, correct errors in the document, and extract information in a structured way.

On the other hand, if the HTML document is small, simple, and well-formed, and performance is a concern, regular expressions can be a good choice. They are simple, fast, and lightweight, and can strip HTML tags quickly.

In summary, if you need to process large and complex HTML documents, lxml is likely to be the best choice, while if you are processing small and simple HTML documents, regular expressions may be the way to go.

--

--

Fullstack CTO
Fullstack CTO

Written by Fullstack CTO

CTO and co-founder at NEWHR & Geekjob

No responses yet