正则式 REGEX - 实例

前言

前面谈到 Python 的正则式定义, 现在就让我们来看一些正则式的实例, 本文会按实例的增加而持续更新; 另外为了说明正则式的內容, 所以大都采用了 re.VERBOSE 方式来书写正则式.

实例一网页内容撷取, 小说目录各章节的标题及链结网址

 
from urllib.request import urlopen
 
url = 'https://www.ptwxz.com/html/11/11175/'
html = urlopen(url).read().decode('gbk')
>>> html
('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 
...
 '<h1>万界最强道长最新章节</h1>\r\n'
 '</div>\r\n'
...
 '<li><a href="8011051.html">第1章 倚天峰上</a></li>\r\n'
 '<li><a href="8011052.html">第2章 万界道士</a></li>\r\n'
...
 '<li><a href="8050881.html">第55章 只身诱敌</a></li>\r\n'
 '<li><a href="8052105.html">第56章 终章</a></li>\r\n'
...
 '</body>\r\n'
 '</html>\r\n')

一般我们会使用 bs4.BeautifulSoup 来处理, 看似比较简单, 但实际上, 并不一定如此, 这里使用正则式, 其结果更简单.

撷取书名

 
import re
# <h1>万界最强道长最新章节</h1>
title_regex = re.compile(r"""
    <h1>        # <h1>
    (.*?)       # 万界最强道长, group(1)
    .{4}        # 最新章节
    </h1>       # </h1>
""", re.VERBOSE)
 
title = title_regex.search(html).group(1)
print(f'小說書名: {title}')
小說書名: 万界最强道长

撷取各章节的链结及章名

 
# <li><a href="8011051.html">第1章 倚天峰上</a></li>
chapter_regex = re.compile(r"""
    <li><a      # <li><a
    \s+         # ' '
    href="      # href="
    (.+?)"      # 8011051.html, group(1) 链结
    >           # >
    (.+?)       # 第1章 倚天峰上, group(2) 章名
    </a></li>   # </a></li>
""", re.VERBOSE)
 
chapters = [(url+m.group(1), m.group(2)) for m in chapter_regex.finditer(html)]
for chapter in chapters:
    print(chapter)
('https://www.ptwxz.com/html/11/11175/8011051.html', '第1章 倚天峰上')
('https://www.ptwxz.com/html/11/11175/8011052.html', '第2章 万界道士')
...
('https://www.ptwxz.com/html/11/11175/8050881.html', '第55章 只身诱敌')
('https://www.ptwxz.com/html/11/11175/8052105.html', '第56章 终章')

撷取第一章的内容

 
import re
from urllib.request import urlopen
 
url = 'https://www.ptwxz.com/html/11/11175/8011051.html'
html = urlopen(url).read().decode('gbk')
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
...
&nbsp;&nbsp;&nbsp;&nbsp;青天白日，浩浩诸峰。<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;悠悠钟声，回荡山间。
...
<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;陈玄一身子一倾，倒在了地上。
</div>
...
</html>

就以\ \ \ \ 為起點, </div>為終點, 取出章節內容, 再以<br\s*?/><br\s*?/>    來分割段落, 這里的空白符就以\s*?來代替.

 
regex = re.compile("""
    (?<=&nbsp;&nbsp;&nbsp;&nbsp;)
    .*?
    (?=</div>)
""", re.VERBOSE | re.DOTALL)
m = regex.search(html)
text = '\n'.join(re.split(r"<br\s*?/><br\s*?/>&nbsp;&nbsp;&nbsp;&nbsp;", m.group().strip()))
# text = re.sub(r"<br\s*?/><br\s*?/>&nbsp;&nbsp;&nbsp;&nbsp;", '\n', m.group().strip())
print(text)
青天白日，浩浩诸峰。
悠悠钟声，回荡山间。
正是清晨时分，倚天峰上，钟声三响，人影绰绰。
...
陈玄一身子一倾，倒在了地上。

实例二撷取 Python 脚本内所有的 class 定义及其文档字符串

这里以 tkinter 库的 __init__.py 为例

读取文件内容

 
import re
import pathlib
import tkinter
 
base = tkinter.__path__[0]
path = pathlib.Path(base).joinpath('__init__.py')
with open(path, 'rt') as f:
    script = f.read()

定义 class 的样式

 
# class xxx (yyy) : """zzz"""
class_pattern = r''' 
    \bclass             # begin of a word   
    \s+?                # space
    [\w]+?              # identifier xxx   
    \s*?                # space   
    (                   # group 1   
        \(              #   (   
        .*?             #   yyy   
        \)              #   )   
    )?                  # group 1 may not exist   
    \s*?                # space   
    :                   # :   
    (                   # group 2   
        \s*?            # space   
        (["]{3}|[']{3}) # group 3, DOC-STRING   
        .*?             # zzz   
        \3              # same as group 3   
    )?                  # maybe no DOC-STRING
'''
class_regex = re.compile(class_pattern, re.VERBOSE | re.DOTALL)

撷取内容

 
class_regex = re.compile(class_pattern, re.VERBOSE | re.DOTALL)
 
classes = [m.group() for m in class_regex.finditer(script)]
for c in classes:
    print(c)
class EventType(str, enum.Enum):
class Event:
    """Container for the properties of an event.
 
    Instances of this type are generated if one of the following events occurs:
 
    KeyPress, KeyRelease - for keyboard events
    ButtonPress, ButtonRelease, Motion, Enter, Leave, MouseWheel - for mouse events
    Visibility, Unmap, Map, Expose, FocusIn, FocusOut, Circulate,
    Colormap, Gravity, Reparent, Property, Destroy, Activate,
    Deactivate - for window events.
 
    If a callback function for one of these events is registered
    using bind, bind_all, bind_class, or tag_bind, the callback is
    called with an Event as first argument. It will have the
    following attributes (in braces are the event types for which
    the attribute is valid):
 
        serial - serial number of event
    num - mouse button pressed (ButtonPress, ButtonRelease)
    focus - whether the window has the focus (Enter, Leave)
    height - height of the exposed window (Configure, Expose)
    width - width of the exposed window (Configure, Expose)
    keycode - keycode of the pressed key (KeyPress, KeyRelease)
    state - state of the event as a number (ButtonPress, ButtonRelease,
                            Enter, KeyPress, KeyRelease,
                            Leave, Motion)
    state - state as a string (Visibility)
    time - when the event occurred
    x - x-position of the mouse
    y - y-position of the mouse
    x_root - x-position of the mouse on the screen
             (ButtonPress, ButtonRelease, KeyPress, KeyRelease, Motion)
    y_root - y-position of the mouse on the screen
             (ButtonPress, ButtonRelease, KeyPress, KeyRelease, Motion)
    char - pressed character (KeyPress, KeyRelease)
    send_event - see X/Windows documentation
    keysym - keysym of the event as a string (KeyPress, KeyRelease)
    keysym_num - keysym of the event as a number (KeyPress, KeyRelease)
    type - type of the event as a number
    widget - widget in which the event occurred
    delta - delta of wheel movement (MouseWheel)
    """
class Variable:
    """Class to define value holders for e.g. buttons.
 
    Subclasses StringVar, IntVar, DoubleVar, BooleanVar are specializations
    that constrain the type of the value returned from get()."""
 
...
 
class LabelFrame(Widget):
    """labelframe widget."""
class PanedWindow(Widget):
    """panedwindow widget."""

实例三字符串内容转换成数据列表

例如我们有一笔有关参考文献的内容如下

 
import re
 
text = ''.join("""
[1] ShijunWangRonald M.Summe, Medical Image Analysis, Volume
16, Issue 5, July 2012, pp. 933-951 https://www.sciencedirect.
com/science/article/pii/S1361841512000333
[2] Dupuytren’s contracture, By Mayo Clinic Staff, https://
www.mayoclinic.org/diseases-conditions/dupuytrenscontracture/
symptoms-causes/syc-20371943
[3] Mean and standard deviation. http://www.bmj.com/about-bmj/
resources-readers/
publications/statistics-square-one/2-
mean-and-standard-deviation
[4] Interquartile Range IQR http://www.mathwords.com/i/
interquartile_range.htm
[5] Why are tree-based models robust to outliers? https://www.
quora.com/Why-are-tree-
based-models-robustto-
outliers
[6] https://www.dummies.com/education/math/statistics/howto-
interpret-a-correlation-coefficient-r/
[7] https://www.medicalnewstoday.com/releases/11856.php
[8] Scikit Learn Auc metrics: http://scikit-learn.org/stable/
modules/generated/sklearn.metrics.auc.html
[9] Scikit Learn Library RoC and AUC scores: http://
scikit-learn.
org/stable/modules/generated/sklearn.metrics.roc_auc_
score.html
""".strip().splitlines())

我们要的是分开的编号, 说明及网址, 在此已经将各行都合并在一起了.

 
regex = re.compile(r"""
    \[              # [
        (\d+?)      #   integer
    ]               # ]
    \s+?            # at least one space or more
    (.*?)           # any characters
    \s*?            # maybe no space or more
    (https?://.+?)  # simple http(s) match
    (?=\[|$)        # end with '[' or end of string, not included
""", re.VERBOSE | re.DOTALL)
 
for lst in regex.findall(text):
    print('\n'.join(lst))

因为我们要的结果是个列表, 所以调用的是 findall 函数; 为了打印出来, 方便观看, 所以又把它们分行了.

1. ShijunWangRonald M.Summe, Medical Image Analysis, Volume16, Issue 5, July 2012, pp. 933-951

https://www.sciencedirect.com/science/article/pii/S1361841512000333

2. Dupuytren’s contracture, By Mayo Clinic Staff,

https://www.mayoclinic.org/diseases-conditions/dupuytrenscontracture/symptoms-causes/syc-20371943

3. Mean and standard deviation.

http://www.bmj.com/about-bmj/resources-readers/publications/statistics-square-one/2-mean-and-standard-deviation

4. Interquartile Range IQR

http://www.mathwords.com/i/interquartile_range.htm

5. Why are tree-based models robust to outliers?

https://www.quora.com/Why-are-tree-based-models-robustto-outliers

6. https://www.dummies.com/education/math/statistics/howto-interpret-a-correlation-coefficient-r/

7. https://www.medicalnewstoday.com/releases/11856.php

8. Scikit Learn Auc metrics:

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html

9. Scikit Learn Library RoC and AUC scores:

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html

	from urllib.request import urlopen

	url = 'https://www.ptwxz.com/html/11/11175/'
	html = urlopen(url).read().decode('gbk')
	>>> html
	('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0
	...
	'<h1>万界最强道长最新章节</h1>\r\n'
	'</div>\r\n'
	...
	'<li><a href="8011051.html">第1章倚天峰上</a></li>\r\n'
	'<li><a href="8011052.html">第2章万界道士</a></li>\r\n'
	...
	'<li><a href="8050881.html">第55章只身诱敌</a></li>\r\n'
	'<li><a href="8052105.html">第56章终章</a></li>\r\n'
	...
	'</body>\r\n'
	'</html>\r\n')

	import re
	# <h1>万界最强道长最新章节</h1>
	title_regex = re.compile(r"""
	<h1> # <h1>
	(.*?) # 万界最强道长, group(1)
	.{4} # 最新章节
	</h1> # </h1>
	""", re.VERBOSE)

	title = title_regex.search(html).group(1)
	print(f'小說書名: {title}')
	小說書名: 万界最强道长

	# <li><a href="8011051.html">第1章倚天峰上</a></li>
	chapter_regex = re.compile(r"""
	<li><a # <li><a
	\s+ # ' '
	href=" # href="
	(.+?)" # 8011051.html, group(1) 链结
	> # >
	(.+?) # 第1章倚天峰上, group(2) 章名
	</a></li> # </a></li>
	""", re.VERBOSE)

	chapters = [(url+m.group(1), m.group(2)) for m in chapter_regex.finditer(html)]
	for chapter in chapters:
	print(chapter)
	('https://www.ptwxz.com/html/11/11175/8011051.html', '第1章倚天峰上')
	('https://www.ptwxz.com/html/11/11175/8011052.html', '第2章万界道士')
	...
	('https://www.ptwxz.com/html/11/11175/8050881.html', '第55章只身诱敌')
	('https://www.ptwxz.com/html/11/11175/8052105.html', '第56章终章')

	regex = re.compile("""
	(?<=    )
	.*?
	(?=</div>)
	""", re.VERBOSE \| re.DOTALL)
	m = regex.search(html)
	text = '\n'.join(re.split(r"<br\s?/><br\s?/>    ", m.group().strip()))
	# text = re.sub(r"<br\s?/><br\s?/>    ", '\n', m.group().strip())
	print(text)
	青天白日，浩浩诸峰。
	悠悠钟声，回荡山间。
	正是清晨时分，倚天峰上，钟声三响，人影绰绰。
	...
	陈玄一身子一倾，倒在了地上。

	import re
	import pathlib
	import tkinter

	base = tkinter.__path__[0]
	path = pathlib.Path(base).joinpath('__init__.py')
	with open(path, 'rt') as f:
	script = f.read()

	# class xxx (yyy) : """zzz"""
	class_pattern = r'''
	\bclass # begin of a word
	\s+? # space
	[\w]+? # identifier xxx
	\s*? # space
	( # group 1
	\( # (
	.*? # yyy
	\) # )
	)? # group 1 may not exist
	\s*? # space
	: # :
	( # group 2
	\s*? # space
	(["]{3}\|[']{3}) # group 3, DOC-STRING
	.*? # zzz
	\3 # same as group 3
	)? # maybe no DOC-STRING
	'''
	class_regex = re.compile(class_pattern, re.VERBOSE \| re.DOTALL)

	class_regex = re.compile(class_pattern, re.VERBOSE \| re.DOTALL)

	classes = [m.group() for m in class_regex.finditer(script)]
	for c in classes:
	print(c)
	class EventType(str, enum.Enum):
	class Event:
	"""Container for the properties of an event.

	Instances of this type are generated if one of the following events occurs:

	KeyPress, KeyRelease - for keyboard events
	ButtonPress, ButtonRelease, Motion, Enter, Leave, MouseWheel - for mouse events
	Visibility, Unmap, Map, Expose, FocusIn, FocusOut, Circulate,
	Colormap, Gravity, Reparent, Property, Destroy, Activate,
	Deactivate - for window events.

	If a callback function for one of these events is registered
	using bind, bind_all, bind_class, or tag_bind, the callback is
	called with an Event as first argument. It will have the
	following attributes (in braces are the event types for which
	the attribute is valid):

	serial - serial number of event
	num - mouse button pressed (ButtonPress, ButtonRelease)
	focus - whether the window has the focus (Enter, Leave)
	height - height of the exposed window (Configure, Expose)
	width - width of the exposed window (Configure, Expose)
	keycode - keycode of the pressed key (KeyPress, KeyRelease)
	state - state of the event as a number (ButtonPress, ButtonRelease,
	Enter, KeyPress, KeyRelease,
	Leave, Motion)
	state - state as a string (Visibility)
	time - when the event occurred
	x - x-position of the mouse
	y - y-position of the mouse
	x_root - x-position of the mouse on the screen
	(ButtonPress, ButtonRelease, KeyPress, KeyRelease, Motion)
	y_root - y-position of the mouse on the screen
	(ButtonPress, ButtonRelease, KeyPress, KeyRelease, Motion)
	char - pressed character (KeyPress, KeyRelease)
	send_event - see X/Windows documentation
	keysym - keysym of the event as a string (KeyPress, KeyRelease)
	keysym_num - keysym of the event as a number (KeyPress, KeyRelease)
	type - type of the event as a number
	widget - widget in which the event occurred
	delta - delta of wheel movement (MouseWheel)
	"""
	class Variable:
	"""Class to define value holders for e.g. buttons.

	Subclasses StringVar, IntVar, DoubleVar, BooleanVar are specializations
	that constrain the type of the value returned from get()."""

	...

	class LabelFrame(Widget):
	"""labelframe widget."""
	class PanedWindow(Widget):
	"""panedwindow widget."""

	import re

	text = ''.join("""
	[1] ShijunWangRonald M.Summe, Medical Image Analysis, Volume
	16, Issue 5, July 2012, pp. 933-951 https://www.sciencedirect.
	com/science/article/pii/S1361841512000333
	[2] Dupuytren’s contracture, By Mayo Clinic Staff, https://
	www.mayoclinic.org/diseases-conditions/dupuytrenscontracture/
	symptoms-causes/syc-20371943
	[3] Mean and standard deviation. http://www.bmj.com/about-bmj/
	resources-readers/
	publications/statistics-square-one/2-
	mean-and-standard-deviation
	[4] Interquartile Range IQR http://www.mathwords.com/i/
	interquartile_range.htm
	[5] Why are tree-based models robust to outliers? https://www.
	quora.com/Why-are-tree-
	based-models-robustto-
	outliers
	[6] https://www.dummies.com/education/math/statistics/howto-
	interpret-a-correlation-coefficient-r/
	[7] https://www.medicalnewstoday.com/releases/11856.php
	[8] Scikit Learn Auc metrics: http://scikit-learn.org/stable/
	modules/generated/sklearn.metrics.auc.html
	[9] Scikit Learn Library RoC and AUC scores: http://
	scikit-learn.
	org/stable/modules/generated/sklearn.metrics.roc_auc_
	score.html
	""".strip().splitlines())

	regex = re.compile(r"""
	\[ # [
	(\d+?) # integer
	] # ]
	\s+? # at least one space or more
	(.*?) # any characters
	\s*? # maybe no space or more
	(https?://.+?) # simple http(s) match
	(?=\[\|$) # end with '[' or end of string, not included
	""", re.VERBOSE \| re.DOTALL)

	for lst in regex.findall(text):
	print('\n'.join(lst))

正则式 REGEX - 实例

前言

实例一 网页内容撷取, 小说目录各章节的标题及链结网址

实例二 撷取 Python 脚本内所有的 class 定义及其文档字符串

实例三 字符串内容转换成数据列表

实例一网页内容撷取, 小说目录各章节的标题及链结网址

实例二撷取 Python 脚本内所有的 class 定义及其文档字符串

实例三字符串内容转换成数据列表