前言
前面谈到 Python 的正则式定义, 现在就让我们来看一些正则式的实例, 本文会按实例的增加而持续更新; 另外为了说明正则式的內容, 所以大都采用了 re.VERBOSE 方式来书写正则式.
实例一 网页内容撷取, 小说目录各章节的标题及链结网址
- 读取小说目录的网页内容
from urllib.request import urlopen | |
url = 'https://www.ptwxz.com/html/11/11175/' | |
html = urlopen(url).read().decode('gbk') | |
>>> html | |
('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 | |
... | |
'<h1>万界最强道长最新章节</h1>\r\n' | |
'</div>\r\n' | |
... | |
'<li><a href="8011051.html">第1章 倚天峰上</a></li>\r\n' | |
'<li><a href="8011052.html">第2章 万界道士</a></li>\r\n' | |
... | |
'<li><a href="8050881.html">第55章 只身诱敌</a></li>\r\n' | |
'<li><a href="8052105.html">第56章 终章</a></li>\r\n' | |
... | |
'</body>\r\n' | |
'</html>\r\n') | |
一般我们会使用 bs4.BeautifulSoup 来处理, 看似比较简单, 但实际上, 并不一定如此, 这里使用正则式, 其结果更简单.
- 撷取书名
import re | |
# <h1>万界最强道长最新章节</h1> | |
title_regex = re.compile(r""" | |
<h1> # <h1> | |
(.*?) # 万界最强道长, group(1) | |
.{4} # 最新章节 | |
</h1> # </h1> | |
""", re.VERBOSE) | |
title = title_regex.search(html).group(1) | |
print(f'小說書名: {title}') | |
小說書名: 万界最强道长 |
- 撷取各章节的链结及章名
# <li><a href="8011051.html">第1章 倚天峰上</a></li> | |
chapter_regex = re.compile(r""" | |
<li><a # <li><a | |
\s+ # ' ' | |
href=" # href=" | |
(.+?)" # 8011051.html, group(1) 链结 | |
> # > | |
(.+?) # 第1章 倚天峰上, group(2) 章名 | |
</a></li> # </a></li> | |
""", re.VERBOSE) | |
chapters = [(url+m.group(1), m.group(2)) for m in chapter_regex.finditer(html)] | |
for chapter in chapters: | |
print(chapter) | |
('https://www.ptwxz.com/html/11/11175/8011051.html', '第1章 倚天峰上') | |
('https://www.ptwxz.com/html/11/11175/8011052.html', '第2章 万界道士') | |
... | |
('https://www.ptwxz.com/html/11/11175/8050881.html', '第55章 只身诱敌') | |
('https://www.ptwxz.com/html/11/11175/8052105.html', '第56章 终章') |
- 撷取第一章的内容
import re | |
from urllib.request import urlopen | |
url = 'https://www.ptwxz.com/html/11/11175/8011051.html' | |
html = urlopen(url).read().decode('gbk') | |
... | |
青天白日,浩浩诸峰。<br /><br /> 悠悠钟声,回荡山间。 | |
... | |
<br /><br /> 陈玄一身子一倾,倒在了地上。 | |
</div> | |
... | |
</html> |
就以\ \ \ \
為起點, </div>
為終點, 取出章節內容, 再以<br\s*?/><br\s*?/>
來分割段落, 這里的空白符就以\s*?
來代替.
regex = re.compile(""" | |
(?<= ) | |
.*? | |
(?=</div>) | |
""", re.VERBOSE | re.DOTALL) | |
m = regex.search(html) | |
text = '\n'.join(re.split(r"<br\s*?/><br\s*?/> ", m.group().strip())) | |
# text = re.sub(r"<br\s*?/><br\s*?/> ", '\n', m.group().strip()) | |
print(text) | |
青天白日,浩浩诸峰。 | |
悠悠钟声,回荡山间。 | |
正是清晨时分,倚天峰上,钟声三响,人影绰绰。 | |
... | |
陈玄一身子一倾,倒在了地上。 |
实例二 撷取 Python 脚本内所有的 class 定义及其文档字符串
这里以 tkinter 库的 __init__.py 为例
- 读取文件内容
import re | |
import pathlib | |
import tkinter | |
base = tkinter.__path__[0] | |
path = pathlib.Path(base).joinpath('__init__.py') | |
with open(path, 'rt') as f: | |
script = f.read() |
- 定义 class 的样式
# class xxx (yyy) : """zzz""" | |
class_pattern = r''' | |
\bclass # begin of a word | |
\s+? # space | |
[\w]+? # identifier xxx | |
\s*? # space | |
( # group 1 | |
\( # ( | |
.*? # yyy | |
\) # ) | |
)? # group 1 may not exist | |
\s*? # space | |
: # : | |
( # group 2 | |
\s*? # space | |
(["]{3}|[']{3}) # group 3, DOC-STRING | |
.*? # zzz | |
\3 # same as group 3 | |
)? # maybe no DOC-STRING | |
''' | |
class_regex = re.compile(class_pattern, re.VERBOSE | re.DOTALL) | |
- 撷取内容
class_regex = re.compile(class_pattern, re.VERBOSE | re.DOTALL) | |
classes = [m.group() for m in class_regex.finditer(script)] | |
for c in classes: | |
print(c) | |
class EventType(str, enum.Enum): | |
class Event: | |
"""Container for the properties of an event. | |
Instances of this type are generated if one of the following events occurs: | |
KeyPress, KeyRelease - for keyboard events | |
ButtonPress, ButtonRelease, Motion, Enter, Leave, MouseWheel - for mouse events | |
Visibility, Unmap, Map, Expose, FocusIn, FocusOut, Circulate, | |
Colormap, Gravity, Reparent, Property, Destroy, Activate, | |
Deactivate - for window events. | |
If a callback function for one of these events is registered | |
using bind, bind_all, bind_class, or tag_bind, the callback is | |
called with an Event as first argument. It will have the | |
following attributes (in braces are the event types for which | |
the attribute is valid): | |
serial - serial number of event | |
num - mouse button pressed (ButtonPress, ButtonRelease) | |
focus - whether the window has the focus (Enter, Leave) | |
height - height of the exposed window (Configure, Expose) | |
width - width of the exposed window (Configure, Expose) | |
keycode - keycode of the pressed key (KeyPress, KeyRelease) | |
state - state of the event as a number (ButtonPress, ButtonRelease, | |
Enter, KeyPress, KeyRelease, | |
Leave, Motion) | |
state - state as a string (Visibility) | |
time - when the event occurred | |
x - x-position of the mouse | |
y - y-position of the mouse | |
x_root - x-position of the mouse on the screen | |
(ButtonPress, ButtonRelease, KeyPress, KeyRelease, Motion) | |
y_root - y-position of the mouse on the screen | |
(ButtonPress, ButtonRelease, KeyPress, KeyRelease, Motion) | |
char - pressed character (KeyPress, KeyRelease) | |
send_event - see X/Windows documentation | |
keysym - keysym of the event as a string (KeyPress, KeyRelease) | |
keysym_num - keysym of the event as a number (KeyPress, KeyRelease) | |
type - type of the event as a number | |
widget - widget in which the event occurred | |
delta - delta of wheel movement (MouseWheel) | |
""" | |
class Variable: | |
"""Class to define value holders for e.g. buttons. | |
Subclasses StringVar, IntVar, DoubleVar, BooleanVar are specializations | |
that constrain the type of the value returned from get().""" | |
... | |
class LabelFrame(Widget): | |
"""labelframe widget.""" | |
class PanedWindow(Widget): | |
"""panedwindow widget.""" |
实例三 字符串内容转换成数据列表
- 例如我们有一笔有关参考文献的内容如下
import re | |
text = ''.join(""" | |
[1] ShijunWangRonald M.Summe, Medical Image Analysis, Volume | |
16, Issue 5, July 2012, pp. 933-951 https://www.sciencedirect. | |
com/science/article/pii/S1361841512000333 | |
[2] Dupuytren’s contracture, By Mayo Clinic Staff, https:// | |
www.mayoclinic.org/diseases-conditions/dupuytrenscontracture/ | |
symptoms-causes/syc-20371943 | |
[3] Mean and standard deviation. http://www.bmj.com/about-bmj/ | |
resources-readers/ | |
publications/statistics-square-one/2- | |
mean-and-standard-deviation | |
[4] Interquartile Range IQR http://www.mathwords.com/i/ | |
interquartile_range.htm | |
[5] Why are tree-based models robust to outliers? https://www. | |
quora.com/Why-are-tree- | |
based-models-robustto- | |
outliers | |
[6] https://www.dummies.com/education/math/statistics/howto- | |
interpret-a-correlation-coefficient-r/ | |
[7] https://www.medicalnewstoday.com/releases/11856.php | |
[8] Scikit Learn Auc metrics: http://scikit-learn.org/stable/ | |
modules/generated/sklearn.metrics.auc.html | |
[9] Scikit Learn Library RoC and AUC scores: http:// | |
scikit-learn. | |
org/stable/modules/generated/sklearn.metrics.roc_auc_ | |
score.html | |
""".strip().splitlines()) |
- 我们要的是分开的编号, 说明及网址, 在此已经将各行都合并在一起了.
regex = re.compile(r""" | |
\[ # [ | |
(\d+?) # integer | |
] # ] | |
\s+? # at least one space or more | |
(.*?) # any characters | |
\s*? # maybe no space or more | |
(https?://.+?) # simple http(s) match | |
(?=\[|$) # end with '[' or end of string, not included | |
""", re.VERBOSE | re.DOTALL) | |
for lst in regex.findall(text): | |
print('\n'.join(lst)) |
- 因为我们要的结果是个列表, 所以调用的是
findall
函数; 为了打印出来, 方便观看, 所以又把它们分行了.
1. ShijunWangRonald M.Summe, Medical Image Analysis, Volume16, Issue 5, July 2012, pp. 933-951
https://www.sciencedirect.com/science/article/pii/S1361841512000333
2. Dupuytren’s contracture, By Mayo Clinic Staff,
https://www.mayoclinic.org/diseases-conditions/dupuytrenscontracture/symptoms-causes/syc-20371943
3. Mean and standard deviation.
http://www.bmj.com/about-bmj/resources-readers/publications/statistics-square-one/2-mean-and-standard-deviation
4. Interquartile Range IQR
http://www.mathwords.com/i/interquartile_range.htm
5. Why are tree-based models robust to outliers?
https://www.quora.com/Why-are-tree-based-models-robustto-outliers
6. https://www.dummies.com/education/math/statistics/howto-interpret-a-correlation-coefficient-r/
7. https://www.medicalnewstoday.com/releases/11856.php
8. Scikit Learn Auc metrics:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html
9. Scikit Learn Library RoC and AUC scores:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html