Regular Expression HOWTO

Introduction

Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.

正则表达式(称为REs、regexes或regex模式)本质上是一种嵌入在Python中的小型、高度专门化的编程语言,可以通过re模块使用。使用这个小语言,您可以指定要匹配的可能字符串集的规则;这个集合可能包含英语句子、电子邮件地址、TeX命令或任何您喜欢的内容。然后可以问一些问题,比如“这个字符串与模式匹配吗?”,或“此字符串中的任何地方是否与模式匹配?”您还可以使用REs修改字符串或以各种方式将其拆分。

Regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C. For advanced use, it may be necessary to pay careful attention to how the engine will execute a given RE, and write the RE in a certain way in order to produce bytecode that runs faster. Optimization isn’t covered in this document, because it requires that you have a good understanding of the matching engine’s internals.

正则表达式模式编译成一系列的字节码,然后由用C编写的匹配引擎执行高级使用,可能需要注意引擎将如何执行一个给定的RE,和怎么写RE以使生成字节码运行得更快。本文不讨论优化,因为它要求您很好地理解匹配引擎的内部机制。

The regular expression language is relatively small and restricted, so not all possible string processing tasks can be done using regular expressions. There are also tasks that can be done with regular expressions, but the expressions turn out to be very complicated. In these cases, you may be better off writing Python code to do the processing; while Python code will be slower than an elaborate regular expression, it will also probably be more understandable.

正则表达式语言相对较小且受到限制,因此并不是所有可能的字符串处理任务都可以使用正则表达式来完成。也可以使用正则表达式完成一些任务,但是表达式非常复杂。在这些情况下,最好编写Python代码来进行处理;虽然Python代码比复杂的正则表达式要慢,但也可能更容易理解。

Simple Patterns

We’ll start by learning about the simplest possible regular expressions. Since regular expressions are used to operate on strings, we’ll begin with the most common task: matching characters.

我们将从学习最简单的正则表达式开始。由于正则表达式用于操作字符串,我们将从最常见的任务开始:匹配字符。

For a detailed explanation of the computer science underlying regular expressions (deterministic and non-deterministic finite automata), you can refer to almost any textbook on writing compilers.

要详细解释正则表达式(确定性和非确定性有限自动机)背后的计算机科学,可以参考几乎所有编写编译器的教科书。

Matching Characters

Most letters and characters will simply match themselves. For example, the regular expression test will match the string test exactly. (You can enable a case-insensitive mode that would let this RE match Test or TEST as well; more about this later.)

大多数字母和字符将简单地匹配它们自己。例如,正则表达式test将与字符串test完全匹配。(您可以启用不区分大小写的模式,让这个RE也匹配TestTEST;稍后再详细介绍。)

There are exceptions to this rule; some characters are special metacharacters, and don’t match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the RE by repeating them or changing their meaning. Much of this document is devoted to discussing various metacharacters and what they do.

这条规则也有例外;有些字符是特殊的元字符,它们本身并不匹配。相反,它们表示应该匹配一些不寻常的东西,或者通过重复它们或改变它们的含义来影响RE的其他部分。本文主要讨论各种元字符及其作用。

Here’s a complete list of the metacharacters; their meanings will be discussed in the rest of this HOWTO.

这是一个完整的元字符列表;它们的含义将在本文的其余部分中讨论。

. ^ $ * + ? { } [ ] \ | ( )

The first metacharacters we’ll look at are [ and ]. They’re used for specifying a character class, which is a set of characters that you wish to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'. For example, [abc] will match any of the characters a, b, or c; this is the same as [a-c], which uses a range to express the same set of characters. If you wanted to match only lowercase letters, your RE would be [a-z].

我们要看的第一个元字符是[]。它们用于指定字符类,也就是您希望匹配的一组字符。字符可以单独列出,也可以通过给出两个字符并用'-'分隔它们来指定字符的范围。例如,[abc]将匹配任何字符abc;与[a-c]一样,它使用一个范围来表示同一组字符。如果您只想匹配小写字母,您的RE将是[a-z]

Metacharacters are not active inside classes. For example, [akm$] will match any of the characters 'a', 'k', 'm', or '$'; '$' is usually a metacharacter, but inside a character class it’s stripped of its special nature.

元字符在类中不活动。例如,[akm$]将匹配任何字符'a''k''m''$''$'通常是一个元字符,但在一个字符类中它被剥夺了它的特殊性质。

You can match the characters not listed within the class by complementing the set. This is indicated by including a '^' as the first character of the class. For example, [^5] will match any character except '5'. If the caret appears elsewhere in a character class, it does not have special meaning. For example: [5^] will match either a '5' or a '^'.

您可以通过补充集合来匹配类中没有列出的字符。这是通过在类的第一个字符中包含'^'来表示的。例如,[^5]将匹配除'5'之外的任何字符。如果插入符号出现在字符类的其他位置,则它没有特殊意义。例如:[5^]既可以与'5'匹配,也可以与'^'匹配。

Perhaps the most important metacharacter is the backslash, \. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.

也许最重要的元字符是反斜杠\。与Python字面的字符串一样,反斜杠后面可以跟各种字符,以表示各种特殊的序列。它还用于转义所有的元字符,以便您仍然可以在模式中匹配它们;例如,如果您需要匹配一个[\,您可以在它们前面加上一个反斜杠来消除它们的特殊含义:\[\\

Some of the special sequences beginning with '\' represent predefined sets of characters that are often useful, such as the set of digits, the set of letters, or the set of anything that isn’t whitespace.

一些以'\'开头的特殊序列表示通常有用的预定义字符集,如数字集、字母集或除空白之外的任何东西的集。

Let’s take an example: \w matches any alphanumeric character. If the regex pattern is expressed in bytes, this is equivalent to the class [a-zA-Z0-9_]. If the regex pattern is a string, \w will match all the characters marked as letters in the Unicode database provided by the unicodedata module. You can use the more restricted definition of \w in a string pattern by supplying the re.ASCII flag when compiling the regular expression.

让我们举个例子:\w匹配任何字母数字字符。如果regex模式是用字节表示的,这就相当于类[a-zA-Z0-9_]。如果regex模式是一个字符串,\w将匹配unicodedata模块提供的Unicode数据库中标记为字母的所有字符。通过在编译正则表达式时提供re.ASCII标志,在编译正则表达式时可以在字符串模式中使用更严格的\w定义。

The following list of special sequences isn’t complete. For a complete list of sequences and expanded class definitions for Unicode string patterns, see the last part of Regular Expression Syntax in the Standard Library reference. In general, the Unicode versions match any character that’s in the appropriate category in the Unicode database.

下面的特殊序列列表并不完整。有关Unicode字符串模式的序列和扩展类定义的完整列表,请参阅标准库参考资料中的Regular Expression Syntax的最后一部分。通常,Unicode版本匹配Unicode数据库中相应类别中的任何字符。

These sequences can be included inside a character class. For example, [\s,.] is a character class that will match any whitespace character, or ',' or '.'.

这些序列可以包含在一个字符类中。例如,[\s,.]是一个字符类,将匹配任何空白字符,或',''.'

The final metacharacter in this section is .. It matches anything except a newline character, and there’s an alternate mode (re.DOTALL) where it will match even a newline. . is often used where you want to match “any character”.

本节的最后一个元字符是.。它可以匹配除换行字符之外的任何内容,并且有一个备用模式(re.DOTALL),它甚至可以匹配换行字符。.通常用在你想匹配“任何字符”的地方。

Repeating Things

Being able to match varying sets of characters is the first thing regular expressions can do that isn’t already possible with the methods available on strings. However, if that was the only additional capability of regexes, they wouldn’t be much of an advance. Another capability is that you can specify that portions of the RE must be repeated a certain number of times.

能够匹配不同的字符集是正则表达式所能做的第一件事,这在字符串上可用的方法中是不可能的。然而,如果这是regexes唯一的附加功能,那么它们就不会是很大的进步。另一种功能是,您可以指定必须重复一定次数的RE部分。

The first metacharacter for repeating things that we’ll look at is *. * doesn’t match the literal character '*'; instead, it specifies that the previous character can be matched zero or more times, instead of exactly once.

第一个元字符用于重复我们将看到的东西是**与字面字符'*'不匹配;相反,它指定前一个字符可以匹配0次或更多次,而不是只匹配一次。

For example, ca*t will match 'ct' (0 'a' characters), 'cat' (1 'a'), 'caaat' (3 'a' characters), and so forth.

例如,ca*t将匹配'ct'(0个'a'字符)、'cat'(1个'a'字符)、'caaat'(3个'a'字符)等等。

Repetitions such as * are greedy; when repeating a RE, the matching engine will try to repeat it as many times as possible. If later portions of the pattern don’t match, the matching engine will then back up and try again with fewer repetitions.

*这样的重复是贪婪的;当重复一个RE时,匹配的引擎会尽可能多地重复它。如果模式后面的部分不匹配,匹配引擎就会备份,然后用较少的重复进行重试。(即先尽可能多重复次数地匹配)

A step-by-step example will make this more obvious. Let’s consider the expression a[bcd]*b. This matches the letter 'a', zero or more letters from the class [bcd], and finally ends with a 'b'. Now imagine matching this RE against the string 'abcbd'.

一个逐步的示例将使这一点更加明显。让我们来考虑一下a[bcd]*b这个表达式。它匹配字母'a'、类[bcd]中的零或多个字母,最后以一个'b'结尾。现在想象将这个RE与字符串'abcbd'相匹配。

Step Matched Explanation
1 a The a in the RE matches.
2 abcbd The engine matches [bcd]*, going as far as it can, which is to the end of the string.
3 Failure The engine tries to match b, but the current position is at the end of the string, so it fails.
4 abcb Back up, so that [bcd]* matches one less character.
5 Failure Try b again, but the current position is at the last character, which is a 'd'.
6 abc Back up again, so that [bcd]* is only matching bc.
6 abcb Try b again. This time the character at the current position is 'b', so it succeeds.
步骤 匹配 解释
1 a 匹配RE表达式中的a
2 abcbd 引擎匹配[bcd]*,尽可能多地匹配,也就是匹配到字符'abcbd'的尽头。
3 Failure 引擎试图匹配b,但当前位置是字符'abcbd'的尽头,所以失败了。
4 abcb 备份,以便[bcd]*匹配(目标字符的)少一个字符。
5 Failure 再次尝试b,但当前位置是最后一个字符,也就是'd'
6 abc 再次备份,以便[bcd]*只匹配bc
6 abcb 再次尝试b。这次当前位置的字符是'b',所以成功。

The end of the RE has now been reached, and it has matched 'abcb'. This demonstrates how the matching engine goes as far as it can at first, and if no match is found it will then progressively back up and retry the rest of the RE again and again. It will back up until it has tried zero matches for [bcd]*, and if that subsequently fails, the engine will conclude that the string doesn’t match the RE at all.

现在到达了RE的末端,它匹配了'abcb'。这演示了匹配引擎如何在一开始就尽可能多地匹配,如果没有找到匹配,它将逐步备份,并一次又一次地重试其余部分。它将备份,直到它尝试了[bcd]*的零匹配,如果随后失败,引擎将得出结论,字符串根本不匹配RE。

Another repeating metacharacter is +, which matches one or more times. Pay careful attention to the difference between * and +; * matches zero or more times, so whatever’s being repeated may not be present at all, while + requires at least one occurrence. To use a similar example, ca+t will match 'cat' (1 'a'), 'caaat' (3 'a's), but won’t match 'ct'.

另一个重复元字符是+,它匹配一个或多个次。注意*+的区别;*匹配0次或更多次,因此无论重复的是什么可能根本不存在,而+需要至少一次出现。举个类似的例子,ca+t会匹配'cat'(1个'a'),'caaat'(3个'a'),但不会匹配'ct'

There are two more repeating qualifiers. The question mark character, ?, matches either once or zero times; you can think of it as marking something as being optional. For example, home-?brew matches either 'homebrew' or 'home-brew'.

还有两个重复限定符。问号字符,?,匹配一次或零次;您可以将它看作是将某些东西标记为可选的。例如,home-?brew可以与'homebrew''home-brew'匹配。

The most complicated repeated qualifier is {m,n}, where m and n are decimal integers. This qualifier means there must be at least m repetitions, and at most n. For example, a/{1,3}b will match 'a/b', 'a//b', and 'a///b'. It won’t match 'ab', which has no slashes, or 'a////b', which has four.

最复杂的重复限定词是{m,n},其中mn是十进制整数。这个限定词意味着至少有m个重复,最多有n个。例如,a/{1,3}b将匹配'a/b''a//b',和'a///b'。它不匹配没有斜杠的'ab',或者有四个斜杠的'a////b'

You can omit either m or n; in that case, a reasonable value is assumed for the missing value. Omitting m is interpreted as a lower limit of 0, while omitting n results in an upper bound of infinity.

你可以省略mn;在这种情况下,为丢失的值假定一个合理的值。省略m被解释为0的下限,而省略n则导致无穷大的上限。

Readers of a reductionist bent may notice that the three other qualifiers can all be expressed using this notation. {0,} is the same as *, {1,} is equivalent to +, and {0,1} is the same as ?. It’s better to use *, +, or ? when you can, simply because they’re shorter and easier to read.

简化论的读者可能会注意到,其他三个限定都可以用以下符号表示。{0,}*一样,{1,}相当于+{0,1}?是一样。尽量使用*+?,只是因为它们更短,更容易阅读。

Using Regular Expressions

Now that we’ve looked at some simple regular expressions, how do we actually use them in Python? The re module provides an interface to the regular expression engine, allowing you to compile REs into objects and then perform matches with them.

既然我们已经了解了一些简单的正则表达式,那么在Python中如何使用它们呢?re模块提供了到正则表达式引擎的接口,允许您将REs编译成对象,然后使用它们执行匹配。

Compiling Regular Expressions

Regular expressions are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions.

正则表达式被编译成模式对象,模式对象具有用于搜索匹配的模式或执行字符串替换等各种操作的方法。

>>> import re
>>> p = re.compile('ab*')
>>> p
re.compile('ab*')

re.compile() also accepts an optional flags argument, used to enable various special features and syntax variations. We’ll go over the available settings later, but for now a single example will do:

re.compile()还接受可选的flags参数,用于启用各种特殊功能和语法变化。我们之后将涉及可用变量设定,但现在仅介绍一个单一的例子:

>>>
>>> p = re.compile('ab*', re.IGNORECASE)

The RE is passed to re.compile() as a string. REs are handled as strings because regular expressions aren’t part of the core Python language, and no special syntax was created for expressing them. (There are applications that don’t need REs at all, so there’s no need to bloat the language specification by including them.) Instead, the re module is simply a C extension module included with Python, just like the socket or zlib modules.

将RE作为字符串传递给re.compile()。RE被处理为字符串,因为正则表达式不是核心Python语言的一部分,并且没有创建特殊的语法来表示它们。(有些应用程序根本不需要RE,因此没有必要通过包含它们来扩展语言规范。)相反,re模块只是Python中包含的一个C扩展模块,就像socketzlib模块一样。

Putting REs in strings keeps the Python language simpler, but has one disadvantage which is the topic of the next section.

将RE放在字符串中可以使Python语言更简单,但是有一个缺点,这就是下一节的主题。

The Backslash Plague

As stated earlier, regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This conflicts with Python’s usage of the same character for the same purpose in string literals.

如前所述,正则表达式使用反斜杠字符('\')来指示特殊形式,或者允许使用特殊字符而不调用它们的特殊含义。这与Python在字符串文本中出于相同目的使用相同字符的做法相冲突。

Let’s say you want to write a RE that matches the string \section, which might be found in a LaTeX file. To figure out what to write in the program code, start with the desired string to be matched. Next, you must escape any backslashes and other metacharacters by preceding them with a backslash, resulting in the string \\section. The resulting string that must be passed to re.compile() must be \\section. However, to express this as a Python string literal, both backslashes must be escaped again.

假设您想要编写一个与字符串\section匹配的RE,这个字符串可能在LaTeX文件中找到。要确定在程序代码中要写什么,可以从需要匹配的字符串开始。接下来,必须通过在反斜杠和其他元字符前面加上反斜杠来转义它们,从而产生字符串\\section。必须传递给re.compile()的结果字符串必须是\\section。但是,要将其表示为Python字面上的字符串,必须再次转义两个反斜杠。

Characters Stage
\section Text string to be matched
\\section Escaped backslash for re.compile()
"\\\\section" Escaped backslashes for a string literal
字符 阶段
\section 要匹配的文本字符串
\\section re.compile()的转义反斜杠
"\\\\section" 为字面上的字符串转义反斜杠

In short, to match a literal backslash, one has to write '\\\\' as the RE string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal. In REs that feature backslashes repeatedly, this leads to lots of repeated backslashes and makes the resulting strings difficult to understand.

简而言之,要匹配字面的反斜杠,必须将'\\\\'作为RE字符串,因为正则表达式必须是\\,而且每个反斜杠必须在常规Python字面字符串中表示为\\。在反复使用反斜杠的RE中,这会导致大量重复的反斜杠,从而使生成的字符串难以理解。

The solution is to use Python’s raw string notation for regular expressions; backslashes are not handled in any special way in a string literal prefixed with 'r', so r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Regular expressions will often be written in Python code using this raw string notation.

解决方案是对正则表达式使用Python的原始字符串表示法;反斜杠在以'r'为前缀的字面上的字符串中没有任何特殊的处理方法,因此r"\n"是一个包含'\''n'的双字符字符串,而"\n"是一个包含换行符的单字符字符串。正则表达式通常用Python代码编写,使用这种原始字符串表示法。

In addition, special escape sequences that are valid in regular expressions, but not valid as Python string literals, now result in a DeprecationWarning and will eventually become a SyntaxError, which means the sequences will be invalid if raw string notation or escaping the backslashes isn’t used.

此外,特殊的转义序列在正则表达式中有效,但在Python字面上的字符串中无效,现在会导致DeprecationWarning,并最终成为SyntaxError,这意味着如果没有使用原始字符串表示法或转义反斜杠,这些序列将无效。

Regular String Raw string
"ab*" r"ab*"
"\\\\section" r"\\section"
"\\w+\\s+\\1" r"\w+\s+\1"
正则符串 原始字符串
"ab*" r"ab*"
"\\\\section" r"\\section"
"\\w+\\s+\\1" r"\w+\s+\1"

Performing Matches

Once you have an object representing a compiled regular expression, what do you do with it? Pattern objects have several methods and attributes. Only the most significant ones will be covered here; consult the re docs for a complete listing.

有了表示编译后的正则表达式的对象后,如何处理它?模式对象有几个方法和属性。这里只讨论最重要的问题;查阅re文档以获得完整的列表。

Method/Attribute Purpose
match() Determine if the RE matches at the beginning of the string.
search() Scan through a string, looking for any location where this RE matches.
findall() Find all substrings where the RE matches, and returns them as a list.
finditer() Find all substrings where the RE matches, and returns them as an iterator.
方法/属性 目的
match() 确定RE是否与字符串的开头匹配。
search() 扫描一个字符串,寻找与此RE匹配的任何位置。
findall() 找到所有与RE匹配的子字符串,并以列表的形式返回它们。
finditer() 找到所有与RE匹配的子字符串,并以“迭代器”的形式返回它们。

match() and search() return None if no match can be found. If they’re successful, a match object instance is returned, containing information about the match: where it starts and ends, the substring it matched, and more.

如果没有找到匹配,match()search()返回None。如果成功,则返回一个match object实例,其中包含关于匹配的信息:它在哪里开始和结束,匹配的子字符串,等等。

You can learn about this by interactively experimenting with the re module. If you have tkinter available, you may also want to look at Tools/demo/redemo.py, a demonstration program included with the Python distribution. It allows you to enter REs and strings, and displays whether the RE matches or fails. redemo.py can be quite useful when trying to debug a complicated RE.

您可以通过交互试验re模块来了解这一点。如果你有tkinter可用,你也可以看看Tools/demo/redemo.py。一个包含在Python发行版中的演示程序。它允许您输入RE和字符串,并显示RE是否匹配或失败。redemo.py在调试复杂的RE时非常有用。

This HOWTO uses the standard Python interpreter for its examples. First, run the Python interpreter, import the re module, and compile a RE:

本指南使用标准的Python解释器进行示例。首先,运行Python解释器,导入re模块,编译一个RE:

>>>
>>> import re
>>> p = re.compile('[a-z]+')
>>> p
re.compile('[a-z]+')

Now, you can try matching various strings against the RE [a-z]+. An empty string shouldn’t match at all, since + means ‘one or more repetitions’. match() should return None in this case, which will cause the interpreter to print no output. You can explicitly print the result of match() to make this clear.

现在,您可以尝试根据RE的[a-z]+匹配各种字符串。空字符串不应该能匹配,因为+意味着“一个或多个重复”。在本例中,match()应该返回None,这将导致解释器不打印输出。您可以显式地打印match()的结果来明确这一点。

>>>
>>> p.match("")
>>> print(p.match(""))
None

Now, let’s try it on a string that it should match, such as tempo. In this case, match() will return a match object, so you should store the result in a variable for later use.

现在,让我们在一个应该能匹配的字符串上试试,比如tempo。在本例中,match()将返回一个match object,您应该将结果存储在一个变量中,供以后使用。

>>>
>>> m = p.match('tempo')
>>> m
<re.Match object; span=(0, 5), match='tempo'>

Now you can query the match object for information about the matching string. Match object instances also have several methods and attributes; the most important ones are:

现在您可以查询match object以获得关于匹配字符串的信息。匹配对象实例也有几个方法和属性;最重要的是:

Method/Attribute Purpose
group() Return the string matched by the RE
start() Return the starting position of the match
end() Return the ending position of the match
span() Return a tuple containing the (start, end) positions of the match
方法/属性 目的
group() 返回由RE匹配的字符串
start() 返回匹配的起始位置
end() 返回匹配的结束位置
span() 返回包含匹配的(开始、结束)位置的元组

Trying these methods will soon clarify their meaning:

尝试这些方法很快就能弄清它们的含义:

>>>
>>> m.group()
'tempo'
>>> m.start(), m.end()
(0, 5)
>>> m.span()
(0, 5)

group() returns the substring that was matched by the RE. start() and end() return the starting and ending index of the match. span() returns both start and end indexes in a single tuple. Since the match() method only checks if the RE matches at the start of a string, start() will always be zero. However, the search() method of patterns scans through the string, so the match may not start at zero in that case.

group()返回由RE匹配的子字符串。start()end()返回匹配的开始和结束索引。span()返回一个元组中的开始和结束索引。因为match()方法只检查字符串开头的RE是否匹配,所以start()总是0。但是,模式的search()方法会扫描字符串,所以在这种情况下,匹配可能不会从0开始。

>>>
>>> print(p.match('::: message'))
None
>>> m = p.search('::: message'); print(m)
<re.Match object; span=(4, 11), match='message'>
>>> m.group()
'message'
>>> m.span()
(4, 11)

In actual programs, the most common style is to store the match object in a variable, and then check if it was None. This usually looks like:

在实际的程序中,最常见的样式是将match object存储在一个变量中,然后检查它是否是None。通常是这样:

p = re.compile( ... )
m = p.match( 'string goes here' )
if m:
    print('Match found: ', m.group())
else:
    print('No match')

Two pattern methods return all of the matches for a pattern. findall() returns a list of matching strings:

两个模式方法返回一个模式的所有匹配项。findall()返回匹配字符串列表:

>>>
>>> p = re.compile(r'\d+')
>>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
['12', '11', '10']

注意以下也成立:

>>> p = re.compile('\d+')
>>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
['12', '11', '10']
>>>

注:

The r prefix, making the literal a raw string literal, is needed in this example because escape sequences in a normal “cooked” string literal that are not recognized by Python, as opposed to regular expressions, now result in a DeprecationWarning and will eventually become a SyntaxError. See The Backslash Plague.

在本例中需要使用r前缀,使字面成为原始字面字符串,因为Python无法识别的正常“熟”的字面字符串中的转义序列(与正则表达式相反),现在会导致DeprecationWarning,并最终成为SyntaxError。看看Backslash Plague(反斜杠瘟疫)。(注:本段话是错误的,这里可不加r前缀。)

findall() has to create the entire list before it can be returned as the result. The finditer() method returns a sequence of match object instances as an iterator:

findall()必须创建整个列表才能返回结果。'finditer()方法以iterator的形式返回一个match object实例序列:

>>>
>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
>>> iterator  
<callable_iterator object at 0x...>
>>> for match in iterator:
...     print(match.span())
...
(0, 2)
(22, 24)
(29, 31)

Module-Level Functions

You don’t have to create a pattern object and call its methods; the re module also provides top-level functions called match(), search(), findall(), sub(), and so forth. These functions take the same arguments as the corresponding pattern method with the RE string added as the first argument, and still return either None or a match object instance.

您不必创建一个模式对象并调用它的方法;re模块还提供了名为match()search()findall()sub()等顶级函数。这些函数使用与对应模式方法相同的参数,并将RE字符串添加为第一个参数,但仍然返回Nonematch object实例。

>>>
>>> print(re.match(r'From\s+', 'Fromage amk'))
None
>>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998')  
<re.Match object; span=(0, 5), match='From '>

Under the hood, these functions simply create a pattern object for you and call the appropriate method on it. They also store the compiled object in a cache, so future calls using the same RE won’t need to parse the pattern again and again.

实际上,这些函数只是为您创建一个模式对象并在其上调用适当的方法。它们还将编译后的对象存储在缓存中,因此使用相同RE的未来调用不需要一次又一次地解析模式。

Should you use these module-level functions, or should you get the pattern and call its methods yourself? If you’re accessing a regex within a loop, pre-compiling it will save a few function calls. Outside of loops, there’s not much difference thanks to the internal cache.

应该使用这些模块级函数,还是应该自己获取模式并调用它的方法?如果在循环中访问regex,预编译它将节省一些函数调用。在循环之外,由于内部缓存,没有太大的区别。

Compilation Flags

Compilation flags let you modify some aspects of how regular expressions work. Flags are available in the re module under two names, a long name such as IGNORECASE and a short, one-letter form such as I. (If you’re familiar with Perl’s pattern modifiers, the one-letter forms use the same letters; the short form of re.VERBOSE is re.X, for example.) Multiple flags can be specified by bitwise OR-ing them; re.I | re.M sets both the I and M flags, for example.

编译标志允许您修改正则表达式工作方式的某些方面。标志在re模块中有两个名称,一个长名称(如IGNORECASE)和一个短的单字母形式(如I)。(如果您熟悉Perl的模式修饰符,那么一个字母的表单使用相同的字母;例如,re.VERBOSE的缩写形式是re.X。)。通过位与可指定多个标志;例如,re.I | re.M将同时设定IM标志。

Here’s a table of the available flags, followed by a more detailed explanation of each one.

下面是可用标志的表,后面是每个标志的详细说明。

Flag Meaning
ASCII, A Makes several escapes like \w, \b, \s and \d match only on ASCII characters with the respective property.
DOTALL, S Make . match any character, including newlines.
IGNORECASE, I Do case-insensitive matches.
LOCALE, L Do a locale-aware match.
MULTILINE, M Multi-line matching, affecting ^ and $.
VERBOSE, X (for ‘extended’) Enable verbose REs, which can be organized more cleanly and understandably.
标志 意义
ASCII, A \w\b\s\d等多个转义,仅在ASCII字符上与各自的属性匹配。
DOTALL, S 使.匹配任何字符,包括换行符。
IGNORECASE, I 不区分大小写的匹配。
LOCALE, L 执行位置感知的匹配。
MULTILINE, M 多行匹配,影响^$
VERBOSE, X (for ‘extended’) 启用详细的RE,可以更清晰、更容易理解地组织它们。

For example, here’s a RE that uses re.VERBOSE; see how much easier it is to read?

例如,这里有一个RE使用了re.VERBOSE;看出来了吗?

charref = re.compile(r"""
 &[#]                # Start of a numeric entity reference
 (
     0[0-7]+         # Octal form
   | [0-9]+          # Decimal form
   | x[0-9a-fA-F]+   # Hexadecimal form
 )
 ;                   # Trailing semicolon
""", re.VERBOSE)

Without the verbose setting, the RE would look like this:

没有冗长(verbose)的设置,RE应该是这样的:

charref = re.compile("&#(0[0-7]+"
                     "|[0-9]+"
                     "|x[0-9a-fA-F]+);")

In the above example, Python’s automatic concatenation of string literals has been used to break up the RE into smaller pieces, but it’s still more difficult to understand than the version using re.VERBOSE.

在上面的例子中,Python的字符串字面量的自动连接被用来将RE分解成更小的块,但是它仍然比使用re.VERBOSE的版本更难理解。

More Pattern Power

So far we’ve only covered a part of the features of regular expressions. In this section, we’ll cover some new metacharacters, and how to use groups to retrieve portions of the text that was matched.

到目前为止,我们只介绍了正则表达式的一部分特性。在本节中,我们将介绍一些新的元字符,以及如何使用组来检索匹配的文本部分。

More Metacharacters

There are some metacharacters that we haven’t covered yet. Most of them will be covered in this section.

有一些元字符我们还没有涉及。本节将介绍其中的大多数。

Some of the remaining metacharacters to be discussed are zero-width assertions. They don’t cause the engine to advance through the string; instead, they consume no characters at all, and simply succeed or fail. For example, \b is an assertion that the current position is located at a word boundary; the position isn’t changed by the \b at all. This means that zero-width assertions should never be repeated, because if they match once at a given location, they can obviously be matched an infinite number of times.

剩下要讨论的一些元字符是零宽度断言。它们不会使引擎通过字符前进;相反,它们根本不消耗任何字符,只是成功或失败。例如,\b是当前位置位于单词边界的断言;位置一点也没有被\b改变。这意味着不应该重复零宽度的断言,因为如果它们在给定位置匹配一次,显然可以匹配无限次。

Grouping

Frequently you need to obtain more information than just whether the RE matched or not. Regular expressions are often used to dissect strings by writing a RE divided into several subgroups which match different components of interest. For example, an RFC-822 header line is divided into a header name and a value, separated by a ':', like this:

通常,您需要获得比是否重新匹配更多的信息。正则表达式通常用于通过将字符串重新划分为几个匹配不同兴趣组件的子组来分析字符串。例如,一个RFC-822头行被分为一个头名和一个值,中间用':'分隔,就像这样:

From: author@example.com
User-Agent: Thunderbird 1.5.0.9 (X11/20061227)
MIME-Version: 1.0
To: editor@example.com

This can be handled by writing a regular expression which matches an entire header line, and has one group which matches the header name, and another group which matches the header’s value.

这可以通过编写一个与整个标题行匹配的正则表达式来处理,其中一个组与标题名匹配,另一个组与标题的值匹配。

Groups are marked by the '(', ')' metacharacters. '(' and ')' have much the same meaning as they do in mathematical expressions; they group together the expressions contained inside them, and you can repeat the contents of a group with a repeating qualifier, such as *, +, ?, or {m,n}. For example, (ab)* will match zero or more repetitions of ab.

组由'('')'元字符标记。'('')'的意思与数学表达式中的意思大致相同;它们将其中包含的表达式组合在一起,您可以使用重复限定词重复组的内容,比如*+?{m,n}。例如,(ab)*将匹配零次或多次重复的ab。(也就是将'('')'包含的内容视为一个整体,这与元字符是[]不同,如紧下示例所示。)

>>>
>>> p = re.compile('(ab)*')
>>> print(p.match('ababababab').span())
(0, 10)

Groups indicated with '(', ')' also capture the starting and ending index of the text that they match; this can be retrieved by passing an argument to group(), start(), end(), and span(). Groups are numbered starting with 0. Group 0 is always present; it’s the whole RE, so match object methods all have group 0 as their default argument. Later we’ll see how to express groups that don’t capture the span of text that they match.

'('')'表示的组还捕获它们匹配的文本的起始和结束索引;这可以通过向group()start()end()span()传递参数来检索。组的编号从0开始。0组始终存在;它是整个RE,所以match object方法都有group 0作为默认参数。稍后,我们将看到如何表示不捕获匹配文本范围的组。

>>>
>>> p = re.compile('(a)b')
>>> m = p.match('ab')
>>> m.group()
'ab'
>>> m.group(0)
'ab'

Subgroups are numbered from left to right, from 1 upward. Groups can be nested; to determine the number, just count the opening parenthesis characters, going from left to right.

子组从左到右编号,从1开始向上编号。组可以嵌套;要确定数字(比如紧下,只有group 0到group 2,为什么知道?因为正则内只有两层'('')',因此2曾子组共三层),只需计算左到右的左括号字符。

>>>
>>> p = re.compile('(a(b)c)d')
>>> m = p.match('abcd')
>>> m.group(0)
'abcd'
>>> m.group(1)
'abc'
>>> m.group(2)
'b'

group() can be passed multiple group numbers at a time, in which case it will return a tuple containing the corresponding values for those groups.

group()可以一次传递多个组号,在这种情况下,它将返回一个元组,其中包含这些组的相应值。

>>>
>>> m.group(2,1,2)
('b', 'abc', 'b')

The group() method returns a tuple containing the strings for all the subgroups, from 1 up to however many there are.

group()方法返回一个元组,其中包含所有子组的字符串,从1到任意数目。

>>>
>>> m.groups()
('abc', 'b')

Backreferences in a pattern allow you to specify that the contents of an earlier capturing group must also be found at the current location in the string. For example, \1 will succeed if the exact contents of group 1 can be found at the current position, and fails otherwise. Remember that Python’s string literals also use a backslash followed by numbers to allow including arbitrary characters in a string, so be sure to use a raw string when incorporating backreferences in a RE.

模式中的反向引用允许您指定必须在字符串的当前位置找到先前捕获组的内容。例如,如果可以在当前位置找到组1的确切内容,则\1将成功,否则将失败。请记住,Python的字面字符串也使用反斜杠和数字,以允许在字符串中包含任意字符,所以在RE中合并反向引用时,一定要使用原始字符串(即字符串前加r,如之前示例所示。)。

For example, the following RE detects doubled words in a string.

例如,下面的RE将检测字符串中重复出现的单词。

>>>
>>> p = re.compile(r'\b(\w+)\s+\1\b')
>>> p.search('Paris in the the spring').group()
'the the'

Backreferences like this aren’t often useful for just searching through a string — there are few text formats which repeat data in this way — but you’ll soon find out that they’re very useful when performing string substitutions.

像这样的反向引用在搜索字符串时通常不是很有用——很少有文本格式以这种方式重复数据——但是您很快就会发现它们在执行字符串替换时非常有用。

Non-capturing and Named Groups

这节挺复杂,没看懂!!!

Elaborate REs may use many groups, both to capture substrings of interest, and to group and structure the RE itself. In complex REs, it becomes difficult to keep track of the group numbers. There are two features which help with this problem. Both of them use a common syntax for regular expression extensions, so we’ll look at that first.

复杂的REs可以使用许多组,既可以捕获感兴趣的子字符串,也可以对RE本身进行分组和构造。在复杂的REs中,很难跟踪组号。有两个特性可以帮助解决这个问题。它们都使用了正则表达式扩展的通用语法,所以我们先来看一下。

Perl 5 is well known for its powerful additions to standard regular expressions. For these new features the Perl developers couldn’t choose new single-keystroke metacharacters or new special sequences beginning with \ without making Perl’s regular expressions confusingly different from standard REs. If they chose & as a new metacharacter, for example, old expressions would be assuming that & was a regular character and wouldn’t have escaped it by writing \& or [&].

Perl 5以其对标准正则表达式的强大增强而闻名。对于这些新特性Perl开发人员不能选择以\开头但没有不同于标准RE的令人困惑的Perl正则表达式的新的单击键元字符或特殊序列。如果他们选择了&作为一种新的元字符,例如,旧的表达式将假设&是一个正则字符,并通过写\&[&]以使&不转义。

The solution chosen by the Perl developers was to use (?...) as the extension syntax. ? immediately after a parenthesis was a syntax error because the ? would have nothing to repeat, so this didn’t introduce any compatibility problems. The characters immediately after the ? indicate what extension is being used, so (?=foo) is one thing (a positive lookahead assertion) and (?:foo) is something else (a non-capturing group containing the subexpression foo).

Perl开发人员选择的解决方案是使用(?...)作为扩展语法。?后面紧跟着一个括号是一个语法错误,因为?没有什么可以重复的,所以这并没有带来任何兼容性问题。?紧后字符表示使用的是哪个扩展名,因此(?=foo)是一个东西(一个积极的前向断言,下一节将介绍前向断言),(?:foo)是另一个东西(一个包含子表达式foo的非捕获组)。

Python supports several of Perl’s extensions and adds an extension syntax to Perl’s extension syntax. If the first character after the question mark is a P, you know that it’s an extension that’s specific to Python.

Python支持几个Perl扩展,并将扩展语法添加到Perl扩展语法中。如果问号后面的第一个字符是P,那么您就知道它是Python特有的扩展。

Now that we’ve looked at the general extension syntax, we can return to the features that simplify working with groups in complex REs.

现在我们已经了解了一般的扩展语法,我们可以返回到在复杂REs中简化使用组的特性。

Sometimes you’ll want to use a group to denote a part of a regular expression, but aren’t interested in retrieving the group’s contents. You can make this fact explicit by using a non-capturing group: (?:...), where you can replace the ... with any other regular expression.

有时,您可能希望使用组来表示正则表达式的一部分,但对检索组的内容不感兴趣。您可以通过使用非捕获组(?:...)来明确这一事实,其中可以将...替换为任何其它正则表达式。

>>>
>>> m = re.match("([abc])+", "abc")
>>> m.groups()
('c',)
>>> m = re.match("(?:[abc])+", "abc")
>>> m.groups()
()

注:

Except for the fact that you can’t retrieve the contents of what the group matched, a non-capturing group behaves exactly the same as a capturing group; you can put anything inside it, repeat it with a repetition metacharacter such as *, and nest it within other groups (capturing or non-capturing). (?:...) is particularly useful when modifying an existing pattern, since you can add new groups without changing how all the other groups are numbered. It should be mentioned that there’s no performance difference in searching between capturing and non-capturing groups; neither form is any faster than the other.

除了您无法检索组匹配的内容之外,非捕获组的行为与捕获组完全相同;您可以将任何内容放入其中,使用重复元字符(例如*)重复它,并将其嵌入其他组(捕捉组或非捕捉组)。(?:...)在修改现有模式时特别有用,因为您可以添加新的组,而不必更改所有其他组的编号方式。需要指出的是,捕获组和非捕获组的搜索性能没有区别;这两种形式都不如另一种形式快。

A more significant feature is named groups: instead of referring to them by numbers, groups can be referenced by a name.

一个更重要的特性是命名组:可以通过名称引用组,而不是通过数字引用组。

The syntax for a named group is one of the Python-specific extensions: (?P<name>...). name is, obviously, the name of the group. Named groups behave exactly like capturing groups, and additionally associate a name with a group. The match object methods that deal with capturing groups all accept either integers that refer to the group by number or strings that contain the desired group’s name. Named groups are still given numbers, so you can retrieve information about a group in two ways:

指定组的语法是特定于python的扩展之一:(?P<name>...)。显然,name是组的名称。命名组的行为与捕获组完全相同,并且将名称与组关联起来。处理捕获组的match object方法都接受通过数字引用组的整数或包含所需组名称的字符串。命名组仍然是给定的数字,所以您可以通过两种方式检索关于一个组的信息:

>>>
>>> p = re.compile(r'(?P<word>\b\w+\b)')
>>> m = p.search( '(((( Lots of punctuation )))' )
>>> m.group('word')
'Lots'
>>> m.group(1)
'Lots'

Additionally, you can retrieve named groups as a dictionary with groupdict():

此外,作为一个字典,您可通过groupdict()检索命名组:

>>>
>>> m = re.match(r'(?P<first>\w+) (?P<last>\w+)', 'Jane Doe')
>>> m.groupdict()
{'first': 'Jane', 'last': 'Doe'}

Named groups are handy because they let you use easily-remembered names, instead of having to remember numbers. Here’s an example RE from the imaplib module:

命名组很方便,因为它让您使用容易记住的名称,而不必记住数字。以下是来自imaplib模块的一个例子:

InternalDate = re.compile(r'INTERNALDATE "'
        r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-'
        r'(?P<year>[0-9][0-9][0-9][0-9])'
        r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])'
        r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])'
        r'"')

It’s obviously much easier to retrieve m.group('zonem'), instead of having to remember to retrieve group 9.

显然,检索m.group('zonem')要容易得多,而不必记住检索第9组。

The syntax for backreferences in an expression such as (...)\1 refers to the number of the group. There’s naturally a variant that uses the group name instead of the number. This is another Python extension: (?P=name) indicates that the contents of the group called name should again be matched at the current point. The regular expression for finding doubled words, \b(\w+)\s+\1\b can also be written as \b(?P<word>\w+)\s+(?P=word)\b:

(...)\1等表达式中的反向引用语法指的是组的数量。很自然,有一种变体使用组名而不是数字。这是另一个Python扩展:(?P=name)表示名为name的组的内容应该在当前点再次匹配。查找双引号的正则表达式\b(\w+)\s+\1\b也可以写成\b(?P<word>\w+)\s+(?P=word)\b

>>>
>>> p = re.compile(r'\b(?P<word>\w+)\s+(?P=word)\b')
>>> p.search('Paris in the the spring').group()
'the the'

Lookahead Assertions

Another zero-width assertion is the lookahead assertion. Lookahead assertions are available in both positive and negative form, and look like this:

另一个零宽度断言是前向断言。前向断言有正反两种形式,如下所示:

To make this concrete, let’s look at a case where a lookahead is useful. Consider a simple pattern to match a filename and split it apart into a base name and an extension, separated by a .. For example, in news.rc, news is the base name, and rc is the filename’s extension.

为了使其更具体,我们来看一个使用前向的情况。考虑一个简单的模式来匹配一个文件名,并将其拆分为一个基本名和一个扩展名,中间用.分隔。例如,在news.rc中,news是基本名,rc是文件名的扩展名。

The pattern to match this is quite simple:

匹配的模式很简单:

.*[.].*$

Notice that the . needs to be treated specially because it’s a metacharacter, so it’s inside a character class to only match that specific character. Also notice the trailing $; this is added to ensure that all the rest of the string must be included in the extension. This regular expression matches foo.bar and autoexec.bat and sendmail.cf and printers.conf.

注意.需要特别处理,因为它是一个元字符,所以它在一个字符类中(即用[]包裹成[.],当然也可加\来转义写成\.),只匹配特定的字符。还要注意结尾的$;这是为了确保所有其余的字符串必须包含在扩展中。这个正则表达式匹配foo.barautoexec.batsendmail.cfprinters.conf

Now, consider complicating the problem a bit; what if you want to match filenames where the extension is not bat? Some incorrect attempts:

现在,考虑一下把问题复杂化一点;如果要匹配扩展名不是bat的文件名,该怎么办?一些错误的尝试:

.*[.][^b].*$

The first attempt above tries to exclude bat by requiring that the first character of the extension is not a b. This is wrong, because the pattern also doesn’t match foo.bar.

上面的第一次尝试通过要求扩展的第一个字符不是b来排除bat。这是错误的,因为模式也不匹配foo.bar

.*[.]([^b]..|.[^a].|..[^t])$

The expression gets messier when you try to patch up the first solution by requiring one of the following cases to match: the first character of the extension isn’t b; the second character isn’t a; or the third character isn’t t. This accepts foo.bar and rejects autoexec.bat, but it requires a three-letter extension and won’t accept a filename with a two-letter extension such as sendmail.cf. We’ll complicate the pattern again in an effort to fix it.

当您试图通过要求匹配下列情况之一来修补第一个解决方案时,表达式变得更混乱:扩展的第一个字符不是b;第二个字母不是a;或者第三个字符不是t。这接受foo.bar并拒绝autoexec.bat,但它需要三个字母的扩展名,不接受两个字母扩展名的文件名,如sendmail.cf。为了修复它,我们将再次使模式复杂化。

.*[.]([^b].?.?|.[^a]?.?|..?[^t]?)$

In the third attempt, the second and third letters are all made optional in order to allow matching extensions shorter than three characters, such as sendmail.cf.

在第三次尝试中,第二和第三个字母都是可选的,以便允许匹配短于三个字符的扩展名,比如sendmail.cf

The pattern’s getting really complicated now, which makes it hard to read and understand. Worse, if the problem changes and you want to exclude both bat and exe as extensions, the pattern would get even more complicated and confusing.

这个模式现在变得非常复杂,很难阅读和理解。更糟糕的是,如果问题发生了变化,并且您想要同时排除batexe扩展,那么模式将变得更加复杂和混乱。

A negative lookahead cuts through all this confusion:

一个反的前向断言打破所有的混乱:

.*[.](?!bat$)[^.]*$

The negative lookahead means: if the expression bat doesn’t match at this point, try the rest of the pattern; if bat$ does match, the whole pattern will fail. The trailing $ is required to ensure that something like sample.batch, where the extension only starts with bat, will be allowed. The [^.]* makes sure that the pattern works when there are multiple dots in the filename.

反的前向断言的意义:如果表达式bat此时不匹配,则尝试其余的模式;如果bat$确实匹配,那么整个模式将失败。后面的$是用来确保类似sample.batch的内容,这里扩展名仅以bat开始,将被允许。[^.]*确保当文件名中有多个点时模式有效。

Excluding another filename extension is now easy; simply add it as an alternative inside the assertion. The following pattern excludes filenames that end in either bat or exe:

排除另一个文件名扩展现在很容易;只需在断言中添加它作为替代。以下模式排除以batexe结尾的文件名:

.*[.](?!bat$|exe$)[^.]*$

Modifying Strings

Up to this point, we’ve simply performed searches against a static string. Regular expressions are also commonly used to modify strings in various ways, using the following pattern methods:

到目前为止,我们只是对一个静态字符串执行搜索。正则表达式还通常用于以各种方式修改字符串,使用以下模式方法:

Method/Attribute Purpose
split() Split the string into a list, splitting it wherever the RE matches
sub() Find all substrings where the RE matches, and replace them with a different string
subn() Does the same thing as sub(), but returns the new string and the number of replacements
方法/属性 目的
split() 将字符串拆分为一个列表,在任何与RE匹配的地方将其拆分
sub() 找到所有与RE匹配的子字符串,并用不同的字符串替换它们
subn() 执行与sub()相同的操作,但返回新字符串和替换的数量

Splitting Strings

The split() method of a pattern splits a string apart wherever the RE matches, returning a list of the pieces. It’s similar to the split() method of strings but provides much more generality in the delimiters that you can split by; string split() only supports splitting by whitespace or by a fixed string. As you’d expect, there’s a module-level re.split() function, too.

模式的split()方法将字符串在任何与RE匹配的地方分开,返回片段列表。它类似于字符串的split()方法,但是在分隔符中提供了更多的通用性,您可以根据这些分隔符进行分割;字符串的split()只支持空格或固定字符串的分割。如您所料,还有一个模块级的re.split()函数。

You can limit the number of splits made, by passing a value for maxsplit. When maxsplit is nonzero, at most maxsplit splits will be made, and the remainder of the string is returned as the final element of the list. In the following example, the delimiter is any sequence of non-alphanumeric characters.

您可以通过为maxsplit传递一个值来限制分割的数量。当maxsplit为非零时,将最多执行maxsplit次分割,并将字符串(分割得到的)部分作为列表的最后一个元素返回。在下面的示例中,分隔符是任何非字母数字字符序列。

>>>
>>> p = re.compile(r'\W+')
>>> p.split('This is a test, short and sweet, of split().')
['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']
>>> p.split('This is a test, short and sweet, of split().', 3)
['This', 'is', 'a', 'test, short and sweet, of split().']

Sometimes you’re not only interested in what the text between delimiters is, but also need to know what the delimiter was. If capturing parentheses are used in the RE, then their values are also returned as part of the list. Compare the following calls:

有时,您不仅对分隔符之间的文本是什么感兴趣,还需要知道分隔符是什么。如果捕获括号在RE中使用,那么它们的值也作为列表的一部分返回。比较以下调用:

>>>
>>> p = re.compile(r'\W+')
>>> p2 = re.compile(r'(\W+)')
>>> p.split('This... is a test.')
['This', 'is', 'a', 'test', '']
>>> p2.split('This... is a test.')
['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', '']

The module-level function re.split() adds the RE to be used as the first argument, but is otherwise the same.

模块级的函数re.split()添加RE作为第一个参数使用,但在其他方面是相同的。

>>>
>>> re.split(r'[\W]+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split(r'([\W]+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split(r'[\W]+', 'Words, words, words.', 1)
['Words', 'words, words.']

Search and Replace

Another common task is to find all the matches for a pattern, and replace them with a different string. The sub() method takes a replacement value, which can be either a string or a function, and the string to be processed.

另一个常见的任务是查找模式的所有匹配项,并用不同的字符串替换它们。sub()方法接受一个替换值(可以是字符串或函数)和要处理的字符串。

Here’s a simple example of using the sub() method. It replaces colour names with the word colour:

下面是一个使用sub()方法的简单示例。它将颜色名称替换为colour:

>>>
>>> p = re.compile('(blue|white|red)')
>>> p.sub('colour', 'blue socks and red shoes')
'colour socks and colour shoes'
>>> p.sub('colour', 'blue socks and red shoes', count=1)
'colour socks and red shoes'

The subn() method does the same work, but returns a 2-tuple containing the new string value and the number of replacements that were performed:

subn()方法执行相同的工作,但返回一个二元组,其中包含新的字符串值和执行的替换数量:

>>>
>>> p = re.compile('(blue|white|red)')
>>> p.subn('colour', 'blue socks and red shoes')
('colour socks and colour shoes', 2)
>>> p.subn('colour', 'no colours at all')
('no colours at all', 0)

Empty matches are replaced only when they’re not adjacent to a previous empty match.

只有当空匹配项与前一个空匹配项不相邻时,才会替换它们。

>>>
>>> p = re.compile('x*')
>>> p.sub('-', 'abxd')
'-a-b--d-'

If replacement is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. Unknown escapes such as \& are left alone. Backreferences, such as \6, are replaced with the substring matched by the corresponding group in the RE. This lets you incorporate portions of the original text in the resulting replacement string.

如果replacement是一个字符串,则处理其中的任何反斜杠转义。也就是说,\n转换为单个换行字符,\r转换为回车符,等等。未知的转义如\&则被单独保留。反向引用如\6,由RE中相应组匹配的子字符串替换。这使您可以将原始文本的一部分合并到结果替换字符串中。

This example matches the word section followed by a string enclosed in {, }, and changes section to subsection:

这个例子匹配了section这个单词,后面跟着一个用{}括起来的字符串,并将section改为subsection

>>>
>>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE)
>>> p.sub(r'subsection{\1}','section{First} section{second}')
'subsection{First} subsection{second}'

There’s also a syntax for referring to named groups as defined by the (?P<name>...) syntax. \g<name> will use the substring matched by the group named name, and \g<number> uses the corresponding group number. \g<2> is therefore equivalent to \2, but isn’t ambiguous in a replacement string such as \g<2>0. (\20 would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character '0'.) The following substitutions are all equivalent, but use all three variations of the replacement string.

还有一个语法用于引用由(?P<name>...)语法定义的命名组。\g<name>将使用与名为name的组匹配的子字符串,\g<number>将使用相应的组号。因此,\g<2>等价于\2,但在替换字符串如\g<2>0中没有歧义(\20会被解释为对第20组的引用,而不是对第2组的引用,后者后跟字面的字符'0'。)。下面的替换都是等价的,使用替换字符串的所有三种变体。

>>>
>>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE)
>>> p.sub(r'subsection{\1}','section{First}')
'subsection{First}'
>>> p.sub(r'subsection{\g<1>}','section{First}')
'subsection{First}'
>>> p.sub(r'subsection{\g<name>}','section{First}')
'subsection{First}'

replacement can also be a function, which gives you even more control. If replacement is a function, the function is called for every non-overlapping occurrence of pattern. On each call, the function is passed a match object argument for the match and can use this information to compute the desired replacement string and return it.

replacement也可以是一个函数,它可以给你更多的控制。如果replacement是一个函数,则对pattern的每个非重叠出现调用该函数。在每次调用时,函数都会被传递一个match object参数,用于匹配,并且可以使用该信息来计算所需的替换字符串并返回它。

In the following example, the replacement function translates decimals into hexadecimal:

在下面的例子中,替换函数将小数转换为十六进制:

>>>
>>> def hexrepl(match):
...     "Return the hex string for a decimal number"
...     value = int(match.group())
...     return hex(value)
...
>>> p = re.compile(r'\d+')
>>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.')
'Call 0xffd2 for printing, 0xc000 for user code.'

When using the module-level re.sub() function, the pattern is passed as the first argument. The pattern may be provided as an object or as a string; if you need to specify regular expression flags, you must either use a pattern object as the first parameter, or use embedded modifiers in the pattern string, e.g. sub("(?i)b+", "x", "bbbb BBBB") returns 'x x'.

当使用模块级的re.sub()函数时,模式作为第一个参数传递。模式可以以对象或字符串提供;如果需要指定正则表达式标志,则必须使用模式对象作为第一个参数,或者在模式字符串中使用嵌入的修饰符,例如,sub("(?i)b+", "x", "bbbb BBBB")返回'x x'

Common Problems

Regular expressions are a powerful tool for some applications, but in some ways their behaviour isn’t intuitive and at times they don’t behave the way you may expect them to. This section will point out some of the most common pitfalls.

正则表达式对于某些应用程序来说是一个强大的工具,但在某些方面,它们的行为并不直观,有时它们的行为也不像您期望的那样。本节将指出一些最常见的陷阱。

Use String Methods

Sometimes using the re module is a mistake. If you’re matching a fixed string, or a single character class, and you’re not using any re features such as the IGNORECASE flag, then the full power of regular expressions may not be required. Strings have several methods for performing operations with fixed strings and they’re usually much faster, because the implementation is a single small C loop that’s been optimized for the purpose, instead of the large, more generalized regular expression engine.

有时使用re模块是一个错误。如果您匹配的是一个固定的字符串或一个字符类,并没有使用任何re特性,比如IGNORECASE'标志,那么可能就不需要正则表达式的全部功能。字符串有几个方法来执行固定字符串的操作,它们通常要快得多,因为实现是一个单独的小的C循环,并为此进行了优化,而不是大型的、更一般化的正则表达式引擎。

One example might be replacing a single fixed string with another one; for example, you might replace word with deed. re.sub() seems like the function to use for this, but consider the replace() method. Note that replace() will also replace word inside words, turning swordfish into sdeedfish, but the naive RE word would have done that, too. (To avoid performing the substitution on parts of words, the pattern would have to be \bword\b, in order to require that word have a word boundary on either side. This takes the job beyond replace()’s abilities.)

一个例子可能是用另一个字符串替换一个固定的字符串;例如,你可以将word替换为deedre.sub()似乎是用于此目的的函数,但请考虑replace()方法。注意,replace()也将替换word内部单词,将swordfish转换为sdeedfish,但朴素RE的word也会这样做。(为了避免对单词的部分进行替换,模式必须是\bword\b,以便要求word的两边都有单词边界。这使工作超出了replace()的能力范围。

Another common task is deleting every occurrence of a single character from a string or replacing it with another single character. You might do this with something like re.sub('\n', ' ', S), but translate() is capable of doing both tasks and will be faster than any regular expression operation can be.

另一个常见的任务是从字符串中删除单个字符的任何出现,或者用另一个字符替换它。你可以用re.sub('\n', ' ', S),但是translate()能够完成这两个任务,而且比任何正则表达式操作都要快。

In short, before turning to the re module, consider whether your problem can be solved with a faster and simpler string method.

简而言之,在转到re模块之前,请考虑是否可以使用更快更简单的字符串方法来解决您的问题。

The match() function only checks if the RE matches at the beginning of the string while search() will scan forward through the string for a match. It’s important to keep this distinction in mind. Remember, match() will only report a successful match which will start at 0; if the match wouldn’t start at zero, match() will not report it.

match()函数只检查字符串开头的RE是否匹配,而search()将向前扫描整个字符串以查找匹配。记住这一点很重要。记住,search()将只报告一个从0开始的成功匹配;如果匹配不是从零开始,search()将不会报告它。

>>>
>>> print(re.match('super', 'superstition').span())
(0, 5)
>>> print(re.match('super', 'insuperable'))
None

On the other hand, search() will scan forward through the string, reporting the first match it finds.

另一方面,search()将向前扫描字符串,报告它找到的第一个匹配项。

>>>
>>> print(re.search('super', 'superstition').span())
(0, 5)
>>> print(re.search('super', 'insuperable').span())
(2, 7)

Sometimes you’ll be tempted to keep using re.match(), and just add .* to the front of your RE. Resist this temptation and use re.search() instead. The regular expression compiler does some analysis of REs in order to speed up the process of looking for a match. One such analysis figures out what the first character of a match must be; for example, a pattern starting with Crow must match starting with a 'C'. The analysis lets the engine quickly scan through the string looking for the starting character, only trying the full match if a 'C' is found.

有时您可能会忍不住继续使用re.match(),然后添加.*到你的RE的前面。抵制这种诱惑,用re.search()来代替。正则表达式编译器对REs进行一些分析,以加快查找匹配项的过程。其中一个分析指出了匹配的第一个字符必须是什么;例如,以Crow开头的模式必须与以'C'开头的模式匹配。该分析允许引擎快速扫描字符串,寻找起始字符,只有在找到'C'时才尝试完全匹配。

Adding .* defeats this optimization, requiring scanning to the end of the string and then backtracking to find a match for the rest of the RE. Use re.search() instead.

添加.*破坏了这个优化,需要扫描到字符串的末尾,然后回溯查找匹配的RE。使用re.search()代替。

Greedy versus Non-Greedy

When repeating a regular expression, as in a*, the resulting action is to consume as much of the pattern as possible. This fact often bites you when you’re trying to match a pair of balanced delimiters, such as the angle brackets surrounding an HTML tag. The naive pattern for matching a single HTML tag doesn’t work because of the greedy nature of .*.

当重复一个正则表达式时,如在a*中,结果操作是匹配尽可能多的模式。当您试图匹配一对平衡的分隔符时,如围绕HTML标记的尖括号,这一事实常常困扰您。由于.*的贪婪特性,匹配单个HTML标记的朴素模式无法工作。

>>>
>>> s = '<html><head><title>Title</title>'
>>> len(s)
32
>>> print(re.match('<.*>', s).span())
(0, 32)
>>> print(re.match('<.*>', s).group())
<html><head><title>Title</title>

The RE matches the '<' in '<html>', and the .* consumes the rest of the string. There’s still more left in the RE, though, and the > can’t match at the end of the string, so the regular expression engine has to backtrack character by character until it finds a match for the >. The final match extends from the '<' in '<html>' to the '>' in '</title>', which isn’t what you want.

RE与'<html>'中的'<'匹配,而且.*消耗字符串的其余部分。不过,在RE中还有更多内容,而且>在字符串末尾无法匹配,因此正则表达式引擎必须逐字符回溯,直到找到>的匹配。最后的匹配从'<html>'中的'<'扩展到'</title>'中的'>',这不是您想要的。

In this case, the solution is to use the non-greedy qualifiers *?, +?, ??, or {m,n}?, which match as little text as possible. In the above example, the '>' is tried immediately after the first '<' matches, and when it fails, the engine advances a character at a time, retrying the '>' at every step. This produces just the right result:

在这种情况下,解决方案是使用非贪婪限定符的*?+???{m,n}?,它匹配的文本越少越好。在上面的例子中,'>'在第一个'<'匹配后立即尝试,当它失败时,引擎一次前进一个字符,每一步都重新尝试'>'。这产生了正确的结果:

>>>
>>> print(re.match('<.*?>', s).group())
<html>

(Note that parsing HTML or XML with regular expressions is painful. Quick-and-dirty patterns will handle common cases, but HTML and XML have special cases that will break the obvious regular expression; by the time you’ve written a regular expression that handles all of the possible cases, the patterns will be very complicated. Use an HTML or XML parser module for such tasks.)

(注意,用正则表达式解析HTML或XML是很痛苦的。快速而不规范的模式将处理常见的情况,但是HTML和XML有特殊的情况,它们将破坏明显的正则表达式;当您编写了一个处理所有可能情况的正则表达式时,模式将变得非常复杂。使用HTML或XML解析器模块来完成这些任务。)

Using re.VERBOSE

By now you’ve probably noticed that regular expressions are a very compact notation, but they’re not terribly readable. REs of moderate complexity can become lengthy collections of backslashes, parentheses, and metacharacters, making them difficult to read and understand.

到目前为止,您可能已经注意到正则表达式是一种非常紧凑的表示法,但是它们的可读性不是很好。中等复杂度的REs可能成为反斜杠、括号和元字符的冗长集合,使它们难于阅读和理解。

For such REs, specifying the re.VERBOSE flag when compiling the regular expression can be helpful, because it allows you to format the regular expression more clearly.

对于此类REs,在编译正则表达式时请指定re.VERBOSE标志很有用,因为它允许您更清晰地格式化正则表达式。

The re.VERBOSE flag has several effects. Whitespace in the regular expression that isn’t inside a character class is ignored. This means that an expression such as dog | cat is equivalent to the less readable dog|cat, but [a b] will still match the characters 'a', 'b', or a space. In addition, you can also put comments inside a RE; comments extend from a # character to the next newline. When used with triple-quoted strings, this enables REs to be formatted more neatly:

re.VERBOSE标记有几个效果。不在字符类内的正则表达式中的空白将被忽略。这意味着dog | cat这样的表达式等价于可读性较差的dog|cat,但是[a b]仍然会匹配字符'a''b'或空格。另外,还可以把注释放在RE里面;注释从#字符扩展到下一个换行符。当与三引号的字符串一起使用时,这使REs的格式更简洁:

pat = re.compile(r"""
 \s*                 # Skip leading whitespace
 (?P<header>[^:]+)   # Header name
 \s* :               # Whitespace, and a colon
 (?P<value>.*?)      # The header's value -- *? used to
                     # lose the following trailing whitespace
 \s*$                # Trailing whitespace to end-of-line
""", re.VERBOSE)

This is far more readable than:

这比以下更易读:

pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$")

Feedback

Regular expressions are a complicated topic. Did this document help you understand them? Were there parts that were unclear, or Problems you encountered that weren’t covered here? If so, please send suggestions for improvements to the author.

正则表达式是一个复杂的主题。这份文档对你理解他们有帮助吗?是否有不清楚的地方,或者您遇到的问题没有在这里讨论?如果是,请将改进建议发送给作者。

The most complete book on regular expressions is almost certainly Jeffrey Friedl’s Mastering Regular Expressions, published by O’Reilly. Unfortunately, it exclusively concentrates on Perl and Java’s flavours of regular expressions, and doesn’t contain any Python material at all, so it won’t be useful as a reference for programming in Python. (The first edition covered Python’s now-removed regex module, which won’t help you much.) Consider checking it out from your library.

关于正则表达式最完整的书几乎可以肯定是由O'Reilly出版的Jeffrey Friedl的《精通正则表达式》。不幸的是,它只关注Perl和Java风格的正则表达式,根本不包含任何Python内容,所以作为Python编程的参考并没有什么用处。(第一版覆盖了Python现在已删除的regex模块,这对您帮助不大。)考虑从你的书目删除。