pyspark.sql.functions.regexp_substr#

pyspark.sql.functions.regexp_substr(str, regexp)[source]#

Returns the first substring that matches the Java regex regexp within the string str. If the regular expression is not found, the result is null.

New in version 3.5.0.

Parameters
strColumn or column name

target column to work on.

regexpColumn or column name

regex pattern to apply.

Returns
Column

the first substring that matches a Java regex within the string str.

Examples

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([("1a 2b 14m", r"\d+")], ["str", "regexp"])

Example 1: Returns the first substring in the str column name that matches the regex pattern (d+) (one or more digits).

>>> df.select('*', sf.regexp_substr('str', sf.lit(r'\d+'))).show()
+---------+------+-----------------------+
|      str|regexp|regexp_substr(str, \d+)|
+---------+------+-----------------------+
|1a 2b 14m|   \d+|                      1|
+---------+------+-----------------------+

Example 2: Returns the first substring in the str column name that matches the regex pattern (mmm) (three consecutive ‘m’ characters)

>>> df.select('*', sf.regexp_substr('str', sf.lit(r'mmm'))).show()
+---------+------+-----------------------+
|      str|regexp|regexp_substr(str, mmm)|
+---------+------+-----------------------+
|1a 2b 14m|   \d+|                   NULL|
+---------+------+-----------------------+

Example 3: Returns the first substring in the str column name that matches the regex pattern in regexp Column.

>>> df.select('*', sf.regexp_substr("str", sf.col("regexp"))).show()
+---------+------+--------------------------+
|      str|regexp|regexp_substr(str, regexp)|
+---------+------+--------------------------+
|1a 2b 14m|   \d+|                         1|
+---------+------+--------------------------+

Example 4: Returns the first substring in the str Column that matches the regex pattern in regexp column name.

>>> df.select('*', sf.regexp_substr(sf.col("str"), "regexp")).show()
+---------+------+--------------------------+
|      str|regexp|regexp_substr(str, regexp)|
+---------+------+--------------------------+
|1a 2b 14m|   \d+|                         1|
+---------+------+--------------------------+