pyspark.sql.functions.regexp_substr#

pyspark.sql.functions.regexp_substr(str, regexp)[source]#

Returns the first substring that matches the Java regex regexp within the string str. If the regular expression is not found, the result is null.

New in version 3.5.0.

Parameters

strColumn or column name: target column to work on.
regexpColumn or column name: regex pattern to apply.

Returns

Column: the first substring that matches a Java regex within the string str.

Examples

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([("1a 2b 14m", r"\d+")], ["str", "regexp"])

Example 1: Returns the first substring in the str column name that matches the regex pattern (d+) (one or more digits).

>>> df.select('*', sf.regexp_substr('str', sf.lit(r'\d+'))).show()
+---------+------+-----------------------+
|      str|regexp|regexp_substr(str, \d+)|
+---------+------+-----------------------+
|1a 2b 14m|   \d+|                      1|
+---------+------+-----------------------+

Example 2: Returns the first substring in the str column name that matches the regex pattern (mmm) (three consecutive ‘m’ characters)

>>> df.select('*', sf.regexp_substr('str', sf.lit(r'mmm'))).show()
+---------+------+-----------------------+
|      str|regexp|regexp_substr(str, mmm)|
+---------+------+-----------------------+
|1a 2b 14m|   \d+|                   NULL|
+---------+------+-----------------------+

Example 3: Returns the first substring in the str column name that matches the regex pattern in regexp Column.

>>> df.select('*', sf.regexp_substr("str", sf.col("regexp"))).show()
+---------+------+--------------------------+
|      str|regexp|regexp_substr(str, regexp)|
+---------+------+--------------------------+
|1a 2b 14m|   \d+|                         1|
+---------+------+--------------------------+

Example 4: Returns the first substring in the str Column that matches the regex pattern in regexp column name.

>>> df.select('*', sf.regexp_substr(sf.col("str"), "regexp")).show()
+---------+------+--------------------------+
|      str|regexp|regexp_substr(str, regexp)|
+---------+------+--------------------------+
|1a 2b 14m|   \d+|                         1|
+---------+------+--------------------------+