pyspark.sql.functions.regexp_substr#
- pyspark.sql.functions.regexp_substr(str, regexp)[source]#
Returns the first substring that matches the Java regex regexp within the string str. If the regular expression is not found, the result is null.
New in version 3.5.0.
- Parameters
- Returns
Column
the first substring that matches a Java regex within the string str.
Examples
>>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([("1a 2b 14m", r"\d+")], ["str", "regexp"])
Example 1: Returns the first substring in the str column name that matches the regex pattern (d+) (one or more digits).
>>> df.select('*', sf.regexp_substr('str', sf.lit(r'\d+'))).show() +---------+------+-----------------------+ | str|regexp|regexp_substr(str, \d+)| +---------+------+-----------------------+ |1a 2b 14m| \d+| 1| +---------+------+-----------------------+
Example 2: Returns the first substring in the str column name that matches the regex pattern (mmm) (three consecutive ‘m’ characters)
>>> df.select('*', sf.regexp_substr('str', sf.lit(r'mmm'))).show() +---------+------+-----------------------+ | str|regexp|regexp_substr(str, mmm)| +---------+------+-----------------------+ |1a 2b 14m| \d+| NULL| +---------+------+-----------------------+
Example 3: Returns the first substring in the str column name that matches the regex pattern in regexp Column.
>>> df.select('*', sf.regexp_substr("str", sf.col("regexp"))).show() +---------+------+--------------------------+ | str|regexp|regexp_substr(str, regexp)| +---------+------+--------------------------+ |1a 2b 14m| \d+| 1| +---------+------+--------------------------+
Example 4: Returns the first substring in the str Column that matches the regex pattern in regexp column name.
>>> df.select('*', sf.regexp_substr(sf.col("str"), "regexp")).show() +---------+------+--------------------------+ | str|regexp|regexp_substr(str, regexp)| +---------+------+--------------------------+ |1a 2b 14m| \d+| 1| +---------+------+--------------------------+