Regular expressions are fun. There are two important concepts that are quite often sought for. Find something based on content that follows our search pattern . OR. Find something based on content that lies ahead. Let's see some examples to see what I mean.
Rules:
Where pattern2 is the pattern for the part that we ACTUALLY want to capture. Pattern1 is the pattern which needs to found as MANDATORY.
Example : We will take the same example. 'Hello World'
Aim : I want to find 'World' ONLY if it is preceded by 'Hello ' (The space here also counts)
Result : 'World'
Rules:
LOOK AHEAD
LOOK AHEAD means , I want to see what lies ahead first, and then decide, to do something now.
Syntax of regex pattern: pattern1(?=pattern2)
Where Pattern1 is the pattern for the part that we ACTUALLY want to capture. Pattern2 is the pattern which needs to be found as MANDATORY. Logically speaking,
IF PATTERN2 is FOUND, then print/get/capture/show PATTERN1
Example: Let there be a string "Hello World"
Aim: I want to find 'Hello' ...ONLY IF its followed by 'World'
re.search(r'\w+(?= World)', 'Hello World').group()
Result : >> 'Hello'
Explanation:
- The \w+ is Pattern1
- The 'World' is Pattern2
- r'\w+(?= World)' means, Find anything which fits \w+ IF it is followed by 'World'
Rules:
- The Pattern2 needs to be in parenthesis.
- If pattern2 itself has parenthesis, then they need to be bracketed. i.e [ ( ] or [ ) ]
LOOK BEHIND
Its just the opposite of the above. It means , I want to see what lies behind me , and then decide to do something.
Syntax of regex pattern: (?<=pattern1)pattern2
Where pattern2 is the pattern for the part that we ACTUALLY want to capture. Pattern1 is the pattern which needs to found as MANDATORY.
IF PATTERN1 is FOUND, then print/get/capture/show PATTERN2
Example : We will take the same example. 'Hello World'
Aim : I want to find 'World' ONLY if it is preceded by 'Hello ' (The space here also counts)
>>> re.search(r'(?<=Hello )\w+', 'Hello World').group()
Result : 'World'
Rules:
- Pattern1 needs to be in parenthesis
- If pattern2 itself has parenthesis, then they need to be bracketed. i.e [ ( ] or [ ) ]
LOOK AHEAD & LOOK BEHIND COMBINED
LOOK AHEAD & LOOK BEHIND COMBINED
Consume only if it is surrounded by the things we want.
Syntax of regex pattern: (?<=pattern_behind)pattern_middle(?=pattern_ahead)
Where pattern_middle is the pattern for the part that we ACTUALLY want to capture. Pattern_behind and Pattern_ahead are patterns which need to be found as MANDATORY.
Example: We will take a new example . 'Hello My World'
Aim: I want to find any word that occurs in between Hello & World
Result : 'My'
Problem : Remove all special symbols with a space, that come in between two alphanumeric character.
Target string : 'This$#is% Matrix# %!'
Explanation: For alphanumeric character, we used [a-zA-Z0-9]
To find special characters between them , we use a combo of look ahead and look behind.
IF PATTERN_AHEAD AND IF PATTERN_BEHIND are BOTH found, consume PATTERN_MIDDLE
Example: We will take a new example . 'Hello My World'
Aim: I want to find any word that occurs in between Hello & World
>>> re.search('(?<=Hello )\w+(?= World)', 'Hello My World').group()
Result : 'My'
More examples:
Problem : Remove all special symbols with a space, that come in between two alphanumeric character.
Target string : 'This$#is% Matrix# %!'
re.sub(r'(?<=[a-zA-Z0-9])[$#@%^\s]+(?=[a-zA-Z0-9])', ' ', 'This$#is% Matrix# %!')
Explanation: For alphanumeric character, we used [a-zA-Z0-9]
To find special characters between them , we use a combo of look ahead and look behind.
No comments:
Post a Comment