Search

11 September, 2018

Regular Expressions. Lets not be greedy.

Recently faced a situation when I wanted to extracted parts of a html code. But then I realized, I have no control over the amount of information regex is giving me back.


Regular Expressions have these qualifiers :

*  + ?

They are all greedy. That means they will always try to get as much as possible .

Let's see an example. 


>>> r = re.search('<.*>', '<a><b><ab>')
>>> r.group()
'<a><b><ab>'


Now I didn't ask for the whole deal here. I just was trying to extract '<a>' from the above text.
So we need to use the ? in partnership with the *

>>> r = re.search('<.*?>', '<a><b><ab>')
>>> r.group()
'<a>'


Another Example:

First case is without the ?.   Second is with  *?


>>> r = re.search('(?:http://.*/)', 'http://www.google.com/search/query/')
>>> r.group()
'http://www.google.com/search/query/'


>>> r = re.search('(http://.*?/)', 'http://www.google.com/search/query/')
>>> r.group()
'http://www.google.com/'

No comments:

Post a Comment