How to extract email from a Web page using Python

Thursday, December 11th, 2008
Advertisement

Subscribe.
Enter your email:

To build an email extractor from a Web page is very easy using Python and regular expression (regex).

The first task is to extract the text from the Web page. To extract the text from the Web page, use the Python urlopen function from urllib module.

from urllib import urlopen
text = urlopen('http://the.web.url')

Second, define the regular expression to identify the email. Compile it into variable named pattern.

pattern = re.compile(
   r"[\w!#$%&'*+/=?^_`{|}~-]+" +
   r"(?:\.[\w!#$%&'*+/=?^_`{|}~-]+)*" +
   r"@(?:[a-z0-9](?:[\w-]*[\w])?\.)+" +
   r"(?:[\w^\d]{2}|" +
   r"com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)\b"
)

Lastly, use the Python re module to search and extract the email, from the text that had been extracted previously.

pattern.findall(text)

It will return a Python list of all emails in that Web page.

This is the full code listing:

import re
from urllib import urlopen
def extractEmail(theUrl):
    text = urlopen(theUrl).read()
    pattern = re.compile(
    r"[\w!#$%&'*+/=?^_`{|}~-]+" +
    r"(?:\.[\w!#$%&'*+/=?^_`{|}~-]+)*" +
    r"@(?:[a-z0-9](?:[\w-]*[\w])?\.)+" +
    r"(?:[\w^\d]{2}|" +
    r"com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)\b"
    )
    return pattern.findall(text)

Please use this code wisely. Thank you.

If you are new here, you might want to subscribe to the RSS feed or newsletter.

Enter your email address:

Creates the exact copy of your hard disk and allows you to instantly restore the entire machine.
New Acronis True Image Home 2010 is the most reliable and easy in use backup solution. Now with online backup option!
15% Discount Code: FMAATIH2010

What else?

Like this article? Share it

 Digg  del.icio.us  TwitThis  Facebook  Reddit  StumbleUpon

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>