In this post, we need to make a small modification to previous PHP script for targeted email extraction.
First, we need to revisit the source file of the targeted web page, we can see that there are repeated blocks of agent contacts, with name, email and phone number. Total of 10 blocks per page.
The plan is to use the script to "cut out" each block and stores into array, then extract name, email and phone number from each block.
As you can see, each block starts with <div class="negotiators-wrapper"> tag and ends with </div></div>. Note that both </div> tags are separated by carriage return and new line feed in this example.
So here is the code for this example:
<?php define('TARGET_BLOCK','~<div class="negotiators-wrapper">(.*?)</div>(\r\n)</div>~s'); define('NAME', '~<div class="negotiators-name"><a href="/negotiator/(.*?)">(.*?)</a></div>~'); define('EMAIL', '~<div class="negotiators-email">(.*?)</div>~'); define('PHONE', '~<div class="negotiators-phone">(.*?)</div>~'); interface HttpScraper { public function parse($body, $head); } class Scraper implements HttpScraper { public function parse($body, $head) { if ($head == 200) { $p = preg_match_all(TARGET_BLOCK, $body, $blocks); if ($p) { foreach($blocks[0] as $block) { $agent[name] = $this->matchPattern(NAME, $block, 2); $agent[email] = $this->matchPattern(EMAIL, $block, 1); $agent[phone] = $this->matchPattern(PHONE, $block, 1); echo "<pre>"; print_r($agent); echo "</pre>"; } } } } public function matchPattern($pattern, $content, $pos) { if (preg_match($pattern, $content, $match)) { return $match[$pos]; } } } class HttpCurl { protected $_cookie, $_parser, $_timeout; private $_ch, $_info, $_body, $_error; public function __construct($p = null) { if (!function_exists('curl_init')) { throw new Exception('cURL not enabled!'); } $this->setParser($p); } public function get($url) { return $this->request($url); } protected function request($url) { $ch = curl_init($url); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); curl_setopt($ch, CURLOPT_MAXREDIRS, 5); curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); curl_setopt($ch, CURLOPT_URL, $url); $this->_body = curl_exec($ch); $this->_info = curl_getinfo($ch); $this->_error = curl_error($ch); curl_close($ch); $this->runParser($this->_body, $this->getStatus()); } public function getStatus() { return $this->_info[http_code]; } public function getHeader() { return $this->_info; } public function getBody() { return $this->_body; } public function __destruct() { } public function setParser($p) { if ($p === null || $p instanceof HttpScraper || is_callable($p)) $this->_parser = $p; } public function runParser($content, $header) { if ($this->_parser !== null) { if ($this->_parser instanceof HttpScraper) $this->_parser->parse($content, $header); else call_user_func($this->_parser, $content, $header); } } } ?>
How it works:
First, I define TARGET_BLOCK to get the block as discussed earlier.
define('TARGET_BLOCK','~<div class="negotiators-wrapper">(.*?)</div>(\r\n)</div>~s');
The carriage return and new line are matched with \r\n, which works for me on when running XAMPP under Window 7. Also, regular expression "s" modifier is used at the end of the pattern to match multilines code.
To get the name, I define NAME as
define('NAME', '~<div class="negotiators-name"><a href="/negotiator/(.*?)">(.*?)</a></div>~');
Note that there is a URL before ">(Name)</a></div>. The URL is different in each block. preg_match() function will grap two set of data, first is partial information of the URL and second is the targeted name. We will ignore the URL information in this case.
To get email and phone number, it is straight forward.
define('EMAIL', '~<div class="negotiators-email">(.*?)</div>~'); define('PHONE', '~<div class="negotiators-phone">(.*?)</div>~');
There is just a slight change to class Scraper:
class Scraper implements HttpScraper { public function parse($body, $head) { if ($head == 200) { $p = preg_match_all(TARGET_BLOCK, $body, $blocks); if ($p) { foreach($blocks[0] as $block) { $agent[name] = $this->matchPattern(NAME, $block, 2); $agent[email] = $this->matchPattern(EMAIL, $block, 1); $agent[phone] = $this->matchPattern(PHONE, $block, 1); echo "<pre>"; print_r($agent); echo "</pre>"; } } } } public function matchPattern($pattern, $content, $pos) { if (preg_match($pattern, $content, $match)) { return $match[$pos]; } } }
First, the function parse() will match and copy the block into array, then we extract the name, email and phone number. The rest of the code remails unchanged.
For this tutorial, we run the test.php and print out the results:
That's it!!
With all the information, you can write a more personalized email to your targeted receivers. You can use email management software such as ListMailPro or any latest and greatest autoresponder to import your list and send out mass email.
So far our script is able to extract email from one page. To extract large quantity of email, our script need to be able to crawl entire targeted pages and grab as many information as possible. This will be discussed next.
Related items
- New and Updated! Facebook Remote Status Update with PHP/cURL Bot
- Facebook Remote Status Update with PHP/cURL Bot
- Create MySQL Database for PHP Web Spider Extracted Emails Addresses (4)
- PHP Web Spider to Crawl Web Pages Pagination and Extract Emails (3)
- Email Extractor Script with PHP cURL and Regular Expression (1)