In this final part of PHP/cURL email extractor, I will show you how to store extracted data into MySQL database. You can store email addresses and contact information collected not just from one website, but also from various websites into the same database.
You might want to store email collected based on your purpose. For example, if you have a real estate website and a internet shopping website, then information collected should be stored into two different categories (tables in MySQL database).
First, you need to activate XAMPP on your PC, both Apache and MySQL. At browser URL, go to "http://localhost/phpmyadmin/". Go to top menu bar and select "Database". To create a new database for our tutorial, enter "email_collection" and press "Create" button, as shown in the picture below.
You can download the source file for PHP cURL Email Extractor from here.
Note: Check out the sample code at bottom of this article.
Go to "email_collection" database and you will find that it is still empty. Select "SQL", cut and paste query below into blank space and click "Go".
CREATE TABLE IF NOT EXISTS `contact_info` ( `id` int(12) NOT NULL AUTO_INCREMENT, `name` varchar(128) NOT NULL, `email` varchar(128) NOT NULL, `phone` varchar(128) NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `email` (`email`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
A table "contact_info" is created under database "email_collection".
Click "contact_info" and there are four data columns. Here we create a column "id" that will automatically increment each time a new entry is added to the table. "name", "email" and "phone" will store data collected. The data column is defined as "UNIQUE KEY" so that no the same emails will store more than once in the database. For example, you have collected and stored an email into database, the same email might be found when you run the script on another website. In this case, we just keep the first and ignore the second one.
So now here is the action: I modified the httpcurl.php file.How it works:
<?php define('TARGET_BLOCK','~<div class="negotiators-wrapper">(.*?)</div>(\r\n)</div>~s'); define('NAME', '~<div class="negotiators-name"><a href="/negotiator/(.*?)">(.*?)</a></div>~'); define('EMAIL', '~<div class="negotiators-email">(.*?)</div>~'); define('PHONE', '~<div class="negotiators-phone">(.*?)</div>~'); define('LASTPAGE', '~<li class="pager-last last"><a href="/negotiators\?page=(.*?)"~'); define('PARSE_CONTENT', TRUE); interface MySQLTable { public function addData($info); } class EmailDatabase extends mysqli implements MySQLTable { private $_table = 'contact_info'; public function __construct() { $host = 'localhost'; $user = 'root'; $pass = ''; $dbname = 'email_collection'; parent::__construct($host, $user, $pass, $dbname); } public function setTableName($name) { $this->_table = $name; } public function addData($info) { $sql = 'INSERT IGNORE INTO ' . $this->_table . ' (name, email, phone) '; $sql .= 'VALUES (\'' . $info[name] . '\', \'' . $info[email] . '\', \'' . $info[phone]. '\')'; return $this->query($sql); } public function query($query, $mode = MYSQLI_STORE_RESULT) { $this->ping(); $res = parent::query($query, $mode); return $res; } } interface HttpScraper { public function parse($body, $head); } class Scraper implements HttpScraper { private $_table; public function __construct($t = null) { $this->setTable($t); } public function __destruct() { if ($this->_table !== null) { $this->_table = null; } } public function setTable($t) { if ($t === null || $t instanceof MySQLTable) $this->_table = $t; } public function getTable() { return $this->_table; } public function parse($body, $head) { if ($head == 200) { $p = preg_match_all(TARGET_BLOCK, $body, $blocks); if ($p) { foreach($blocks[0] as $block) { $agent[name] = $this->matchPattern(NAME, $block, 2); $agent[email] = $this->matchPattern(EMAIL, $block, 1); $agent[phone] = $this->matchPattern(PHONE, $block, 1); // echo "<pre>"; print_r($agent); echo "</pre>"; $this->_table->addData($agent); } } } } public function matchPattern($pattern, $content, $pos) { if (preg_match($pattern, $content, $match)) { return $match[$pos]; } } } class HttpCurl { protected $_cookie, $_parser, $_timeout; private $_ch, $_info, $_body, $_error; public function __construct($p = null) { if (!function_exists('curl_init')) { throw new Exception('cURL not enabled!'); } $this->setParser($p); } public function get($url, $status = FALSE) { $this->request($url); if ($status === TRUE) { return $this->runParser($this->_body, $this->getStatus()); } } protected function request($url) { $ch = curl_init($url); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); curl_setopt($ch, CURLOPT_MAXREDIRS, 5); curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); curl_setopt($ch, CURLOPT_URL, $url); $this->_body = curl_exec($ch); $this->_info = curl_getinfo($ch); $this->_error = curl_error($ch); curl_close($ch); } public function getStatus() { return $this->_info[http_code]; } public function getHeader() { return $this->_info; } public function getBody() { return $this->_body; } public function __destruct() { } public function setParser($p) { if ($p === null || $p instanceof HttpScraper || is_callable($p)) $this->_parser = $p; } public function runParser($content, $header) { if ($this->_parser !== null) { if ($this->_parser instanceof HttpScraper) $this->_parser->parse($content, $header); else call_user_func($this->_parser, $content, $header); } } } ?>
A new class "EmailDatabase" has been added to our code. This class implements MySQLTable which has one function addData() in the interface.
The constructor of EmailDatase class will connect to MySQL through username and password given. Under XAMPP environment, host will be localhost, username is root and no password needed.
public function __construct() { $host = 'localhost'; $user = 'root'; $pass = ''; $dbname = 'email_collection'; parent::__construct($host, $user, $pass, $dbname); }
The addData() will write data array $info that has name, email and phone collected, into table "contact_info".
public function addData($info) { $sql = 'INSERT IGNORE INTO ' . $this->_table . ' (name, email, phone) '; $sql .= 'VALUES (\'' . $info[name] . '\', \'' . $info[email] . '\', \'' . $info[phone]. '\')'; return $this->query($sql); } public function query($query, $mode = MYSQLI_STORE_RESULT) { $this->ping(); $res = parent::query($query, $mode); return $res; } }
The Scraper class also changed to handle the new database.
private $_table; public function __construct($t = null) { $this->setTable($t); } public function __destruct() { if ($this->_table !== null) { $this->_table = null; } } public function setTable($t) { if ($t === null || $t instanceof MySQLTable) $this->_table = $t; }
The constructor of Scraper class will set the database passed during instantiation.
Slight changes to function parse() where we insert addData() function to write to database.
$this->_table->addData($agent);
For our test.php:
<?php include 'httpcurl.php'; $target = "http://<domain name>/negotiators?page="; $startPage = $target . "1"; $scrapeContent = new Scraper; $firstPage = new HttpCurl(); $firstPage->get($startPage); if ($firstPage->getStatus() === 200) { $lastPage = $scrapeContent->matchPattern(LASTPAGE, $firstPage->getBody(), 1); } $db = new EmailDatabase; $scrapeContent = new Scraper($db); $pages = new HttpCurl($scrapeContent); for($i=1; $i <= $lastPage; $i++) { $targetPage = $target . $i; $pages->get($targetPage, PARSE_CONTENT); } ?>
Only minor changes with $db object created and passded to $scrapeContent.
$db = new EmailDatabase; $scrapeContent = new Scraper($db); $pages = new HttpCurl($scrapeContent);
When we run the program, all data will be stored into database.
In fact, there are 1600++ data stored in the database!
What that, most likely you just need to change this portion to extract different websites.
define('TARGET_BLOCK','~<div class="negotiators-wrapper">(.*?)</div>(\r\n)</div>~s'); define('NAME', '~<div class="negotiators-name"><a href="/negotiator/(.*?)">(.*?)</a></div>~'); define('EMAIL', '~<div class="negotiators-email">(.*?)</div>~'); define('PHONE', '~<div class="negotiators-phone">(.*?)</div>~'); define('LASTPAGE', '~<li class="pager-last last"><a href="/negotiators\?page=(.*?)"~'); define('PARSE_CONTENT', TRUE);
That is all for PHP email extractor. Remember, don't spam your email recipients!
So far we are scrapping on text. I will use the same code to demonstrate how to download and store images from websites in next article.
Code:
1. httpcurl.php
<?php /******************************************************** * These are website specific matching pattern * * Change these matching patterns for each websites * * Else you will not get any results * ********************************************************/ define('TARGET_BLOCK','~<div class="negotiators-wrapper">(.*?)</div>(\r\n)</div>~s'); define('NAME', '~<div class="negotiators-name"><a href="/negotiator/(.*?)">(.*?)</a></div>~'); define('EMAIL', '~<div class="negotiators-email">(.*?)</div>~'); define('PHONE', '~<div class="negotiators-phone">(.*?)</div>~'); define('LASTPAGE', '~<li class="pager-last last"><a href="/negotiators\?page=(.*?)"~'); define('PARSE_CONTENT', TRUE); // Interface MySQLTable interface MySQLTable { public function addData($info); } // Class EmailDatabase // Use the code below to crease table /***************************************************** CREATE TABLE IF NOT EXISTS `contact_info` ( `id` int(12) NOT NULL AUTO_INCREMENT, `name` varchar(128) NOT NULL, `email` varchar(128) NOT NULL, `phone` varchar(128) NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `email` (`email`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; *******************************************************/ class EmailDatabase extends mysqli implements MySQLTable { private $_table = 'contact_info'; // set default table // Connect to database public function __construct() { $host = 'localhost'; $user = 'root'; $pass = ''; $dbname = 'email_collection'; parent::__construct($host, $user, $pass, $dbname); } // Use this function to change to another table public function setTableName($name) { $this->_table = $name; } // Write data to table public function addData($info) { $sql = 'INSERT IGNORE INTO ' . $this->_table . ' (name, email, phone) '; $sql .= 'VALUES (\'' . $info[name] . '\', \'' . $info[email] . '\', \'' . $info[phone]. '\')'; return $this->query($sql); } // Execute MySQL query here public function query($query, $mode = MYSQLI_STORE_RESULT) { $this->ping(); $res = parent::query($query, $mode); return $res; } } // Interface HttpScraper interface HttpScraper { public function parse($body, $head); } // Class Scraper class Scraper implements HttpScraper { private $_table; // Store MySQL table if want to write to database. public function __construct($t = null) { $this->setTable($t); } // Delete table info at descructor public function __destruct() { if ($this->_table !== null) { $this->_table = null; } } // Set table info to private variable $_table public function setTable($t) { if ($t === null || $t instanceof MySQLTable) $this->_table = $t; } // Get table info public function getTable() { return $this->_table; } // Parse function public function parse($body, $head) { if ($head == 200) { $p = preg_match_all(TARGET_BLOCK, $body, $blocks); if ($p) { foreach($blocks[0] as $block) { $agent[name] = $this->matchPattern(NAME, $block, 2); $agent[email] = $this->matchPattern(EMAIL, $block, 1); $agent[phone] = $this->matchPattern(PHONE, $block, 1); // echo "<pre>"; print_r($agent); echo "</pre>"; $this->_table->addData($agent); } } } } // Return matched info public function matchPattern($pattern, $content, $pos) { if (preg_match($pattern, $content, $match)) { return $match[$pos]; } } } // Class HttpCurl class HttpCurl { protected $_cookie, $_parser, $_timeout; private $_ch, $_info, $_body, $_error; // Check curl activated // Set Parser as well public function __construct($p = null) { if (!function_exists('curl_init')) { throw new Exception('cURL not enabled!'); } $this->setParser($p); } // Get web page and run parser public function get($url, $status = FALSE) { $this->request($url); if ($status === TRUE) { return $this->runParser($this->_body, $this->getStatus()); } } // Run cURL to get web page source file protected function request($url) { $ch = curl_init($url); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); curl_setopt($ch, CURLOPT_MAXREDIRS, 5); curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); curl_setopt($ch, CURLOPT_URL, $url); $this->_body = curl_exec($ch); $this->_info = curl_getinfo($ch); $this->_error = curl_error($ch); curl_close($ch); } // Get http_code public function getStatus() { return $this->_info[http_code]; } // Get web page header information public function getHeader() { return $this->_info; } // Get web page content public function getBody() { return $this->_body; } public function __destruct() { } // set parser, either object or callback function public function setParser($p) { if ($p === null || $p instanceof HttpScraper || is_callable($p)) $this->_parser = $p; } // Execute parser public function runParser($content, $header) { if ($this->_parser !== null) { if ($this->_parser instanceof HttpScraper) $this->_parser->parse($content, $header); else call_user_func($this->_parser, $content, $header); } } } ?>
2. test.php
<?php include 'httpcurl.php'; // include lib file $target = "http://<website domain>/negotiators?page="; // Set our target's url, remember not to include nu,ber in pagination $startPage = $target . "1"; // Set first page $scrapeContent = new Scraper; $firstPage = new HttpCurl(); $firstPage->get($startPage); // get first page content if ($firstPage->getStatus() === 200) { $lastPage = $scrapeContent->matchPattern(LASTPAGE, $firstPage->getBody(), 1); // get total page info from first page } $db = new EmailDatabase; // can be excluded if do not want to write to database $scrapeContent = new Scraper($db); // // can be excluded as well $pages = new HttpCurl($scrapeContent); // Looping from first page to last and parse each and every pages to database for($i=1; $i <= $lastPage; $i++) { $targetPage = $target . $i; $pages->get($targetPage, PARSE_CONTENT); } ?>