在这篇文章里,我将讨论如何使用PHP/cURL网路蜘蛛下载和保存图像文件。我会用之前的电子邮址提取脚本作为示范。其实相同的脚本通过一些修改后可以用来提取购物网站,如ebay.com或者amazon.com的产品信息和图像,搬到您所指定的数据库。我们也可以从目录网站提取业务信息,文字信息和图像到您的网站。
以下是要提取图像文件储存进数据库的几项考量:
1 )不同的网站, 不同页面,甚至同一页面会有很多种图像文件格式( JPEG,PNG , GIF等)。
如果我们想对从不同的网站所采集到的图像建立共同的数据库,那么我们的PHP网路蜘蛛脚本需要能够转换成我们所要的文件格式。
2 )每个图像的文件大小不同。
一些图像可能非常大,一些则非常小。我们的PHP网路蜘蛛脚本需要能够调整大文件至更小的尺寸。调整大文件至小不是问题。小尺寸调大将使质量很差。
3 )我们需要图像文件的命名约定。
各个网站图像文件命名不同。有些长文件名,有些短。图像文件存储到我们的文件夹前,我们需要重新命名这些文件。
4 )我们需要在MySQL数据库中添加一个列,并将图像链接到相关信息。
那我们可以开始了。。。
Note: Check out the sample code at bottom of this article.
首先,我们来看一下图像文件匹配模式的分隔符来。
我在scraper.php文件加入IMAGE匹配模式。
define('TARGET_BLOCK','~<div class="negotiators-wrapper">(.*?)</div>(\r\n)</div>~s'); define('NAME', '~<div class="negotiators-name"><a href="/negotiator/(.*?)">(.*?)</a></div>~'); define('EMAIL', '~<div class="negotiators-email">(.*?)</div>~'); define('PHONE', '~<div class="negotiators-phone">(.*?)</div>~'); define('LASTPAGE', '~<li class="pager-last last"><a href="/negotiators\?page=(.*?)"~'); define('IMAGE', '~<div class="negotiators-photo"><a href="/negotiator/(.*?)"><img src="/(.*?)"~'); define('PARSE_CONTENT', TRUE); define('IMAGE_DIR', 'c:\\xampp\\htdocs\\scraper\\image\\');
我还添加了IMAGE_DIR,这是存储图像文件夹的途径。
我将一个处理图像的类IMAGE添加到image.php脚本内。西蒙·贾维斯于2006年写下这个程序,你可以在http://www.white-hat-web-design.co.uk/blog/resizing-images-with-php 找到原代码。 我修改了一些部份以配合我们的脚本。
<?php class Image { private $_image; private $_imageFormat; public function load($imageFile) { $imageInfo = getImageSize($imageFile); $this->_imageFormat = $imageInfo[2]; if( $this->_imageFormat === IMAGETYPE_JPEG ) { $this->_image = imagecreatefromjpeg($imageFile); } elseif( $this->_imageFormat === IMAGETYPE_GIF ) { $this->_image = imagecreatefromgif($imageFile); } elseif( $this->_imageFormat === IMAGETYPE_PNG ) { $this->_image = imagecreatefrompng($imageFile); } } public function save($imageFile, $_imageFormat=IMAGETYPE_JPEG, $compression=75, $permissions=null) { if( $_imageFormat == IMAGETYPE_JPEG ) { imagejpeg($this->_image,$imageFile,$compression); } elseif ( $_imageFormat == IMAGETYPE_GIF ) { imagegif($this->_image,$imageFile); } elseif ( $_imageFormat == IMAGETYPE_PNG ) { imagepng($this->_image,$imageFile); } if( $permissions != null) { chmod($imageFile,$permissions); } } public function getWidth() { return imagesx($this->_image); } public function getHeight() { return imagesy($this->_image); } public function resizeToHeight($height) { $ratio = $height / $this->getHeight(); $width = $this->getWidth() * $ratio; $this->resize($width,$height); } public function resizeToWidth($width) { $ratio = $width / $this->getWidth(); $height = $this->getheight() * $ratio; $this->resize($width,$height); } public function scale($scale) { $width = $this->getWidth() * $scale/100; $height = $this->getheight() * $scale/100; $this->resize($width,$height); } private function resize($width, $height) { $newImage = imagecreatetruecolor($width, $height); imagecopyresampled($newImage, $this->_image, 0, 0, 0, 0, $width, $height, $this->getWidth(), $this->getHeight()); $this->_image = $newImage; } } ?>
加载映像文件可用load()函数,也可用getWidth()和getHeight()函数获取图像的宽度和高度。使用save()函数储存图像文件之前,可以运用resizeToWidth(),resizeToHeight()或scale()函数调整图像的宽度和高度。
save()函数可以转换图像文件格式。我作了一个示范,您也可以自己试试看。
我们在MySQL表“contact_info”里添加一个新列“image”。
然后,我们在类EmailDatabase的addData()函数添加值“$info[image]”。
class EmailDatabase extends mysqli implements MySQLTable { private $_table = 'contact_info'; // set default table // Connect to database public function __construct() { $host = 'localhost'; $user = 'root'; $pass = ''; $dbname = 'email_collection'; parent::__construct($host, $user, $pass, $dbname); } // Use this function to change to another table public function setTableName($name) { $this->_table = $name; } // Write data to table public function addData($info) { $sql = 'INSERT IGNORE INTO ' . $this->_table . ' (name, email, phone, image) '; $sql .= 'VALUES (\'' . $info[name] . '\', \'' . $info[email] . '\', \'' . $info[phone]. '\', \'' . $info[image] .'\')'; return $this->query($sql); } // Execute MySQL query here public function query($query, $mode = MYSQLI_STORE_RESULT) { $this->ping(); $res = parent::query($query, $mode); return $res; } }
我也在类Scraper中加入saveImage()函数。
class Scraper implements HttpScraper { private $_table; // Store MySQL table if want to write to database. public function __construct($t = null) { $this->setTable($t); } // Delete table info at descructor public function __destruct() { if ($this->_table !== null) { $this->_table = null; } } // Set table info to private variable $_table public function setTable($t) { if ($t === null || $t instanceof MySQLTable) $this->_table = $t; } // Get table info public function getTable() { return $this->_table; } // Parse function public function parse($body, $head) { if ($head == 200) { $p = preg_match_all(TARGET_BLOCK, $body, $blocks); if ($p) { foreach($blocks[0] as $block) { $agent[name] = $this->matchPattern(NAME, $block, 2); $agent[email] = $this->matchPattern(EMAIL, $block, 1); $agent[phone] = $this->matchPattern(PHONE, $block, 1); $originalImagePath = $this->matchPattern(IMAGE, $block, 2); $agent[image] = $this->saveImage($originalImagePath, IMAGETYPE_GIF); $this->_table->addData($agent); } } } } // Return matched info public function matchPattern($pattern, $content, $pos) { if (preg_match($pattern, $content, $match)) { return $match[$pos]; } } public function saveImage($imageUrl, $imageType = 'IMAGETYPE_GIF') { if (!file_exists(IMAGE_DIR)) { mkdir(IMAGE_DIR, 0777, true); } if( $imageType === IMAGETYPE_JPEG ) { $fileExt = 'jpg'; } elseif ( $imageType === IMAGETYPE_GIF ) { $fileExt = 'gif'; } elseif ( $imageType === IMAGETYPE_PNG ) { $fileExt = 'png'; } $newImageName = md5($imageUrl). '.' . $fileExt; $image = new Image(); $image->load($imageUrl); $image->resizeToWidth(100); $image->save( IMAGE_DIR . $newImageName, $imageType ); return $newImageName; } }
saveImage()函数首先检查图像文件夹是否已存在,如否则创建文件夹。
if (!file_exists(IMAGE_DIR)) { mkdir(IMAGE_DIR, 0777, true); }
在这练习中,所有图像将自动改为GIT文件格式(您可自己修改默认格式)。也可用$imageType转换格式。
if( $imageType === IMAGETYPE_JPEG ) { $fileExt = 'jpg'; } elseif ( $imageType === IMAGETYPE_GIF ) { $fileExt = 'gif'; } elseif ( $imageType === IMAGETYPE_PNG ) { $fileExt = 'png'; }
文件命名也需符合转换的格式。
$newImageName = md5($imageUrl). '.' . $fileExt;
在这里,我使用PHP函数md5()散列图像的URL,并追加文件扩展名来创建新的文件名。实际上,你也可以更改使用time()函数等等来符合您的命名标准。
然后,我们使用load()函数将图像储存。在这个例子中,原始文件的大小是130X130。我通过调用resizeToWidth(100)调整到100×100。然后将图像保存到指定文件夹。
$image = new Image(); $image->load($imageUrl); $image->resizeToWidth(100); $image->save( IMAGE_DIR . $newImageName, $imageType ); return $newImageName;
返回的文件名会通过parse()函数存储在MySQL。
public function parse($body, $head) { if ($head == 200) { $p = preg_match_all(TARGET_BLOCK, $body, $blocks); if ($p) { foreach($blocks[0] as $block) { $agent[name] = $this->matchPattern(NAME, $block, 2); $agent[email] = $this->matchPattern(EMAIL, $block, 1); $agent[phone] = $this->matchPattern(PHONE, $block, 1); $originalImagePath = $this->matchPattern(IMAGE, $block, 2); $agent[image] = $this->saveImage($originalImagePath, IMAGETYPE_GIF); $this->_table->addData($agent); } } } }
当运行脚本时,您可看到MySQL存储每个房地产经纪人的映像文件名称。
然后去图像目录看看,您会发现下载的图片都是100×100尺寸的GIF格式!
Code:
1. httpcurl.php
<?php // Class HttpCurl class HttpCurl { protected $_cookie, $_parser, $_timeout; private $_ch, $_info, $_body, $_error; // Check curl activated // Set Parser as well public function __construct($p = null) { if (!function_exists('curl_init')) { throw new Exception('cURL not enabled!'); } $this->setParser($p); } // Get web page and run parser public function get($url, $status = FALSE) { $this->request($url); if ($status === TRUE) { return $this->runParser($this->_body, $this->getStatus()); } } // Run cURL to get web page source file protected function request($url) { $ch = curl_init($url); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); curl_setopt($ch, CURLOPT_MAXREDIRS, 5); curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); curl_setopt($ch, CURLOPT_URL, $url); $this->_body = curl_exec($ch); $this->_info = curl_getinfo($ch); $this->_error = curl_error($ch); curl_close($ch); } // Get http_code public function getStatus() { return $this->_info[http_code]; } // Get web page header information public function getHeader() { return $this->_info; } // Get web page content public function getBody() { return $this->_body; } public function __destruct() { } // set parser, either object or callback function public function setParser($p) { if ($p === null || $p instanceof HttpScraper || is_callable($p)) $this->_parser = $p; } // Execute parser public function runParser($content, $header) { if ($this->_parser !== null) { if ($this->_parser instanceof HttpScraper) $this->_parser->parse($content, $header); else call_user_func($this->_parser, $content, $header); } } } ?>
2. image.php
<?php class Image { private $_image; private $_imageFormat; public function load($imageFile) { $imageInfo = getImageSize($imageFile); $this->_imageFormat = $imageInfo[2]; if( $this->_imageFormat === IMAGETYPE_JPEG ) { $this->_image = imagecreatefromjpeg($imageFile); } elseif( $this->_imageFormat === IMAGETYPE_GIF ) { $this->_image = imagecreatefromgif($imageFile); } elseif( $this->_imageFormat === IMAGETYPE_PNG ) { $this->_image = imagecreatefrompng($imageFile); } } public function save($imageFile, $_imageFormat=IMAGETYPE_JPEG, $compression=75, $permissions=null) { if( $_imageFormat == IMAGETYPE_JPEG ) { imagejpeg($this->_image,$imageFile,$compression); } elseif ( $_imageFormat == IMAGETYPE_GIF ) { imagegif($this->_image,$imageFile); } elseif ( $_imageFormat == IMAGETYPE_PNG ) { imagepng($this->_image,$imageFile); } if( $permissions != null) { chmod($imageFile,$permissions); } } public function getWidth() { return imagesx($this->_image); } public function getHeight() { return imagesy($this->_image); } public function resizeToHeight($height) { $ratio = $height / $this->getHeight(); $width = $this->getWidth() * $ratio; $this->resize($width,$height); } public function resizeToWidth($width) { $ratio = $width / $this->getWidth(); $height = $this->getheight() * $ratio; $this->resize($width,$height); } public function scale($scale) { $width = $this->getWidth() * $scale/100; $height = $this->getheight() * $scale/100; $this->resize($width,$height); } private function resize($width, $height) { $newImage = imagecreatetruecolor($width, $height); imagecopyresampled($newImage, $this->_image, 0, 0, 0, 0, $width, $height, $this->getWidth(), $this->getHeight()); $this->_image = $newImage; } } ?>
3. scraper.php
<?php /******************************************************** * These are website specific matching pattern * * Change these matching patterns for each websites * * Else you will not get any results * ********************************************************/ define('TARGET_BLOCK','~<div class="negotiators-wrapper">(.*?)</div>(\r\n)</div>~s'); define('NAME', '~<div class="negotiators-name"><a href="/negotiator/(.*?)">(.*?)</a></div>~'); define('EMAIL', '~<div class="negotiators-email">(.*?)</div>~'); define('PHONE', '~<div class="negotiators-phone">(.*?)</div>~'); define('LASTPAGE', '~<li class="pager-last last"><a href="/negotiators\?page=(.*?)"~'); define('IMAGE', '~<div class="negotiators-photo"><a href="/negotiator/(.*?)"><img src="/(.*?)"~'); define('PARSE_CONTENT', TRUE); define('IMAGE_DIR', 'c:\\xampp\\htdocs\\scraper\\image\\'); // Interface MySQLTable interface MySQLTable { public function addData($info); } // Class EmailDatabase // Use the code below to crease table /***************************************************** CREATE TABLE IF NOT EXISTS `contact_info` ( `id` int(12) NOT NULL AUTO_INCREMENT, `name` varchar(128) NOT NULL, `email` varchar(128) NOT NULL, `phone` varchar(128) NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `email` (`email`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; *******************************************************/ class EmailDatabase extends mysqli implements MySQLTable { private $_table = 'contact_info'; // set default table // Connect to database public function __construct() { $host = 'localhost'; $user = 'root'; $pass = ''; $dbname = 'email_collection'; parent::__construct($host, $user, $pass, $dbname); } // Use this function to change to another table public function setTableName($name) { $this->_table = $name; } // Write data to table public function addData($info) { $sql = 'INSERT IGNORE INTO ' . $this->_table . ' (name, email, phone, image) '; $sql .= 'VALUES (\'' . $info[name] . '\', \'' . $info[email] . '\', \'' . $info[phone]. '\', \'' . $info[image] .'\')'; return $this->query($sql); } // Execute MySQL query here public function query($query, $mode = MYSQLI_STORE_RESULT) { $this->ping(); $res = parent::query($query, $mode); return $res; } } // Interface HttpScraper interface HttpScraper { public function parse($body, $head); } // Class Scraper class Scraper implements HttpScraper { private $_table; // Store MySQL table if want to write to database. public function __construct($t = null) { $this->setTable($t); } // Delete table info at descructor public function __destruct() { if ($this->_table !== null) { $this->_table = null; } } // Set table info to private variable $_table public function setTable($t) { if ($t === null || $t instanceof MySQLTable) $this->_table = $t; } // Get table info public function getTable() { return $this->_table; } // Parse function public function parse($body, $head) { if ($head == 200) { $p = preg_match_all(TARGET_BLOCK, $body, $blocks); if ($p) { foreach($blocks[0] as $block) { $agent[name] = $this->matchPattern(NAME, $block, 2); $agent[email] = $this->matchPattern(EMAIL, $block, 1); $agent[phone] = $this->matchPattern(PHONE, $block, 1); $originalImagePath = $this->matchPattern(IMAGE, $block, 2); $agent[image] = $this->saveImage($originalImagePath, IMAGETYPE_GIF); // echo "<pre>"; print_r($agent); echo "</pre>"; $this->_table->addData($agent); } } } } // Return matched info public function matchPattern($pattern, $content, $pos) { if (preg_match($pattern, $content, $match)) { return $match[$pos]; } } public function saveImage($imageUrl, $imageType = 'IMAGETYPE_GIF') { if (!file_exists(IMAGE_DIR)) { mkdir(IMAGE_DIR, 0777, true); } if( $imageType === IMAGETYPE_JPEG ) { $fileExt = 'jpg'; } elseif ( $imageType === IMAGETYPE_GIF ) { $fileExt = 'gif'; } elseif ( $imageType === IMAGETYPE_PNG ) { $fileExt = 'png'; } $path_parts = pathinfo($imageUrl); $newImageName = md5($imageUrl). '.' . $fileExt; $image = new Image(); $image->load($imageUrl); $image->resizeToWidth(100); $image->save( IMAGE_DIR . $newImageName, $imageType ); return $newImageName; } } ?>
4. extract.php
<?php include 'image.php'; include 'scraper.php'; include 'httpcurl.php'; // include lib file $target = "http://<domain name>/negotiators?page="; // Set our target's url, remember not to include nu,ber in pagination $startPage = $target . "1"; // Set first page $scrapeContent = new Scraper; $firstPage = new HttpCurl(); $firstPage->get($startPage); // get first page content if ($firstPage->getStatus() === 200) { $lastPage = $scrapeContent->matchPattern(LASTPAGE, $firstPage->getBody(), 1); // get total page info from first page } $db = new EmailDatabase(); // can be excluded if do not want to write to database $scrapeContent = new Scraper($db); // // can be excluded as well $pages = new HttpCurl($scrapeContent); // Looping from first page to last and parse each and every pages to database for($i=1; $i <= $lastPage; $i++) { $targetPage = $target . $i; $pages->get($targetPage, PARSE_CONTENT); } ?>