使用PHP/cURL网路蜘蛛脚本下载及保存图像

使用PHP/cURL网路蜘蛛脚本下载及保存图像

在这篇文章里,我将讨论如何使用PHP/cURL网路蜘蛛下载和保存图像文件。我会用之前的电子邮址提取脚本作为示范。其实相同的脚本通过一些修改后可以用来提取购物网站,如ebay.com或者amazon.com的产品信息和图像,搬到您所指定的数据库。我们也可以从目录网站提取业务信息,文字信息和图像到您的网站。

以下是要提取图像文件储存进数据库的几项考量:

1 )不同的网站, 不同页面,甚至同一页面会有很多种图像文件格式( JPEG,PNG , GIF等)。

如果我们想对从不同的网站所采集到的图像建立共同的数据库,那么我们的PHP网路蜘蛛脚本需要能够转换成我们所要的文件格式。

2 )每个图像的文件大小不同。

一些图像可能非常大,一些则非常小。我们的PHP网路蜘蛛脚本需要能够调整大文件至更小的尺寸。调整大文件至小不是问题。小尺寸调大将使质量很差。

3 )我们需要图像文件的命名约定。

各个网站图像文件命名不同。有些长文件名,有些短。图像文件存储到我们的文件夹前,我们需要重新命名这些文件。

4 )我们需要在MySQL数据库中添加一个列,并将图像链接到相关信息。

那我们可以开始了。。。

Note: Check out the sample code at bottom of this article.

 首先,我们来看一下图像文件匹配模式的分隔符来。

图像文件匹配模式的分隔符

我在scraper.php文件加入IMAGE匹配模式。

define('TARGET_BLOCK','~<div class="negotiators-wrapper">(.*?)</div>(\r\n)</div>~s');
define('NAME', '~<div class="negotiators-name"><a href="/negotiator/(.*?)">(.*?)</a></div>~');
define('EMAIL', '~<div class="negotiators-email">(.*?)</div>~');
define('PHONE', '~<div class="negotiators-phone">(.*?)</div>~');
define('LASTPAGE', '~<li class="pager-last last"><a href="/negotiators\?page=(.*?)"~');
define('IMAGE', '~<div class="negotiators-photo"><a href="/negotiator/(.*?)"><img src="/(.*?)"~');
define('PARSE_CONTENT', TRUE);
define('IMAGE_DIR', 'c:\\xampp\\htdocs\\scraper\\image\\');

 我还添加了IMAGE_DIR,这是存储图像文件夹的途径。

我将一个处理图像的类IMAGE添加到image.php脚本内。西蒙·贾维斯于2006年写下这个程序,你可以在http://www.white-hat-web-design.co.uk/blog/resizing-images-with-php 找到原代码。 我修改了一些部份以配合我们的脚本。

<?php

class Image {   
private $_image; 
private $_imageFormat;   

public function load($imageFile) {   
	$imageInfo = getImageSize($imageFile); 
	$this->_imageFormat = $imageInfo[2]; 
	if( $this->_imageFormat === IMAGETYPE_JPEG ) {   
		$this->_image = imagecreatefromjpeg($imageFile); 
	} elseif( $this->_imageFormat === IMAGETYPE_GIF ) {  
		$this->_image = imagecreatefromgif($imageFile); 
	} elseif( $this->_imageFormat === IMAGETYPE_PNG ) {  
		$this->_image = imagecreatefrompng($imageFile); 
	} 
} 

public function save($imageFile, $_imageFormat=IMAGETYPE_JPEG, $compression=75, $permissions=null) {   
	if( $_imageFormat == IMAGETYPE_JPEG ) { 
		imagejpeg($this->_image,$imageFile,$compression); 
	} elseif ( $_imageFormat == IMAGETYPE_GIF ) {   
		imagegif($this->_image,$imageFile); 
	} elseif ( $_imageFormat == IMAGETYPE_PNG ) {   
		imagepng($this->_image,$imageFile); 
	} 
	if( $permissions != null) {   
		chmod($imageFile,$permissions); 
	} 
} 
	

public function getWidth() {   
	return imagesx($this->_image); 
} 

public function getHeight() {   
	return imagesy($this->_image); 
} 

public function resizeToHeight($height) {   
	$ratio = $height / $this->getHeight(); 
	$width = $this->getWidth() * $ratio; 
	$this->resize($width,$height); 
}   

public function resizeToWidth($width) { 
	$ratio = $width / $this->getWidth(); 
	$height = $this->getheight() * $ratio; 
	$this->resize($width,$height); 
}   

public function scale($scale) { 
	$width = $this->getWidth() * $scale/100; 
	$height = $this->getheight() * $scale/100; 
	$this->resize($width,$height); 
}   

private function resize($width, $height) { 
	$newImage = imagecreatetruecolor($width, $height); 
	imagecopyresampled($newImage, $this->_image, 0, 0, 0, 0, $width, $height, $this->getWidth(), $this->getHeight()); 
	$this->_image = $newImage; 
}   

}

?>

 

加载映像文件可用load()函数,也可用getWidth()和getHeight()函数获取图像的宽度和高度。使用save()函数储存图像文件之前,可以运用resizeToWidth(),resizeToHeight()或scale()函数调整图像的宽度和高度。

save()函数可以转换图像文件格式。我作了一个示范,您也可以自己试试看。

我们在MySQL表“contact_info”里添加一个新列“image”。

在MySQL表“contact_info”里添加一个新列“image”

然后,我们在类EmailDatabase的addData()函数添加值“$info[image]”。

class EmailDatabase extends mysqli implements MySQLTable	{
	private $_table = 'contact_info';     // set default table

	// Connect to database
	public function __construct() 	{
		$host = 'localhost';
		$user = 'root';
		$pass = '';
		$dbname = 'email_collection';
		parent::__construct($host, $user, $pass, $dbname);
	}
	
	// Use this function to change to another table	
	public function setTableName($name)  {
		$this->_table = $name;
	}

	// Write data to table
	public function addData($info)	{
		$sql = 'INSERT IGNORE INTO ' . $this->_table . ' (name, email, phone, image) ';
		$sql .= 'VALUES (\'' . $info[name] . '\', \'' . $info[email] . '\', \'' . $info[phone]. '\', \'' . $info[image] .'\')';
		return $this->query($sql);
	}

	// Execute MySQL query here
	public function query($query, $mode = MYSQLI_STORE_RESULT)	{
		$this->ping();
		$res = parent::query($query, $mode);
		return $res;
	}
}

 

我也在类Scraper中加入saveImage()函数。

class Scraper implements HttpScraper	{
	private $_table;	

	// Store MySQL table if want to write to database.
    public function __construct($t = null) {
        $this->setTable($t);
    }	 
	
	// Delete table info at descructor
	public function __destruct()	{
		if ($this->_table !== null) {
			$this->_table = null;
		}
	}

	// Set table info to private variable $_table
    public function setTable($t)   {
        if ($t === null || $t instanceof MySQLTable)  
            $this->_table = $t;
    }
	
	// Get table info
	public function getTable()  {
		return $this->_table;
	}
	
	// Parse function
    public function parse($body, $head) {
       if ($head == 200) {    
        $p = preg_match_all(TARGET_BLOCK, $body, $blocks);         
            if ($p) {
                foreach($blocks[0] as $block) {
                    $agent[name] = $this->matchPattern(NAME, $block, 2);
                    $agent[email] = $this->matchPattern(EMAIL, $block, 1);
                    $agent[phone] = $this->matchPattern(PHONE, $block, 1);
                    $originalImagePath = $this->matchPattern(IMAGE, $block, 2);		
					$agent[image] = $this->saveImage($originalImagePath, IMAGETYPE_GIF);	
					$this->_table->addData($agent);
               }
            }
        }
    }
     
	// Return matched info
    public function matchPattern($pattern, $content, $pos) {
        if (preg_match($pattern, $content, $match)) {
            return $match[$pos];
        }  
    }
	

	public function saveImage($imageUrl, $imageType = 'IMAGETYPE_GIF') {		
		if (!file_exists(IMAGE_DIR)) {
			mkdir(IMAGE_DIR, 0777, true);
		}
		
		if( $imageType === IMAGETYPE_JPEG ) { 
			$fileExt = 'jpg';
		} elseif ( $imageType === IMAGETYPE_GIF ) {   
			$fileExt = 'gif';
		} elseif ( $imageType === IMAGETYPE_PNG ) {   
			$fileExt = 'png';
		} 
		
		$newImageName = md5($imageUrl). '.' . $fileExt;
			
		$image = new Image(); 
		$image->load($imageUrl); 
		$image->resizeToWidth(100); 
		$image->save( IMAGE_DIR . $newImageName,  $imageType );  		
		return $newImageName;
	}	
	
	
}

 saveImage()函数首先检查图像文件夹是否已存在,如否则创建文件夹。

		if (!file_exists(IMAGE_DIR)) {
			mkdir(IMAGE_DIR, 0777, true);
		}

 在这练习中,所有图像将自动改为GIT文件格式(您可自己修改默认格式)。也可用$imageType转换格式。

		if( $imageType === IMAGETYPE_JPEG ) { 
			$fileExt = 'jpg';
		} elseif ( $imageType === IMAGETYPE_GIF ) {   
			$fileExt = 'gif';
		} elseif ( $imageType === IMAGETYPE_PNG ) {   
			$fileExt = 'png';
		} 

 文件命名也需符合转换的格式。

$newImageName = md5($imageUrl). '.' . $fileExt;

 

在这里,我使用PHP函数md5()散列图像的URL,并追加文件扩展名来创建新的文件名。实际上,你也可以更改使用time()函数等等来符合您的命名标准。

然后,我们使用load()函数将图像储存。在这个例子中,原始文件的大小是130X130。我通过调用resizeToWidth(100)调整到100×100。然后将图像保存到指定文件夹。

		$image = new Image(); 
		$image->load($imageUrl); 
		$image->resizeToWidth(100); 
		$image->save( IMAGE_DIR . $newImageName,  $imageType );  		
		return $newImageName;

 返回的文件名会通过parse()函数存储在MySQL。

    public function parse($body, $head) {
       if ($head == 200) {    
        $p = preg_match_all(TARGET_BLOCK, $body, $blocks);         
            if ($p) {
                foreach($blocks[0] as $block) {
                    $agent[name] = $this->matchPattern(NAME, $block, 2);
                    $agent[email] = $this->matchPattern(EMAIL, $block, 1);
                    $agent[phone] = $this->matchPattern(PHONE, $block, 1);
                    $originalImagePath = $this->matchPattern(IMAGE, $block, 2);		
					$agent[image] = $this->saveImage($originalImagePath, IMAGETYPE_GIF);	
					$this->_table->addData($agent);
               }
            }
        }
    }

 

当运行脚本时,您可看到MySQL存储每个房地产经纪人的映像文件名称。

MySQL存储每个房地产经纪人的映像文件名称

然后去图像目录看看,您会发现下载的图片都是100×100尺寸的GIF格式!

图像目录

 

Code: 

1. httpcurl.php

<?php
 
 // Class HttpCurl
class HttpCurl {
    protected $_cookie, $_parser, $_timeout;
    private $_ch, $_info, $_body, $_error;
      
	// Check curl activated
	// Set Parser as well
    public function __construct($p = null) {
        if (!function_exists('curl_init')) {
            throw new Exception('cURL not enabled!');
        } 
        $this->setParser($p);
    }
  
	// Get web page and run parser
    public function get($url, $status = FALSE) {
		$this->request($url);	
		if ($status === TRUE) {
			return $this->runParser($this->_body, $this->getStatus()); 
		}		
    }
  
	// Run cURL to get web page source file
    protected function request($url) {
        $ch = curl_init($url);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
        curl_setopt($ch, CURLOPT_MAXREDIRS, 5);   
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
        curl_setopt($ch, CURLOPT_URL, $url);
        $this->_body = curl_exec($ch);
        $this->_info  = curl_getinfo($ch);
        $this->_error = curl_error($ch);
        curl_close($ch);      
    }
  
	// Get http_code
    public function getStatus() {
        return $this->_info[http_code];
    }
      
	// Get web page header information
    public function getHeader() {
        return $this->_info;
    }
  
	// Get web page content
    public function getBody() {
        return $this->_body;
    }
      
    public function __destruct() {
    } 
      
	// set parser, either object or callback function
    public function setParser($p)   {
        if ($p === null || $p instanceof HttpScraper || is_callable($p))  
            $this->_parser = $p;
    }
  
	// Execute parser
    public function runParser($content, $header)    {
        if ($this->_parser !== null)
        {
            if ($this->_parser instanceof HttpScraper)
                $this->_parser->parse($content, $header);
            else
                call_user_func($this->_parser, $content, $header);
        }
    } 
}
  
?>

 

2. image.php

<?php

class Image {   
private $_image; 
private $_imageFormat;   

public function load($imageFile) {   
	$imageInfo = getImageSize($imageFile); 
	$this->_imageFormat = $imageInfo[2]; 
	if( $this->_imageFormat === IMAGETYPE_JPEG ) {   
		$this->_image = imagecreatefromjpeg($imageFile); 
	} elseif( $this->_imageFormat === IMAGETYPE_GIF ) {  
		$this->_image = imagecreatefromgif($imageFile); 
	} elseif( $this->_imageFormat === IMAGETYPE_PNG ) {  
		$this->_image = imagecreatefrompng($imageFile); 
	} 
} 

public function save($imageFile, $_imageFormat=IMAGETYPE_JPEG, $compression=75, $permissions=null) {   
	if( $_imageFormat == IMAGETYPE_JPEG ) { 
		imagejpeg($this->_image,$imageFile,$compression); 
	} elseif ( $_imageFormat == IMAGETYPE_GIF ) {   
		imagegif($this->_image,$imageFile); 
	} elseif ( $_imageFormat == IMAGETYPE_PNG ) {   
		imagepng($this->_image,$imageFile); 
	} 
	if( $permissions != null) {   
		chmod($imageFile,$permissions); 
	} 
} 
	

public function getWidth() {   
	return imagesx($this->_image); 
} 

public function getHeight() {   
	return imagesy($this->_image); 
} 

public function resizeToHeight($height) {   
	$ratio = $height / $this->getHeight(); 
	$width = $this->getWidth() * $ratio; 
	$this->resize($width,$height); 
}   

public function resizeToWidth($width) { 
	$ratio = $width / $this->getWidth(); 
	$height = $this->getheight() * $ratio; 
	$this->resize($width,$height); 
}   

public function scale($scale) { 
	$width = $this->getWidth() * $scale/100; 
	$height = $this->getheight() * $scale/100; 
	$this->resize($width,$height); 
}   

private function resize($width, $height) { 
	$newImage = imagecreatetruecolor($width, $height); 
	imagecopyresampled($newImage, $this->_image, 0, 0, 0, 0, $width, $height, $this->getWidth(), $this->getHeight()); 
	$this->_image = $newImage; 
}   

}

?>

 

3. scraper.php

<?php

/********************************************************
* These are website specific matching pattern           *
* Change these matching patterns for each websites      *
* Else you will not get any results                     *
********************************************************/
define('TARGET_BLOCK','~<div class="negotiators-wrapper">(.*?)</div>(\r\n)</div>~s');
define('NAME', '~<div class="negotiators-name"><a href="/negotiator/(.*?)">(.*?)</a></div>~');
define('EMAIL', '~<div class="negotiators-email">(.*?)</div>~');
define('PHONE', '~<div class="negotiators-phone">(.*?)</div>~');
define('LASTPAGE', '~<li class="pager-last last"><a href="/negotiators\?page=(.*?)"~');
define('IMAGE', '~<div class="negotiators-photo"><a href="/negotiator/(.*?)"><img src="/(.*?)"~');
define('PARSE_CONTENT', TRUE);
define('IMAGE_DIR', 'c:\\xampp\\htdocs\\scraper\\image\\');
 
// Interface MySQLTable
interface MySQLTable	{
	public function addData($info);	
}

// Class EmailDatabase
// Use the code below to crease table
/*****************************************************
  CREATE TABLE IF NOT EXISTS `contact_info` (
  `id` int(12) NOT NULL AUTO_INCREMENT,
  `name` varchar(128) NOT NULL,
  `email` varchar(128) NOT NULL,
  `phone` varchar(128) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `email` (`email`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8;
*******************************************************/
class EmailDatabase extends mysqli implements MySQLTable	{
	private $_table = 'contact_info';     // set default table

	// Connect to database
	public function __construct() 	{
		$host = 'localhost';
		$user = 'root';
		$pass = '';
		$dbname = 'email_collection';
		parent::__construct($host, $user, $pass, $dbname);
	}
	
	// Use this function to change to another table	
	public function setTableName($name)  {
		$this->_table = $name;
	}

	// Write data to table
	public function addData($info)	{
		$sql = 'INSERT IGNORE INTO ' . $this->_table . ' (name, email, phone, image) ';
		$sql .= 'VALUES (\'' . $info[name] . '\', \'' . $info[email] . '\', \'' . $info[phone]. '\', \'' . $info[image] .'\')';
		return $this->query($sql);
	}

	// Execute MySQL query here
	public function query($query, $mode = MYSQLI_STORE_RESULT)	{
		$this->ping();
		$res = parent::query($query, $mode);
		return $res;
	}
}


// Interface HttpScraper
interface HttpScraper
{
    public function parse($body, $head);
}
  
 // Class Scraper
class Scraper implements HttpScraper	{
	private $_table;	

	// Store MySQL table if want to write to database.
    public function __construct($t = null) {
        $this->setTable($t);
    }	 
	
	// Delete table info at descructor
	public function __destruct()	{
		if ($this->_table !== null) {
			$this->_table = null;
		}
	}

	// Set table info to private variable $_table
    public function setTable($t)   {
        if ($t === null || $t instanceof MySQLTable)  
            $this->_table = $t;
    }
	
	// Get table info
	public function getTable()  {
		return $this->_table;
	}
	
	// Parse function
    public function parse($body, $head) {
       if ($head == 200) {    
        $p = preg_match_all(TARGET_BLOCK, $body, $blocks);         
            if ($p) {
                foreach($blocks[0] as $block) {
                    $agent[name] = $this->matchPattern(NAME, $block, 2);
                    $agent[email] = $this->matchPattern(EMAIL, $block, 1);
                    $agent[phone] = $this->matchPattern(PHONE, $block, 1);
                    $originalImagePath = $this->matchPattern(IMAGE, $block, 2);		
					$agent[image] = $this->saveImage($originalImagePath, IMAGETYPE_GIF);	
//                    echo "<pre>"; print_r($agent); echo "</pre>";
					$this->_table->addData($agent);
               }
            }
        }
    }
     
	// Return matched info
    public function matchPattern($pattern, $content, $pos) {
        if (preg_match($pattern, $content, $match)) {
            return $match[$pos];
        }  
    }
	

	public function saveImage($imageUrl, $imageType = 'IMAGETYPE_GIF') {		
		if (!file_exists(IMAGE_DIR)) {
			mkdir(IMAGE_DIR, 0777, true);
		}
		
		if( $imageType === IMAGETYPE_JPEG ) { 
			$fileExt = 'jpg';
		} elseif ( $imageType === IMAGETYPE_GIF ) {   
			$fileExt = 'gif';
		} elseif ( $imageType === IMAGETYPE_PNG ) {   
			$fileExt = 'png';
		} 
	
		$path_parts = pathinfo($imageUrl);
		
		$newImageName = md5($imageUrl). '.' . $fileExt;
			
		$image = new Image(); 
		$image->load($imageUrl); 
		$image->resizeToWidth(100); 
		$image->save( IMAGE_DIR . $newImageName,  $imageType );  		
		return $newImageName;
	}	
	
	
}
 

?>

 

4. extract.php

<?php
include 'image.php';
include 'scraper.php';
include 'httpcurl.php';	// include lib file
   
$target = "http://<domain name>/negotiators?page=";	// Set our target's url, remember not to include nu,ber in pagination
$startPage = $target . "1";	// Set first page

$scrapeContent = new Scraper;
$firstPage = new HttpCurl();
$firstPage->get($startPage);   // get first page content

if ($firstPage->getStatus() === 200) {
	$lastPage = $scrapeContent->matchPattern(LASTPAGE, $firstPage->getBody(), 1);	// get total page info from first page
}

$db = new EmailDatabase();	// can be excluded if do not want to write to database
$scrapeContent = new Scraper($db);	// // can be excluded as well
$pages = new HttpCurl($scrapeContent);

// Looping from first page to last and parse each and every pages to database
for($i=1; $i <= $lastPage; $i++) { 
	$targetPage = $target . $i;
	$pages->get($targetPage, PARSE_CONTENT);
}

?>

 

最后修改于 星期二, 29 12月 2020 07:17
给本项目评分
(0 得票数)
返回顶部