After setting your PHP/MySQL environment with XAMPP, now we can start to create PHP script to retrieve a web page source file. There are many libraries in PHP to send request to our targeted web server and receive the response in a file format. One of the common way to achieve this is to use cURL extension in PHP.
For now, we create a very simple PHP/cURL class to help us request web page from server. After that, we can proceed to "operate" source file to scrape information we need. Also, we need to modify and enhance the code of this class as we going further.
First, create a folder "scraper" under C:\xampp\htdocs, then create a text file using Notepad++ called httpcurl.php under directory C:\xampp\htdocs\scraper.
To enable syntax higlighting from Notepad++, select "Language"->"P"->"PHP".
In httpcurl.php, you have:
<?php class HttpCurl { private $_info, $_body, $_error; public function __construct() { if (!function_exists('curl_init')) { throw new Exception('cURL not enabled!'); } } public function get($url) { $this->request($url); } protected function request($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); curl_setopt($ch, CURLOPT_URL, $url); $this->_body = curl_exec($ch); $this->_info = curl_getinfo($ch); $this->_error = curl_error($ch); curl_close($ch); } public function getStatus() { return $this->_info[http_code]; } public function getHeader() { return $this->_info; } public function getBody() { return $this->_body; } public function __destruct() { } } ?>
How it works
1. First we create a cURL wrapper class called HttpCurl, then we declare three private variables, $_body, $_error and $_info.
class HttpCurl { private $_info, $_body, $_error;
2. We check whether cURL is enabled in PHP configuration at the class constructor. If not enabled, the script will display error message. You need to enable cURL extension in php.ini file and restart XAMPP.
3. We create a public get() function, which call a protected function request().
4. In the protected function request(), we call cURL functions to request for web page.
First, we need to initialize cURL with curl_init() and store the returned curl handle to $ch.
For the most basic operation to request a web page from web server, we need to set two options.
curl_setopt($ch, CURLOPT_URL, $url) to tell cURL that web page requested is located at $url.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE) to tell cURL return the transfer as string of the return values of curl_exec() instead of display it directly.
5. We execute cURL with curl_exec($ch). The return value, which is the string format of the web page requested is passed to private variable $_body.
6. Before we scrape info from a webpage, we need to know the status of the web page request after executing curl_exec(). We call the function curl_getinfo() to obtain the transfer information and store in private variable $_info.
7. Before we close the session for current cURL, we check is there any error from the transfer by calling curl_error(). $_error will be an empty string if no error.
8. Lastly, we call curl_close() to close the current session. At this stage, we have stored all information of the requested web page.
Next, we are ready to test out script and request web page from our targeted server. Now, create another text file at the same directory, called test.php. You can run the file through Windows Command Prompt or for a simple program like this, execute the program using web browser. Enter this url to run the program: localhost/scraper/test.php.
<?php include 'httpcurl.php'; $target = "http://<domain name>"; $page = new HttpCurl(); $page->get($target); echo " Web Page Header<br>"; print_r($page->getHeader()); echo "<br>"; echo " Web Page Status<br>"; print_r($page->getStatus()); echo "<br>"; echo " Web Page Body<br>"; print_r($page->getBody()); ?>
First, we include the file httpcurl.php in our test.php, then we instantiate the object by stating $page=new HttpCurl(), then request the web page by calling the function get.
There are two important part of information we need before scraping a web page. One is the header information, second is the content of the requested web page.
The header information can be viewed by calling function getHeader() and displayed from print_r() function.
By replacing domain name with google.com (you can try with any targeted domain), the output is
Array ( [url] => http://www.google.com/ [content_type] => text/html; charset=ISO-8859-1 [http_code] => 200 [header_size] => 1114 [request_size] => 102 [filetime] => -1 [ssl_verify_result] => 0 [redirect_count] => 1 [total_time] => 0.686 [namelookup_time] => 0.031 [connect_time] => 0.046 [pretransfer_time] => 0.046 [size_upload] => 0 [size_download] => 51402 [speed_download] => 74930 [speed_upload] => 0 [download_content_length] => 219 [upload_content_length] => 0 [starttransfer_time] => 0.093 [redirect_time] => 0.094 [certinfo] => Array ( ) [redirect_url] => )
The most important header information is the http_code.
If http_code value is 200, which means web page was successfully returned by web server, then we can proceed to scape information we want.
Since we just want to know the status of http_code, we call function getStatus in this case.
Web Page Status 200
If you want to know the content of the web page return at this stage, just echo $page->getBody(). If you run it under XAMPP, you will see the same page output from browser. This is because web browser try to translate HTML tag in the returned source file.
To view actual source file, right click your mouse and select "View Source".
Dealing with Redirection
There is one more thing to consider before we proceed further.
Your target web page might have set a redirection to another url. Even worst, some of the web sites might intentionally or unintentionally set up “web spider trap”, where web page A is redirected to page B, then page B redirected to page C, then redirected back to page A. Your script might stuck at this infinite loop.
For example, if you change domain name to "php8legs.com", and because my website redirect php8legs.com to php8legs.com/en or php8legs.com/zh pending on your browser language setting, you will get the below output and no source file returned.
Web Page Header Array ( [url] => http://php8legs.com [content_type] => text/html; charset=utf-8 [http_code] => 301 [header_size] => 315 [request_size] => 51 [filetime] => -1 [ssl_verify_result] => 0 [redirect_count] => 0 [total_time] => 1.467 [namelookup_time] => 0 [connect_time] => 0.281 [pretransfer_time] => 0.281 [size_upload] => 0 [size_download] => 0 [speed_download] => 0 [speed_upload] => 0 [download_content_length] => 0 [upload_content_length] => 0 [starttransfer_time] => 1.467 [redirect_time] => 0 [certinfo] => Array ( ) [redirect_url] => http://php8legs.com/en/ ) Web Page Status 301 Web Page Body
We can add CURLOPT_FOLLOWLOCATION to TURE, then CURLOPT_MAXREDIRS to a small number.
$this->_options['CURLOPT_FOLLOWLOCATION'] = TRUE; $this->_options['CURLOPT_MAXREDIRS'] = 5;
In this case, our script will follow the redirection of a web page for a maximum of 5 times and get out of loop after that.
So finally, our first draft of HttpCurl class looks like this:
<?php class HttpCurl { private $_info, $_body; public function __construct() { if (!function_exists('curl_init')) { throw new Exception('cURL not enabled!'); } } public function get($url) { $this->request($url); } protected function request($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); curl_setopt($ch, CURLOPT_MAXREDIRS, 5); curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); curl_setopt($ch, CURLOPT_URL, $url); $this->_body = curl_exec($ch); $this->_info = curl_getinfo($ch); $this->_error = curl_error($ch); curl_close($ch); } public function getStatus() { return $this->_info[http_code]; } public function getHeader() { return $this->_info; } public function getBody() { return $this->_body; } public function __destruct() { } } ?>
Next we can proceed to scrape some information we need!