Friday, June 21, 2013

Extract Urls from a remote webpage using PHP

Scraping data from website is extremely popular now a days. I have written a simple website parser class to grab all the urls from a website. Shared the class below for all to see and fun.

We will use the parser class below to extract all image sources and hyper links from a website.
Uses:
Create an instance of WebsiteParser class with a website url to get all the urls from their. And, then call getHrefLinks() and getImageSources() method like below to extract hyper links and image sources respectively.

View Demo :: Try it out and rate on phpclasses.org



4 comments:

  1. i add the proxy values to curl_options array:

    properties in the class:
    /**
    * Proxy Address
    * @var int
    */
    private $proxy;

    /**
    * Proxy Port
    * @var int
    */
    private $proxy_port;

    constructor:
    /**
    * Class constructor
    * @param string $url Target Url to parse
    * @param string $link_type Link type to grab
    * @param int $proxy proxy address
    * @param int $proxy_port proxy port
    */
    function __construct($url, $link_type = 'all', $proxy = 0, $proxy_port = 0) {
    $this->target_url = $url;
    $this->setUrls();
    $this->setLinksType($link_type);
    if ($proxy > 0 && $proxy_port > 0) {
    $this->proxy = $proxy;
    $this->proxy_port = $proxy_port;
    $this->curl_options = array_merge (
    $this->curl_options,
    array('CURLOPT_PROXY' => $this->proxy, 'CURLOPT_PROXYPORT' => $this->proxy_port));
    }
    } //__construct()

    ReplyDelete
    Replies
    1. Thank you for your comment and using the class. You can update that in to https://github.com/morshedalam/url-scraper-php. Can checkout the class from - http://www.phpclasses.org/package/8113-PHP-Parse-and-extract-links-and-images-from-Web-pages.html as well.

      Delete
  2. ups sorry, property "proxy" is a string type!

    ReplyDelete
    Replies
    1. Thank you for your comment and using the class.
      You can update that in to https://github.com/morshedalam/url-scraper-php.

      Can checkout the class from - http://www.phpclasses.org/package/8113-PHP-Parse-and-extract-links-and-images-from-Web-pages.html as well.

      Delete