Scraping Twitpics with PHP

posted in: HOWTO | 0

Update! My twitpic scraper (as well as search API calls) have been integrated into NCSU’s Tweetgator. Check it out on Github!

A couple months ago, IU East re-vamped its twitter wall. We incorporated a codebase originally developed by NCSU, and then I extended it by adding inline hashtag searching and a twitpic scraper.

At the time I wrote it, I could not find any other existing Twitpic scraper – Twitpic doesn’t have a formal API (or at least, it didn’t then; I don’t think it does at the time of this writing, either).

Effectively what this script does (see after the jump) is to browse the Twitpic site, parse out the image IDs, and then re-create the Twitpic images. It is somewhat rudimentary in that it does not cache nor does is it actually download the images — anyone reading this may feel free to extend the code into something like that.

The script below is licensed under the GPL2, with all the requirements, freedoms, and obligations therein.


///// CURL down the Twitpic data /////////////////////////////
// See below for discussion on why we're not RegExing for images directly
$searchForPhotos = '';
$ch = curl_init($url);
$photoIDs = array();
$photos = array();

//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$temp = curl_exec($ch);

preg_match_all($searchForPhotos, $temp, $photoIDs);
$photoIDs = array_slice($photoIDs[0], 0, $quantity);

$output = "";

switch ($format) {
  case "LI":
    foreach($photoIDs as $id)
      $output .= sprintf($liFormat, $id);

    ///// Parse out the raw data into a usable format (JSON) //////
    foreach($photoIDs as $id)
      $photos[$id]["mini"] = sprintf(MINI_URL, $id);
      $photos[$id]["thumb"] = sprintf(THUMB_URL, $id);
      $photos[$id]["full"] = sprintf(LARGE_URL, $id);
      $photos[$id]["url"] = sprintf(PIC_URL, $id);
    $output = json_encode($photos);
echo $output;


The main challenge I encountered while doing this initially was that Twitpic obfuscates the way images are created — it seems counter-intuitive to search for link tags instead of IMG tags, but if you try hotlinking directly to the IMG src, you’ll find that it doesn’t work (presumably being blocked via .htaccess or something similar).

In the interest of fairness — it would probably be ideal to actually download and cache the images, rather than hotlinking — and I would advise anyone that implements this script to do that.

Once the page scrape has been parsed for link targets, we take those matches and then build new URLs. The actual photos, visible offsite, are different URLs than those that are shown on Twitpic’s site itself, which outsources to Amazon’s cloud web services. The Twitpic /show/ URLs are about as close to an API / web service that it comes, so that’s what we have to work with!

The IU East twitter wall uses a quick Ajax call to populate the Twitpic photos, to increase page load speed.