Scraping pictures from Tiffany’s website

Not that this is what’s kept me busy in technology for the last couple months, but it’s easy to post about as I try to recover from the holidays’ hiatus.

Tiffany’s (that high-end jewelry store) has many picture of its products on its site. The photography is quite good, and most of the pictures have large, high quality versions; makes for good desktop wallpaper. However, it’s tedious to go to each link and save each picture individually – there’s a lof of them.

Regular mirroring doesn’t work too well. The site uses a lot of Javascript, in particular to pop up windows with the englarged picture inside. Fortunately, most of Tiffany’s zoomed-in pictures of its products on the website have a regular format, e.g. http://www.tiffany.com/images/products/zoom_images/12259417_xl.jpg. Tather than try to be “smart” and figure a way to decipher correct links from the website programmatically, I chose to be “dumb” and do a brute force search on every possible number.

This saves me a lot of brain time, at the expense of computer time. Fortunately, the computer doesn’t value its time as much as I value mine, so it’s happy to run for a few days. It takes quite awhile to test one hundred million possibilities. The script below isn’t my first attempt at trying to get them all, but it’s certainly the fastest. I ran it on a Linux box, and it’s mainly curl that makes it possible.

Some of the optimizations I made along the way:

  • A naive HTTP download for each possibility is quite slow, not so much from creating a new process each time, but from the network connection overhead and I/O.
  • curl is much better than wget for these kinds of things, as one can specify a pattern and range of numbers to use, and it will also reuse the same HTTP connection. Truly, “a Client that groks the URLs”.
  • It was faster to download the HTTP header and look at the advertised file type (e.g. image/jpeg) using grep than make a full HTTP request and examine the resultant downloaded data to determine the file type (e.g. with file). More so than I would have thought if the speedup was only due to less bytes over the wire.

The script follows:

 #!/bin/sh  # e.g. http://www.tiffany.com/images/products/zoom_images/12259417_xl.jpg START_ID=10000000  CLEAN_LOCK=clean.lock CURL_LOCK1=curl1.lock CURL_LOCK2=curl2.lock CURL_LOCK3=curl3.lock CURL_LOCK4=curl4.lock  function remove_non_jpegs {   # Delete the non-JPEG files   START_ID=$1   END_ID=$2   CLEAN_LOCK=$3   echo "[`date +'%x %X'`]" Cleaning $START_ID-$END_ID    ID=$START_ID   while [ $ID -le $END_ID ];   do     JPG=${ID}_xl.jpg     if [ -z "`grep -m 1 'image/jpeg' $JPG`" ]; then       rm $JPG     else       echo $JPG is O.K.       rm $JPG       wget -q "http://www.tiffany.com/images/products/zoom_images/$JPG"     fi     ID=`expr $ID + 1`   done    rm -f $CLEAN_LOCK    echo "[`date +'%x %X'`]" Done cleaning $START_ID-$END_ID }  function slurp_data {   curl -I -s "http://www.tiffany.com/images/products/zoom_images/[$1-$2]_xl.jpg" -o "#1_xl.jpg"   rm -f $3 }  while [ $START_ID -lt 20000000 ]; do   END_ID=`expr $START_ID + 4000`    INTERVAL=`expr $END_ID - $START_ID`   INTERVAL=`expr $INTERVAL / 4`    ONE_QUARTER_ID=`expr $START_ID + $INTERVAL`   TWO_QUARTER_ID=`expr $ONE_QUARTER_ID + $INTERVAL`   THREE_QUARTER_ID=`expr $TWO_QUARTER_ID + $INTERVAL`    echo "[`date +'%x %X'`]" Getting $START_ID-$END_ID   lockfile $CURL_LOCK1 $CURL_LOCK2 $CURL_LOCK3 $CURL_LOCK4   slurp_data $START_ID $ONE_QUARTER_ID $CURL_LOCK1 &   slurp_data `expr $ONE_QUARTER_ID + 1` $TWO_QUARTER_ID $CURL_LOCK2 &   slurp_data `expr $TWO_QUARTER_ID + 1` $THREE_QUARTER_ID $CURL_LOCK3 &   slurp_data `expr $THREE_QUARTER_ID + 1` $END_ID $CURL_LOCK4 &   lockfile $CURL_LOCK1 $CURL_LOCK2 $CURL_LOCK3 $CURL_LOCK4   rm -f $CURL_LOCK1 $CURL_LOCK2 $CURL_LOCK3 $CURL_LOCK4   echo "[`date +'%x %X'`]" Done getting $START_ID-$END_ID    lockfile $CLEAN_LOCK   remove_non_jpegs $START_ID $END_ID $CLEAN_LOCK &    START_ID=`expr $END_ID + 1` done 

Kenzo Ramen mania

I’ve been going to Kenzo Ramen a bit too much lately, trying to bring other people there and solicit confirmation of whether this reaches the apex of “true ramen” that can be found in Asia. I also tend to eat two (large) bowls each time I go, which perhaps contributes to the feeling of dropping by too often. :)

This is the Galbi (Korean BBQ meat) and Shoyu (soy sauce) Ramen combo:
2005_1228_183231
The gabli’s good and the shoyu ramen is almost refreshing, it’s such a simple broth.

I followed that order (I’m a glutton I know) with Kenzo’s newest addition, the pork-broth Tonkotsu Ramen:
2005_1228_183104
The broth for this is rich, with a fair bit of fat. I think I would have preferred this first instead of the Galbi/Shoyu combo. I love the egg too, though I wish I knew how to marinade to make it slightly sweet light that.

Follow

Get every new post delivered to your Inbox.