Not that this is what’s kept me busy in technology for the last couple months, but it’s easy to post about as I try to recover from the holidays’ hiatus.
Tiffany’s (that high-end jewelry store) has many picture of its products on its site. The photography is quite good, and most of the pictures have large, high quality versions; makes for good desktop wallpaper. However, it’s tedious to go to each link and save each picture individually – there’s a lof of them.
Regular mirroring doesn’t work too well. The site uses a lot of Javascript, in particular to pop up windows with the englarged picture inside. Fortunately, most of Tiffany’s zoomed-in pictures of its products on the website have a regular format, e.g. http://www.tiffany.com/images/products/zoom_images/12259417_xl.jpg. Tather than try to be “smart” and figure a way to decipher correct links from the website programmatically, I chose to be “dumb” and do a brute force search on every possible number.
This saves me a lot of brain time, at the expense of computer time. Fortunately, the computer doesn’t value its time as much as I value mine, so it’s happy to run for a few days. It takes quite awhile to test one hundred million possibilities. The script below isn’t my first attempt at trying to get them all, but it’s certainly the fastest. I ran it on a Linux box, and it’s mainly curl that makes it possible.
Some of the optimizations I made along the way:
- A naive HTTP download for each possibility is quite slow, not so much from creating a new process each time, but from the network connection overhead and I/O.
curlis much better thanwgetfor these kinds of things, as one can specify a pattern and range of numbers to use, and it will also reuse the same HTTP connection. Truly, “a Client that groks the URLs”.- It was faster to download the HTTP header and look at the advertised file type (e.g. image/jpeg) using
grepthan make a full HTTP request and examine the resultant downloaded data to determine the file type (e.g. withfile). More so than I would have thought if the speedup was only due to less bytes over the wire.
The script follows:
#!/bin/sh # e.g. http://www.tiffany.com/images/products/zoom_images/12259417_xl.jpg START_ID=10000000 CLEAN_LOCK=clean.lock CURL_LOCK1=curl1.lock CURL_LOCK2=curl2.lock CURL_LOCK3=curl3.lock CURL_LOCK4=curl4.lock function remove_non_jpegs { # Delete the non-JPEG files START_ID=$1 END_ID=$2 CLEAN_LOCK=$3 echo "[`date +'%x %X'`]" Cleaning $START_ID-$END_ID ID=$START_ID while [ $ID -le $END_ID ]; do JPG=${ID}_xl.jpg if [ -z "`grep -m 1 'image/jpeg' $JPG`" ]; then rm $JPG else echo $JPG is O.K. rm $JPG wget -q "http://www.tiffany.com/images/products/zoom_images/$JPG" fi ID=`expr $ID + 1` done rm -f $CLEAN_LOCK echo "[`date +'%x %X'`]" Done cleaning $START_ID-$END_ID } function slurp_data { curl -I -s "http://www.tiffany.com/images/products/zoom_images/[$1-$2]_xl.jpg" -o "#1_xl.jpg" rm -f $3 } while [ $START_ID -lt 20000000 ]; do END_ID=`expr $START_ID + 4000` INTERVAL=`expr $END_ID - $START_ID` INTERVAL=`expr $INTERVAL / 4` ONE_QUARTER_ID=`expr $START_ID + $INTERVAL` TWO_QUARTER_ID=`expr $ONE_QUARTER_ID + $INTERVAL` THREE_QUARTER_ID=`expr $TWO_QUARTER_ID + $INTERVAL` echo "[`date +'%x %X'`]" Getting $START_ID-$END_ID lockfile $CURL_LOCK1 $CURL_LOCK2 $CURL_LOCK3 $CURL_LOCK4 slurp_data $START_ID $ONE_QUARTER_ID $CURL_LOCK1 & slurp_data `expr $ONE_QUARTER_ID + 1` $TWO_QUARTER_ID $CURL_LOCK2 & slurp_data `expr $TWO_QUARTER_ID + 1` $THREE_QUARTER_ID $CURL_LOCK3 & slurp_data `expr $THREE_QUARTER_ID + 1` $END_ID $CURL_LOCK4 & lockfile $CURL_LOCK1 $CURL_LOCK2 $CURL_LOCK3 $CURL_LOCK4 rm -f $CURL_LOCK1 $CURL_LOCK2 $CURL_LOCK3 $CURL_LOCK4 echo "[`date +'%x %X'`]" Done getting $START_ID-$END_ID lockfile $CLEAN_LOCK remove_non_jpegs $START_ID $END_ID $CLEAN_LOCK & START_ID=`expr $END_ID + 1` done

