I do my own stock investment and purchases. One great read for ideas is TD Waterhouse’s Newcrest research, which it makes available to those who have brokerage acounts with them. One has to log in to their website, navigate a few pages, and then look at the reports one at a time.
Two annoyances for me: 1) I don’t remember to log in everyday and check, and 2) the session timeout for the website is 15 minutes so that I have to log back in after reading a longish paper.
Being me, I can’t stand anything to do with computers and not being able to do something I believe should be possible. I don’t want to navigate the website and read the PDFs one at a time, I’d much rather have it automatically be picked up and sent to me via email, which I already have a habit of reading.
So after a few hours in the dead of night, and several more after some death-like sleep, I finally hit upon a setup that works for me. Here’s what I learned:
On my side
- wget doesn’t support saving session cookies between calls
- curl is a more powerful and flexible tool than wget
- LiveHttpHeaders is a Mozilla plugin that lets one see exactly what the browser is sending and receiving - better than a logging proxy, for HTTPS purposes.
On TD’s side
Their process to get to the market research goes something like this:
- Copy the user values in the login form to a hidden form on the same page, and submit that instead. The hidden form has an authorization token that is generated everytime you read the front page, so you can’t just submit the form as the first step. Get a session cookie for your troubles.
- Get a pointless page where TD says they’re loading your profile. All it seems to do is reload to the user’s start page.
- From the start page, load the MarketsAndResearch page. It looks like we get another kind of cookie for our troubles: skipping this step yields “session timed out” messages. This yields a page composed of a few frames: header, side bar, and the main content page.
- Here’s where it gets a little weird. On loading the main content page, you get back an unusual hidden form. It’s prepopulated with a magic number (the form value is called “magicno”), other form values called Blob1, Blob2, and Blob3. Blob3 is empty, but Blob1 and Blob2 have a really long string of characters in each (300+?). It posts this to some other server, www.tdcanada.wallst.com, and an ASP page. Said ASP (ASP.NET according to the HTTP headers) takes the magic values and gives back a redirect page. The redirect page has a few query parameters that identify you to the redirect target. Get a session cookie (presumably for this wallst.com server) for your troubles.
- After that, it’s pretty straightforward. All the session cookies have been obtained, it’s a matter of parsing some HTML and javascript function calls to figure out which URL and what query parameters to use to get the report for the day you want. It’s a bit obtuse; the first call to get a report creates a window that calls a different URL (with similar arguments), and that window calls some “cgi-bin/upload.dll?” URL to finally get the PDF.
Applying a little reasoning, it looks like TD decided on a Java general portal, but their research people are running ASP.NET. So the Java portal goes to the trouble of contacting the ASP.NET server behind the scenes, agreeing on some magic values, and then giving the browser these values use on the ASP.NET server. The ASP.NET server then knows that the user has been given the thumbs up already by the main portal. The ASP.NET server then tells the browser to go to a certain place, with a certain number in the query parameter, to get the session cookie.
Put that way, sounds like some kind of back-alley deal.
In the end, I’m just happy I was ultimately successful. Other than the learning experience, spending this much time to solve such a small amount of manual labour seems pointless; I’m happy to have done it for the knowledge though.
I must admit, the TD website is a little more “kludgy” then I might have imagined. I was particuarly surprised that their site causes IE to prompt the user about a potential security breach, because TD is reusing an SSL certificate that’s for a different TD server.