[colug-432] Data mining

Richard Hornsby richardjhornsby at gmail.com
Fri Dec 17 16:04:27 EST 2010

This type of web scraping was a major part of my job for about 5 years, but I'm missing something also -- they're doing something weird and I haven't quite nailed down what yet.  Going directly to the search results page using GET variables works fine, no need to post the form. 


On that page is what appears to be a standard HTML link to the taxinfo page -- but the target TAXINFO script is looking for something - the "stateful" things I can think of are: a sessionid of some sort in the URL, a cookie, or a referrer.  I've tried all three and none are working.  The other thing I thought was that it was a wget UA problem, but that wasn't it either.

wget --referer="http://franklincountyoh.metacama.com:80/do/searchByParcelId?taxDistrict=010&parcelNbr=008436" -U "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-HK) AppleWebKit/533.18.1 (KHTML, like Gecko) Version/5.0.2 Safari/533.18.5" --load-cookies cookies "http://franklincountyoh.metacama.com/do/selectDisplay?select=TAXINFO&curpid=01000843600"

at the moment I'm stumped and have to be missing something I should know about.  I have to run to an event practice in a bit, but I'll keep messing with it.  There is a way -- we just haven't found it yet.

On Dec 17, 2010, at 13:25 , Steve VanSlyck wrote:

> If you go to
> http://franklincountyoh.metacama.com/do/searchByParcelId
> then enter, for example, parcel no. 010-008436
> you are taken to
> http://franklincountyoh.metacama.com/do/searchByParcelId;jsessionid=2C5A034D6CE2B1FF816F095759EF5347
> (which contains what I assume to be a session ID).
> If you then click on the link labeled "Tax/Payment Info" you are taken to
> http://franklincountyoh.metacama.com/do/selectDisplay?select=TAXINFO&curpid=01000843600 
> You can easily print that page to a PDF file if you like.
> So all that works fine.
> Yesterday, I was able to grab a large number of TAXINFO pages by
> (a) creating an HTML page on my desktop with a link to each of the parcels 
> I was interested in, and
> (b) created a PDF file using Acrobat to grab the page and one level down).
> It didn't work at first, so I then
> (a) opened the page, grabbed the session ID link,
> (b) left the page open,
> (c) pasted the link into my local HTML page,
> (d) and then created the PDF file from Acrobat as before.
> It worked like a dream. Well, last night anyway.
> Today, however, it will not work at all. I've tried using the same session 
> ID, using a fresh session ID, opening Acrobat first, opening Firefox (or 
> IE first) and so on.
> I don't know enough of what's going on under the hood, however, to recreate 
> my earlier success. I don't even know if the need for a session ID was 
> the problem or what I did (whatever it was) that made it go.
> Are there any Internet experts here that can help? I really don't want to 
> download 200 pages manually.
