[colug-432] Data mining

Richard Hornsby richardjhornsby at gmail.com
Fri Dec 17 16:04:27 EST 2010


This type of web scraping was a major part of my job for about 5 years, but I'm missing something also -- they're doing something weird and I haven't quite nailed down what yet.  Going directly to the search results page using GET variables works fine, no need to post the form. 

http://franklincountyoh.metacama.com:80/do/searchByParcelId?taxDistrict=010&parcelNbr=008436

On that page is what appears to be a standard HTML link to the taxinfo page -- but the target TAXINFO script is looking for something - the "stateful" things I can think of are: a sessionid of some sort in the URL, a cookie, or a referrer.  I've tried all three and none are working.  The other thing I thought was that it was a wget UA problem, but that wasn't it either.

wget --referer="http://franklincountyoh.metacama.com:80/do/searchByParcelId?taxDistrict=010&parcelNbr=008436" -U "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-HK) AppleWebKit/533.18.1 (KHTML, like Gecko) Version/5.0.2 Safari/533.18.5" --load-cookies cookies "http://franklincountyoh.metacama.com/do/selectDisplay?select=TAXINFO&curpid=01000843600"


at the moment I'm stumped and have to be missing something I should know about.  I have to run to an event practice in a bit, but I'll keep messing with it.  There is a way -- we just haven't found it yet.



On Dec 17, 2010, at 13:25 , Steve VanSlyck wrote:

> If you go to
> 
> http://franklincountyoh.metacama.com/do/searchByParcelId
> 
> then enter, for example, parcel no. 010-008436
> 
> you are taken to
> 
> http://franklincountyoh.metacama.com/do/searchByParcelId;jsessionid=2C5A034D6CE2B1FF816F095759EF5347
> 
> (which contains what I assume to be a session ID).
> 
> If you then click on the link labeled "Tax/Payment Info" you are taken to
> 
> http://franklincountyoh.metacama.com/do/selectDisplay?select=TAXINFO&curpid=01000843600 
> You can easily print that page to a PDF file if you like.
> 
> So all that works fine.
> 
> Yesterday, I was able to grab a large number of TAXINFO pages by
> (a) creating an HTML page on my desktop with a link to each of the parcels 
> I was interested in, and
> (b) created a PDF file using Acrobat to grab the page and one level down).
> 
> It didn't work at first, so I then
> 
> (a) opened the page, grabbed the session ID link,
> (b) left the page open,
> (c) pasted the link into my local HTML page,
> (d) and then created the PDF file from Acrobat as before.
> 
> It worked like a dream. Well, last night anyway.
> 
> Today, however, it will not work at all. I've tried using the same session 
> ID, using a fresh session ID, opening Acrobat first, opening Firefox (or 
> IE first) and so on.
> 
> I don't know enough of what's going on under the hood, however, to recreate 
> my earlier success. I don't even know if the need for a session ID was 
> the problem or what I did (whatever it was) that made it go.
> 
> Are there any Internet experts here that can help? I really don't want to 
> download 200 pages manually.
> 
> 
> _______________________________________________
> colug-432 mailing list
> colug-432 at colug.net
> http://lists.colug.net/mailman/listinfo/colug-432




More information about the colug-432 mailing list