[colug-432] Data mining
Richard Hornsby
richardjhornsby at gmail.com
Fri Dec 17 16:04:27 EST 2010
This type of web scraping was a major part of my job for about 5 years, but I'm missing something also -- they're doing something weird and I haven't quite nailed down what yet. Going directly to the search results page using GET variables works fine, no need to post the form.
http://franklincountyoh.metacama.com:80/do/searchByParcelId?taxDistrict=010&parcelNbr=008436
On that page is what appears to be a standard HTML link to the taxinfo page -- but the target TAXINFO script is looking for something - the "stateful" things I can think of are: a sessionid of some sort in the URL, a cookie, or a referrer. I've tried all three and none are working. The other thing I thought was that it was a wget UA problem, but that wasn't it either.
wget --referer="http://franklincountyoh.metacama.com:80/do/searchByParcelId?taxDistrict=010&parcelNbr=008436" -U "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-HK) AppleWebKit/533.18.1 (KHTML, like Gecko) Version/5.0.2 Safari/533.18.5" --load-cookies cookies "http://franklincountyoh.metacama.com/do/selectDisplay?select=TAXINFO&curpid=01000843600"
at the moment I'm stumped and have to be missing something I should know about. I have to run to an event practice in a bit, but I'll keep messing with it. There is a way -- we just haven't found it yet.
On Dec 17, 2010, at 13:25 , Steve VanSlyck wrote:
> If you go to
>
> http://franklincountyoh.metacama.com/do/searchByParcelId
>
> then enter, for example, parcel no. 010-008436
>
> you are taken to
>
> http://franklincountyoh.metacama.com/do/searchByParcelId;jsessionid=2C5A034D6CE2B1FF816F095759EF5347
>
> (which contains what I assume to be a session ID).
>
> If you then click on the link labeled "Tax/Payment Info" you are taken to
>
> http://franklincountyoh.metacama.com/do/selectDisplay?select=TAXINFO&curpid=01000843600
> You can easily print that page to a PDF file if you like.
>
> So all that works fine.
>
> Yesterday, I was able to grab a large number of TAXINFO pages by
> (a) creating an HTML page on my desktop with a link to each of the parcels
> I was interested in, and
> (b) created a PDF file using Acrobat to grab the page and one level down).
>
> It didn't work at first, so I then
>
> (a) opened the page, grabbed the session ID link,
> (b) left the page open,
> (c) pasted the link into my local HTML page,
> (d) and then created the PDF file from Acrobat as before.
>
> It worked like a dream. Well, last night anyway.
>
> Today, however, it will not work at all. I've tried using the same session
> ID, using a fresh session ID, opening Acrobat first, opening Firefox (or
> IE first) and so on.
>
> I don't know enough of what's going on under the hood, however, to recreate
> my earlier success. I don't even know if the need for a session ID was
> the problem or what I did (whatever it was) that made it go.
>
> Are there any Internet experts here that can help? I really don't want to
> download 200 pages manually.
>
>
> _______________________________________________
> colug-432 mailing list
> colug-432 at colug.net
> http://lists.colug.net/mailman/listinfo/colug-432
More information about the colug-432
mailing list