[colug-432] Data mining

Steve VanSlyck s.vanslyck at spamcop.net
Fri Dec 17 17:48:36 EST 2010


Thanks Rich. Here's domething odd.

When I open the origninal HTML file which worked, and release all of the 
links except the first one which had a session ID that displayed in 
Firefox (unless I've gone mad again), THAT file works!

Creating a new one does not, but using the original does.

Does this reveal anything useful?

----- Original Message -----
From: Richard Hornsby <richardjhornsby at gmail.com>
To: Central OH Linux User Group - 432xx <colug-432 at colug.net>
Date: Fri, 17 Dec 2010 15:04:27 -0600
Subject: Re: [colug-432] Data mining

> 
> This type of web scraping was a major part of my job for about 5 years, 
but I'm missing something also -- they're doing something weird and I 
haven't quite nailed down what yet.  Going directly to the search results 
page using GET variables works fine, no need to post the form. 
> 
> 
http://franklincountyoh.metacama.com:80/do/searchByParcelId?taxDistrict=010&parcelNbr=008436
> 
> On that page is what appears to be a standard HTML link to the taxinfo 
page -- but the target TAXINFO script is looking for something - the 
"stateful" things I can think of are: a sessionid of some sort in the 
URL, a cookie, or a referrer.  I've tried all three and none are working. 
 The other thing I thought was that it was a wget UA problem, but that 
wasn't it either.
> 
> wget 

--referer="http://franklincountyoh.metacama.com:80/do/searchByParcelId?taxDistrict=010&parcelNbr=008436" 
-U "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-HK) AppleWebKit/533.18.1 
(KHTML, like Gecko) Version/5.0.2 Safari/533.18.5" --load-cookies cookies 

"http://franklincountyoh.metacama.com/do/selectDisplay?select=TAXINFO&curpid=01000843600"
> 
> 
> at the moment I'm stumped and have to be missing something I should know 
about.  I have to run to an event practice in a bit, but I'll keep 
messing with it.  There is a way -- we just haven't found it yet.
> 
> 
> 
> On Dec 17, 2010, at 13:25 , Steve VanSlyck wrote:
> 
> > If you go to
> > 
> > http://franklincountyoh.metacama.com/do/searchByParcelId
> > 
> > then enter, for example, parcel no. 010-008436
> > 
> > you are taken to
> > 
> > 

http://franklincountyoh.metacama.com/do/searchByParcelId;jsessionid=2C5A034D6CE2B1FF816F095759EF5347
> > 
> > (which contains what I assume to be a session ID).
> > 
> > If you then click on the link labeled "Tax/Payment Info" you are taken 
to
> > 
> > 

http://franklincountyoh.metacama.com/do/selectDisplay?select=TAXINFO&curpid=01000843600 
> > You can easily print that page to a PDF file if you like.
> > 
> > So all that works fine.
> > 
> > Yesterday, I was able to grab a large number of TAXINFO pages by
> > (a) creating an HTML page on my desktop with a link to each of the 
parcels 
> > I was interested in, and
> > (b) created a PDF file using Acrobat to grab the page and one level 
down).
> > 
> > It didn't work at first, so I then
> > 
> > (a) opened the page, grabbed the session ID link,
> > (b) left the page open,
> > (c) pasted the link into my local HTML page,
> > (d) and then created the PDF file from Acrobat as before.
> > 
> > It worked like a dream. Well, last night anyway.
> > 
> > Today, however, it will not work at all. I've tried using the same 
session 
> > ID, using a fresh session ID, opening Acrobat first, opening Firefox 
(or 
> > IE first) and so on.
> > 
> > I don't know enough of what's going on under the hood, however, to 
recreate 
> > my earlier success. I don't even know if the need for a session ID was 
> > the problem or what I did (whatever it was) that made it go.
> > 
> > Are there any Internet experts here that can help? I really don't want 
to 
> > download 200 pages manually.
> > 
> > 
> > _______________________________________________
> > colug-432 mailing list
> > colug-432 at colug.net
> > http://lists.colug.net/mailman/listinfo/colug-432
> 
> 
> _______________________________________________
> colug-432 mailing list
> colug-432 at colug.net
> http://lists.colug.net/mailman/listinfo/colug-432
> 


More information about the colug-432 mailing list