[colug-432] Data mining

Steve VanSlyck s.vanslyck at spamcop.net
Fri Dec 17 17:50:29 EST 2010


REPLACE all of the links instead of the first one....

----- Original Message -----
From: "Steve VanSlyck" <s.vanslyck at spamcop.net>
To: "Central OH Linux User Group - 432xx" <colug-432 at colug.net>
Date: Fri, 17 Dec 2010 17:48:36 -0500
Subject: Re: [colug-432] Data mining

> Thanks Rich. Here's domething odd.
> 
> When I open the origninal HTML file which worked, and release all of the 
> links except the first one which had a session ID that displayed in 
> Firefox (unless I've gone mad again), THAT file works!
> 
> Creating a new one does not, but using the original does.
> 
> Does this reveal anything useful?
> 
> ----- Original Message -----
> From: Richard Hornsby <richardjhornsby at gmail.com>
> To: Central OH Linux User Group - 432xx <colug-432 at colug.net>
> Date: Fri, 17 Dec 2010 15:04:27 -0600
> Subject: Re: [colug-432] Data mining
> 
> > 
> > This type of web scraping was a major part of my job for about 5 years, 
> but I'm missing something also -- they're doing something weird and I 
> haven't quite nailed down what yet.  Going directly to the search results 
> page using GET variables works fine, no need to post the form. 
> > 
> > 
> 
http://franklincountyoh.metacama.com:80/do/searchByParcelId?taxDistrict=010&parcelNbr=008436
> > 
> > On that page is what appears to be a standard HTML link to the taxinfo 
> page -- but the target TAXINFO script is looking for something - the 
> "stateful" things I can think of are: a sessionid of some sort in the 
> URL, a cookie, or a referrer.  I've tried all three and none are working. 
>  The other thing I thought was that it was a wget UA problem, but that 
> wasn't it either.
> > 
> > wget 
> 
> 
--referer="http://franklincountyoh.metacama.com:80/do/searchByParcelId?taxDistrict=010&parcelNbr=008436" 
> -U "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-HK) AppleWebKit/533.18.1 
> (KHTML, like Gecko) Version/5.0.2 Safari/533.18.5" --load-cookies cookies 
> 
> 
"http://franklincountyoh.metacama.com/do/selectDisplay?select=TAXINFO&curpid=01000843600"
> > 
> > 
> > at the moment I'm stumped and have to be missing something I should 
know 
> about.  I have to run to an event practice in a bit, but I'll keep 
> messing with it.  There is a way -- we just haven't found it yet.
> > 
> > 
> > 
> > On Dec 17, 2010, at 13:25 , Steve VanSlyck wrote:
> > 
> > > If you go to
> > > 
> > > http://franklincountyoh.metacama.com/do/searchByParcelId
> > > 
> > > then enter, for example, parcel no. 010-008436
> > > 
> > > you are taken to
> > > 
> > > 
> 
> 
http://franklincountyoh.metacama.com/do/searchByParcelId;jsessionid=2C5A034D6CE2B1FF816F095759EF5347
> > > 
> > > (which contains what I assume to be a session ID).
> > > 
> > > If you then click on the link labeled "Tax/Payment Info" you are 
taken 
> to
> > > 
> > > 
> 
> 
http://franklincountyoh.metacama.com/do/selectDisplay?select=TAXINFO&curpid=01000843600 
> > > You can easily print that page to a PDF file if you like.
> > > 
> > > So all that works fine.
> > > 
> > > Yesterday, I was able to grab a large number of TAXINFO pages by
> > > (a) creating an HTML page on my desktop with a link to each of the 
> parcels 
> > > I was interested in, and
> > > (b) created a PDF file using Acrobat to grab the page and one level 
> down).
> > > 
> > > It didn't work at first, so I then
> > > 
> > > (a) opened the page, grabbed the session ID link,
> > > (b) left the page open,
> > > (c) pasted the link into my local HTML page,
> > > (d) and then created the PDF file from Acrobat as before.
> > > 
> > > It worked like a dream. Well, last night anyway.
> > > 
> > > Today, however, it will not work at all. I've tried using the same 
> session 
> > > ID, using a fresh session ID, opening Acrobat first, opening Firefox 
> (or 
> > > IE first) and so on.
> > > 
> > > I don't know enough of what's going on under the hood, however, to 
> recreate 
> > > my earlier success. I don't even know if the need for a session ID 
was 
> > > the problem or what I did (whatever it was) that made it go.
> > > 
> > > Are there any Internet experts here that can help? I really don't 
want 
> to 
> > > download 200 pages manually.
> > > 
> > > 
> > > _______________________________________________
> > > colug-432 mailing list
> > > colug-432 at colug.net
> > > http://lists.colug.net/mailman/listinfo/colug-432
> > 
> > 
> > _______________________________________________
> > colug-432 mailing list
> > colug-432 at colug.net
> > http://lists.colug.net/mailman/listinfo/colug-432
> > 
> _______________________________________________
> colug-432 mailing list
> colug-432 at colug.net
> http://lists.colug.net/mailman/listinfo/colug-432
> 


More information about the colug-432 mailing list