[colug-432] Data mining
Steve VanSlyck
s.vanslyck at spamcop.net
Fri Dec 17 17:50:29 EST 2010
REPLACE all of the links instead of the first one....
----- Original Message -----
From: "Steve VanSlyck" <s.vanslyck at spamcop.net>
To: "Central OH Linux User Group - 432xx" <colug-432 at colug.net>
Date: Fri, 17 Dec 2010 17:48:36 -0500
Subject: Re: [colug-432] Data mining
> Thanks Rich. Here's domething odd.
>
> When I open the origninal HTML file which worked, and release all of the
> links except the first one which had a session ID that displayed in
> Firefox (unless I've gone mad again), THAT file works!
>
> Creating a new one does not, but using the original does.
>
> Does this reveal anything useful?
>
> ----- Original Message -----
> From: Richard Hornsby <richardjhornsby at gmail.com>
> To: Central OH Linux User Group - 432xx <colug-432 at colug.net>
> Date: Fri, 17 Dec 2010 15:04:27 -0600
> Subject: Re: [colug-432] Data mining
>
> >
> > This type of web scraping was a major part of my job for about 5 years,
> but I'm missing something also -- they're doing something weird and I
> haven't quite nailed down what yet. Going directly to the search results
> page using GET variables works fine, no need to post the form.
> >
> >
>
http://franklincountyoh.metacama.com:80/do/searchByParcelId?taxDistrict=010&parcelNbr=008436
> >
> > On that page is what appears to be a standard HTML link to the taxinfo
> page -- but the target TAXINFO script is looking for something - the
> "stateful" things I can think of are: a sessionid of some sort in the
> URL, a cookie, or a referrer. I've tried all three and none are working.
> The other thing I thought was that it was a wget UA problem, but that
> wasn't it either.
> >
> > wget
>
>
--referer="http://franklincountyoh.metacama.com:80/do/searchByParcelId?taxDistrict=010&parcelNbr=008436"
> -U "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-HK) AppleWebKit/533.18.1
> (KHTML, like Gecko) Version/5.0.2 Safari/533.18.5" --load-cookies cookies
>
>
"http://franklincountyoh.metacama.com/do/selectDisplay?select=TAXINFO&curpid=01000843600"
> >
> >
> > at the moment I'm stumped and have to be missing something I should
know
> about. I have to run to an event practice in a bit, but I'll keep
> messing with it. There is a way -- we just haven't found it yet.
> >
> >
> >
> > On Dec 17, 2010, at 13:25 , Steve VanSlyck wrote:
> >
> > > If you go to
> > >
> > > http://franklincountyoh.metacama.com/do/searchByParcelId
> > >
> > > then enter, for example, parcel no. 010-008436
> > >
> > > you are taken to
> > >
> > >
>
>
http://franklincountyoh.metacama.com/do/searchByParcelId;jsessionid=2C5A034D6CE2B1FF816F095759EF5347
> > >
> > > (which contains what I assume to be a session ID).
> > >
> > > If you then click on the link labeled "Tax/Payment Info" you are
taken
> to
> > >
> > >
>
>
http://franklincountyoh.metacama.com/do/selectDisplay?select=TAXINFO&curpid=01000843600
> > > You can easily print that page to a PDF file if you like.
> > >
> > > So all that works fine.
> > >
> > > Yesterday, I was able to grab a large number of TAXINFO pages by
> > > (a) creating an HTML page on my desktop with a link to each of the
> parcels
> > > I was interested in, and
> > > (b) created a PDF file using Acrobat to grab the page and one level
> down).
> > >
> > > It didn't work at first, so I then
> > >
> > > (a) opened the page, grabbed the session ID link,
> > > (b) left the page open,
> > > (c) pasted the link into my local HTML page,
> > > (d) and then created the PDF file from Acrobat as before.
> > >
> > > It worked like a dream. Well, last night anyway.
> > >
> > > Today, however, it will not work at all. I've tried using the same
> session
> > > ID, using a fresh session ID, opening Acrobat first, opening Firefox
> (or
> > > IE first) and so on.
> > >
> > > I don't know enough of what's going on under the hood, however, to
> recreate
> > > my earlier success. I don't even know if the need for a session ID
was
> > > the problem or what I did (whatever it was) that made it go.
> > >
> > > Are there any Internet experts here that can help? I really don't
want
> to
> > > download 200 pages manually.
> > >
> > >
> > > _______________________________________________
> > > colug-432 mailing list
> > > colug-432 at colug.net
> > > http://lists.colug.net/mailman/listinfo/colug-432
> >
> >
> > _______________________________________________
> > colug-432 mailing list
> > colug-432 at colug.net
> > http://lists.colug.net/mailman/listinfo/colug-432
> >
> _______________________________________________
> colug-432 mailing list
> colug-432 at colug.net
> http://lists.colug.net/mailman/listinfo/colug-432
>
More information about the colug-432
mailing list