[colug-432] Data mining
Steve VanSlyck
s.vanslyck at spamcop.net
Fri Dec 17 17:48:36 EST 2010
Thanks Rich. Here's domething odd.
When I open the origninal HTML file which worked, and release all of the
links except the first one which had a session ID that displayed in
Firefox (unless I've gone mad again), THAT file works!
Creating a new one does not, but using the original does.
Does this reveal anything useful?
----- Original Message -----
From: Richard Hornsby <richardjhornsby at gmail.com>
To: Central OH Linux User Group - 432xx <colug-432 at colug.net>
Date: Fri, 17 Dec 2010 15:04:27 -0600
Subject: Re: [colug-432] Data mining
>
> This type of web scraping was a major part of my job for about 5 years,
but I'm missing something also -- they're doing something weird and I
haven't quite nailed down what yet. Going directly to the search results
page using GET variables works fine, no need to post the form.
>
>
http://franklincountyoh.metacama.com:80/do/searchByParcelId?taxDistrict=010&parcelNbr=008436
>
> On that page is what appears to be a standard HTML link to the taxinfo
page -- but the target TAXINFO script is looking for something - the
"stateful" things I can think of are: a sessionid of some sort in the
URL, a cookie, or a referrer. I've tried all three and none are working.
The other thing I thought was that it was a wget UA problem, but that
wasn't it either.
>
> wget
--referer="http://franklincountyoh.metacama.com:80/do/searchByParcelId?taxDistrict=010&parcelNbr=008436"
-U "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-HK) AppleWebKit/533.18.1
(KHTML, like Gecko) Version/5.0.2 Safari/533.18.5" --load-cookies cookies
"http://franklincountyoh.metacama.com/do/selectDisplay?select=TAXINFO&curpid=01000843600"
>
>
> at the moment I'm stumped and have to be missing something I should know
about. I have to run to an event practice in a bit, but I'll keep
messing with it. There is a way -- we just haven't found it yet.
>
>
>
> On Dec 17, 2010, at 13:25 , Steve VanSlyck wrote:
>
> > If you go to
> >
> > http://franklincountyoh.metacama.com/do/searchByParcelId
> >
> > then enter, for example, parcel no. 010-008436
> >
> > you are taken to
> >
> >
http://franklincountyoh.metacama.com/do/searchByParcelId;jsessionid=2C5A034D6CE2B1FF816F095759EF5347
> >
> > (which contains what I assume to be a session ID).
> >
> > If you then click on the link labeled "Tax/Payment Info" you are taken
to
> >
> >
http://franklincountyoh.metacama.com/do/selectDisplay?select=TAXINFO&curpid=01000843600
> > You can easily print that page to a PDF file if you like.
> >
> > So all that works fine.
> >
> > Yesterday, I was able to grab a large number of TAXINFO pages by
> > (a) creating an HTML page on my desktop with a link to each of the
parcels
> > I was interested in, and
> > (b) created a PDF file using Acrobat to grab the page and one level
down).
> >
> > It didn't work at first, so I then
> >
> > (a) opened the page, grabbed the session ID link,
> > (b) left the page open,
> > (c) pasted the link into my local HTML page,
> > (d) and then created the PDF file from Acrobat as before.
> >
> > It worked like a dream. Well, last night anyway.
> >
> > Today, however, it will not work at all. I've tried using the same
session
> > ID, using a fresh session ID, opening Acrobat first, opening Firefox
(or
> > IE first) and so on.
> >
> > I don't know enough of what's going on under the hood, however, to
recreate
> > my earlier success. I don't even know if the need for a session ID was
> > the problem or what I did (whatever it was) that made it go.
> >
> > Are there any Internet experts here that can help? I really don't want
to
> > download 200 pages manually.
> >
> >
> > _______________________________________________
> > colug-432 mailing list
> > colug-432 at colug.net
> > http://lists.colug.net/mailman/listinfo/colug-432
>
>
> _______________________________________________
> colug-432 mailing list
> colug-432 at colug.net
> http://lists.colug.net/mailman/listinfo/colug-432
>
More information about the colug-432
mailing list