I have a question for the group (or the Pavuk authors.)
I am considering using Pavuk for a project that would require a bit of
modification on how
it runs. Having only basic knowledge of C, the source code is a bit complex.
If someone
could point me to a few things, I should be able to take it from there.
I want to spider a large list of websites (1000+) and then process the content.
1) I would like to fetch the list of sites to spider and the various parameters
(domains to
skip, extentions, depth, etc.) from a databse instead of a text file. Where in
the code
would it be best to set this up so that the apporpriate internal variables are
set for Pavuk?
2) I want to post process the html that I get. (No need to save to disk), I
imagine that at
some point all the HTML is in a varialbe that I can easily access and then pass
to a few
functions of my own.
Anybody feel like helping out a newbie still learning his way around C??
Thanks!!
-N