I believe that at least the first 4 or 5 items are easy to do. I'll have to experiment a bit on the 6th.
I'm out of town on business just now, but as soon as I get back, I'll post a sample command line that should do most of what you want.
Marty
-----Original Message-----I have a fairly simple task to do, of crawling a website for certain types of
From: Greg Hazel [mailto:gah@...]
Sent: Sunday, October 19, 2003 2:48 PM
To: pavuk@yahoogroups.com
Subject: pavuk command line option questions
links.
However, wget seems unable to do this efficiently (and honestly segfaults 50% of
the time trying to do it on the --spider line). The command I worked up looked
something like: wget -r -A htm,html http://www.domain.com --output-file=baz;wget
-r --spider http://www.doman.com -output-file=baz && [complicated perl script to
grep baz for the urls I wanted]
I've looked at the pavuk options for several hours now, and perhaps I'm missing
something that someone with more experience can answer:
I'll take the goals one step at a time, as it a very simple concept, with a few
preferred features it seems like might be possible:
1. crawl a single website (no external domains) for all links to .foo or .bar
files
2. print these links to a file, or stdout
3. keep these links external
4. don't download the files, as they may be very large.
5. if the indexes or anything else need to be saved to disk (or if that improves
subsequent crawls of the same site) then that's ok.
6. download a specified range of each .foo .bar file (but always download all of
the indexes)
Is any or all of this possible? They're pretty much in order of importance.
Thanks in advance.
Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.