I have a fairly simple task to do, of crawling a website for certain types of
links.
However, wget seems unable to do this efficiently (and honestly segfaults 50% of
the time trying to do it on the --spider line). The command I worked up looked
something like: wget -r -A htm,html http://www.domain.com --output-file=baz;wget
-r --spider http://www.doman.com -output-file=baz && [complicated perl script to
grep baz for the urls I wanted]
I've looked at the pavuk options for several hours now, and perhaps I'm missing
something that someone with more experience can answer:
I'll take the goals one step at a time, as it a very simple concept, with a few
preferred features it seems like might be possible:
1. crawl a single website (no external domains) for all links to .foo or .bar
files
2. print these links to a file, or stdout
3. keep these links external
4. don't download the files, as they may be very large.
5. if the indexes or anything else need to be saved to disk (or if that improves
subsequent crawls of the same site) then that's ok.
6. download a specified range of each .foo .bar file (but always download all of
the indexes)
Is any or all of this possible? They're pretty much in order of importance.
Thanks in advance.