Search the web
Sign In
New User? Sign Up
pavuk · Pavuk Webgrabber Mailing List
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
pavuk command line option questions   Message List  
Reply | Forward Message #732 of 988 |
RE: pavuk command line option questions

I believe that at least the first 4 or 5 items are easy to do. I'll have to experiment a bit on the 6th.
 
I'm out of town on business just now, but as soon as I get back, I'll post a sample command line that should do most of what you want.
 
Marty
-----Original Message-----
From: Greg Hazel [mailto:gah@...]
Sent: Sunday, October 19, 2003 2:48 PM
To: pavuk@yahoogroups.com
Subject: pavuk command line option questions

I have a fairly simple task to do, of crawling a website for certain types of
links.
However, wget seems unable to do this efficiently (and honestly segfaults 50% of
the time trying to do it on the --spider line). The command I worked up looked
something like: wget -r -A htm,html http://www.domain.com --output-file=baz;wget
-r --spider http://www.doman.com -output-file=baz && [complicated perl script to
grep baz for the urls I wanted]
I've looked at the pavuk options for several hours now, and perhaps I'm missing
something that someone with more experience can answer:

I'll take the goals one step at a time, as it a very simple concept, with a few
preferred features it seems like might be possible:
1. crawl a single website (no external domains) for all links to .foo or .bar
files
2. print these links to a file, or stdout
3. keep these links external
4. don't download the files, as they may be very large.
5. if the indexes or anything else need to be saved to disk (or if that improves
subsequent crawls of the same site) then that's ok.
6. download a specified range of each .foo .bar file (but always download all of
the indexes)

Is any or all of this possible? They're pretty much in order of importance.

Thanks in advance.




Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.


Tue Oct 21, 2003 3:57 am

marty_fouts
Offline Offline
Send Email Send Email

Forward
Message #732 of 988 |
Expand Messages Author Sort by Date

I have a fairly simple task to do, of crawling a website for certain types of links. However, wget seems unable to do this efficiently (and honestly segfaults...
Greg Hazel
gah@...
Send Email
Oct 20, 2003
8:07 am

I believe that at least the first 4 or 5 items are easy to do. I'll have to experiment a bit on the 6th. I'm out of town on business just now, but as soon as I...
Martin Fouts
marty_fouts
Offline Send Email
Nov 19, 2003
8:19 am
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help