Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Message search is now enhanced, find messages faster. Take it for a spin.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 2999 - 3029 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
2999
I'm working on a crawl where I only want pages that are subdirs of: http://www.someserver.com/dir1/dir2/dir3/dir4/ and http://www.someserver.com/dir5/dir6/ ...
Eric
mar1ow2003
Offline Send Email
Jul 3, 2006
2:56 am
3000
I need to perform a somewhat unusual crawl in which I have many seeds (100's of thousands) all from the same host, and only want to extract pages under a...
mar1ow2003
Offline Send Email
Jul 3, 2006
4:01 am
3001
... The link-to-crawl looks like "http://foo.org/testing/person%27s-stuff" in the hosting page? This should just pass through Heritrix untouched and result in...
Michael Stack
stackarchiveorg
Offline Send Email
Jul 3, 2006
10:09 pm
3002
... It was using numeric character entities, so the quote was really &#039; in the HTML page. Will....
Will Sargent
will_sargent
Offline Send Email
Jul 3, 2006
10:38 pm
3003
... How many domains? How much delay? How many toethreads? Study the frontier and thread reports over time to figure where the crawler is spending time. ......
Michael Stack
stackarchiveorg
Offline Send Email
Jul 3, 2006
10:44 pm
3004
... Unfortunately, the only documentation in this case is the code itself. The code is hard to follow since it turns on some rather involved regexes. ... You...
Michael Stack
stackarchiveorg
Offline Send Email
Jul 3, 2006
11:24 pm
3005
Marco sent me his order files offlist and looking them over, I was reminded of this previously-discussed issue, where the syntax for SURT-prefix-files changed...
Gordon Mohr
gojomo
Online Now Send Email
Jul 3, 2006
11:38 pm
3006
... Start with the developers manual. Its a little dated but tells a good story -- with pictures! -- about how processors work. (Extractors are a ...
Michael Stack
stackarchiveorg
Offline Send Email
Jul 3, 2006
11:41 pm
3007
... In general 'embeds' (things necessary to render a page) are fetched ASAP after their containing page, before other already-waiting pages. But, this would...
Gordon Mohr
gojomo
Online Now Send Email
Jul 4, 2006
12:21 am
3008
... Eric, I think the solution to your problem is what Gordon Mohr suggested here: http://groups.yahoo.com/group/archive-crawler/message/2972 Regards, Frank...
Frank McCown
mccownf
Offline Send Email
Jul 5, 2006
2:59 pm
3009
Pardon me. I did not see your subsequent posting to the list where you talk about entity encodings and ExtractorHTML. So, yeah, the ExtractorHTML should...
Michael Stack
stackarchiveorg
Offline Send Email
Jul 5, 2006
6:13 pm
3010
Hi, ... force-added ... This is good enough for me. But i found that under $HERITRIX_HOME/jobs/status directory the size of jdb files is increasing so rapidly....
callforshadab
Offline Send Email
Jul 6, 2006
5:16 am
3011
Hi, I need a way to track from which seed (source) a URI came from. This information is written down in the crawl.log if I set the attribute 'source-tag-seeds'...
barabasy69
Offline Send Email
Jul 6, 2006
7:50 pm
3012
In examining my crawl logs, I find that heritrix is trying to download robots.txt from every directory I access on the server, i.e. ...
Eric
mar1ow2003
Offline Send Email
Jul 6, 2006
10:11 pm
3013
... The robots.txt standard only provides for root/host-level robots.txt files, so that's the only URI automatically checked by Heritrix. I suspect the pages...
Gordon Mohr
gojomo
Online Now Send Email
Jul 6, 2006
10:17 pm
3014
Pardon me if this is a repeat email, I was't sure my previous posted. I am having issues with making this thing run on Windows, I have read the FAQ, made the...
molzbh
Online Now Send Email
Jul 7, 2006
2:33 am
3015
You are correct, these errant robots.txt URL's are coming from speculative embeds (from javascript). I was fooled because I had a crawl finish with all of its...
Eric
mar1ow2003
Offline Send Email
Jul 7, 2006
3:56 am
3016
Hello, is there a way how to automatically repair broken links in downloaded file. Especially in situation, where we download only text files(html,js), all the...
martin.benuska
Offline Send Email
Jul 7, 2006
8:02 am
3018
... I do not know of such a tool. Would be a nice tool to have though. For example, it could be used to make the DVD of archived content that a poster from a...
Michael Stack
stackarchiveorg
Offline Send Email
Jul 7, 2006
4:30 pm
3019
Try the prescription from this posting: http://groups.yahoo.com/group/archive-crawler/message/2816 (Here is the original report: ...
Michael Stack
stackarchiveorg
Offline Send Email
Jul 7, 2006
5:26 pm
3020
... The jdb log are fundamental to the crawler. They contain all of the crawler state. You might be able to tune the backgound bdbje cleaner thread so it does...
Michael Stack
stackarchiveorg
Offline Send Email
Jul 7, 2006
5:49 pm
3021
... This class does not make for easy reading. ... Looking at code, this looks hard. The content that gets written to ARCs is 'recorded' by wrapping the apache...
Michael Stack
stackarchiveorg
Offline Send Email
Jul 7, 2006
5:56 pm
3022
Ya, I can see now that it is not such an easy task since the apache commons httpclient has no way to inject new header in the response. And heritrix just wrap...
barabasy69
Offline Send Email
Jul 7, 2006
10:15 pm
3023
barabasy69 wrote: ... Unfortunately, the set of columns stored on the metadata line is immutable (See here for more on the ARC format: ...
Michael Stack
stackarchiveorg
Offline Send Email
Jul 7, 2006
11:43 pm
3024
Thanks! works like a charm....
molzbh
Online Now Send Email
Jul 8, 2006
1:22 am
3025
It seems like there isn't a branch for the 1.8 release -- or am I missing something? The highest branch is heritrix-1_6, and the highest version is ...
Yousef Ourabi
yousef_ourabi
Online Now Send Email
Jul 9, 2006
1:09 am
3026
Hello: I sent a somewhat similar email to this list yesterday that seems to have been blocked due to the attached patches? Ignore that one if it ends up making...
Yousef Ourabi
yousef_ourabi
Online Now Send Email
Jul 9, 2006
9:27 pm
3027
Hello, sorry for newbie question, but is it possible to configure filter to accept only sites with specific encoding ? like ISO-8859-1 etc. ? Thank you, Martin...
martin.benuska
Offline Send Email
Jul 10, 2006
2:28 pm
3028
... Not really. Encoding, if specified, is done on a page by page basis in the HTTP response header and/or in HEAD of the HTML page. You could set a...
Michael Stack
stackarchiveorg
Offline Send Email
Jul 10, 2006
4:15 pm
3029
... For 1517693, (and admittedly this is a hack), I used org.htmlparser.util.Translate. You may be able to use the code there as a reference. ...
Will Sargent
will_sargent
Offline Send Email
Jul 10, 2006
4:42 pm
Messages 2999 - 3029 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help