I'm working on a crawl where I only want pages that are subdirs of: http://www.someserver.com/dir1/dir2/dir3/dir4/ and http://www.someserver.com/dir5/dir6/ ...
I need to perform a somewhat unusual crawl in which I have many seeds (100's of thousands) all from the same host, and only want to extract pages under a...
... The link-to-crawl looks like "http://foo.org/testing/person%27s-stuff" in the hosting page? This should just pass through Heritrix untouched and result in...
... How many domains? How much delay? How many toethreads? Study the frontier and thread reports over time to figure where the crawler is spending time. ......
... Unfortunately, the only documentation in this case is the code itself. The code is hard to follow since it turns on some rather involved regexes. ... You...
Marco sent me his order files offlist and looking them over, I was reminded of this previously-discussed issue, where the syntax for SURT-prefix-files changed...
... In general 'embeds' (things necessary to render a page) are fetched ASAP after their containing page, before other already-waiting pages. But, this would...
... Eric, I think the solution to your problem is what Gordon Mohr suggested here: http://groups.yahoo.com/group/archive-crawler/message/2972 Regards, Frank...
Pardon me. I did not see your subsequent posting to the list where you talk about entity encodings and ExtractorHTML. So, yeah, the ExtractorHTML should...
Hi, ... force-added ... This is good enough for me. But i found that under $HERITRIX_HOME/jobs/status directory the size of jdb files is increasing so rapidly....
Hi, I need a way to track from which seed (source) a URI came from. This information is written down in the crawl.log if I set the attribute 'source-tag-seeds'...
... The robots.txt standard only provides for root/host-level robots.txt files, so that's the only URI automatically checked by Heritrix. I suspect the pages...
Pardon me if this is a repeat email, I was't sure my previous posted. I am having issues with making this thing run on Windows, I have read the FAQ, made the...
You are correct, these errant robots.txt URL's are coming from speculative embeds (from javascript). I was fooled because I had a crawl finish with all of its...
Hello, is there a way how to automatically repair broken links in downloaded file. Especially in situation, where we download only text files(html,js), all the...
... I do not know of such a tool. Would be a nice tool to have though. For example, it could be used to make the DVD of archived content that a poster from a...
... The jdb log are fundamental to the crawler. They contain all of the crawler state. You might be able to tune the backgound bdbje cleaner thread so it does...
... This class does not make for easy reading. ... Looking at code, this looks hard. The content that gets written to ARCs is 'recorded' by wrapping the apache...
Ya, I can see now that it is not such an easy task since the apache commons httpclient has no way to inject new header in the response. And heritrix just wrap...
It seems like there isn't a branch for the 1.8 release -- or am I missing something? The highest branch is heritrix-1_6, and the highest version is ...
Hello: I sent a somewhat similar email to this list yesterday that seems to have been blocked due to the attached patches? Ignore that one if it ends up making...
Hello, sorry for newbie question, but is it possible to configure filter to accept only sites with specific encoding ? like ISO-8859-1 etc. ? Thank you, Martin...
... Not really. Encoding, if specified, is done on a page by page basis in the HTTP response header and/or in HEAD of the HTML page. You could set a...