sorry for asking questions not related to techniques. accessing yahoo group with web browser is a little bit slow here, and the rss reader can only read...
I've written a frontier module with persistent state. Both the current frontier state (queued URIs, statistics, etc.) and the set of successfully fetched URIs...
... Take a look at the DomainSensitiveFrontier: http://crawler.archive.org/xref/org/archive/crawler/frontier/DomainSensitiveFrontier.html. Here is from its...
Am I understanding correctly that the way to use this is to set up a regex to match on content types you want to exclude, not to match the ones you want to...
... Depends on where you're trying to set the filter. Here's the user manual paragraph on the ContentTypeRegExpFilter if content-type filtering is your thing: ...
Hello everyone. As some of you know, I've been working on a new Heritrix module (or rather modules) that would allow for iterative crawling, which adjusts it's...
I had found the user manual pages, but not the FAQ question about it. So at the midfetch stage I want to set it to return true for the content-types I want to...
I think the URL for the FAQ just got a little messed up, try: http://crawler.archive.org/faq.html#midfetch The Q is: " I only want to download text/html and...
Here's how I do this for my latest crawls: 1. Add the ContentTypeRegExFilter as a midfetch-filter and write-processor Archiver filter. 2. In the settings for...
Just tried this out. It seems that the crawl never "finishes". Tried it on a single URL, limiting it to 10 docs for that site. Once it reaches 10, it starts...
Hi, I am trying to use this crawler to build a search engine , has any one had any experience , any problems encountered , please share ur experiences . what...
I am just curious if there is a way to multithread Heritrix when crawling a single site. Also there was a post about embedding heritrix without having to call ...
... Not reliably (There is the 'valence' feature on Frontier but its problematic even after help from members of this list). ... There's been some progress....
... We've done a few experiments trying to make ARC files, the default product of a Heritrix crawl, searchable. Mostly this has consisted of trying to get ARC...
... <>The way that the DomainSensitiveFrontier works is that when it hits the max-docs-for-this-domain threshold, it marks all remaining queued URLs as ...
This is kind of a general crawl question, not specific to heritrix, but I figured this was as good a place as any to ask it. Basically, when you're running a...
Hey, I did reply to this just shortly before I left for the weekend, but it seems that the reply didn't make it (hope this one do). And yeah, it doesn't work...
In general, yes. I've been managing a crawl of the entire .is TLD, and while most domains require no special attention, about 3-5% have various crawler traps....
Thanks, Tom. FYI, you have "au" as one of your extensions, which causes problems if you are crawling any sites in Australia (.au domain). Caused me some ...
... I'm surprised this is being applied against the domains --- the filter should only be applied against the end of a full-qualified URL, so even a TLD should...
It surprised me too, but I was feeding it a .au domain, and it wouldn't crawl anything until I removed au from the exclusion list. Here's the crawl log entry: ...
Tom is right that as part of any scheduleable HTTP (or other recognizably hierarchical) URI, there will be a '/' after the hostname portion, preventing a tail...
hi,everyone I update cvs from HEAD just now and run heritrix on windows 2003, When click 'New job based on it ' to create a job,heritrix tell me An error...
ansi
mymaillist@...
Dec 8, 2004 8:01 am
1260
I take it this was working for you before? The only relevant change I can see has been made lately is to ...
This is the first time I run heritrix 1.3 on windows. The 1.6 vesion of JobConfigureUtils.java doesn't work,too. After add Profiles/seeds.txt and...
ansi
mymaillist@...
Dec 8, 2004 9:47 am
1262
... Yeah. I think so (The path thats being complained about is a windows path with '\' separators. CLASSPATH paths use '/' separators). I added path...
FYI, the crawler project wiki, at... http://crawler.archive.org/cgi-bin/wiki.pl ...has been upgraded to the 1.0 version of the UseMod wiki software. (It was...