... Hi Gordon, In fact what you said it's happening. Only the lower level of the hierarchy (that corresponds to http://www.zzz.com.br) contains any ...
... This should work, if there are no stray character on any of the override segments and the filter regular expressions and settings are correct. What kind of...
... Ahh...interesting idea Gordon. I like it. I'm still not sure how large of a problem this will be for us, but I'll pursue something along these lines if...
... Hi Gordon, Sorry for answering just now. Last week was so busy. I was looking for a stray character on all of the override segments and I didn't find any....
... Again, this really should work, so if you have a compact test case you can forward to demonstrate the problem, we'll work on it as a bug. It shouldn't be...
Hi, Hope a generous person point me to the right direction. I have done a number of crawls already using heritrix. I want to know if 6KB/s is an acceptable...
... 6KB/s seems low. How many threads are you running? Are they all occupied all the time? Has the crawl just started or is the 6KB/s a measure taken after...
All set -- creating a separate surt prefix file worked like a charm! -Adam ... org.archive.crawler.scope.SeedFileIterator.transform(SeedFileIterator.java:90) ...
... The warning is harmless, generated when scanning the seeds for plain URLs. Specifying the SURT inline with the "+" notation generates the warning but also...
Thanks Gordon- I think I must still be missing something, though. I'm currently using the SURT: http://(com,example,www,)/ This appears in my SURT dump and...
... It could be transcluded (if it were referred to by certain short chains of hops from other URLs already in-scope), but removing the acceptIfTranscluded...
Working great now Gordon. I was confused about how the decide rules worked, thinking a reject from any rule would negate that URL. I only realized my...
Thank you very much Stack. Usually, I only do domain crawl of one-three seeds at a time. I guess that is the real reason for the low bandwidth speed. ... know ...
... Programmatically, or as a crawl operator seeking a summary report? If the latter, there are not-yet-documented query-string parameters that may be added to...
Right now Heritrix defines ${HOSTNAME} for path substitutions, but it would be useful (to me, at least) to have others. For example, if I have a separate...
... Makes sense Tom. If you want to send over a patch, that'd be appreciated though I'd say doing it right might be a bit of work. As it is currently, the...
Look for the call to clearHeld() in WorkQueueFrontier: that's the point where the queue is no longer actively 'held' by the other queues-of-queues, because...
Dear All, We are taking Digital Preservation class this semester. Part of the project is to apply Heritrix to collect documents. Our project is to preserve...
This message is about Human beings, Democracy, UNHCR, Refugees, The Iraqis, Islam, Kurds, Human rights, Respect, Money, Donations, Angelina Jolie, Pavarotti,...
almostatmygoalnowwh9@...
Feb 13, 2006 6:43 am
2634
Thanks a bunch. Some sort of notification mechanism (or protected methods from where we can send notifications) would be of great help. We were trying to...
There are two features I would like to see in Heritrix, and before I implement them, I would like to ask if anyone else has, and if so, if they would be...
... Check out this feature: http://crawler.archive.org/articles/user_manual.html#credentials. Looks like the edventure site takes a HTTP POST of login...
... I saw the JMX interface, but I didn't realize the crawler could be told to 'hold' at start and stop with 0 seeds. I assume that's the frontier's...
... Interesting. Is the size on disk the concern, or the performance? If the latter, I don't know how much of a benefit marking the whole site as done would...