Seems like the copy on crawler.archive.org is corrupt. This is actually a paper that was presented at the 2004 International Web Archiving Workshop and is...
A JDBCWriter I once used to write html content into mysql database. ... package org.archive.crawler.writer; import...
ansi
mymaillist@...
Sep 6, 2007 8:28 pm
4539
Hi, The last part of my DecidingScope profiles includes a section to exclude any URIs that match any regular expressions in a list of 0 or more URIs like this...
Andrea Goethals
andrea_goethals@...
Sep 11, 2007 9:36 pm
4540
i wanted to extend QueueAssignmentPolicy i create my own QueueAssignmentPolicy NicknameQueueAssignmentPolicy /* NicknameQueueAssignmentPolicy * * $Id:...
I've figured out some of this. As far as escaping any part of the URI to make it into a Java regular expression, it seems to work if you do or don't escape...
Andrea Goethals
andrea_goethals@...
Sep 12, 2007 9:44 pm
4542
I have a problem :if i made a new QueueAssignmentPolicy,how can I use it in heritrix,how to change the AbstractFrontier, my AbstractFrontier is this kind: ...
Keeping track of the several levels of escaping can be challenging. My main suggestion would be that even if you are composing the order.xml directly for your...
Hi all, I'm a newbie for Heritrix. So all the questions may have some simple answers. I want to crawl on wap sites. As an example; http://wap.gezeglen.com When...
Thanks for the response. I am using heritrix 1.12.1. The problem I was seeing with the & is that the order.xml can't handle having an unescaped (& instead of...
Andrea Goethals
andrea_goethals@...
Sep 17, 2007 5:53 pm
4546
Hello, I have a problem: I failed to build Heritrix by maven. Heritrix(selftest) can run successfully in ecilpse, but maven failed to build the project. There...
Carolyn
chq_qing@...
Sep 18, 2007 3:25 am
4547
Check the tests that failed. Could they be failing because you are running on windows -- Heritrix is not 'officially' supported on windows -- or perhaps...
thanks for the suggestion , yes,i'm using Heritrix 1.12.1 i just run the eclipse and operate heritrix run as java application when i reset the computer ,i...
Yes, it's under branches/pjack_settings. It requires maven2 to build. I would characterize the code as stable but untested. We are working a few final...
Nick, From your later message, I assume you succeeded in making your NicknameQueueAssignmentPolicy appear in the web UI. (FYI, it is not necessary to edit...
Thanks Michael Stack and Olaf Freyer. With your help, the trouble is solved: I fixed two failed testcasts(org.archive.crawler.extractor.ExtractorHTMLTest and...
Carolyn
chq_qing@...
Sep 19, 2007 3:51 am
4557
Mr.Mohr, From your response,NicknameQueueAssignmentPolicy will be problematic, And I understand why the download speed initially-fast and then very-slow. Your...
Hiya all, When trying to crawl http://ibeatrice.blogspot.com, Heritrix 1.8 (under WCT) only collects the front page and prints the following stack track to...
Hi Jackson - I tried reproducing in 1.8.0 with a usual scope/processors setup against the given URI, and could not: the page is extracted without error and ...
Hiya Gordon, It turns out that I still had my investigative build of Heritrix on that machine, probing the ExtractorDOC bug I found a while ago, and it was my ...
I need to do a regular expression search and replace on all uris that the crawler finds, before they are processed. I am having trouble deciding where the...
The Heritrix Development team will be putting out a preview release soon for Heritrix 2.x and we'd like to enlist a broad number of testers to experiment with...
Dear Kris: I am very interested in Heritrix, and I hope I could join the test team. I have attended to your project for sereral months, and have read the...
Carolyn
chq_qing@...
Sep 24, 2007 2:44 am
4564
Hallo, I've got a question.. How to log every link that haven't been downloaded in a crawl? The reason why is important too. I'm download only sites that ends...
Hi Adam, You can just uncomment org.archive.crawler.postprocessor.LinksScoper.level = INFO in the heritrix.properties, and set LinksScoper's override-logger to...
Thanks a lot. It helps very much ;] Now I'm logging into separate file and I've set some filter (I'm experimenting with index.html/htm/php/asp) But still I...