I've been able to build and run the crawler. However, now that I've made some changes to the code, how do I build my new version? The 'maven dist' doesn't...
... Please paste in the error you're seeing Adam. 'maven dist' or 'maven jar' should just work including any code changes you've made to the base src into the...
thanks for the quick reply. i made a simple change to the ARCReader class and then did 'maven dist:build' which seems to take forever. here's the output....
... The 'dist' target does a bunch of packaging (Builds two webapps, generates documentation, builds src and bin packages). Its probably not what you want. ...
I'm trying to run the precompiled heritrix-1.0.4.zip on (gasp!) Windows XP and am running into problems riht from the getgo. ... set HERITRIX_HOME=C:\heritrix ...
pls try this script ... @rem ************************************************************************* @rem This script is used to start Heritrix. @rem @rem...
ansi
mymaillist@...
Nov 3, 2004 4:47 pm
1142
I have a couple of large crawls that I want to start, but will hold off for 1.2 to be labeled before doing them if we're close. How close is HEAD to what will...
... Its looking like Monday or Tuesday. We've run our base test plan and all seems fine and dandy but a shadow 1.2 crawl of a 1.0.5 crawl -- the HEAD of the...
It seems to me that the filedesc:// URL-record in the generated ARC-files has an error There are 2 newlines after the content which causes the length of the ...
In ARCWriter#generateARCFileMetaData it does this after writing the metadata: // Write out a couple of LINE_SEPARATORs to end this record. metabaos.write(("" +...
I've read message 841 and the article at http://www.dreamersrealm.net/tree/blog/2004/08/19. a hybrid method is proposed there to limite crawls to HTML. I...
... The whole saga can be found at http://www.dreamersrealm.net/tree/blog/heritrix/ which has some further notes not included in the 19 August post. ... ...
Its looking like release won't happen till Friday at the earliest. We're going to let some comparison test crawls that we have running here go to completion so...
I also notice in the FAQ at the homepage of Heritrix, the answer for the common problem 5, "..., or, if you want to instead look at document mimetypes, you can...
... Here's a note on midfetch filter from user manual: "Its also possible to add in filters that are checked after the download of the HTTP response headers...
where can i find *midfetch-filters* filter, i'm using version 1.0.0, should i download a new version. ... possible to ... response ... filters to ... (Aborted ...
Hi ozimmels, You can specify a cookie file in 'settings' tab -- HTTP Fetcher: load-cookies-from-file. File needs to be in Netscape format. If I am not mistaken...
You're stuck in the HTTP fetcher. We've seen issues fetching https in old versions but haven't seen it happening in 1.2.0 as yet (Below do not seem to be...
Here's a trace, though I don't think it did anything useful. The -SIGQUIT didn't work. Attaching to process ID 10533, please wait... Debugger attached...
... Remember that you have to sort the CDX file -- ExtractCDX doesn't do that as the file can easily become too big to have in memory, but e.g. Unix sort()...
first - thanks for the help. somethings are not clear to me yet. this is what i know 1. heritrix creates a : IAH-20041115081418-00001-zen.arc.gz 2. gunzip -d...
... So you've renamed sortedArc.cdx arc.cdx afterwards? Otherwise, you'll need to point at sortedArc.cdx when starting the proxyviewer below. ... It will try...
first - thanks for the help. somethings are not clear to me yet. this is what i know 1. heritrix creates a : IAH-20041115081418-00001-zen.arc.gz You might want...