... The error might only occur if the '%' is at the end of the query string (see stacktrace). ... I would consider passing invalid URIs the queue as a bug. We...
Christian Kohlschuetter
ck-heritrix@...
Jun 1, 2005 8:38 am
1901
I'm trying to crawl http://www.orcun.de which as far as I can tell is completely flash-based. I can't seem to get Heritrix to follow through the links in the...
We are in the process of upgrading to Heritrix 1.4 and we've noticed that when when the crawl completes and the logging status says complete, the app never...
... After finishing a crawl, Heritrix waits around unless you explicitly ask it to shutdown via 'Shut down Heritrix software' link on main index page. Are you...
Dear all, We have made available a Heritrix processor that interfaces with rainbow, the most widely known, and perhaps most widely used, text classification...
The whole idea of having TextUtils.getMatcher() was to avoid Matcher instantiation, so we are basically talking about cache efficiency and performance. You are...
Christian Kohlschuetter
ck-heritrix@...
Jun 3, 2005 9:55 am
1907
We seemed to have found the bug that caused this problem. We run heritrix in an integrated software solution. So its run via API with a command line like...
... Thank you both for the excellent contribution. The doc. is really great too: i.e. Overview.pdf (I like the suggestion of the classifier being used to...
... Thanks for this bug report! I have filed an issue along with a patch ([ 1214478 ] ThreadLocalHttpConnectionManager starts a non-daemon Thread) Cheers, -- ...
can i use the tool such as pkunpack or BitZipper to extract data from the arc file? i tried to get the ARC files from the linux machine to a XP machine, and...
Is there an xsd file somewhere in the distribution for the order.xml file? I looked around the directories, but haven't located one. I noticed the order.xml...
... Unfortunately, there are other compressed aggregate file type out there that use the ARC extension, dating back all the way to DOS and early Mac, which can...
Hi, Christian. Good ideas, but with your suggested tactic, code that isn't careful could still have unpredictable results. Perhaps this is a contrived example,...
... Thank you very much for the replies. In this case, i think if I wish to view each harvested data, I shall write a program first use arcreader command to...
Hi Gordon, yes, this example would not be valid for the getMatcher contract I suggested. I doubt that getMatcher should even support this, as it would render...
Christian Kohlschuetter
ck-heritrix@...
Jun 7, 2005 10:25 am
1920
Sorry, two typos (copy-and-paste...) ... throw new IllegalArgumentException("Pattern must not be null"); ... m = pattern.matcher(input); -- Christian...
Christian Kohlschuetter
ck-heritrix@...
Jun 7, 2005 11:19 am
1921
Additional notes: 1. TextUtils.returnMatcher(Matcher) should always be called in a "finally" clause to ensure that the Matcher comes back to the lending....
Christian Kohlschuetter
ck-heritrix@...
Jun 7, 2005 1:42 pm
1922
... Or subclass ARCReader and as its running through the records, have it write tmp.pdf, rather than just offsets, its default behavior. Or use the...
I've been looking into using the command line JMX client (very slick, btw) and am struck with the question about the difference between adding a URL to the...
Matt Ittigson
cydatamatt@...
Jun 8, 2005 3:21 pm
1924
What I want to do is use a template order.xml and settings dir, with our default configuration, settings, etc. as the basis for creating new crawls (but not...
Personally I'd just use 'sed' to replace certain keywords in your templates with the values you want. Doesn't seem worth making anything more complex,...
Thanks, Tom. I'll take a look at doing it that way too. This seems to be my week for premature posting to groups. I seem to have figured out what I was doing...
What did you use to validate it? I get this error: "org.apache.xmlbeans.XmlException: error: Unexpected element: CDATA" when I try to parse an order.xml file...
... FYI, in $HERITRIX_HOME/src/resources/arcMetaheaderBody.xsl is a stylesheet that reads an order file and writes a subset for inclusion at the head of every...