In this case, it's not the crawler that's sending any email, but the server-side script it's hitting, which is specified in the FORM ACTION attribute. For...
Last week we upgraded the heart of Heritrix, the HttpClient lib from release 2.0 to 3.0. Its 3.0 alpha2 but by all accounts, its stable, and thats been our...
stack
stack@...
Oct 4, 2004 5:36 pm
1066
... Yes, I've been running my test crawl (the one that died and whose log I sent you) for the last 10 hours using an update against HEAD I made around 07h30...
... I run a heritrix instance use BdbFrontier yesterday. It has 'Downloaded 1163658 documents in 23 h., 15 min. and 20 sec'. No KeyQueue naming errors now. ...
ansi
mymaillist@...
Oct 6, 2004 4:18 am
1068
... Good stuff. We'll try not to break it again (smile). ... Thanks for the alerts Ansi. Looks like you're instance is using 'SingleHttpConnectionManager'...
stack
stack@...
Oct 6, 2004 5:22 pm
1069
I have an ARC file (generated from an ongoing crawl using a recent HEAD snapshot) that causes the arcreader to die: Exception in thread "main"...
Is there a specification for the SURT beyond the paragraph in the User's Guide? Could you add real examples to the manual? Thanks. -tree -- Tom Emerson...
... Can I have your ARC Tom? Might give clue on how the records were constructed. Thanks, St.Ack P.S. You got note on questions for the 'archive pass'...
stack
stack@...
Oct 8, 2004 9:00 pm
1072
... Gordon was talking up SURT at LoC this week (Library of Congress). I'll get him to add his notes into manual (You've seen the issue? It has some good...
stack
stack@...
Oct 8, 2004 9:07 pm
1073
I get this Exception,too:( Ansi...
ansi
mymaillist@...
Oct 9, 2004 1:00 am
1074
Is there a Maven goal I can use to build just enough to test Heritrix, including the Web GUI? By test I mean configure and submit a job, and run a short...
... 'maven jar' will build the jar only. It does the unit tests, which you probably want, but not all of the other doc. generation. The jar gets created...
stack
stack@...
Oct 11, 2004 6:24 pm
1076
I am interested in using the Heritrix extractors to pull links from HTML documents. The problem is that in addition to the links, I need to know the position...
... Its not currently supported. To know the position of each link in a page, you'll need to doctor each of the extractors you're interested in to log the link...
stack
stack@...
Oct 12, 2004 6:43 pm
1078
The link on the main heritrix page (http://crawler.archive.org/articles/user_manual.html) just brings up a blank page. I was hoping to find some documentation...
Something probably went wrong with the auto generation during the most recent build. I'm sure Michael will fix it once he gets in. In the meantime you can use...
... Should be fixed by the time ye get this mail (Bad src xml). Thanks for pointing it out. ... Yeah, its a new feature as Kris says. Here's the little note...
stack
stack@...
Oct 13, 2004 4:45 pm
1081
robeger writes: [...] ... What in particular would you like to know? I wrote the filter, so ask away. ;-) -tree -- Tom Emerson...
I was looking at your notes on http://www.dreamersrealm.net/tree/blog/2004/08/19/#html_only about it. Sounds like what I want to do - just grab text content....
Hi all, Are the crawl.log fields described somewhere? I figured it out myself and wrote my own doc by reading the code after not finding anything in the...
... The user manual has a coarse description. See '8.2.1. crawl.log' in http://crawler.archive.org/articles/user_manual.html. It could be tightened up. Send...
stack
stack@...
Oct 13, 2004 8:01 pm
1085
... The above sounds like a decent tactic. Leaving off the pre-fetch filter would mean that you'd do content-type checks only. Might be more suited to your...
stack
stack@...
Oct 13, 2004 8:17 pm
1086
... Fixed in HEAD: https://sourceforge.net/tracker/index.php?func=detail&aid=1045736&group_id=73833&atid=539099. St.Ack...
stack
stack@...
Oct 14, 2004 1:37 am
1087
... Of course when I went looking the User Manual wasn't available online yet. What I ended up with is pretty much what's there. I would find it more readable...
... [...] ... I don't think you need the mid-fetch filter, but I may be missing something. ... Yes, one regexp will give you better performance. The one stack...
... General plan is to build a meaty glossary and then mess with xinclude to duplicate the meaty snippets throughout the docs (Haven't gotten to the xinclude...
stack
stack@...
Oct 14, 2004 4:16 pm
1090
I'm trying to wrap my head around the following observations about the seeds in a crawl I did. - The original seed list has 280 URLs. - The seed list after the...
Hi Tom, First of all, you should definitely care about this kind of discrepancies. It is very important that all reports are accurate and that they make sense....
Hi, I got someone to install Heritrix on a machine for me. I just gave them the user manual link. Following the instructions in there they simply installed a...
Williamson, Mark
Mark.Williamson@...
Oct 16, 2004 7:28 am
1093
... Thats interesting Mark. It works with full SDK? Make a bug and I'll fix the manual. This is 1.0.4? Is it running the selftest when this happens because...