Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Show off your group to the world. Share a photo of your group with us.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 2486 - 2515 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
2486
Hello Joe/Community, I believe you can make use of the JMX call exposed by Heritrix "interrupt(String threadName") to kill a runaway thread. The call...
Vishwesh Thakur
vishwesh_thakur
Offline Send Email
Jan 2, 2006
6:06 am
2487
In such a situation, it's good to run a 'Threads Report' to see what the laggard thread is doing; it might be legitimately in the middle of a long download. ...
Gordon Mohr (archive....
gojomo
Offline Send Email
Jan 2, 2006
5:20 pm
2488
... Yes. The second crawl will re-fetch all pages though they may have been just-fetched by a previous crawl. By default, a Heritrix crawl has no knowledge of...
stack
stackarchiveorg
Offline Send Email
Jan 2, 2006
9:08 pm
2489
I finally logged in to the web console and looked at the thread report. It looks like the thread is stuck at one regular expression call (java.uitl.regex) and...
joehung302
Offline Send Email
Jan 4, 2006
12:27 am
2490
What is the difference? I have a proof crawler setup at 1500MB heap. It can never run Bloom filter for more than 1 day (maybe half a day). I've tried several ...
joehung302
Offline Send Email
Jan 4, 2006
12:41 am
2491
... If it was busy in regex extraction, rather than a network fetch, there's probably a reproduceable problem with our regex expressions, where they perform...
Gordon Mohr (archive....
gojomo
Offline Send Email
Jan 4, 2006
1:06 am
2492
Hello, I want to restrict the number of fetched documents per webserver. I am using the QuotaEnforcer prefetch procesor. But I am unsure which directive should...
Thimo Eichstädt
abc@...
Send Email
Jan 4, 2006
2:24 am
2493
Hi all, Does anyone know how can I retrieve the original seed URI from the current CrawlURI (which is not always the base uri, as they got somehow redirected...
Samuel
samendonca
Offline Send Email
Jan 4, 2006
12:56 pm
2494
... pass ... from). ... directory ... I did that and I was able to create a job and it does look like a recovered job. However I'm not seeing any directory...
joehung302
Offline Send Email
Jan 4, 2006
6:16 pm
2495
I was not able to recover a job from a previous checkpoint. Here is the error message. Any suggestions? cheers, -joe ...
joehung302
Offline Send Email
Jan 4, 2006
7:28 pm
2496
... Was the 'previous checkpoint' not a 'fast checkpoint'? How did you instantiate the recovery? Via the UI? Does the listing of bdbje logs under the...
stack
stackarchiveorg
Offline Send Email
Jan 4, 2006
7:39 pm
2497
... Its not possible currently. We've been talking about adding this feature with a while (There's an RFE here: ...
stack
stackarchiveorg
Offline Send Email
Jan 4, 2006
7:50 pm
2498
... The recovered job resumes the crawl? Did you pass order file to JMX addJob or a jar of order + seeds + overrides, etc? If the latter, it will make a new...
stack
stackarchiveorg
Offline Send Email
Jan 4, 2006
7:53 pm
2499
... One makes counts on a 'CrawlHost' [http://crawler.archive.org/xref/org/archive/crawler/datamodel/CrawlHost.html] basis. The other on a 'CrawlServer' ...
stack
stackarchiveorg
Offline Send Email
Jan 4, 2006
8:03 pm
2500
... Why? OOME? Can you give crawler bigger heap? Bloom filter by default uses ~500Megs [See note here: ...
stack
stackarchiveorg
Offline Send Email
Jan 4, 2006
8:35 pm
2501
... the ... dir). ... directory ... I was doing the former. seems like a bad choice... Assuming I don't have any overrides, can I just pass it a jar file ...
joehung302
Offline Send Email
Jan 4, 2006
9:15 pm
2502
... is ... you ... I have <boolean name="checkpoint-copy-bdbje-logs">true</boolean> so I guess the fast checkpointing is off. I did the instantiation from the...
joehung302
Offline Send Email
Jan 4, 2006
9:31 pm
2503
... Yes. ... Others here that make use of it pass the empty string (Need to add an override that is absent the seeds parameter). Keep asking questions, St.Ack...
stack
stackarchiveorg
Offline Send Email
Jan 4, 2006
10:17 pm
2504
... Is it possible that the job successfully started from this checkpoint wrote back atop the checkpoint (Sounds like its possible going by your mails of...
stack
stackarchiveorg
Offline Send Email
Jan 4, 2006
10:26 pm
2505
Here are more background information problem: There are 2 crawl job involved, each one have 3 checkpoint directory. The first crawl job I started about 5 days...
joehung302
Offline Send Email
Jan 4, 2006
10:42 pm
2506
Hello people. :) I hope you can help me in making heritrix work. I was able to start heritrix UI and run crawling jobs. However, all my job will hang at 33%...
alxartes
Offline Send Email
Jan 5, 2006
4:03 am
2507
... I would add to Stack's examples of things combined in the 'host' count: other HTTP servers on nonstandard ports, such as "http://example.com:8080". All...
Gordon Mohr (archive....
gojomo
Offline Send Email
Jan 6, 2006
12:04 am
2508
... problems, ... the ... Here is the hung thread along with the problem URL: [ToeThread #24: http://www.just-for-golf.com/golf-vacation.html CrawlURI...
joehung302
Offline Send Email
Jan 6, 2006
7:20 am
2509
... Quite possible. I found this exception in heritrix_out.log ================================ 01/06/2006 06:28:39 +0000 SEVERE ...
joehung302
Offline Send Email
Jan 6, 2006
8:12 am
2510
When running under ubuntu, my latest Mazilla can not access WUI, why? what can I do? I run heritrix from eclipse 3.1 and the heritrix is 1.4.0-src....
jilin05
Offline Send Email
Jan 8, 2006
7:00 am
2511
... Did the problem occur only after upgrading Mozilla? What URL are you trying to access the admin web UI? What error does Mozilla give? Is the Heritrix...
Gordon Mohr
gojomo
Offline Send Email
Jan 8, 2006
7:07 pm
2512
Hi, members Now, I am using Mozilla/5.0. The error that Mozilla gives is "the operation timed out when attempting to contact wanghui". The URL is...
jilin05
Offline Send Email
Jan 9, 2006
3:03 am
2513
I have a job that is getting very large, the partition that Heritrix lives on will get full soon. I have paused the job. I would like to shut down Heritrix,...
sbsbofh
Offline Send Email
Jan 9, 2006
4:31 pm
2514
... They do not. You need to checkpoint and then in the new location, do a recovery from the checkpoint made at end-of-crawl. See section 9.4, "Checkpointing"...
stack
stackarchiveorg
Offline Send Email
Jan 9, 2006
5:49 pm
2515
If you've never had success running Heritrix on this machine, the browser is probably a non-issue. The error message suggests nothing is listening on the...
Gordon Mohr (archive....
gojomo
Offline Send Email
Jan 9, 2006
6:43 pm
Messages 2486 - 2515 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help