Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 2825 - 2855 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
2825
... The link posted by libsoft in an earlier response is a good place to start. Change the filters so they look for text/xml and application/xml rather than...
Michael Stack
stackarchiveorg
Offline Send Email
May 1, 2006
7:33 pm
2826
Hi, I have question about QuotaEnforcer. Is there any possibility to log sites that have more pages then limit set in host-max-fetch-successes? (when the site...
Adam Broke
minonanux
Offline Send Email
May 2, 2006
11:04 am
2827
... You'd have to add it. Would be easy enough to do though. See http://crawler.archive.org/xref/org/archive/crawler/prefetch/QuotaEnforcer.html#237. Add in...
Michael Stack
stackarchiveorg
Offline Send Email
May 2, 2006
3:58 pm
2828
... Any URIs blocked by quota will appear in the crawl.log with the -5003 (S_BLOCKED_BY_QUOTA) code, so that might be enough. The ultimate vision for this...
Gordon Mohr
gojomo
Online Now Send Email
May 2, 2006
7:20 pm
2829
Hi there, I'm using Heritrix to crawl about 130 domains. It occurs that some domains needs specials configurations and I'm doing this with Refinement (one for...
andlanna
Offline Send Email
May 2, 2006
10:05 pm
2830
I'm not familiar with this problem; is there any more information in the Alert -- like a stack trace? - Gordon @ IA...
Gordon Mohr
gojomo
Online Now Send Email
May 2, 2006
11:20 pm
2831
No, there are no exceptions or stack trace with the alert. Only this... André Lanna ... the ... crawled....
andlanna
Offline Send Email
May 3, 2006
12:55 am
2832
Hmm, OK, thanks. There may be more information in heritrix_out.log. Can you look through that file for any SEVERE lines that correspond to the alerts, and...
Gordon Mohr
gojomo
Online Now Send Email
May 3, 2006
1:26 am
2833
This is a problem I had and reported (offlist) earlier in the year. The bug report: ...
kris@...
kristsi25
Offline Send Email
May 3, 2006
8:39 am
2834
The Internet Archive, home of the Heritrix web crawler, Wayback archive browser, and NutchWAX archive search engine projects, has current opportunities for...
Gordon Mohr
gojomo
Online Now Send Email
May 3, 2006
9:51 pm
2835
Hi all: Can anyone confirm that (in theory) the Heritrix crawl.log file can be entirely reproduced from the ARC files generated by a successful crawl? And to...
Gordon Paynter
Gordon.Paynter@...
Send Email
May 4, 2006
12:58 am
2836
... You could come close, but not reproduce all info in these logs from ARCs. For example, taking a representative successful crawl.log line: ...
Gordon Mohr
gojomo
Online Now Send Email
May 4, 2006
1:40 am
2837
Hallo, I know the OutOfMemory problem was there mentioned many times, but i could not find any topic that can solve my problem. I am working on crawling whole...
Adam Brokes
goblin_cz
Offline Send Email
May 6, 2006
8:40 am
2839
... Some suggestions regarding the OOME: - If you are using the ExtractorSWF, use the latest release (1.8.0, officially out just today) or disable the...
Gordon Mohr
gojomo
Online Now Send Email
May 8, 2006
7:30 pm
2840
Hi everyone, I'v dug some more through the Heritrix sources to track down the problem but haven't found a solution yet. This problem starts to get really...
pandae667
Offline Send Email
May 9, 2006
9:39 am
2841
Hi, thanks a lot for your advices. Today I have started heritrix 1.8.0. With this options JAVA_OPTS= -Xmx1800m on Java(TM) 2 Runtime Environment, Standard...
goblin_cz
Offline Send Email
May 9, 2006
1:19 pm
2842
And when I decrease size of seeds list to 50000 then the crawl start normaly, but then I recieve this alert many times.. Serious error occured trying to...
goblin_cz
Offline Send Email
May 9, 2006
1:28 pm
2843
Olaf, I apologize for not responding to you initial report sooner. Of course, this should never happen: the 'queued' tally should match the total number of...
Gordon Mohr
gojomo
Online Now Send Email
May 9, 2006
7:37 pm
2844
These "java.lang.OutOfMemoryError: unable to create new native thread" errors are exactly the sort of OOMEs I mentioned that are *not* due to a shortage of...
Gordon Mohr
gojomo
Online Now Send Email
May 9, 2006
7:50 pm
2845
Hi Gordon, ... Discovered: 20652 Queued: 12 Finished: 8249 Successfully: 8132 Failed: 28 Disregarded: 89 ... Already included size:...
pandae667
Offline Send Email
May 10, 2006
9:08 am
2846
Hi, In the HeritrixProtocolSocketFactory class (Heritrix 1.8), the following method: public Socket createSocket(String host, int port) throws IOException,...
tizo_trico
Offline Send Email
May 10, 2006
10:22 am
2847
Hi everyone, I have yet another question: Is it possible to specify the location of the profiles folder (either via a heritrix.properties setting or via...
pandae667
Offline Send Email
May 10, 2006
10:34 am
2848
... I'll change it. Did it burn you in some way? Any progress on hcc? Thanks Tizo. St.Ack...
Michael Stack
stackarchiveorg
Offline Send Email
May 10, 2006
3:46 pm
2849
... No. We look for profiles in the conf directory. Conf directory is expected to be at HERITRIX_HOME/conf. What if we added system property (and environment...
Michael Stack
stackarchiveorg
Offline Send Email
May 10, 2006
3:53 pm
2850
without stopping the crawler? Currently we have a filter nocrawl-sites.surt that looks like this +http://(com,companycomplains, +http://(com,badcompany, ... ...
joehung302
Offline Send Email
May 10, 2006
5:14 pm
2851
... Hi St. Ack, that would be really helpfull. That way I could just unpack a new heritrix version (for as long as configurations don't change significantly)...
pandae667
Offline Send Email
May 10, 2006
5:46 pm
2852
... What happens if you pause the crawler, remove the current filter, add a new filter of a different name but with same list plus the new element? Its still a...
Michael Stack
stackarchiveorg
Offline Send Email
May 10, 2006
5:46 pm
2853
... I've added an RFE: http://sourceforge.net/tracker/index.php?func=detail&aid=1485819&group_id=73833&atid=539102. Shouldn't be hard to do. Most other disk...
Michael Stack
stackarchiveorg
Offline Send Email
May 10, 2006
6:07 pm
2854
If you are using DecidingScope you can have a SurtPrefixedDecideRule rule that might look like this: <newObject name="rejectIfSurtPrefixed"...
Igor Ranitovic
iranitovic
Offline Send Email
May 10, 2006
6:38 pm
2855
St.Ack, We have the HCC tests in stand by, because we have to show some people that Heritrix can run multiple parallel jobs in the same machine. As a colleague...
tizo_trico
Offline Send Email
May 11, 2006
3:35 am
Messages 2825 - 2855 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help