Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Want your group to be featured on the Yahoo! Groups website? Add a group photo to Flickr.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 3770 - 3799 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
3770
Am using heritrix-1.10.1 on windows xp to crawl a local site. It does it well but when i see it in WUI it shows me 99% completed and never finishes. Then I...
vjsjolly
Offline Send Email
Feb 2, 2007
12:59 pm
3771
Read about the ARC file format here, http://crawler.archive.org/articles/developer_manual/arcs.html, in the Developer's Manual. Its not a zip file. Use gzip...
Michael Stack
stackarchiveorg
Offline Send Email
Feb 2, 2007
4:05 pm
3772
Hello all, During the course of crawling, ExtractorHTML will construct outlinks using any form action urls that it encounters. Looking through the source...
blah_1977
Online Now Send Email
Feb 3, 2007
2:57 am
3773
On a related note, if during the course of discovering form action urls with long parameter strings, the discovered outlink has length > 2083 (i.e., the max....
blah_1977
Online Now Send Email
Feb 3, 2007
4:23 am
3774
I just had a crawl using the latest heritrix run out of disk space and it is now in an unusable state. This is surprising me as I've had crawls with 1.8 run...
Eric
ej@...
Send Email
Feb 3, 2007
6:22 pm
3775
Once a crawl hits an out-of-disk condition, it may be in an unresumable and uncheckpointable state -- cleanly recovering from everywhere this might happen...
Gordon Mohr
gojomo
Online Now Send Email
Feb 3, 2007
8:08 pm
3776
Ah, I just had to wait for all 10 timeouts to happen. Then I could pause and checkpoint just fine and now it's back to running as normal. However, I do have...
Eric
ej@...
Send Email
Feb 3, 2007
8:11 pm
3777
That's a good idea. Don't run out of disk space. (Said the Seagate stockholder. Ok, not anymore - I sold it during the Veritas stock flip.) Point still holds....
John Lekashman
lekash
Offline Send Email
Feb 3, 2007
11:06 pm
3778
I have a crawler stuck on "Job Status: Could" I've yet to see this. Can't seem to checkpoint. Will have to hard kill....
Oliver
oliverc.rm
Offline Send Email
Feb 5, 2007
7:23 pm
3779
... Never seen that. If you see anything interesting from your alerts, heritrix_out.log, or other diagnostics (like SIGQUIT or 'jstack' thread dump), please...
Gordon Mohr
gojomo
Online Now Send Email
Feb 5, 2007
8:19 pm
3780
0 alerts for this job. Job never started it seems. Failed on loading order.xml put the crawler into a "Could" state. Will rerun this order.xml and see what...
Oliver
oliverc.rm
Offline Send Email
Feb 5, 2007
9:11 pm
3781
Ok found the problem. Somehow when the order.xml was sent all I got for name field was " <name/> " It was lacking the first part " <name> " I'm going to guess...
Oliver
oliverc.rm
Offline Send Email
Feb 5, 2007
9:48 pm
3782
Hello everyone, Crawl-by-Example plugin for Heritrix, done as a part of Google Summer of Code project under the guidance of Gordon Mohr, is now released for ...
Michael
mike_be
Offline Send Email
Feb 6, 2007
9:30 am
3783
Hi all, I started working on some utilities to use the archives on windows, and decided to share my work with you. You'll find more info on...
rnebor
Offline Send Email
Feb 6, 2007
6:54 pm
3784
... Heritrix only does HTTP POST when configured to supply login credentials. See the 6.2.3. Credentials section on this page ...
Michael Stack
stack@...
Send Email
Feb 6, 2007
7:30 pm
3785
Thank you. Nice diagram. What kind of changes did you make to libarc to make it run on windows? Were they just porting changes or were there any patches that...
Michael Stack
stackarchiveorg
Offline Send Email
Feb 6, 2007
11:13 pm
3786
Thanks Mostly porting changes to the original lib were made yes, I'll have to make sure that everything is correctly isolated for the next release, as so far...
rnebor
Offline Send Email
Feb 6, 2007
11:55 pm
3787
I have tried running Heritrix 1.10.2 on a server with a firewall and encountered errors when I tried running it. Both the GUIPORT and the JMXPORT are already...
alxartes
Offline Send Email
Feb 7, 2007
11:45 am
3788
Hi all, I'd be more than happy to integrate Regis's changes into libarc, when I get a spare minute I'll take a look at them and update things appropriately....
Tom Emerson
TEmerson@...
Send Email
Feb 7, 2007
2:02 pm
3789
Hey Alexis: You cannot reach the GUI through the firewall? I'd think that this at least should work. In standalone mode, Heritrix registers itself with the...
Michael Stack
stack@...
Send Email
Feb 7, 2007
5:21 pm
3790
Hello all, I try to setup a cluster. There is several Heritrix-1.11.0 sucessfully running and registered at ... # java -cp hcc-0.2.0.jar...
p0ruchik
Offline Send Email
Feb 8, 2007
2:41 pm
3791
Do you have archive-commons.jar in your CLASSPATH? (You can get one here, http://builds.archive.org:8080/cruisecontrol/buildresults/HEAD-heritrix, under the...
Michael Stack
stack@...
Send Email
Feb 8, 2007
3:53 pm
3792
... Feb 8, 2007 8:35:21 PM org.archive.hcc.ClusterControllerBean init INFO: maxPerContainer setting: 5 javax.naming.NoInitialContextException: Need to specify...
p0ruchik
Offline Send Email
Feb 8, 2007
4:44 pm
3793
You need a jndi.properties on your CLASSPATH. See the jndi.properties in Heritrix. At head of file is comment describing setup using JBOSS for example. ...
Michael Stack
stack@...
Send Email
Feb 8, 2007
4:58 pm
3794
You can find it here: http://www.zvents.com/labs/hdfs_writer_processor I've iterated on this a bit and have used it for a 5 million document crawl with no...
nuggetwheat
Offline Send Email
Feb 8, 2007
7:17 pm
3795
... Thank you very much! Connection is created. But the next exception is ... Feb 9, 2007 1:04:49 AM org.archive.hcc.ClusterControllerBean init INFO:...
p0ruchik
Offline Send Email
Feb 8, 2007
9:33 pm
3796
Looking at the code, http://crawler.archive.org/hcc/xref/org/archive/hcc/ClusterControllerBean.html#1600, it looks like the setup of the proxy failed (line...
Michael Stack
stackarchiveorg
Offline Send Email
Feb 8, 2007
10:24 pm
3797
OK, I see the Credential-related classes. Now in the CredentialStore class there is a create() method. Can I use this method to programmatically create a...
blah_1977
Online Now Send Email
Feb 9, 2007
1:50 am
3798
Does Heretrix visit a page multiple times during a crawl? If so under what conditions? If a page X is linked to from 3 other pages for example, does Heretrix ...
nt_bdr
Offline Send Email
Feb 9, 2007
1:52 am
3799
... to > connect to (And listening for JMX connections on the designated port?) ... #java -jar ./bin/cmdline-jmxclient-0.10.5.jar - crw2:8849 ... ...
p0ruchik
Offline Send Email
Feb 9, 2007
10:22 am
Messages 3770 - 3799 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help