Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 5910 - 5939 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
5910
Hi, I'm using Heritrix 2.0.2 with a few custom processor classes which I use to extract relevant data from the HTML (by replacing ARCWriterProcessor with one...
Enrico Detoma
enrico.detoma@...
Send Email
Jul 1, 2009
7:35 am
5911
cross-posted from [Archive-access-discuss] On Tue, 30 Jun 2009 13:47:51 -0400 Zhenzhen Xue <zjuzhenzhen@...> ... interesting. is this really necessary? ...
steve@...
stearcorg
Online Now Send Email
Jul 1, 2009
6:01 pm
5912
Hi, Is there any way of extracting contents of homepage given a URI under that domain/host at runtime in heritrix? ex: URI is,...
ramab1988
Offline Send Email
Jul 2, 2009
8:35 am
5913
Hi we have found what was the problem with speed we had set mirrorwriter processor, which caused a lot of disk operations which slowed disk down and used a lot...
nukleonrus
Offline Send Email
Jul 4, 2009
7:23 am
5914
Hi, I am trying to read a compressed archive with C# and not having any luck. I tried to use the GZipStream, but it stops after reading in the first 10 lines...
nfoscarini
Offline Send Email
Jul 4, 2009
9:35 pm
5915
Hi people! I need to write a crawler that could find all the RSS links on a domain and store them in a database. However, I wouldnt like to crawl whole domains...
progre55
Offline Send Email
Jul 6, 2009
7:55 am
5916
Hello nfoscarini, Each record in a compressed [w]arc is individually gzipped. A series of gzipped chunks concatenated together is itself a valid gzipped file....
Noah Levitt
nlevitt0
Offline Send Email
Jul 6, 2009
10:08 pm
5917
Hi, any link / suggestion on how to embed/reuse heritrix engine in another Java app? I just want to call the crawler with an initial seed URL, and get HTML...
plinio.conti
Offline Send Email
Jul 7, 2009
10:07 pm
5918
Hi, I've been crawling for two days selective crawl with 300 toe threds. After this time heritrix unexpectedly crash with this message in heritrix_out.log # #...
goblin_cz
Offline Send Email
Jul 14, 2009
9:15 am
5919
Hi people! I'm running a broad crawl, and after about 40-50 minutes getting a TMOE exception, followed by "ROS/RIS Already open for ToeThread ##" exceptions. ...
progre55
Offline Send Email
Jul 14, 2009
2:44 pm
5920
Hello, We are trying to set up dedupe in our crawls right now but we are having some issues launching the Recrawl jobs. We have no problems adding the...
Ignacio Garcia
igc.csmail@...
Send Email
Jul 14, 2009
4:23 pm
5921
Hi, How to crawl only those URL which satisfy regular expression? For ex. I have one domain "www.example.com". I want to crawl only those URL which contains...
Nizam
bhavin_mca2000
Offline Send Email
Jul 14, 2009
4:23 pm
5922
Hello, I think you want to try the regular MatchesRegExpDecideRule instead of the MatchesFilePatternDecideRule which I believe looks at just the suffix of the...
Ko, Lauren
laurendko
Offline Send Email
Jul 14, 2009
4:55 pm
5923
At Tue, 14 Jul 2009 09:14:24 -0000, ... See: <http://webarchive.jira.com/browse/HER-1126> Your java version is somewhat out of date, it seems that upgrading to...
Erik Hetzner
e_hetzner
Offline Send Email
Jul 14, 2009
7:32 pm
5924
... This looks like the same issue as discussed in this previous message: http://tech.groups.yahoo.com/group/archive-crawler/message/5859 Increasing the...
Gordon Mohr
gojomo
Online Now Send Email
Jul 14, 2009
8:47 pm
5925
That's generally the right approach. Can you find more of the exception stack, perhaps in heritrix_out.log? (The 'IllegalStateException' and anything that...
Gordon Mohr
gojomo
Online Now Send Email
Jul 14, 2009
9:11 pm
5926
... You can see an example of a ReplayInputStream ('RIS') acquired and then properly closed in ARCWriterProcessor.innerProcessResult(). It's acquired via... ...
Gordon Mohr
gojomo
Online Now Send Email
Jul 14, 2009
9:26 pm
5927
Hello Gordon, Here's the complete error log from heritrix_out.log: 2009-07-13 20:33:02.524::WARN: /crawler_area/do_launch.jsp: ...
Ignacio Garcia
igc.csmail@...
Send Email
Jul 15, 2009
12:37 pm
5928
Hi Gordon! Thanks a lot for the attention and suggestions! Well, I have checked the whole code, and none of the custom processors (I have only 2 of them) use...
progre55
Offline Send Email
Jul 15, 2009
3:07 pm
5929
Gordon, you are the best! :) Today I have seen updates on svn, especially those for closing ReplayInputStream and the closeQuietly() utility. After an update,...
progre55
Offline Send Email
Jul 16, 2009
8:32 am
5930
You're welcome; I wasn't sure the changes would make a difference, good to hear they have. (It's probably only the change in RecordingOutputStream that's...
Gordon Mohr
gojomo
Online Now Send Email
Jul 16, 2009
7:42 pm
5931
Hi, I have been trying to build 1.4.3 source code taken from the sourceforge page since the past 2 days without any luck. Something or the other comes up with...
precious_vastu
Offline Send Email
Jul 17, 2009
8:33 am
5932
I've never used your crawler, but I used to write my own simple solutions in Perl for crawling and scraping pages. I would like to ask you for a comment on the...
dzieciou
Offline Send Email
Jul 17, 2009
3:41 pm
5933
At Fri, 17 Jul 2009 04:23:16 -0000, ... Hi Utsav - Your message highlights one of the (408) warnings that the build produced, but not an error. Do you know...
Erik Hetzner
e_hetzner
Offline Send Email
Jul 17, 2009
5:03 pm
5934
Hi Erik, The error is in the import sun.net.www.protocol.fileUrlConnection. It says the sun.net cannot be resolved. So I commented out the portion using ...
Utsav Saraf
precious_vastu
Offline Send Email
Jul 18, 2009
2:40 am
5935
At Sat, 18 Jul 2009 10:40:10 +0800, ... Hi Utsav - Heritrix 1.14.3 uses Maven 1.0.2 so be sure that you are using that version of Maven. The sun.net code is...
Erik Hetzner
e_hetzner
Offline Send Email
Jul 18, 2009
4:11 am
5936
Hi Erik, Firstly, Thanks for taking time out to look into this. I have removed maven 1.1 and installed 1.0.2 from the archive. I am using Ubuntu and jdk 1.5 ...
Utsav Saraf
precious_vastu
Offline Send Email
Jul 18, 2009
4:15 am
5937
At Sat, 18 Jul 2009 12:15:47 +0800, ... Hi Utsav - I am happy to help. There is some information here ...
Erik Hetzner
e_hetzner
Offline Send Email
Jul 18, 2009
4:40 am
5938
The use case you describe is not web crawling, it is just downloading a set of files. While Heritrix could be (using some custom bean shell scripts) configured...
Kristinn Sigurdsson
kristsi25
Offline Send Email
Jul 20, 2009
5:22 pm
5939
Thanks a lot Erik, that worked. Also I reinstalled java and the advice on the link pointed at by you only worked after that. Now I want to add Javascript...
precious_vastu
Offline Send Email
Jul 21, 2009
3:17 am
Messages 5910 - 5939 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help