Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Hear how Yahoo! Groups has changed the lives of others. Take me there.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 4537 - 4566 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
4537
Seems like the copy on crawler.archive.org is corrupt. This is actually a paper that was presented at the 2004 International Web Archiving Workshop and is...
Kristinn Siguršsson
kristsi25
Offline Send Email
Sep 3, 2007
11:38 am
4538
A JDBCWriter I once used to write html content into mysql database. ... package org.archive.crawler.writer; import...
ansi
mymaillist@...
Send Email
Sep 6, 2007
8:28 pm
4539
Hi, The last part of my DecidingScope profiles includes a section to exclude any URIs that match any regular expressions in a list of 0 or more URIs like this...
Andrea Goethals
andrea_goethals@...
Send Email
Sep 11, 2007
9:36 pm
4540
i wanted to extend QueueAssignmentPolicy i create my own QueueAssignmentPolicy NicknameQueueAssignmentPolicy /* NicknameQueueAssignmentPolicy * * $Id:...
nickzwk
Offline Send Email
Sep 12, 2007
7:23 am
4541
I've figured out some of this. As far as escaping any part of the URI to make it into a Java regular expression, it seems to work if you do or don't escape...
Andrea Goethals
andrea_goethals@...
Send Email
Sep 12, 2007
9:44 pm
4542
I have a problem :if i made a new QueueAssignmentPolicy,how can I use it in heritrix,how to change the AbstractFrontier, my AbstractFrontier is this kind: ...
nickzwk
Offline Send Email
Sep 13, 2007
2:05 am
4543
Keeping track of the several levels of escaping can be challenging. My main suggestion would be that even if you are composing the order.xml directly for your...
Gordon Mohr
gojomo
Online Now Send Email
Sep 14, 2007
6:51 pm
4544
Hi all, I'm a newbie for Heritrix. So all the questions may have some simple answers. I want to crawl on wap sites. As an example; http://wap.gezeglen.com When...
mavci
Offline Send Email
Sep 16, 2007
2:11 am
4545
Thanks for the response. I am using heritrix 1.12.1. The problem I was seeing with the & is that the order.xml can't handle having an unescaped (& instead of...
Andrea Goethals
andrea_goethals@...
Send Email
Sep 17, 2007
5:53 pm
4546
Hello, I have a problem: I failed to build Heritrix by maven. Heritrix(selftest) can run successfully in ecilpse, but maven failed to build the project. There...
Carolyn
chq_qing@...
Send Email
Sep 18, 2007
3:25 am
4547
Check the tests that failed. Could they be failing because you are running on windows -- Heritrix is not 'officially' supported on windows -- or perhaps...
Michael Stack
stackarchiveorg
Offline Send Email
Sep 18, 2007
3:35 am
4548
thanks for the suggestion , yes,i'm using Heritrix 1.12.1 i just run the eclipse and operate heritrix run as java application when i reset the computer ,i...
nick zhang
nickzwk
Offline Send Email
Sep 18, 2007
6:07 am
4549
Hi Carolyn, one of the issues seems to be the one reported over here: http://webteam.archive.org/jira/browse/HER-1221 Regards Olaf Freyer...
Aaron
pandae667
Offline Send Email
Sep 18, 2007
7:07 am
4550
Anyone have the source code for the command line jmx client (org.archive.jmx.Client)? I can't find it on sourceforge ...
acidbluebriggs
Offline Send Email
Sep 18, 2007
5:26 pm
4551
https://archive-crawler.svn.sourceforge.net/svnroot/archive-crawler/trunk/cmdline-jmxclient/ St.Ack...
Michael Stack
stackarchiveorg
Offline Send Email
Sep 18, 2007
5:34 pm
4552
It's in our svn repository at https://archive-crawler.svn.sourceforge.net/svnroot/trunk/cmdline-jmxclient. -Paul...
Paul Jack
poetbeware
Offline Send Email
Sep 18, 2007
6:14 pm
4553
Yes, it's under branches/pjack_settings. It requires maven2 to build. I would characterize the code as stable but untested. We are working a few final...
Paul Jack
poetbeware
Offline Send Email
Sep 18, 2007
6:16 pm
4554
Cool thanks a lot! ... https://archive-crawler.svn.sourceforge.net/svnroot/trunk/cmdline-jmxclient. ... ...
acidbluebriggs
Offline Send Email
Sep 18, 2007
7:51 pm
4555
Nick, From your later message, I assume you succeeded in making your NicknameQueueAssignmentPolicy appear in the web UI. (FYI, it is not necessary to edit...
Gordon Mohr
gojomo
Online Now Send Email
Sep 18, 2007
10:17 pm
4556
Thanks Michael Stack and Olaf Freyer. With your help, the trouble is solved: I fixed two failed testcasts(org.archive.crawler.extractor.ExtractorHTMLTest and...
Carolyn
chq_qing@...
Send Email
Sep 19, 2007
3:51 am
4557
Mr.Mohr, From your response,NicknameQueueAssignmentPolicy will be problematic, And I understand why the download speed initially-fast and then very-slow. Your...
nick zhang
nickzwk
Offline Send Email
Sep 19, 2007
4:13 am
4558
Hiya all, When trying to crawl http://ibeatrice.blogspot.com, Heritrix 1.8 (under WCT) only collects the front page and prints the following stack track to...
jacksonpope
Offline Send Email
Sep 19, 2007
12:43 pm
4559
Hi Jackson - I tried reproducing in 1.8.0 with a usual scope/processors setup against the given URI, and could not: the page is extracted without error and ...
Gordon Mohr
gojomo
Online Now Send Email
Sep 19, 2007
7:33 pm
4560
Hiya Gordon, It turns out that I still had my investigative build of Heritrix on that machine, probing the ExtractorDOC bug I found a while ago, and it was my ...
Pope, Jackson
jacksonpope
Offline Send Email
Sep 20, 2007
7:56 am
4561
I need to do a regular expression search and replace on all uris that the crawler finds, before they are processed. I am having trouble deciding where the...
cybersammy
Offline Send Email
Sep 21, 2007
6:02 pm
4562
The Heritrix Development team will be putting out a preview release soon for Heritrix 2.x and we'd like to enlist a broad number of testers to experiment with...
Kris Carpenter
kris_carpent...
Offline Send Email
Sep 21, 2007
11:02 pm
4563
Dear Kris: I am very interested in Heritrix, and I hope I could join the test team. I have attended to your project for sereral months, and have read the...
Carolyn
chq_qing@...
Send Email
Sep 24, 2007
2:44 am
4564
Hallo, I've got a question.. How to log every link that haven't been downloaded in a crawl? The reason why is important too. I'm download only sites that ends...
goblin_cz
Offline Send Email
Sep 24, 2007
8:43 pm
4565
Hi Adam, You can just uncomment org.archive.crawler.postprocessor.LinksScoper.level = INFO in the heritrix.properties, and set LinksScoper's override-logger to...
Igor Ranitovic
iranitovic
Offline Send Email
Sep 26, 2007
11:30 am
4566
Thanks a lot. It helps very much ;] Now I'm logging into separate file and I've set some filter (I'm experimenting with index.html/htm/php/asp) But still I...
goblin_cz
Offline Send Email
Sep 26, 2007
1:53 pm
Messages 4537 - 4566 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help