Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 2612 - 2641 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
2612
... Hi Gordon, In fact what you said it's happening. Only the lower level of the hierarchy (that corresponds to http://www.zzz.com.br) contains any ...
andlanna
Offline Send Email
Feb 1, 2006
1:30 pm
2613
... This should work, if there are no stray character on any of the override segments and the filter regular expressions and settings are correct. What kind of...
Gordon Mohr (archive....
gojomo
Online Now Send Email
Feb 1, 2006
11:16 pm
2614
... Ahh...interesting idea Gordon. I like it. I'm still not sure how large of a problem this will be for us, but I'll pursue something along these lines if...
Adam Fisk
afisk3
Offline Send Email
Feb 2, 2006
9:13 pm
2615
... Hi Gordon, Sorry for answering just now. Last week was so busy. I was looking for a stray character on all of the override segments and I didn't find any....
andlanna
Offline Send Email
Feb 6, 2006
12:14 pm
2616
... Again, this really should work, so if you have a compact test case you can forward to demonstrate the problem, we'll work on it as a bug. It shouldn't be...
Gordon Mohr (archive....
gojomo
Online Now Send Email
Feb 6, 2006
9:54 pm
2617
Hi, Hope a generous person point me to the right direction. I have done a number of crawls already using heritrix. I want to know if 6KB/s is an acceptable...
alxartes
Offline Send Email
Feb 7, 2006
11:37 am
2618
I'm trying to use SurtPrefixedDecideRule to run a relatively simple crawl. Here's the XML from my order file: <newObject ...
Adam Fisk
afisk3
Offline Send Email
Feb 7, 2006
4:56 pm
2619
... 6KB/s seems low. How many threads are you running? Are they all occupied all the time? Has the crawl just started or is the 6KB/s a measure taken after...
stack
stackarchiveorg
Offline Send Email
Feb 7, 2006
5:44 pm
2620
All set -- creating a separate surt prefix file worked like a charm! -Adam ... org.archive.crawler.scope.SeedFileIterator.transform(SeedFileIterator.java:90) ...
Adam Fisk
afisk3
Offline Send Email
Feb 7, 2006
7:46 pm
2621
... The warning is harmless, generated when scanning the seeds for plain URLs. Specifying the SURT inline with the "+" notation generates the warning but also...
Gordon Mohr (archive....
gojomo
Online Now Send Email
Feb 7, 2006
9:29 pm
2622
Thanks Gordon- I think I must still be missing something, though. I'm currently using the SURT: http://(com,example,www,)/ This appears in my SURT dump and...
Adam Fisk
afisk3
Offline Send Email
Feb 7, 2006
11:07 pm
2623
... It could be transcluded (if it were referred to by certain short chains of hops from other URLs already in-scope), but removing the acceptIfTranscluded...
Gordon Mohr (archive....
gojomo
Online Now Send Email
Feb 7, 2006
11:53 pm
2624
Is there a way to figure out the domains(or WorkQueues) which are completed(rather exhausted) ? prasenjit...
Prasenjit Mukherjee
prasen_aol
Offline Send Email
Feb 8, 2006
5:56 am
2625
Working great now Gordon. I was confused about how the decide rules worked, thinking a reject from any rule would negate that URL. I only realized my...
Adam Fisk
afisk3
Offline Send Email
Feb 8, 2006
3:43 pm
2626
Thank you very much Stack. Usually, I only do domain crawl of one-three seeds at a time. I guess that is the real reason for the low bandwidth speed. ... know ...
alxartes
Offline Send Email
Feb 9, 2006
8:45 am
2627
... Programmatically, or as a crawl operator seeking a summary report? If the latter, there are not-yet-documented query-string parameters that may be added to...
Gordon Mohr (archive....
gojomo
Online Now Send Email
Feb 9, 2006
10:55 pm
2628
Not via UI. I was trying to find out the code section where we could possibly determine that a particular WorkQueue is exhausted. I looked into...
Prasenjit Mukherjee
prasen_aol
Offline Send Email
Feb 10, 2006
7:16 am
2629
Right now Heritrix defines ${HOSTNAME} for path substitutions, but it would be useful (to me, at least) to have others. For example, if I have a separate...
Tom Emerson
tree02139
Offline Send Email
Feb 10, 2006
7:19 pm
2630
... Makes sense Tom. If you want to send over a patch, that'd be appreciated though I'd say doing it right might be a bit of work. As it is currently, the...
stack
stackarchiveorg
Offline Send Email
Feb 10, 2006
9:52 pm
2631
Look for the call to clearHeld() in WorkQueueFrontier: that's the point where the queue is no longer actively 'held' by the other queues-of-queues, because...
Gordon Mohr (archive....
gojomo
Online Now Send Email
Feb 11, 2006
10:33 pm
2632
Dear All, We are taking Digital Preservation class this semester. Part of the project is to apply Heritrix to collect documents. Our project is to preserve...
azucarmarron23
Offline Send Email
Feb 13, 2006
1:34 am
2633
This message is about Human beings, Democracy, UNHCR, Refugees, The Iraqis, Islam, Kurds, Human rights, Respect, Money, Donations, Angelina Jolie, Pavarotti,...
almostatmygoalnowwh9@...
Send Email
Feb 13, 2006
6:43 am
2634
Thanks a bunch. Some sort of notification mechanism (or protected methods from where we can send notifications) would be of great help. We were trying to...
Prasenjit Mukherjee
prasen_aol
Offline Send Email
Feb 13, 2006
8:02 am
2635
There are two features I would like to see in Heritrix, and before I implement them, I would like to ask if anyone else has, and if so, if they would be...
kjerken
Offline Send Email
Feb 13, 2006
4:05 pm
2636
... Check out this feature: http://crawler.archive.org/articles/user_manual.html#credentials. Looks like the edventure site takes a HTTP POST of login...
stack
stackarchiveorg
Offline Send Email
Feb 13, 2006
4:11 pm
2637
... We've not see hide nor hair round these parts. ... Have you considered using JMX to do this? The API allows remote submission of seeds. See ...
stack
stackarchiveorg
Offline Send Email
Feb 13, 2006
4:38 pm
2638
When I look at the Modules page of a new crawl job, I see links to ... Select Pre Processors Processors that should run before any fetching ...
kjerken
Offline Send Email
Feb 13, 2006
5:26 pm
2639
... I saw the JMX interface, but I didn't realize the crawler could be told to 'hold' at start and stop with 0 seeds. I assume that's the frontier's...
kjerken
Offline Send Email
Feb 13, 2006
5:35 pm
2640
... Thats right. St.Ack...
stack
stackarchiveorg
Offline Send Email
Feb 13, 2006
6:04 pm
2641
... Interesting. Is the size on disk the concern, or the performance? If the latter, I don't know how much of a benefit marking the whole site as done would...
Gordon Mohr (archive....
gojomo
Online Now Send Email
Feb 13, 2006
7:15 pm
Messages 2612 - 2641 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help