Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 2060 - 2089 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
2060
Why is URL's like: http://www.nettidende.dk/txt/Statstidende/search_showdoc.phtml?agid=10&date=30.07.2005&rsid=112290089225590&sl=2&adid=23548288 considered...
Bjarne Andersen
bjarne_dk2000
Offline Send Email
Aug 1, 2005
12:57 pm
2061
Sorry - my mistake ! Those URL's were the referring ones - the invalid ones were all 'invalid:' in the crawl.log best Bjarne Andersen ... -- Bjarne Andersen ...
Bjarne Andersen
bjarne_dk2000
Offline Send Email
Aug 1, 2005
1:24 pm
2062
Using Head: 1.5.0-200507251046 ToeThreadReport (slightly edited to just show times): Check Thread #5 and #6, active for + 6m and + 3m. These times don't seem...
ryanatl
Online Now Send Email
Aug 1, 2005
3:50 pm
2063
Followup: This seems to be progressive. I launched a new crawl (same order.xml, different seeds), and while I'm seeing LinkScoper processors running for 7-10...
Ryan Gran
ryanatl
Online Now Send Email
Aug 1, 2005
4:23 pm
2064
Dear All, I've been looking into the User Manual, but I was not able to find out how to do the following: I would like to do a crawl in which I do not download...
Marco Baroni
kumaraja2000
Offline Send Email
Aug 1, 2005
5:42 pm
2065
Marco, You can use the DomainSensitiveFrontier and set the "max-docs" parameter in the Frontier settings to the maximum number from each website. Rob. ... -- ...
Rob Eger
robeger
Offline Send Email
Aug 1, 2005
6:07 pm
2066
Another way to doing this is to use per host overrides (settings tab) and budgeting feature with the default (BdbFrontier) frontier. Take a look at...
Igor Ranitovic
iranitovic
Offline Send Email
Aug 1, 2005
6:38 pm
2067
Two things that first come to mind: (1) Do the URIs they're working on have a gigantic number of outlinks? [*] (2) What Scope are you using? The legacy scopes...
Gordon Mohr
gojomo
Online Now Send Email
Aug 1, 2005
7:02 pm
2068
... [*] Not a potential cause of the slowdown, but related: A new feature in CVS HEAD sets a cap on the number of links that can be found in any page, as a...
Gordon Mohr
gojomo
Online Now Send Email
Aug 1, 2005
7:12 pm
2069
... Hey Bjarne: The 'invalid:' scheme is added to URLs when the deserialization of CrawlURIs fails. It shouldn't ever happen but we see it from time to time...
stack
stackarchiveorg
Offline Send Email
Aug 1, 2005
10:04 pm
2070
I crawled with heritrix-1.5.0-200506151248 - so that's without the fix - this is because that's the version we currently use in our production system - to...
Bjarne Andersen
bjarne_dk2000
Offline Send Email
Aug 2, 2005
5:31 am
2071
Our experience with the DomainSensitiveFrontier is that it's not suited for large seed-lists (e.g. 10.000 seeds which is our default crawl-size) - things slow...
Bjarne Andersen
bjarne_dk2000
Offline Send Email
Aug 2, 2005
5:41 am
2072
Hi members, I came across problem running Heritrix on Windows XP, it was throwing exception in terminal "java.io.IOException: Cannot find subdir: conf". I just...
Subramanya C R
subramanyacr
Offline Send Email
Aug 2, 2005
8:10 am
2073
Thanks a lot to all those who provided advice! ... We also start from about 10,000 seeds, and thus we will first experiment with the...
Marco Baroni
kumaraja2000
Offline Send Email
Aug 2, 2005
8:44 am
2074
Would it be possible to put in new QueueAssignmentPolicies without having to alter: private final static String [] AVAILABLE_QUEUE_ASSIGNMENT_POLICIES in...
Bjarne Andersen
bjarne_dk2000
Offline Send Email
Aug 2, 2005
1:18 pm
2075
Thank you for the below info. I've added a pointer to your list note into our FAQs. Yours, St.Ack...
stack
stackarchiveorg
Offline Send Email
Aug 2, 2005
4:59 pm
2076
... Yes. How about adding a line in heritrix.properties that looked like this: org.archive.crawler.frontier.AbstractFrontier.queue-assignment-policy =...
stackarchiveorg
Offline Send Email
Aug 2, 2005
5:25 pm
2077
heritrix.properties sounds good to me Should that file just be in the classpath somewhere? best Bjarne Andersen ... -- Bjarne Andersen IT-udvikler ...
Bjarne Andersen
bjarne_dk2000
Offline Send Email
Aug 3, 2005
6:06 am
2078
hi I am new to this and i want to write a code to lets say crawl couple of web sites. Can anyone suggest where i can get some sample code . I went through the...
itsmylifem
Offline Send Email
Aug 3, 2005
10:17 am
2079
To simply crawl a couple of websites you do not need to write code at all. Simply follow the first few steps in the user manual: ...
Bjarne Andersen
bjarne_dk2000
Offline Send Email
Aug 3, 2005
10:43 am
2080
thanks for sharing this, I gave up a while back because I was having so many problems but soon I want to try again. If you have any other stuff to share in ...
Kent Gibson
kentgibson
Offline Send Email
Aug 3, 2005
9:13 pm
2081
... Let me have a go at this in the next day or so. ... There is one built into the jar but if we find one on filesystem, then we use this one instead. St.Ack...
stack
stackarchiveorg
Offline Send Email
Aug 4, 2005
1:47 am
2082
Yeah that is ok but I want to do this through java code. Any idea how can i do this. Regards Madhur ... at all. ... if ... set ... your seeds ... couple of ......
itsmylifem
Offline Send Email
Aug 4, 2005
3:20 am
2083
... Not really, I examined a dozen or so URLs in the list. A few pages had ALOT of links (maybe 1200-1500 at most), but most were normal pages. ... Since I...
Ryan Gran
ryanatl
Online Now Send Email
Aug 4, 2005
1:01 pm
2084
... See http://crawler.archive.org/faq.html#embedding for a start. Thereafter, study Heritrix.java class for how it adds jobs. Or there is a JMX interface to...
stack
stackarchiveorg
Offline Send Email
Aug 4, 2005
3:44 pm
2085
Hi, One more addition to "Windows XP" problem. "heritrix-1.4.0.zip" doesn't contain the default profile. I tried in Linux (with same zip file), it works...
Subramanya C R
subramanyacr
Offline Send Email
Aug 5, 2005
6:08 am
2086
I took OS/X and Java 1.4.2 out of the picture, by starting up a crawl on a RedHat Linux machine with Java 1.5. I noticed Java 1.5 provides toe thread stack...
ryanatl
Online Now Send Email
Aug 8, 2005
3:58 pm
2087
Hello fellow crawlers, At Luxembourg's national library we're taking our first steps in crawling. I'm looking at this myself at the moment, I've managed to...
Charles Foetz
charel95
Offline Send Email
Aug 8, 2005
4:33 pm
2088
... This sounds like this issue: http://crawler.archive.org/faq.html#windowsstart. Do you have your own wrapper script starting up Heritrix or are you using...
stack
stackarchiveorg
Offline Send Email
Aug 8, 2005
6:25 pm
2089
... Yes. Its kinda sweet. Only available on 1.5.x JVM though (Feature courtesy of Christian Kohlschütter). ... expungeStaleEntries is called everytime we...
stack
stackarchiveorg
Offline Send Email
Aug 8, 2005
6:44 pm
Messages 2060 - 2089 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help