Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Message search is now enhanced, find messages faster. Take it for a spin.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 5593 - 5622 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
5593
Hi, the Czech crawl again :) . I started with default profile and set some specific rules (100MB limit etc.) and run the crawl again. You can find the...
goblin_cz
Offline Send Email
Dec 3, 2008
10:49 pm
5594
Hello Adam, thanks for the report. I can't claim to know anything about the more serious second issue. But the first issue appears to be ...
Noah Levitt
nlevitt0
Offline Send Email
Dec 3, 2008
11:12 pm
5595
Re: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE As Noah notes, this is a known issue which should be fixed in a future 2.x release. In...
Gordon Mohr
gojomo
Online Now Send Email
Dec 5, 2008
12:30 am
5596
Hi to all, Is there way to fetch YouTube videos with Heritrix (1.14.x)? We have been trying to download selected youtube videos in our thematic crawls but...
Tomas Ukkonen
tomas.ukkonen@...
Send Email
Dec 8, 2008
6:50 pm
5597
Hi, we have a custom profile to crawl a small subset of pages from each site in a seed file, and we wanted to limit our crawler such that it only download...
alihoaliho
Offline Send Email
Dec 9, 2008
3:12 pm
5598
As also mentioned in the referenced message, your "htmlContentTypeFilter" makes no sense in a scope -- there's no content-type yet to compare. In the case of...
Gordon Mohr
gojomo
Online Now Send Email
Dec 9, 2008
6:44 pm
5599
I found a thread here that describes how you can use the JMX Client to have an url retried (How to add a URL into the retry list?) excerpt describing JMX...
mjjjhjemj
Offline Send Email
Dec 9, 2008
11:25 pm
5600
Hi all refering to the post with subject "Broad-scope 10M seeds Xmx6G 64-Bit JVM: OOME: GC overhead limit exceeded" and the statement that the OOME exception...
Juergen Umbrich
juergen@...
Send Email
Dec 10, 2008
12:39 am
5601
If my guesswork on the previous post was correct, it was the requests to display seed reports (via the web UI) that created the problem -- not the mere...
Gordon Mohr
gojomo
Online Now Send Email
Dec 10, 2008
5:10 am
5602
Hi, I am useing version 2.0.2. I am viewing crawl.log by RegExp. I enter this: http://.*¥.html but text field value was changed: http://.*?.html Can I use...
takeru sasaki
sasaki.takeru@...
Send Email
Dec 11, 2008
3:56 pm
5603
This is probably a web UI encoding issue we could fix, BUT... I don't think any '¥' characters, exactly as such, will be found in the crawl.log. Instead, it...
Gordon Mohr
gojomo
Online Now Send Email
Dec 11, 2008
7:56 pm
5604
Thank you for your help. I want to escape "." (dot), not "¥" (back slash). And other Regex meta charactors. Such as ".()[]?". I will debug and build heritrix...
takeru sasaki
sasaki.takeru@...
Send Email
Dec 12, 2008
2:05 am
5605
I'm sorry I didn't understand. Heritrix uses the Java regex syntax, as described at: http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html So, a...
Gordon Mohr
gojomo
Online Now Send Email
Dec 12, 2008
2:46 am
5606
Thank you. You are right!! I am using mac osx 10.4 and Firefox3 in japanese. I enter backslash "\" into editor (CarbonEmacs) and copy it, and paste in firefox....
takeru sasaki
sasaki.takeru@...
Send Email
Dec 12, 2008
3:20 am
5607
Hi Tomas, Here is a script that can be used to download vidoes with Heritrix. Maybe this will help you? It is now on the Heritrix WIKI: ...
adam.taylor78
Offline Send Email
Dec 16, 2008
5:16 pm
5608
Hi! I have a problem with the NotMatchesListRegExpDecideRule. My aim is to crawl the following sites : http://tennis.fr/outils http://tennis.fr/breves ...
jpeimecke
Offline Send Email
Dec 17, 2008
6:07 pm
5609
Your rule can only REJECT not-matching URIs; whether any URIs are ACCEPTed depends on the other rules. What are your other rules? - Gordon @ IA...
Gordon Mohr
gojomo
Online Now Send Email
Dec 17, 2008
7:39 pm
5610
Thanks for your answer. Here are my decideRule : global root:scope:rules list org.archive.modules.deciderules.DecideRule global root:scope:rules:0 object ...
jpeimecke
Offline Send Email
Dec 18, 2008
9:04 am
5611
Thanks. Your rule #8, NotOnDomainsDecideRule, may be unnecessary. Based on the earlier rules, only on-domain and inline-linked URIs will have been ACCEPTed...
Gordon Mohr
gojomo
Online Now Send Email
Dec 18, 2008
9:32 am
5612
It works fine. Thanks you very much. I have seen that if I had a DecideRule, I must run a first crawl in order to the settings appear. Is that normal?...
jpeimecke
Offline Send Email
Dec 18, 2008
11:01 am
5613
... I'm sorry, I don't understand the question. I would say that when building a custom scope with new DecideRules, it is good to work incrementally, only...
Gordon Mohr
gojomo
Online Now Send Email
Dec 18, 2008
6:48 pm
5614
hello! I am trying wayback. http://archive-access.sourceforge.net/projects/wayback/ I have a question. My wayback instance has many html pages. If image-file...
takeru sasaki
sasaki.takeru@...
Send Email
Dec 19, 2008
5:04 am
5615
Hi All, I am a newbie to heritrix..I checked out the latest stable version of Heritrix 2...and I tried to debug the crawl process step by step... My Attempts &...
ckannanck
Offline Send Email
Dec 27, 2008
1:30 am
5616
... For a beginner, the 1.14.x code could be a better place to start -- the documentation is better, and there are still significant UI/configuration changes...
Gordon Mohr
gojomo
Online Now Send Email
Dec 27, 2008
5:53 am
5617
Thanks a lot !!! That helps.... I am looking for continuous crawling functionality which I believe is being developed on 2.x ( correct me if I am wrong ) ...I...
ckannanck
Offline Send Email
Dec 28, 2008
8:38 am
5618
Hi, I am new to heritrix and trying to run one sample job but facing problem. I have configured "max-toe-thread" to 100 but still it never starts all threads....
pandya.bhavin@...
pandya.bhavi...
Offline Send Email
Dec 29, 2008
9:26 am
5619
... The plan is for continuous and adaptive crawling to be implemented by the 2.4 release, in early 2009. That means within operator-set limits and...
Gordon Mohr
gojomo
Online Now Send Email
Dec 30, 2008
7:01 am
5620
All threads will only be used if there are many separate sites to crawl. Heritrix will only fetch a single URI from a site at a time, and will pause between...
Gordon Mohr
gojomo
Online Now Send Email
Dec 30, 2008
7:07 am
5621
Hi,   Thanks for reply... I tried with large set of websites and its working as per expectation.   Thanks. Bhavin ... From: Gordon Mohr <gojomo@...> ...
Bhavin Pandya
pandya.bhavi...
Offline Send Email
Dec 30, 2008
8:50 am
5622
Hi, I have started using heritrix on single machine. I was just thinking what are the different ways we can achieve distributed crawling using heritrix. I can...
pandya.bhavin@...
pandya.bhavi...
Offline Send Email
Dec 30, 2008
12:38 pm
Messages 5593 - 5622 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help