Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Message search is now enhanced, find messages faster. Take it for a spin.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 5626 - 5655 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
5626
Hello all, I hope I am not missing something obvious. I am using Heritrix 2.0.2 and have successfully installed/run simple crawls. Now, I only want to save...
peterlikarish
Offline Send Email
Jan 5, 2009
5:39 am
5627
I recommend using 1.14.x unless you specifically need 2.0 features. The documentation (both in the official user manual and various notes/threads) is better;...
Gordon Mohr
gojomo
Offline Send Email
Jan 5, 2009
9:52 pm
5628
Hi - We are beginning to look at implementing duplicate reduction in our crawling. I am trying to get my head around the various features available for...
Erik Hetzner
e_hetzner
Offline Send Email
Jan 8, 2009
12:26 am
5629
Can someone tell me how can I append jobs (orders) and than execute those orders via command line (using heritix params) I can successfully run 1 (one) order...
meta_tag_http
Offline Send Email
Jan 13, 2009
12:24 am
5630
Folks, I'm having trouble building heritrix from 1.14.1 and 2 sources. Part of the problem is my unfamiliarity with maven, jelly, qdox, etc. The problems are...
stevenkhanderson
stevenkhande...
Offline Send Email
Jan 13, 2009
5:03 pm
5631
I don't know of a formal comparison, or of any other systems in use with Heritrix-centric workflows. The WARC writing of recent Heritrix releases already...
Gordon Mohr
gojomo
Offline Send Email
Jan 13, 2009
11:48 pm
5632
Hello Steve, When you say "heritrix-1.14.1 and 2", I'm not sure if you mean heritrix-1.14.2 or heritrix 2.0, but anyway the latest version on the 1.x line is...
Noah Levitt
nlevitt0
Offline Send Email
Jan 15, 2009
3:40 am
5633
Noah, Thanks -- that worked. I think I had the wrong qdox-current.jar before. Since your message, I tried from scratch, and apparently the old maven doesn't...
stevenkhanderson
stevenkhande...
Offline Send Email
Jan 15, 2009
7:07 pm
5634
Steve, that's an interesting idea. I filed http://webteam.archive.org/jira/browse/HER-1591. Feel free to add yourself as a watcher on that issue if you like. A...
Noah Levitt
nlevitt0
Offline Send Email
Jan 15, 2009
7:46 pm
5635
I have a couple of questions about Heritrix. Does it support POST? For example, I want to crawl sites that have dropdown boxes and submit buttons. I'd like to...
Greg Allen
sitecrawl
Online Now Send Email
Jan 16, 2009
5:01 pm
5636
I would like to configure Heritrix to not even look for a robots.txt. I have permission for the sites I am crawling to ignore the robots.txt which I have done,...
mjjjhjemj
Offline Send Email
Jan 16, 2009
5:04 pm
5637
Hello Greg, Heritrix doesn't support POST, nor does it support the kind of link extraction you describe. The core reason not to support POST is that...
Noah Levitt
nlevitt0
Offline Send Email
Jan 16, 2009
11:32 pm
5638
Hi list, I'm running heritrix-2.0.2. I'm having trouble restarting heritrix. I stopped it with the kill command (with the default TERM signal). Now when I want...
sylvainaparis
Offline Send Email
Jan 20, 2009
2:55 pm
5639
The problem isn't with your shutdown; rather there's a bug preventing settings of NotMatchesListRegExpDecideRule from being recognized. (I presume you added...
Gordon Mohr
gojomo
Offline Send Email
Jan 20, 2009
10:16 pm
5640
I am a newbie to Heritrix...Someone set it up for me a while ago and I just found this problem couple days ago and I couldn't find a solution,, Heritrix...
roy0325
Offline Send Email
Jan 21, 2009
4:38 pm
5641
... Could someone please help me on this? Thanks, Mike...
mjjjhjemj
Offline Send Email
Jan 21, 2009
7:36 pm
5642
... There is no version 2.1. From the screenshot you'd posted on a JIRA issue, it looks like one of the 1.X versions... perhaps an outdated version, because...
Gordon Mohr
gojomo
Offline Send Email
Jan 21, 2009
7:42 pm
5643
Hi, Thanks for your answer. ... In fact the option is present in the web UI after a first crawl ... out -- OK. Are the ARC file formats identical between 1.14...
syl20a
Offline Send Email
Jan 21, 2009
8:57 pm
5644
... Hmm, did you initially add the rule via the web UI? ... Yes, both the ARC and WARC formats should be identical between 1.14.2 and 2.0.2. - Gordon @ IA...
Gordon Mohr
gojomo
Offline Send Email
Jan 21, 2009
9:05 pm
5645
At Tue, 13 Jan 2009 15:48:09 -0800, ... Hi Gordon - Many thanks for your response. I have been making use of the PersistLogProcessors and will be trying out...
Erik Hetzner
e_hetzner
Offline Send Email
Jan 21, 2009
10:06 pm
5646
... I have to launch a first crawl with a sheet without this notmatchregexrule. Then when I go back in the sheet editor in the WebUI, the option is available. ...
syl20a
Offline Send Email
Jan 21, 2009
10:57 pm
5647
Because Heritrix WebUI is over http instead of https our institution is requiring us to use X server apps (Xming) on our PCs for access, which in turn has a...
grahlaura
Offline Send Email
Jan 23, 2009
2:45 pm
5648
... It's probably easiest to wrap it with an https proxy, then hopefully the code for the web UI won't need to change. I once did this for another app using...
Brendan O'Connor
brenocon@...
Send Email
Jan 23, 2009
7:29 pm
5649
This was very helpful, thanks. For future record, I ended up adding MatchesFilePatternDecideRules using the use-preset-pattern for audio, video and images and...
peterlikarish
Offline Send Email
Jan 23, 2009
8:23 pm
5650
HTTPS support is coming in a future heritrix-2 release; an initial implementation exists in our heritrix2 source tree. The related issue is: ...
Gordon Mohr
gojomo
Offline Send Email
Jan 23, 2009
11:30 pm
5651
There's currently no way to tell Heritrix not to request a robots.txt, and changing that would probably require custom coding. We are unlikely to make the...
Gordon Mohr
gojomo
Offline Send Email
Jan 24, 2009
9:24 pm
5652
Hello, I have a question. How many memory for one crawl job. My crawl is stoped with OutOfMemoryError, and Web UI is not work well. I created 4-5 jobs based on...
takeru sasaki
sasaki.takeru@...
Send Email
Jan 27, 2009
10:40 am
5653
... The memory requirements depend on your crawl parameters, especially the number of ToeThreads configured. You should be safe with the default configuration...
Gordon Mohr
gojomo
Offline Send Email
Jan 28, 2009
5:36 am
5654
Thank you for your help. Default setting is -Xmx256m, I know it is for single crawl. I will try with 256*(same time crawls) MB memory. Thank you very much! And...
takeru sasaki
sasaki.takeru@...
Send Email
Jan 28, 2009
6:12 am
5655
... That may not help, unless you also change the BDB cache-percent setting. Each crawl's database environment will grow, as long as the crawl is still...
Gordon Mohr
gojomo
Offline Send Email
Jan 28, 2009
7:01 am
Messages 5626 - 5655 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help