Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Want to share photos of your group with the world? Add a group photo to Flickr.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 1711 - 1740 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
1711
Respected sir, Our problem is this. We need to create one TXT file that contain only the crawled uri.There are lots of thing in crawl.log file likewise ...
chirag chauhan
chirag_299
Offline Send Email
Apr 4, 2005
5:59 am
1712
Note: forwarded message attached. Yahoo! India Matrimony: Find your life partneronline. Respected sir, Our problem is this. We need to create one TXT file...
chirag chauhan
chirag_299
Offline Send Email
Apr 4, 2005
9:06 am
1713
... There's a bunch of ways to do this. One way would be to register a new logger -- study how loggers are registered here ...
stackarchiveorg
Offline Send Email
Apr 4, 2005
6:12 pm
1714
Hi Guys, I want to use Heritrix to crawl google search results and harvest the URLs, these URLs are then going to be put into a directory as in a listing. So...
David Cullen
mrdavidcullen@...
Send Email
Apr 11, 2005
3:11 pm
1715
... Look at the crawl.log. This will tell you if you are crawling or not. Are you only fetching the google robots.txt and then it goes no futher? Google bars...
stack
stackarchiveorg
Offline Send Email
Apr 11, 2005
3:45 pm
1716
Hi Folks, I am crawling a site and it has been running for 54 hrs and 99% crawl job. Normally it runs about 24 hr in the couple of last crawl. I am pretty...
Mr. J
bighead007us
Offline Send Email
Apr 11, 2005
8:15 pm
1717
I just updated my sources from HEAD and get the following failure in the unit tests: [junit] Running org.archive.crawler.frontier.AdaptiveRevisitHostQueueTest ...
Tom Emerson
tree02139
Offline Send Email
Apr 11, 2005
8:23 pm
1718
... I suggest first pausing the crawl, then using the Frontier report or 'view or edit frontier URIs' option (which only appears in a paused crawl) to verify...
Gordon Mohr (Internet...
gojomo
Online Now Send Email
Apr 11, 2005
8:36 pm
1719
... Thanks, work like a charm....
Mr. J
bighead007us
Offline Send Email
Apr 11, 2005
8:56 pm
1720
... I just did a fresh checkout and all built fine (Did you do a 'cvs update -Pd' to get all newly added items?). St.Ack...
stack
stackarchiveorg
Offline Send Email
Apr 11, 2005
9:56 pm
1721
Hi all, I received a couple of NPEs in the new UbiCrawler-integrated code. The crawl continues, apparently just fine. But I thought I'd pass one along: Title:...
Tom Emerson
tree02139
Offline Send Email
Apr 12, 2005
3:58 pm
1722
Hi, I've found that for some reason or another some webmasters want to allow robots to access scripts, but not with some certain parameters, i.e., in...
ogrenholm
Offline Send Email
Apr 13, 2005
12:22 pm
1723
... Looks like an imrpvoement to me. I'll apply it. Thanks, good catch! More comment: The various robots.txt specifications all use the term 'path' to describe...
Gordon Mohr (Internet...
gojomo
Online Now Send Email
Apr 13, 2005
7:17 pm
1724
Hi, I ran Heritrix 1.2.0 from command line on a debian machine to do a broadscope crawl on a set of 200 seed urls. I used 15 therads and allocated 1280 Megs of...
Saurabh Pathak
sau_pathak
Offline Send Email
Apr 14, 2005
7:24 pm
1725
Broad=scope memory issues is a known problem. First off, see http://crawler.archive.org/faq.html#oome_broadcrawl Secondly, I'd grab a 1.3 snapshot and try...
Tom Emerson
tree02139
Offline Send Email
Apr 14, 2005
7:34 pm
1726
What is the reason that number of writers in the ARC writer pool cannot be increased mid-crawl? I increased the number of toe-threads from 50 to 90, and all 90...
Tom Emerson
tree02139
Offline Send Email
Apr 14, 2005
8:03 pm
1727
... There's an old RFE asking for this feature: http://sourceforge.net/tracker/index.php?func=detail&aid=919727&group_id=73833&atid=539102. Its of priority 3....
stack
stackarchiveorg
Offline Send Email
Apr 14, 2005
8:23 pm
1728
... This is interesting reading, thanks. Could I ask you to elaborate why increasing the number of writers seemed to lower through-put? -- Tom Emerson...
Tom Emerson
tree02139
Offline Send Email
Apr 14, 2005
8:31 pm
1729
Hi Tom, That is pretty good! When you get a chance could you please check how many queues (hosts) are part of the crawl? Thanks. i....
Igor Ranitovic
iranitovic
Offline Send Email
Apr 14, 2005
9:15 pm
1730
... Embarrassingly, we didn't spend enough time looking. More writers would mean more contention for the disk head possibly holding threads in the write step...
stack
stackarchiveorg
Offline Send Email
Apr 14, 2005
9:41 pm
1731
... That would make sense. Now that you can specify multiple ARC directories, you could spread writers across multiple drives. One thing that I haven't done in...
Tom Emerson
tree02139
Offline Send Email
Apr 15, 2005
12:22 am
1732
... Current status: Run time: 175h., 7 min. and 25 sec. Processed docs/sec: 9.57 (6.74) KB/sc: 175 (107) Active Thread Count: 150 of 150 Total data received:...
Tom Emerson
tree02139
Offline Send Email
Apr 15, 2005
12:31 am
1733
... Paste it into a message Tom and send it over. ConcurrentModificationExceptions trying to view Frontier report are a known issue but you shouldn't be...
stack
stackarchiveorg
Offline Send Email
Apr 15, 2005
12:42 am
1734
... [...] That's the one. The crawl continues, so no worries. -- Tom Emerson Basis Technology Corp. Software Architect...
Tom Emerson
tree02139
Offline Send Email
Apr 15, 2005
12:47 am
1735
Thanks for your reply Tom. I tried running a broadscope crawl using the current 1.3 snapshot of the CVS head. I also used BdBFrontier with 50 threads and...
Saurabh Pathak
sau_pathak
Offline Send Email
Apr 15, 2005
4:50 am
1736
Once an OutOfMemoryError is hit, a crawl attempts to pause, but will often be unable to proceed in any meaningful way, depending on what operation was ...
Gordon Mohr (@Interne...
gojomo
Online Now Send Email
Apr 15, 2005
3:59 pm
1737
Also, was the crawl job that OOME'd started after another job had run w/o a restart of Heritrix between the running of the jobs? Is this a case of '[ 1123230...
stack
stackarchiveorg
Offline Send Email
Apr 15, 2005
4:11 pm
1738
Thanks Michael and Gordon for your inputs, ... I am running my crawls from commandline so I dont think above might be the case. Once crawl stops due to OOME I...
Saurabh Pathak
sau_pathak
Offline Send Email
Apr 18, 2005
1:27 pm
1739
... Are you overriding the default heritrix.properties file? In particular, this line: 'org.archive.crawler.datamodel.BigMapFactory.class = ...
stack
stackarchiveorg
Offline Send Email
Apr 18, 2005
4:48 pm
1740
Thanks a lot for your reply Michael. ... particular, ... I am not overriding the heritrix.property file apart from specifying a higher value of heap in...
Saurabh Pathak
sau_pathak
Offline Send Email
Apr 18, 2005
7:37 pm
Messages 1711 - 1740 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help