Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Want to share photos of your group with the world? Add a group photo to Flickr.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 1315 - 1344 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
1315
I was just running a crawl of ~500 URLs and noticed that I was getting a ton of these (almost 600 when I stopped the crawl). I'm using the max doc per site...
robeger
Offline Send Email
Jan 3, 2005
5:32 pm
1316
... There is a known issue with CMEs. See http://crawler.archive.org/articles/releasenotes.html#1_0_0_limitations. Is this it? St.Ack...
stack
stackarchiveorg
Offline Send Email
Jan 3, 2005
6:23 pm
1317
So is the the modifications that the Frontier is doing when it hits the max docs per site limit that are causing the exceptions? I'm not doing any edits...
robeger
Offline Send Email
Jan 3, 2005
6:45 pm
1318
... The problem seems to be the insertion of an OrFilter to exclude hosts that have gone over their download quota; the insertion is happening while elsewhere...
stack
stackarchiveorg
Offline Send Email
Jan 3, 2005
7:56 pm
1319
Hallo All, I used Heritrix to crawl some hosts and my crawl job is finished. I would like to get the links structure of pages that are carwled. Can you tell me...
Niti Witthayawiroj
niti_wit
Offline Send Email
Jan 4, 2005
1:59 pm
1320
... We don't have publically-available tools currently that will allow you to extract document links post crawl. We have to add them. We currently use...
stack
stackarchiveorg
Offline Send Email
Jan 5, 2005
2:35 am
1321
From crawl.log you can get this information with a bit of perl and unix commandline. For example: cat crawl.log | tr -s " " | cut -f4,6 -d " " | sort -k2,2 |...
Igor Ranitovic
iranitovic
Offline Send Email
Jan 5, 2005
3:25 am
1322
Hi, Thank you so much for your answer. I used the commandline that you give me, but i think the output from commandline are not correct with my requirement. I...
Niti Witthayawiroj
niti_wit
Offline Send Email
Jan 5, 2005
1:22 pm
1323
... Thank you! I have successfully started Heritrix from my app, but there is still something i don't understand: With the web interface, everything wors fine....
pm5400845
Offline Send Email
Jan 5, 2005
4:45 pm
1324
... The same order file is used in standalone Heritrix and works? Same platform? Is this a windows box (Perhaps the following is related: ...
stack
stackarchiveorg
Offline Send Email
Jan 5, 2005
5:05 pm
1325
Hi Niti, Yes, the commandline will output links only one hop away. Heritrix does not use PageRank therefore we don't have that info as part of Heritrix...
Igor Ranitovic
iranitovic
Offline Send Email
Jan 5, 2005
11:41 pm
1326
... Thank You for your answer! Yes, i am running Heritrix on a Windows 2000 box. I have moved heritrix's JAR to the top of my classpath, and now, it works. But...
Philippe MOULIN
pm5400845
Offline Send Email
Jan 6, 2005
4:48 pm
1327
... On the same machine, the standalone Heritrix works without need of specifying seeds as IPs? Otherwise, I'd say there's an issue with DNS on your windows...
stack
stackarchiveorg
Offline Send Email
Jan 6, 2005
6:27 pm
1328
So I've used Heritrix to download a Web page, and I have the page content as a CharSequence (i.e., ReplayCharSequence). The CharSequence interface in Java says...
bergmark_d
Offline Send Email
Jan 6, 2005
10:02 pm
1329
Hi, Thank so much for your suggestion. Now i can get the informations of link structure from Heritrix. My supervisor(Chistian) help me to write a new java...
Niti Witthayawiroj
niti_wit
Offline Send Email
Jan 7, 2005
12:27 am
1330
Very sweet. Thanks for the below. St.Ack...
stack
stackarchiveorg
Offline Send Email
Jan 7, 2005
2:04 am
1331
This looks like a bug in the implementation of the ReplayCharSequence interface (both of them). The reason this slipped by is probably that every class has...
Kristinn Sigurdsson
kristsi25
Offline Send Email
Jan 7, 2005
10:11 am
1332
I have been working on a programming project that involves Heritrix and have run into some unexpected issues: 1. I attempted to create a new processor module...
Xavid
chi2vid
Online Now Send Email
Jan 7, 2005
7:43 pm
1333
Sorry to post again, but I just found something else about my processor problem. While the version of the profile I created that the web interface displays...
Xavid
chi2vid
Online Now Send Email
Jan 7, 2005
7:58 pm
1334
Thanks for reporting the bug (And thanks Kris for the patch). I've committed the fix to HEAD and added unit tests. Yours, St.Ack...
stack
stackarchiveorg
Offline Send Email
Jan 7, 2005
10:56 pm
1335
Hi Niti, This is cool. If I am not wrong you needed different link information then just from-to URLs, one hope away, right? i....
Igor Ranitovic
iranitovic
Offline Send Email
Jan 8, 2005
1:21 am
1336
... This might be a bug in how our profiles work. What happens if you create a new job based off a job that successfully included your processor? Does it...
stack
stackarchiveorg
Offline Send Email
Jan 8, 2005
9:33 pm
1337
... Can you give a recipe of what you did? I just tried adding new profiles via the UI, changing the processor list content, and then creating new jobs and it...
stack
stackarchiveorg
Offline Send Email
Jan 8, 2005
9:47 pm
1338
... Are you run Heritrix on a windows box? try add default/order.xml and default/seeds.txt to conf/profiles dir. then create a new job. ... -- ... This mail is...
ansi
mymaillist@...
Send Email
Jan 9, 2005
2:26 am
1339
Sounds like a problem with serialization of the profile. When you add a processor to a job or profile via the web ui, the in memory objects are modified and...
Kristinn Sigurdsson
kristsi25
Offline Send Email
Jan 10, 2005
9:10 am
1340
Dear expert crawlers, I keep receiving warnings like the following in the heritrix_out.log file: java.io.IOException: Too many open files Moreover, after half...
Marco Baroni
kumaraja2000
Offline Send Email
Jan 10, 2005
12:32 pm
1341
... What scope are you using? Which platform? How many seeds? ... Check the thread report and frontier reports to see where things are holding up. If you have...
Tom Emerson
tree02139
Offline Send Email
Jan 10, 2005
12:43 pm
1342
... Sorry -- I should have been more specific. I'm on linux/debian/sarge, and I use heritrix 1.2.0. I have the same problem with both host and domain scopes...
Marco Baroni
kumaraja2000
Offline Send Email
Jan 10, 2005
9:09 pm
1343
... 32k should be more than sufficent. In heritrix_out.log, before the crawler starts, it prints out its ulimits settings. Does it say that you are entitled...
Michael Stack
stackarchiveorg
Offline Send Email
Jan 10, 2005
9:09 pm
1344
... Well, actually now I see that it doesn't: open files (-n) 1024 which is funny because I did add a higher limit to ...
Marco Baroni
kumaraja2000
Offline Send Email
Jan 10, 2005
9:21 pm
Messages 1315 - 1344 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help