I was just running a crawl of ~500 URLs and noticed that I was getting a ton of these (almost 600 when I stopped the crawl). I'm using the max doc per site...
So is the the modifications that the Frontier is doing when it hits the max docs per site limit that are causing the exceptions? I'm not doing any edits...
... The problem seems to be the insertion of an OrFilter to exclude hosts that have gone over their download quota; the insertion is happening while elsewhere...
Hallo All, I used Heritrix to crawl some hosts and my crawl job is finished. I would like to get the links structure of pages that are carwled. Can you tell me...
... We don't have publically-available tools currently that will allow you to extract document links post crawl. We have to add them. We currently use...
From crawl.log you can get this information with a bit of perl and unix commandline. For example: cat crawl.log | tr -s " " | cut -f4,6 -d " " | sort -k2,2 |...
Hi, Thank you so much for your answer. I used the commandline that you give me, but i think the output from commandline are not correct with my requirement. I...
... Thank you! I have successfully started Heritrix from my app, but there is still something i don't understand: With the web interface, everything wors fine....
Hi Niti, Yes, the commandline will output links only one hop away. Heritrix does not use PageRank therefore we don't have that info as part of Heritrix...
... Thank You for your answer! Yes, i am running Heritrix on a Windows 2000 box. I have moved heritrix's JAR to the top of my classpath, and now, it works. But...
... On the same machine, the standalone Heritrix works without need of specifying seeds as IPs? Otherwise, I'd say there's an issue with DNS on your windows...
So I've used Heritrix to download a Web page, and I have the page content as a CharSequence (i.e., ReplayCharSequence). The CharSequence interface in Java says...
Hi, Thank so much for your suggestion. Now i can get the informations of link structure from Heritrix. My supervisor(Chistian) help me to write a new java...
This looks like a bug in the implementation of the ReplayCharSequence interface (both of them). The reason this slipped by is probably that every class has...
I have been working on a programming project that involves Heritrix and have run into some unexpected issues: 1. I attempted to create a new processor module...
Sorry to post again, but I just found something else about my processor problem. While the version of the profile I created that the web interface displays...
... This might be a bug in how our profiles work. What happens if you create a new job based off a job that successfully included your processor? Does it...
... Can you give a recipe of what you did? I just tried adding new profiles via the UI, changing the processor list content, and then creating new jobs and it...
... Are you run Heritrix on a windows box? try add default/order.xml and default/seeds.txt to conf/profiles dir. then create a new job. ... -- ... This mail is...
ansi
mymaillist@...
Jan 9, 2005 2:26 am
1339
Sounds like a problem with serialization of the profile. When you add a processor to a job or profile via the web ui, the in memory objects are modified and...
Dear expert crawlers, I keep receiving warnings like the following in the heritrix_out.log file: java.io.IOException: Too many open files Moreover, after half...
... What scope are you using? Which platform? How many seeds? ... Check the thread report and frontier reports to see where things are holding up. If you have...
... Sorry -- I should have been more specific. I'm on linux/debian/sarge, and I use heritrix 1.2.0. I have the same problem with both host and domain scopes...
... 32k should be more than sufficent. In heritrix_out.log, before the crawler starts, it prints out its ulimits settings. Does it say that you are entitled...