Hi, I have been running heritrix for 3 days on a big pack of seeds (420 000). It has been ended normaly but downloaded only 99GB and only about 4 millions...
Download Complete Business Package download 130 Million Email Address 250+ professional layered Photoshop PSD website templates 650+ HTML Website and...
... Yes -- it means that the predicted false-positive rate inherent to a bloom filter won't go over 1-in-4million (1 in 2^22) up through 125million inserts. ...
Download Complete Business Package download 130 Million Email Address 250+ professional layered Photoshop PSD website templates 650+ HTML Website and...
... Normally, this would mean that the hostname DNS lookup for those URLs failed. With no successful DNS lookup, the URL cannot be fetched. Are these same URLs...
... rate ... Actually that might be good enough. My current idea is to have all 8 crawlers (total 8) download 1B pages in total. Assume ideal page distribution...
... Yes -- an URL is tested against (and inserted in) the already-included set just before it is queued/scheduled, not when it is downloaded. - Gordon @ IA...
Hi, I have an application that runs multiple instances of Heritrix in a single JVM. The application creates a new Heritrix instance to run each harvest and...
In Heritrix 1.9, after a crawl job has been paused, the user can click "View or Edit Frontier URIs" and be taken to a screen where they can add, view, or...
... Seeds often have special treatment, for example by changing the crawl's scope -- so you might want to add URIs that are not treated specially. Note that if...
Heritrix is running on Solaris, but my browser is running on Windows where my file is located. A file upload (Browse) button would be useful in this...
... Yes, they are. ... settings ... I will explore that. ... begin ... would be ... I am using 1.8 version. I add surt prefix (+http://(cz,) to my seeds.txt...
Sorry, I forgotten... How many Toe Threads (max-toe-threads) can be on broad crawl with 4GB RAM and Pentium III 900MHz? And what exactly mean seeds.ignored -...
... Yes (If I understand you correctly). The second time it runs, it has no knowledge of the first run and will happily travel the same path as the previous...
... Hard to say. Start with default and work your way up. See how your throughput changes. Watch net and disk i/o and your CPU consumption. Try and balance...
hi . The problem that I hava met one, when I collection a designated web site, whenever arrival 99%, do not carry out collection. They needs 7 or 8 hours . ...
Hi all, I am new to using heritrix. The manual says that the heritrix supports only ISO format. Anybody has worked on making heritrix to follow utf-8 charset?...
Hi, i have run broad crawl (with deciding scope) and everything was ok. But today we have got problem with electricity and we have to reboot the server. I made...
Who wants to make $1000+ per day? Just Entering Simple Data From Home What if I told you that you can stay at home, quit your existing job, use my amazing...
Does this parameter "total-bandwidth-usage-KB-sec" count against (1) downloaded data, or (2) saved data (into arc files). The reason why I ask this question...
Hi there, Is there a way to obtain the UID (method getUID() of CrawlJob instance), from a processor and from the scope? If yes, how that can be done? If not,...
Great Bonus : Download 130 Million Email Address ourComplete Business Package: 250+ professional layered Photoshop PSD website templates 650+ HTML Website and...
I have been trying unsuccessfully to do a CVS checkout of Heritrix the past few weeks based on the instructions here: http://crawler.archive.org/cvs-usage.html...
Not without acrobatics (If running single Heritrix instance: Heritrix.getSingleInstance().getJobHandler().getCurrentJob().getUID();). I'd be interested to...
Please supply Heritrix version and how you set up the 'recovery'? Were you doing 'fast' checkpointing or letting the crawler manage the bdbje log files for...