You might want to take a look at the Automated Revisiting Module being developed at the moment by kris@... It does implement a new Frontier including a...
Yes, the AR module (currently available as a branch of the Heritrix project, http://crawltools.archive.org:8080/cruisecontrol/buildresults/BRANCH-heritri ...
Kris, I was aware of your AR module and should have asked a couple questions about it in that earlier post. The algorithm I suggested could be written as a...
Hey John, Partly the reason for a seperate frontier is one of parallel development. When I started working on it there was no BDBFrontier and there Some other...
... Would be great if you could confirm that you are indeed getting better results. ... Can you cite the section in rfc2616 where it says this please Dave (I ...
I have a stored profile that I use for all of my crawls that contains the elaborate (well, that may be an overstatement, but I do tweak a lot of nobs and add...
... see private reply to your archive.org address ... see sections 15.1.2 and 15.1.3 I'm basing my comment on the following from 15.1.3 Clients SHOULD NOT...
... Smile. I'll try it for you Tom if you want to send my your profile -- because it should just work (I'd like to see what issues we run into). St.Ack...
Executing the following command: /home/apps/heritrix/bin/heritrix -n /home/apps/heritrix/conf/profiles/testProfile/order.xml With the following order.xml...
It appears that Heritrix does not use the system host file (/etc/hosts). Is this correct? -- Rich Collins Director of Information Technology InjuryBoard.com ...
... I tried it. There was one issue where an old default didn't make the transition nicely so I added code to handle this in HEAD. Otherwise, there is one...
... Heritrix uses the dnsjava library to do its lookups. See 'Limitations' section in http://www.xbill.org/dnsjava/README for description of how it goes about...
I've got a midfetch filter that looks at the content-length and last-modified headers and pretends that a disk directory structure (as could be produced by...
Eyery time I use a file compare program like Beyond Compare to compare the old order.xml with the new one.And change the diffence manually:( Ansi...
ansi
mymaillist@...
Feb 2, 2005 12:43 am
1467
Hi Kris, Help me get up to speed with what your thinking here. I'm obviously totally new here, so take my questions as interest not argument. ... A duplicate...
... Yeah. Needs some work. ... I think your difficulty is not seeing the three little configuration options surts-source-file, seeds-as-surt-prefixes, and...
... [...] Good lord, I completely missed these. And here I was thinking that I had scanned all of the entries on the page. Even not moving them in the WUI but...
OK, so I'm still a bit hosed because my surt prefix (meant to mimic path scope) prevents the site's robots.txt file from being read, and then I get a ream of...
... You might consider broadening the key to accomodate timestamp or you might put your timestamp in place of the key tail ordinal of 64 bits (You may want to...
... Sorry about that Tom. I added a note under 'surtprefixscope' that says: 'When you use this scope, it adds 3 hard-to-find-in-the-UI attributes -- ...
Tom Emerson wrote: ... A commit I made earlier today was supposed to avoid your seeing this exception. Did you update recently? If not, try setting your ...
... I updated right after your note saying you fixed this, but the change must not have percolated to the anonymous server. Once I made the 'max-length-bytes'...
... Indeed, a path scope crawl using the same seed has crawled over 1600 documents so far and is only "23%" complete. Shouldn't the surt pattern provide the...
... To a point, yes, but a repeating Frontier may be interested in rediscovered URIs. I.e. if a new or changed page embedds another document, we may want to ...
... Would it be difficult to have the WUI check that you enter a valid user agent and from string? Or is the check too complex to put in a place like that?...
I belive that the WUI prints a red star next to the setting if it is invalid. Maybe it should be more forceful, but at the time the functionality was added, it...
Hi, Thanks for the reply. I just realized that I have to specify the proxy host and port. May I know where I should specify it? Inn Fang ... get ... Source) ...