Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Want your group to be featured on the Yahoo! Groups website? Add a group photo to Flickr.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 1900 - 1929 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
1900
... The error might only occur if the '%' is at the end of the query string (see stacktrace). ... I would consider passing invalid URIs the queue as a bug. We...
Christian Kohlschuetter
ck-heritrix@...
Send Email
Jun 1, 2005
8:38 am
1901
I'm trying to crawl http://www.orcun.de which as far as I can tell is completely flash-based. I can't seem to get Heritrix to follow through the links in the...
robeger
Offline Send Email
Jun 1, 2005
7:36 pm
1902
Never mind... A lesson in making sure you check everything before posting to a group. There was a robots meta-tag set to nofollow on the page. Rob....
robeger
Offline Send Email
Jun 1, 2005
8:22 pm
1903
We are in the process of upgrading to Heritrix 1.4 and we've noticed that when when the crawl completes and the logging status says complete, the app never...
jirleech
Offline Send Email
Jun 1, 2005
8:45 pm
1904
... After finishing a crawl, Heritrix waits around unless you explicitly ask it to shutdown via 'Shut down Heritrix software' link on main index page. Are you...
stack
stackarchiveorg
Offline Send Email
Jun 2, 2005
4:18 pm
1905
Dear all, We have made available a Heritrix processor that interfaces with rainbow, the most widely known, and perhaps most widely used, text classification...
bergmark_d
Offline Send Email
Jun 3, 2005
12:47 am
1906
The whole idea of having TextUtils.getMatcher() was to avoid Matcher instantiation, so we are basically talking about cache efficiency and performance. You are...
Christian Kohlschuetter
ck-heritrix@...
Send Email
Jun 3, 2005
9:55 am
1907
We seemed to have found the bug that caused this problem. We run heritrix in an integrated software solution. So its run via API with a command line like...
jirleech
Offline Send Email
Jun 3, 2005
5:47 pm
1908
... Thank you both for the excellent contribution. The doc. is really great too: i.e. Overview.pdf (I like the suggestion of the classifier being used to...
stack
stackarchiveorg
Offline Send Email
Jun 3, 2005
6:13 pm
1909
... Thanks for this bug report! I have filed an issue along with a patch ([ 1214478 ] ThreadLocalHttpConnectionManager starts a non-daemon Thread) Cheers, -- ...
Christian Kohlschuetter
ck-heritrix@...
Send Email
Jun 3, 2005
7:39 pm
1910
... Applied. (Thanks to both of you). St.Ack...
stack
stackarchiveorg
Offline Send Email
Jun 3, 2005
7:46 pm
1911
can i use the tool such as pkunpack or BitZipper to extract data from the arc file? i tried to get the ARC files from the linux machine to a XP machine, and...
Inn Fang
innfang
Online Now Send Email
Jun 4, 2005
6:46 am
1912
Is there an xsd file somewhere in the distribution for the order.xml file? I looked around the directories, but haven't located one. I noticed the order.xml...
robeger
Offline Send Email
Jun 6, 2005
2:21 pm
1913
Also, how often does this change?...
robeger
Offline Send Email
Jun 6, 2005
2:22 pm
1914
... Its available whereever the admin webapp is served at '/heritrix_settings.xsd' (e.g. http://localhost:8080/heritrix_settings.xsd). See ...
stack
stackarchiveorg
Offline Send Email
Jun 6, 2005
3:05 pm
1915
... Try gzip. ... (Probably) Checkout 'Internet Archive ARC files' in the developer manual to learn about the ARC format: ...
stack
stackarchiveorg
Offline Send Email
Jun 6, 2005
3:36 pm
1916
... Unfortunately, there are other compressed aggregate file type out there that use the ARC extension, dating back all the way to DOS and early Mac, which can...
Gordon Mohr (@Interne...
gojomo
Offline Send Email
Jun 6, 2005
6:38 pm
1917
Hi, Christian. Good ideas, but with your suggested tactic, code that isn't careful could still have unpredictable results. Perhaps this is a contrived example,...
Gordon Mohr
gojomo
Offline Send Email
Jun 7, 2005
1:35 am
1918
... Thank you very much for the replies. In this case, i think if I wish to view each harvested data, I shall write a program first use arcreader command to...
Inn Fang
innfang
Online Now Send Email
Jun 7, 2005
7:21 am
1919
Hi Gordon, yes, this example would not be valid for the getMatcher contract I suggested. I doubt that getMatcher should even support this, as it would render...
Christian Kohlschuetter
ck-heritrix@...
Send Email
Jun 7, 2005
10:25 am
1920
Sorry, two typos (copy-and-paste...) ... throw new IllegalArgumentException("Pattern must not be null"); ... m = pattern.matcher(input); -- Christian...
Christian Kohlschuetter
ck-heritrix@...
Send Email
Jun 7, 2005
11:19 am
1921
Additional notes: 1. TextUtils.returnMatcher(Matcher) should always be called in a "finally" clause to ensure that the Matcher comes back to the lending....
Christian Kohlschuetter
ck-heritrix@...
Send Email
Jun 7, 2005
1:42 pm
1922
... Or subclass ARCReader and as its running through the records, have it write tmp.pdf, rather than just offsets, its default behavior. Or use the...
stack
stackarchiveorg
Offline Send Email
Jun 7, 2005
5:06 pm
1923
I've been looking into using the command line JMX client (very slick, btw) and am struck with the question about the difference between adding a URL to the...
Matt Ittigson
cydatamatt@...
Send Email
Jun 8, 2005
3:21 pm
1924
What I want to do is use a template order.xml and settings dir, with our default configuration, settings, etc. as the basis for creating new crawls (but not...
robeger
Offline Send Email
Jun 8, 2005
9:06 pm
1925
Personally I'd just use 'sed' to replace certain keywords in your templates with the values you want. Doesn't seem worth making anything more complex,...
Tom Emerson
tree02139
Offline Send Email
Jun 8, 2005
9:14 pm
1926
Thanks, Tom. I'll take a look at doing it that way too. This seems to be my week for premature posting to groups. I seem to have figured out what I was doing...
Rob Eger
robeger
Offline Send Email
Jun 8, 2005
9:18 pm
1927
What did you use to validate it? I get this error: "org.apache.xmlbeans.XmlException: error: Unexpected element: CDATA" when I try to parse an order.xml file...
Rob Eger
robeger
Offline Send Email
Jun 8, 2005
10:17 pm
1928
... FYI, in $HERITRIX_HOME/src/resources/arcMetaheaderBody.xsl is a stylesheet that reads an order file and writes a subset for inclusion at the head of every...
stack
stackarchiveorg
Offline Send Email
Jun 8, 2005
10:20 pm
1929
... Try it now (Below is patch just committed). My test must have been against an odd order file (else I was hallucinating). St.Ack Index:...
stack
stackarchiveorg
Offline Send Email
Jun 8, 2005
10:44 pm
Messages 1900 - 1929 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help