Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 1233 - 1263 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
1233
sorry for asking questions not related to techniques. accessing yahoo group with web browser is a little bit slow here, and the rss reader can only read...
bjhong02
Offline Send Email
Dec 1, 2004
1:58 pm
1234
I've written a frontier module with persistent state. Both the current frontier state (queued URIs, statistics, etc.) and the set of successfully fetched URIs...
tztwh
Offline Send Email
Dec 1, 2004
6:51 pm
1235
I know you can put a max number of documents for a crawl, but can you set a max number of documents per site? Thanks, Rob Eger Aptas, Inc....
robeger
Offline Send Email
Dec 1, 2004
10:35 pm
1236
... Take a look at the DomainSensitiveFrontier: http://crawler.archive.org/xref/org/archive/crawler/frontier/DomainSensitiveFrontier.html. Here is from its...
stack
stackarchiveorg
Offline Send Email
Dec 1, 2004
10:57 pm
1237
Am I understanding correctly that the way to use this is to set up a regex to match on content types you want to exclude, not to match the ones you want to...
robeger
Offline Send Email
Dec 1, 2004
11:08 pm
1238
... Depends on where you're trying to set the filter. Here's the user manual paragraph on the ContentTypeRegExpFilter if content-type filtering is your thing: ...
stack
stackarchiveorg
Offline Send Email
Dec 1, 2004
11:22 pm
1239
The below is sweet. St.Ack...
stack
stackarchiveorg
Offline Send Email
Dec 1, 2004
11:37 pm
1240
Hello everyone. As some of you know, I've been working on a new Heritrix module (or rather modules) that would allow for iterative crawling, which adjusts it's...
Kristinn Sigurdsson
kristsi25
Offline Send Email
Dec 2, 2004
10:25 am
1241
I had found the user manual pages, but not the FAQ question about it. So at the midfetch stage I want to set it to return true for the content-types I want to...
robeger
Offline Send Email
Dec 2, 2004
3:19 pm
1242
I think the URL for the FAQ just got a little messed up, try: http://crawler.archive.org/faq.html#midfetch The Q is: " I only want to download text/html and...
Kristinn Sigurdsson
kristsi25
Offline Send Email
Dec 2, 2004
3:30 pm
1243
Here's how I do this for my latest crawls: 1. Add the ContentTypeRegExFilter as a midfetch-filter and write-processor Archiver filter. 2. In the settings for...
Tom Emerson
tree02139
Offline Send Email
Dec 2, 2004
3:45 pm
1244
Just tried this out. It seems that the crawl never "finishes". Tried it on a single URL, limiting it to 10 docs for that site. Once it reaches 10, it starts...
robeger
Offline Send Email
Dec 2, 2004
6:26 pm
1245
Hi, I am trying to use this crawler to build a search engine , has any one had any experience , any problems encountered , please share ur experiences . what...
Sivaji Adurthy
adurthy
Offline Send Email
Dec 2, 2004
7:35 pm
1246
I am just curious if there is a way to multithread Heritrix when crawling a single site. Also there was a post about embedding heritrix without having to call ...
jirleech
Offline Send Email
Dec 2, 2004
7:48 pm
1247
... Not reliably (There is the 'valence' feature on Frontier but its problematic even after help from members of this list). ... There's been some progress....
stack
stackarchiveorg
Offline Send Email
Dec 2, 2004
8:50 pm
1248
... We've done a few experiments trying to make ARC files, the default product of a Heritrix crawl, searchable. Mostly this has consisted of trying to get ARC...
stack
stackarchiveorg
Offline Send Email
Dec 2, 2004
10:40 pm
1249
... <>The way that the DomainSensitiveFrontier works is that when it hits the max-docs-for-this-domain threshold, it marks all remaining queued URLs as ...
stack
stackarchiveorg
Offline Send Email
Dec 3, 2004
8:18 pm
1250
This is kind of a general crawl question, not specific to heritrix, but I figured this was as good a place as any to ask it. Basically, when you're running a...
robeger
Offline Send Email
Dec 3, 2004
9:31 pm
1251
Hey, I did reply to this just shortly before I left for the weekend, but it seems that the reply didn't make it (hope this one do). And yeah, it doesn't work...
ogrenholm
Offline Send Email
Dec 5, 2004
11:35 am
1253
In general, yes. I've been managing a crawl of the entire .is TLD, and while most domains require no special attention, about 3-5% have various crawler traps....
Kristinn Sigurdsson
kristsi25
Offline Send Email
Dec 6, 2004
8:24 am
1254
Thanks, Tom. FYI, you have "au" as one of your extensions, which causes problems if you are crawling any sites in Australia (.au domain). Caused me some ...
robeger
Offline Send Email
Dec 6, 2004
4:57 pm
1255
... I'm surprised this is being applied against the domains --- the filter should only be applied against the end of a full-qualified URL, so even a TLD should...
Tom Emerson
tree02139
Offline Send Email
Dec 6, 2004
5:50 pm
1256
It surprised me too, but I was feeding it a .au domain, and it wouldn't crawl anything until I removed au from the exclusion list. Here's the crawl log entry: ...
robeger
Offline Send Email
Dec 6, 2004
8:17 pm
1257
... Weird. Well, now we know. :-) -tree -- Tom Emerson Basis Technology Corp. Software Architect...
Tom Emerson
tree02139
Offline Send Email
Dec 6, 2004
8:26 pm
1258
Tom is right that as part of any scheduleable HTTP (or other recognizably hierarchical) URI, there will be a '/' after the hostname portion, preventing a tail...
Gordon Mohr (Internet...
gojomo
Online Now Send Email
Dec 7, 2004
2:58 am
1259
hi,everyone I update cvs from HEAD just now and run heritrix on windows 2003, When click 'New job based on it ' to create a job,heritrix tell me An error...
ansi
mymaillist@...
Send Email
Dec 8, 2004
8:01 am
1260
I take it this was working for you before? The only relevant change I can see has been made lately is to ...
Kristinn Sigurdsson
kristsi25
Offline Send Email
Dec 8, 2004
9:03 am
1261
This is the first time I run heritrix 1.3 on windows. The 1.6 vesion of JobConfigureUtils.java doesn't work,too. After add Profiles/seeds.txt and...
ansi
mymaillist@...
Send Email
Dec 8, 2004
9:47 am
1262
... Yeah. I think so (The path thats being complained about is a windows path with '\' separators. CLASSPATH paths use '/' separators). I added path...
stack
stackarchiveorg
Offline Send Email
Dec 8, 2004
7:27 pm
1263
FYI, the crawler project wiki, at... http://crawler.archive.org/cgi-bin/wiki.pl ...has been upgraded to the 1.0 version of the UseMod wiki software. (It was...
Gordon Mohr (Internet...
gojomo
Online Now Send Email
Dec 8, 2004
10:41 pm
Messages 1233 - 1263 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help