Skip to search.

Breaking News Visit Yahoo! News for the latest.

×Close this window

bixo-dev · Bixo Web Mining Toolkit

The Yahoo! Groups Product Blog

Check it out!

Group Information

  • Members: 114
  • Category: Open Source
  • Founded: May 17, 2009
  • Language: English
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Messages

Advanced
Messages Help
Messages 1295 - 1325 of 1325   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand Author Sort by Date ^
1295 Chris Schneider
schmed2000 Send Email
Jan 10, 2013
3:02 pm
Hi Nilay, ... I have tried to answer your latest set of questions below. I have now exhausted the time I have available to help you, at least until you are...
1296 Rehan Malek
rehan_malek75 Send Email
Jan 11, 2013
8:49 am
hi vivek :) Thankyou for your quick response. just gone through the cascading documents. but what should be done to get only urls for all fetched pages. and i...
1297 Vivek Magotra
vmagotra Send Email
Jan 13, 2013
7:44 pm
Hi Rehan, On Jan 11, 2013, at 5:49 PM, Rehan Malek <rehan_malek75@...> wrote: [snip] ... The status pipe (FetchPipe.getStatusTailPipe()) has the status...
1298 Rehan Malek
rehan_malek75 Send Email
Jan 17, 2013
1:11 pm
Thankyou vivek.! i am still unable to get all the urls associated with fetched pages. could you please provide cascading workflow for getting all urls.....?? ...
1299 rehan_malek75 Send Email Jan 18, 2013
8:05 am
Hi all, How to modify Democrawlworkflow to get all the urls of all fetched pages please explain it in Detail ....
1300 Vivek Magotra
vmagotra Send Email
Jan 19, 2013
2:11 am
Hi Rehan, ... To get all the urls of the fetched pages for the current loop here's what I would do : In the createFlow() method, after you get the statusPipe,...
1301 rehan_malek75 Send Email Jan 21, 2013
8:33 am
Thanks for giving response. and i am working on this...
1302 rehan_malek75 Send Email Jan 21, 2013
8:33 am
hi all, i am currently facing problem with status sub-folder inside output directory. i am unable to view the status sub-folder. as such by default its...
1303 Pat Ferrel
sillyaliases... Send Email
Jan 24, 2013
4:01 am
I think Vivek added all of the NotSoSimpleCrawlTool to Bixo's DemoCrawlTool. It produces the same hadoop sequence file in each loop dir. I wrote another...
1304 Ken Krugler
kkrugler Send Email
Feb 3, 2013
3:05 am
Hi all, Just a heads-up that Lewis McGibbney has just released 0.2 of the crawler-commons library. The next release of Bixo will use this jar, since it...
1305 markatasu Send Email Feb 21, 2013
5:53 am
Hi Everyone, I'm working with an early-stage well funded stealth mode start-up in the big data analytics space – creating a unified platform that collects,...
1306 jeffjeffrsn Apr 2, 2013
12:58 pm
Hi, In the DemoCrawlTool I added a new Pipe to the tail of the parsePipe. In it i use the parsed content and the url. Now i also need the original...
1307 Chris Schneider
schmed2000 Send Email
Apr 2, 2013
2:12 pm
Hi Jeff, I am not sure what you meant when you wrote "added a new Pipe to the tail of the parsePipe". If you did add a tail pipe containing only...
1308 jeffjeffrsn Apr 2, 2013
5:02 pm
Hi Chris, Thanks for the answer. Now I'm subclassing the baseparser. Thanks, - Jeff...
1309 jeffjeffrsn Apr 2, 2013
5:22 pm
Hi Eeryone, I noticed, that the democrawler stays at one domain. ... I've got the domain example.com. At this domain there are outlinks to test.example.com,...
1310 Ken Krugler
kkrugler Send Email
Apr 4, 2013
11:36 pm
Hi Vivek, I was looking at the DemoCrawlWorkflow source, and noticed this snippet: Pipe urlFromOutlinksPipe = new Pipe("url from outlinks",...
1311 Ken Krugler
kkrugler Send Email
Apr 4, 2013
11:41 pm
Hi Jeff, ... By default if you provide a -domain parameter, then URL filtering is set up such that only URLs for that domain are accepted (all other URLs are...
1312 Pat Ferrel
sillyaliases... Send Email
May 9, 2013
5:31 pm
It's been awhile since I did a new build of bixo. For some reason, though I haven't changed the code, I'm getting all sorts of test errors. I was getting an...
1313 Pat Ferrel
reallyreally... Send Email
May 9, 2013
6:00 pm
Hmm, comment out the test and it completes without errors. Maybe openDNS is the problem? On May 9, 2013, at 10:31 AM, Pat Ferrel <pat.ferrel@...> wrote: ...
1314 Pat Ferrel
reallyreally... Send Email
May 17, 2013
2:00 pm
Hi guys, I'm back to crawling Pinterest to update my experimental recommender. I created a merged miner/crawler, which was working fine if slowly. I added an...
1315 Ken Krugler
kkrugler Send Email
May 17, 2013
10:00 pm
Hi Pat, ... It will try to fetch every URL, but it will only make one HttpClient request for each URL. HttpClient will retry multiple times, and if the server...
1316 Pat Ferrel
reallyreally... Send Email
May 21, 2013
11:58 pm
Thanks. With the below settings I also changed the retry to 0 and the log level to trace. It looks like perfectly good urls are not getting fetched. Thinking I...
1317 Ken Krugler
kkrugler Send Email
May 22, 2013
1:53 pm
Hi Pat, ... In bixo there's a FetchAndParseTool that you can use to fetch individual URLs. I'd try that, as another way to test. Some random ideas… ...
1318 Pat Ferrel
sillyaliases... Send Email
May 22, 2013
3:06 pm
Again, thanks. You are the only one I know with crawler-fu skills. I'll take a look at the headers in the simple fetcher. I suspect that Pinterest is not...
1319 Ken Krugler
kkrugler Send Email
May 22, 2013
4:35 pm
... One other thought - we do "batch" fetching of URLs, using keep-alive to optimize the connection that we create with the server. Pinterest might not like...
1320 Pat Ferrel
sillyaliases... Send Email
May 22, 2013
4:55 pm
The miner is getting urls to 404s. So Pinterest is allowing removal of pages but leaving the links to those pages around. If leaving them in the crawldb marked...
1321 Ken Krugler
kkrugler Send Email
May 22, 2013
5:10 pm
... A 404 should result in UrlStatus.HTTP_NOT_FOUND What you do with those entries in the crawlDB is up to your processing code. In the DemoCrawlTool, the...
1323 dennis.buroh Jun 14, 2013
2:14 pm
Hello A short question: Who works the FixedScoreGenerator.class. or where can i find the data of the class? Thank you! Best regards Dennis...
1324 Ken Krugler
kkrugler Send Email
Jun 14, 2013
9:15 pm
Hi Dennis, ... I'm not sure what exactly you're asking, but the code is in bixo.operations.FixedScoreGenerator, and looks like: public class...
1325 dennis.buroh Jun 15, 2013
4:16 pm
Thank you very much. That's exactly what i wanted....
Messages 1295 - 1325 of 1325   Oldest  |  < Older  |  Newer >  |  Newest
Add to My Yahoo!      XML What's This?

Copyright © 2010 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines NEW - Help