Hi Otis, ... Well, it's a toolkit - so hooking it up for a wide/big crawl means writing some code, but there's nothing in the Bixo architecture (after a few...
Hello again, In the last few weeks, I have been thoroughly reading the documentation of Hadoop, Cascading and Bixo. I've also set up a small cluster in Nutch...
I saw this post on the Nutch list, and thought that we should verify our handling of URLs (or rather, what Tika does) is correct in regards to resolving...
I am highly impressed with this product. I downloaded and installed the same in my system . I am able to run the Simple Crawler as documented in the Getting...
Mike Bowles, PhD and Patricia Hoffman, PhD are teaching a Machine Learning Class. The class will begin at the level of elementary probability and statistics...
Hi, I am creating a parser as follows: HashSet<String> tagNames = new HashSet<String>(); tagNames.add("a"); tagNames.add("img"); HashSet<String> attributes =...
Hi, I made modification to SimpleCrawlTool and able to extract the url of the images in the 1st iteration , now i want to download the images , currently i...
Hello, at the moment I am studing bixo, I read de slides and I play with de examples and the code, but It's very complicated understand the code without a...
Hi , Earlier i successfully created my project using bixo-core-1.0-SNAPSHOT.jar present in the distribution. Moving i am trying to move to maven . So in...
Hi , I run the SimpleCrawlTool and parameters are set as SimpleCrawlToolOptions options = new SimpleCrawlToolOptions(); options.setAgentName("tester"); ...
Hi, I am new to Bixo,Cascading and Hadoop, I was able to run the example. I could see that various folders are created and when i run SimpleStatusTool . I get...
Hi, I run SimpleCrawlTool from Eclipse and I get this message dozens of time during one run: ERROR examples.CreateUrlDatumFromStatusFunction:83 - Unknown...
Hi, I'm bixo newbie and have one important question. I'd like to use bixo as a tool for constant monitoring some range of domains and extract some data from...
When I try to compile Bixo, I get the following message from ant: [artifact:dependencies] Diagnosis: [artifact:dependencies] [artifact:dependencies] Unable to...
Hi ! I'm trying to install Bixo but i get a failing test on: [junit] -> at bixo.operations.ProcessRobotsTask.run(ProcessRobotsTask.java:135) Is something that...
I'm trying to load OpenBixo into Eclipse basically by doing an "ant eclipse" and then in Eclipse "Import existing project". However it is not working for me -...
OK, I've got ahead and pushed this change. Let me know if it works for you. To summarize - you should now be able to set the list of supported link tags in the...
I see that from the docs we save fetched pages to some kind of permanent store. (I am assuming it would be some kind of Hadoop based NoSQL database but don't...
Are there any more examples like SimpleCrawlTool. I've looked through the code in bixo.tools but ideally I'd like something nearer to Nutch to start from....
Hello, I'm going to implement a domain-specific crawler and studying bixo for this task. My problem is that web sites are few in number but they are very...
Hi everyone, I am new to bixo, and trying to use this crawler to retrieve attributes from tags other than <a href="...">. I am using the SimpleParser class, by...
Hi all, I am trying to work with the SimpleParser class to make it extract attribute values of different tags. Here is some design issue that I encountered: ...
I noticed that public static final java.lang.String CONTENT_FIELD; was changed to a private field in the latest bixo release. I use FetchedDatum.CONTENT_FIELD...
Hi All, Not sure what I'm doing wrong here. Eclipse Run Configuration: -agentname Tester -domain apple.com -numloops 2 -outputdir c:\temp92;bixo Output: 10/11/02...
Hi all, A quick note about the 0.5.1 release. We've fixed the dist build and the bin/bixo script so that you should now be able to run the tool from the...
Hi all, I just did a new build of Bixo, tagged as version 0.4.8 in GitHub. Various files of interest that are available for use: - The maven artifact is in the...