Search the web
Sign In
New User? Sign Up
pavuk · Pavuk Webgrabber Mailing List
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Show off your group to the world. Share a photo of your group with us.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Retrieving documents only.   Message List  
Reply | Forward Message #858 of 988 |
Re: [pavuk] changes and fixes for pavuk (related to last weeks CVS content)

Progress? yes. We've got a working EXEcutable of sorts.

Results so far:

win32 port now includes multithreading, using pthreads-Win32 (great
stuff!), and also includes regex processing (though this feature has NOT
been tested yet!) using tre ( http://laurikari.net/tre/ ).

Today has been spent almost entirely hunting down those pesky
multithreading and memleak bugs, but I got them all... well, at least
enough of them to get pavuk to grab an entire MediaWiki-based site,
which features compressed transmission and a few other wicked things.
Next to that I also tested pavuk on my own sites and found the 'gzip'
decompression was very aggressive: my apache will send a 'text/gzip'
mime type for any files available on the site in .gz form. Those files
should _not_ be decompressed as they were intended to be compressed from
the start, so pavuk now checks if the URL ends on a 'gz', 'tgz' or 'z'
(and 'Z') extension, in which case the auto-decompress is not applied.

I think this stuff is nearing completion (though the really hard part
remains ;-) = writing documentation).

I'll have to check with Dirk what he thinks about the way to go ahead,
so for now I'll provide a URL where you can grab the complete source
tree, including MSVC2005 project files and libraries (in source code
form): libtre, zlib and openssl (I use a special build of that for all
my projects).

In the archive a 'debug build' pavuk.exe can be found in the
pavuk/pavuk_Debug/ directory. Note that it will only run in environments
where the MSVC8.0 runtime libraries have been installed. (And then
there's that Microsoft @#$% about the embedded manifest and all. :-(
Anyway, the build has all the libraries statically linked in (that's why
it's rather large) except the MSVC 8.0 runtime.

FYI: 'chunky' has not yet been removed entirely, though the 'chunky/DoS'
sections can be in/excluded at build time by configuring the proper
#define in config.h now.


URL to grab the current state of the art:

Changelog (17K) - additions at the bottom:
http://www.hobbelt.com/pavuk-temp/pavuk-20070507-a.ChangeLog.i_a.txt

full sourcecode + debug EXE:
http://www.hobbelt.com/pavuk-temp/pavuk-20070507-a.7z


Take care,

Ger Hobbelt


TODO:

- BUG FIXING: I still wonder why halfway down the grab there was this
completely mangled URL/path which dumped a screen full of the same
high-ASCII character. Smells bad. Smells like memory corruption if you
ask me. Hm...
- CLEANUP: take out the DBGdecl/DBGpass/DBGvars tweak (see config.h) to
help me track mem leaks in MSVC debug builds.
- TESTS: create some proper test cases where pavuk is going to grab a
piece of a site. pavuk\tests\ is a local start.
- DOCUMENTATION: update documentation (manpage, ...?) & clean up the
original [b]log html pages which I wrote when I was on a mission in
2005/2006.
- CODE MERGE: decide on a way to take out 'chunky/DoS' in such a way
that I can remerge it back into pavuk when the need arises - I sometimes
use this tool to test the websites I develop for clients too.
- ...



>



Mon May 7, 2007 1:58 am

i_a42
Offline Offline
Send Email Send Email

Forward
Message #858 of 988 |
Expand Messages Author Sort by Date

Hello everybody. pavuk is very powerfull and i like it. but the number of options is totaly confusing me. Is there a way/mode to receive only the documents of...
strdemos
Offline Send Email
Mar 20, 2007
5:57 pm

... "-mode singlepage" should do that for you I think. Ciao -- http://www.dstoecker.eu/ (PGP key available)...
Dirk Stoecker
stoeckerd
Offline Send Email
Mar 21, 2007
7:04 am

Hi, Finally I got to taking up the project of porting pavuk to (native) Win32 again after a fat year of utter silence. :-) And it works! (single thread only...
Gerrit E.G. Hobbelt
i_a42
Offline Send Email
Apr 30, 2007
11:20 pm

Progress? yes. We've got a working EXEcutable of sorts. Results so far: win32 port now includes multithreading, using pthreads-Win32 (great stuff!), and also...
Gerrit E.G. Hobbelt
i_a42
Offline Send Email
May 7, 2007
2:00 am

Sorry, forgot to include the pthreads-win32 sources and project. :-( New URLs for Win32: Changelog (17K) - additions at the bottom: ...
Gerrit E.G. Hobbelt
i_a42
Offline Send Email
May 8, 2007
1:25 am

Hi, progress report: fixed a few bugs, including the one concerning RFC2616 chunked downloads (which related to this group's thread @ 2007-05-09, subject:...
Gerrit E.G. Hobbelt
i_a42
Offline Send Email
May 15, 2007
12:11 am
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help