Progress? yes. We've got a working EXEcutable of sorts.
Results so far:
win32 port now includes multithreading, using pthreads-Win32 (great
stuff!), and also includes regex processing (though this feature has NOT
been tested yet!) using tre ( http://laurikari.net/tre/ ).
Today has been spent almost entirely hunting down those pesky
multithreading and memleak bugs, but I got them all... well, at least
enough of them to get pavuk to grab an entire MediaWiki-based site,
which features compressed transmission and a few other wicked things.
Next to that I also tested pavuk on my own sites and found the 'gzip'
decompression was very aggressive: my apache will send a 'text/gzip'
mime type for any files available on the site in .gz form. Those files
should _not_ be decompressed as they were intended to be compressed from
the start, so pavuk now checks if the URL ends on a 'gz', 'tgz' or 'z'
(and 'Z') extension, in which case the auto-decompress is not applied.
I think this stuff is nearing completion (though the really hard part
remains ;-) = writing documentation).
I'll have to check with Dirk what he thinks about the way to go ahead,
so for now I'll provide a URL where you can grab the complete source
tree, including MSVC2005 project files and libraries (in source code
form): libtre, zlib and openssl (I use a special build of that for all
my projects).
In the archive a 'debug build' pavuk.exe can be found in the
pavuk/pavuk_Debug/ directory. Note that it will only run in environments
where the MSVC8.0 runtime libraries have been installed. (And then
there's that Microsoft @#$% about the embedded manifest and all. :-(
Anyway, the build has all the libraries statically linked in (that's why
it's rather large) except the MSVC 8.0 runtime.
FYI: 'chunky' has not yet been removed entirely, though the 'chunky/DoS'
sections can be in/excluded at build time by configuring the proper
#define in config.h now.
URL to grab the current state of the art:
Changelog (17K) - additions at the bottom:
http://www.hobbelt.com/pavuk-temp/pavuk-20070507-a.ChangeLog.i_a.txt
full sourcecode + debug EXE:
http://www.hobbelt.com/pavuk-temp/pavuk-20070507-a.7z
Take care,
Ger Hobbelt
TODO:
- BUG FIXING: I still wonder why halfway down the grab there was this
completely mangled URL/path which dumped a screen full of the same
high-ASCII character. Smells bad. Smells like memory corruption if you
ask me. Hm...
- CLEANUP: take out the DBGdecl/DBGpass/DBGvars tweak (see config.h) to
help me track mem leaks in MSVC debug builds.
- TESTS: create some proper test cases where pavuk is going to grab a
piece of a site. pavuk\tests\ is a local start.
- DOCUMENTATION: update documentation (manpage, ...?) & clean up the
original [b]log html pages which I wrote when I was on a mission in
2005/2006.
- CODE MERGE: decide on a way to take out 'chunky/DoS' in such a way
that I can remerge it back into pavuk when the need arises - I sometimes
use this tool to test the websites I develop for clients too.
- ...
>