thanks for your answer. The regex you suggested still didn't quite work, so I put some debugging output in org.archive.crawler.url.canonicalize.RegexRule.canonicalize() and figured out what was going on. There were three problems:
1) you need to enter the regex into the settings page on the admin GUI without quotes. In retrospect this seems obvious, but I suggest you change section 6.2.1.1. of the user manual to say that you should put the expression listed in quotes on the manual page into the admin GUI without quotes.
2) the $ and ? escaping needs to be done with a single slash. (I figured this out by printing the regex that was being passed to TextUtils.getMatcher() -- with two slashes it ended up with 2 slashes being passed to TextUtils.getMatcher(), but you need 1 slash at that level.) I think maybe you were thinking of how you'd format this string if it was being done from within some Java code, rather than as input via the admin GUI.
3) I needed two different overrides to make things work for this case: one for handling the case of URLs that included a sessionId without a catid, and one where there was sessionId with a catid. I needed this because cases that didn't have a catid would either not match on the regex (if I required 1 or more of the catid part of the pattern), or, if I allowed 0 or more of the catid part of the pattern, they would match but then cause a "null" to be put into the canonicalized URL where the ${2} was in my replacement pattern. The two patterns I ended up using were:
^(.+)(?:;\$sessionid\$[A-Za-z0-9]{32})$
with the format string ${1}
to canonicalize URLs like http://joann.com/search/search_results.jhtml;$sessionid$kd0pc4qaakvxip4sy5lrhor50ld3uepo
and
^(.+)(?:;\$sessionid\$[A-Za-z0-9]{32})(\?.*)$
with the format string ${1}${2}
to canonicalize URLs like http://joann.com/catalog.jhtml;$sessionid$kd0pc4qaakvxip4sy5lrhor50ld3uepo?catid=5
One other thing I noticed while debugging this: at one point I was confused about whether to use ${2} or ${3} when I had a capturing followed by non-capturing followed by capturing group. When I used ${3} it silently started skipping crawling URLs, which happening because the call to org.archive.crawler.url.canonicalize.RegexRule.format() was getting an IndexOutOfBoundsException inside java.util.regex.group(). You might want to add a try/catch around the call:
buffer.append(matcher.group(groupIndex));
and then log something about this to the log file. I know this was a config error on my part, but having it fail silently made it harder to figure out what was wrong.
Thanks
- Mike
At 08:16 PM 2/22/2005, stackarchiveorg wrote:
--- In archive-crawler@yahoogroups.com, "stackarchiveorg" <stack@a...>
wrote:
> Try this for the matching regex --
> "^(.+)(?:\\$sessionid\\$[A-Z0-9]{32})(\\?.*)+$" -- and this for the
> formatting expression: "$1$1".
>
Upon review, you probably want to include the ';' in the non-capturing
group and the formatting expression should be '$1$2', not '$1$1'.
St.Ack
Yahoo! Groups Links
- To visit your group on the web, go to:
- http://groups.yahoo.com/group/archive-crawler/
- To unsubscribe from this group, send an email to:
- archive-crawler-unsubscribe@yahoogroups.com
- Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.