Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Hear how Yahoo! Groups has changed the lives of others. Take me there.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
URL Canonicalization regex to strip sessionIds   Message List  
Reply | Forward Message #1611 of 6178 |
Re: [archive-crawler] Re: URL Canonicalization regex to strip sessionIds

Hi,

thanks for your answer.  The regex you suggested still didn't quite work, so I put some debugging output in org.archive.crawler.url.canonicalize.RegexRule.canonicalize() and figured out what was going on.  There were three problems:

1) you need to enter the regex into the settings page on the admin GUI without quotes.  In retrospect this seems obvious, but I suggest you change section 6.2.1.1. of the user manual to say that you should put the expression listed in quotes on the manual page into the admin GUI without quotes.

2) the $ and ? escaping needs to be done with a single slash.  (I figured this out by printing the regex that was being passed to TextUtils.getMatcher() -- with two slashes it ended up with 2 slashes being passed to TextUtils.getMatcher(), but you need 1 slash at that level.)  I think maybe you were thinking of how you'd format this string if it was being done from within some Java code, rather than as input via the admin GUI.

3) I needed two different overrides to make things work for this case: one for handling the case of URLs that included a sessionId without a catid, and one where there was sessionId with a catid.  I needed this because cases that didn't have a catid would either not match on the regex (if I required 1 or more of the catid part of the pattern), or, if I allowed 0 or more of the catid part of the pattern, they would match but then cause a "null" to be put into the canonicalized URL where the ${2} was in my replacement pattern.  The two patterns I ended up using were:
    ^(.+)(?:;\$sessionid\$[A-Za-z0-9]{32})$
    with the format string ${1}
    to canonicalize URLs like http://joann.com/search/search_results.jhtml;$sessionid$kd0pc4qaakvxip4sy5lrhor50ld3uepo

and
    ^(.+)(?:;\$sessionid\$[A-Za-z0-9]{32})(\?.*)$
    with the format string ${1}${2}
    to canonicalize URLs like http://joann.com/catalog.jhtml;$sessionid$kd0pc4qaakvxip4sy5lrhor50ld3uepo?catid=5

One other thing I noticed while debugging this: at one point I was confused about whether to use ${2} or ${3} when I had a capturing followed by non-capturing followed by capturing group.  When I used ${3} it silently started skipping crawling URLs, which happening because the call to org.archive.crawler.url.canonicalize.RegexRule.format() was getting an IndexOutOfBoundsException inside java.util.regex.group().  You might want to add a try/catch around the call:
    buffer.append(matcher.group(groupIndex));
and then log something about this to the log file.  I know this was a config error on my part, but having it fail silently made it harder to figure out what was wrong.

Thanks
 - Mike


At 08:16 PM 2/22/2005, stackarchiveorg wrote:

--- In archive-crawler@yahoogroups.com, "stackarchiveorg" <stack@a...>
wrote:
> Try this for the matching regex --
> "^(.+)(?:\\$sessionid\\$[A-Z0-9]{32})(\\?.*)+$" -- and this for the
> formatting expression: "$1$1".
>
Upon review, you probably want to include the ';' in the non-capturing
group and the formatting expression should be '$1$2', not '$1$1'.

St.Ack





Yahoo! Groups Links


Wed Feb 23, 2005 10:08 pm

mfschwartz
Offline Offline
Send Email Send Email

Forward
Message #1611 of 6178 |
Expand Messages Author Sort by Date

hi, I encountered a site (http://joann.com) that generates sessionIds that don't get stripped by the URL Canonicalization rules that come in the default...
mfschwartz
Offline Send Email
Feb 22, 2005
5:25 pm

... Try this for the matching regex -- "^(.+)(?:\\$sessionid\\$[A-Z0-9]{32})(\\?.*)+$" -- and this for the formatting expression: "$1$1". (You don't seem to...
stackarchiveorg
Offline Send Email
Feb 23, 2005
2:39 am

... Upon review, you probably want to include the ';' in the non-capturing group and the formatting expression should be '$1$2', not '$1$1'. St.Ack...
stackarchiveorg
Offline Send Email
Feb 23, 2005
3:16 am

Hi, thanks for your answer. The regex you suggested still didn't quite work, so I put some debugging output in ...
Mike Schwartz
mfschwartz
Offline Send Email
Feb 23, 2005
10:18 pm

... Done. ... Yes. I should have been more clear. ... Thanks for the above. I added logging of the IndexOutOfBoundsException as suggested and added an append...
stack
stackarchiveorg
Offline Send Email
Feb 28, 2005
7:48 pm
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help