Re: [aseek-devel] UCode(), RemoveSpecials() and GetCode() bug - all version

From: Kir Kolyshkin (no email)
Date: Thu Aug 01 2002 - 09:54:20 EDT


Thank you very much for bug hunting and fixing, I have included your
name in THANKS.

I have committed first and third patches into CVS (both HEAD and v_1_3
branch). As for the second patch, al@ said he will develop better fix.

Bui Quang Minh wrote:
>
> Hi all,
>
> when I was trying to add the vietnamese support for aspseek, I found 3
> following bugs with unicode support in aspseek.
>
> The first one is in the way UCode() function deal with utf8 decoding, it
> miss the utf8 sequence coresponding to 0x1EC0 - CAPITAL LATIN LETTER E WITH
> CIRCUMFLEX AND TIDLE ; and 0x1EC4 (I forget the unicode name) , which is two
> correct vietnamese characters. The prolem is : '&' is numeric AND ,
> different from '&&' is logic AND. This problem affects all version until
> 1.2.10.
> The patch for 1.2.10 should be:
>
> --- aspseek-1.2.10-old/include/ucharset.h Sat Jun 8 22:52:41 2002
> +++ aspseek-1.2.10/include/ucharset.h Mon Jul 29 15:59:23 2002
> @@ -667,7 +667,7 @@
> {
> if (src[1] && (src[0] & 0xF))
> {
> - if (src[2] & (src[1] & 0x3F))
> + if (src[2] && (src[1] & 0x3F))
> {
> ucode = ((src[0] & 0xF) << 12) | ((src[1] & 0x3F) << 6) | (src[2] &
> 0x3F);
> src += 3;
>
> The secound bug is in RemoveSpecials() in excerpts.cpp. When a word starts
> with a alphabet character, for example, everything is ok. But if it starts
> with &#272; - for example, the excerpt and cache output will be adnormal.
> This results from RemoveSpecials() fails to build the offmap. So I suggest
> the following patch for 1.2.10, which is also occured in all previous
> versions.
>
> --- aspseek-1.2.10-old/src/excerpts.cpp Sat May 11 22:27:27 2002
> +++ aspseek-1.2.10/src/excerpts.cpp Mon Jul 29 15:59:23 2002
> @@ -60,8 +60,21 @@
> {
> if (j)
> {
> + fl = 1;
> *e=' '; e++; j = 0;
> }
> +
> + if (fl)
> + {
> + fl = 0;
> + if (offmap && (last_dif != (e - dst) -i) )
> + {
> + // Insert or remove occured
> + (*offmap)[dst - e] = i;
> + last_dif = (e - dst) - i;
> + }
> + }
> +
> memcpy(e, s + i, (ps - s) - i);
> e += (ps - s) - i;
> *e = 0;
>
> And finally, the last bug is in GetCode() function, the patch for it should
> be:
>
> --- aspseek-1.2.10-old/src/ucharset.cpp Tue Jun 18 17:57:23 2002
> +++ aspseek-1.2.10/src/ucharset.cpp Mon Jul 29 15:59:24 2002
> @@ -289,7 +289,7 @@
> {
> s++;
> WORD code = SgmlToChar((const char*&)s, 0);
> - return IsLetter(code) ? code : 0;
> + return IsLetter(code) ? TolowerU(code) : 0;
> }
> return charset->UCodeLower((const BYTE*&) s);
> };
>
> With this bug, I think that the unpatch function will not work properly with
> &# in uppercase - lowercase conversion, for example in Vietnamese , the
> lowercase of &#272; (D STROKE) should be &#273;.
> And the lastword, there is another optimized patching for 3rd bug, this
> patch has 2 same check in IsLetter and TolowerU. So the solution is to
> combine both, but code will extent to some line of codes.
>
> Any comment is welcome.
>
> Bui Quang Minh

-- ICQ UIN 7551596 Phone +7 903 6722750 --
   Guinness a Day Keeps a Doctor Away (people's wisdom)








Hosted Email Solutions

Invaluement Anti-Spam DNSBLs



Powered By FreeBSD   Powered By FreeBSD