Re: [aseek-users] Thought you all might want to know

From: Karen Barnes (no email)
Date: Fri Sep 27 2002 - 16:23:18 EDT


Hello Gerrit,

Thanks for the reply. I have configured aspseek.conf:

DeleteBad yes

but that doesn't seem to do anything in this case. After running index for
several hours it does report:

Deleting 'deleted' records from urlword[s] ... done (8916 records deleted)

which I believe is what the "DeleteBad yes" does. I think that if a URL was
indexed previously and it is in the "urlword[s]" database table, then those
are deleted from those tables during this run of index, but ONLY if they
were due for a reindex and not deleted from urlword table which is a
different table than urlword[s]. The 404's and all other URLs marked by the
status is in the urlword table. See database table structure here:

http://www.aspseek.org/man/aspseek-sql.5.html

This is why I am concerned with the proper deletion of records based on
their status. So for example if we didn't have "DeleteBad yes" and then we
did a:

./index -C -s 404

then I'll bet you these records wouldn't be deleted in the urlword[s]
table[s] which will ultimately corrupt things I believe. When you have
several million URLs you may not notice a problem until of course it is too
late.

Even using a non production set of data (which mine now is anyway), it would
be hard if not impossible to know if all things were deleted properly. Maybe
using something like this:

./index -C -s 404

might indeed delete URLs from urlword table and even update the deltas so
searching is not corrupted, but old data from those indexed may still be
hanging around in the urlwords[s] tables. Without an explaination of exactly
how this all works it would be impossible to know for sure.

John's question was a very good question I believe because it addresses the
actual and necessary steps required to properly delete records based on
their status. If Kir were to answer this it would put this subject to rest
once and for all.

I'm sure this is a question Kir has great knowledge about and can answer
with a couple of keystrokes in an email post. I also think the more we
discuss this you can see a need for this question to be answered. I
understand you are not worried about 404's but let me tell you about another
problem I ran into just the other day regarding this same situation.

I have a list of URLs that I have gathered over the years. This list
contains roughly 4 million records. The other day I was indexing a couple of
hundred thousand and found a site that has DNS records still active, but the
domain's owner has moved and removed everything from the server. In fact
there is no server to go to. This domain had over 4,000 pages of valuable
information which the owner moved to a new domain. Now the problem...

When indexer went out to fetch each document it first asks DNS where to find
the domain (domain to ip resolution). DNS replies with something like this
"123.456.789.012" so index now makes a socket call to this IP and sends the
hostname and GET request for the URL. The problem is this IP is does not
contain any records for this domain. Now indexer hangs like crazy waiting 90
seconds between each page request. So if you have 4,000 pages and index
tries to fetch each one of them for 90 seconds, your indexer will run
roughly 100 HOURS for these measly 4,000 URLs! That's exactly what was
happening to me.

Lucky for me I didn't do this when starting index:

./index -N 50 &

or I would have never known it was hung up on this one domain. Kir's answer
was not a solution because I was already doing what he said to do:

[Kir]

The best solution is to run many threads (say, -N 50 is not that bad). If
one site will be unavailable, 1 thread will try to reach it, and other 49
threads will continue running fine, so you will have 2% indexing speed
decrease, which seems to be OK to me.

but the mathematics here is simple. If I'm indexing 100,000 URLs and when
those URLs are all good, it will take roughly 4 hours. The problem is that
these 4,000 bad ones will take 100 hours as explained above. So even if
running index with threads and 49 are fetching good URLs and only 1 was
fetching this bad one, eventually you end up on this bad one because all the
good ones are done indexing. No matter how you look at it index will still
take 100 hours to try and index these 4,000 URLs.

Well I hope my clarifying things has made you think about how you delete
records and I also hope Kir will put this mail thread to rest by giving us
his thoughts and solutions on these questions.

Of course one of John's biggest concern (and mine) on his post was the fact
that MySQL bombing out using this:

set-variable = max_allowed_packet = 1m

and then changing this to:

set-variable = max_allowed_packet = 10m

Index will then be able to properly finish. We were both concerned as to the
actual limits and why it requires so much RAM for a measly 2 million URLs.

Oh well Gerrit you have a good weekend.

Bye for now,
Karen

>From: Gerrit Hannaert <>
>Reply-To:
>To:
>Subject: Re: [aseek-users] Thought you all might want to know
>Date: Fri, 27 Sep 2002 21:06:53 +0200
>
>Hi Karen,
>All I can say is, I have done removal of URLs using -C and limiting
>options and it seems to work for me. No, I haven't done any serious
>testing on this - because deleting 404s is really not that important to
>me. You could try -X1 and/or -X2 afterwards if you want to be sure
>everything is consistent, but again, no I can't assure that will be the
>case because I don't know.
>
>If you're scared to do it on your production db, then set up a test
>system and try it. If it doesn't work as expected, then post to the list.
>I surely didn't have the intention of making anyone look bad, I just
>thought you might get results more quickly if you tried it out yourself.
>I usually get results more quickly that way, anyway.
>
>Actually, perhaps the option you are looking for is "DeleteBad yes | no"
>in aspseek.conf. (man aspseek.conf). Yes, that would probably be the
>best thing to do.
>
>Cheers,
>
> >
> > According to you I can do the following without one bit of fear
> > something will break:
> >
> > ./index -C -s 1
> > ./index -C -s 202
> > ./index -C -s 204
> > ./index -C -s 205
> > ./index -C -s 300
> > ...

_________________________________________________________________
Chat with friends online, try MSN Messenger: http://messenger.msn.com








Hosted Email Solutions

Invaluement Anti-Spam DNSBLs



Powered By FreeBSD   Powered By FreeBSD