Board index » delphi » Looking for an algorithm...

Looking for an algorithm...

On Fri, 17 Jan 1997 23:07:18 -0600, "Alan M. Evans"

Quote
<A.Michael.Ev...@airmail.net> wrote:
>People Unclear on the Concept All-Time Grand Prize Winner:

>ro...@ix.netcom.com wrote:

>> P.S. this is for a C++ program.

And off topic too...

Glenn Grotzinger
Web Page: http://www.geocities.com/Paris/3537
Writer of the Excellent Training Manual known as the TP Tutorial.
To email, if you hit the reply button, delete the {remove_this}
out of the replied message.  Just some protection from SPAM. :(

 

Re:Looking for an algorithm...


People Unclear on the Concept All-Time Grand Prize Winner:

Quote
ro...@ix.netcom.com wrote:
> P.S. this is for a C++ program.

Re:Looking for an algorithm...


Here is the scenario:
   A huge amount of data is being transferred  (in the form of string
records). This data must be recorded in special files, with the
condition that all records must be unique  (the records transferred
are not unique records).
  What I am looking for, is a way to check that a given record is
unique before recording (this should happen run time). One bad
solution, is before recording the string, go through the file
comparing it to all the others. This is unacceptable, because it would
take too long.
  A better solution is as follows:
    Allocate a large chunk of memory - let's call it STORAGE. (We will
use this STORAGE as a set of bits.) As soon as a new record is
transferred, by some function a number (KEY) will be generated for
this string. This KEY is going to be the bit number in our STORAGE. We
will go to that bit in STORAGE and see if it is 0. If so, that means
that our string is unique (and we can easily record it). If the bit is
1, that would mean that this string was transferred before, or another
string generated the same KEY.

What I am looking for is this function that would take in a string
(about 20 characters) and generate a number between 0 and 4 mill.
Obviously, with the hope that it would be very unlikely for two
different strings to produce the same KEY.

Any suggestions?

Thank you very much in advance.

P.S. this is for a C++ program.

Re:Looking for an algorithm...


Quote
On Sat, 18 Jan 1997 05:06:32 GMT, ro...@ix.netcom.com wrote:
>  A better solution is as follows:
>    Allocate a large chunk of memory - let's call it STORAGE. (We will
>use this STORAGE as a set of bits.) As soon as a new record is
>transferred, by some function a number (KEY) will be generated for
>this string. This KEY is going to be the bit number in our STORAGE. We
>will go to that bit in STORAGE and see if it is 0. If so, that means
>that our string is unique (and we can easily record it). If the bit is
>1, that would mean that this string was transferred before, or another
>string generated the same KEY.
>Any suggestions?

This doesn't sound very promissing as well; you will still need to look through
the whole list of keys and even then you're not completely sure. What you should
do in cases like this is to create a double or multi linked list of the records
and sort them, just like any other database program does. When the list is
sorted your problem become trival. You might think that sorting would take a lot
of time, but that is certainly not always true, there are many fast ways of
sorting a double linked list (books!) but my favourite way is to sort it when I
am creating it.

If sorting is out of the question, for some really odd reason, then you have a
problem, however creating a KEY for strings should not be to hard, simply add or
multiply all charachers and store the lowest 32 bit of the summation or
multiplication in the KEY. I'm not sure any other algorithm would perform very
much better in the end since you are loosing information anyway. No a very
promising method, don't do it!

Quote
>P.S. this is for a C++ program.

C++ programs don't belong in the turbo pascal newsgroup.

I hope that this will help you a bit,

Peter de Jong,
wpdej...@worldonline.nl

Re:Looking for an algorithm...


Quote
ro...@ix.netcom.com wrote:

> Here is the scenario:

...

Quote
>     Allocate a large chunk of memory - let's call it STORAGE. (We will
> use this STORAGE as a set of bits.) As soon as a new record is
> transferred, by some function a number (KEY) will be generated for
> this string. This KEY is going to be the bit number in our STORAGE. We
> will go to that bit in STORAGE and see if it is 0. If so, that means
> that our string is unique (and we can easily record it). If the bit is
> 1, that would mean that this string was transferred before, or another
> string generated the same KEY.

> What I am looking for is this function that would take in a string
> (about 20 characters) and generate a number between 0 and 4 mill.
> Obviously, with the hope that it would be very unlikely for two
> different strings to produce the same KEY.

What you're describing sounds similar to a hashing algorithm.  
However, you are requiring a hashing function that will completely
avoid collisions (i.e., no two distinct strings produce the
same key).  I don't think this is possible given your problem
description.  

Your strings are 20 characters in length and I'll assume that
you allow the standard English letters A..Z for each character
(ignoring case).  Since each character can assume any of 26
different values there are precisely 26^20 (~1.99 x 10^28)
possible string combinations.  This is a staggeringly huge
number!  If you use one bit for every string, you'll need
~1.9 x 10^22 Megabytes (I don't think anybody makes a
motherboard that holds that many SIMMs).

Now, you are asking for a function that will map 26^20 distinct
strings onto the range 0..4 million.  The Pigeon-Hole Principle
tells us that this cannot be done without collisions.  In fact,
any such mapping will produce an enormous number of collisions.

In practice, you probably will not encounter all 26^20 distinct
strings.  Therefore, you could get by with a smaller amount of
storage.  Even still, any algorithm that maps a given domain
onto a smaller range, must provide a means to handle collisions.

Why are you using this type of approach?  Why not use a simple
database system, like Paradox, dBase, or MS Access, and create
a table the uses these 20 character strings as a primary key.
Then simply insert the records into this keyed table.  The key
violations will prevent duplicates.

If for some reason (like this is really a homework assignment),
you must use the approach your were suggesting, then you'd better
do a bit of research into the subject of hashing.  Check out
Donald Knuth's book on Searching and Sorting.

Quote
> Any suggestions?

> Thank you very much in advance.

> P.S. this is for a C++ program.

It doesn't matter what language you are using for this problem.  The
algorithm is completely language independent.

Hope this helps,
Craig

--
-----------------------------------------------------------------------------
Craig Jackson            
lnusgmr.vzk...@eds.com
Electronic Data Systems

Other Threads