Board index » delphi » Re: Fastest TStringList compatible Hashtable?

Re: Fastest TStringList compatible Hashtable?


2007-10-09 10:50:12 PM
delphi108
Rolf Lampa [RIL] writes:
Quote
Doesn't seem to be the right one either.
It's far more appropriate than non-tech. ;-)
--
Nick Hodges
Delphi Product Manager - CodeGear
blogs.codegear.com/nickhodges
 
 

Re: Fastest TStringList compatible Hashtable?

Rudy Velthuis [TeamB] skrev:
Quote
Rolf Lampa [RIL] writes:

>Just subscribed to ... delphi.general, got no posts, dropped it again.

borland.public.delphi.language.delphi.general is surely not empty.
I eventually found the right one and, surprise, it wasn't empty! =)
Regards,
// Rolf Lampa
 

Re: Fastest TStringList compatible Hashtable?

John Herbster skrev:
Quote
"Rolf Lampa [RIL]" <XXXX@XXXXX.COM>wrote

>>>I don't need the string data in the hashlist which it indexes,
>>>I only need the indexes, and the object property.

Rolf,

Does this problem have unique object values for each string?
Yes. And the strings are unique to. it is actually a title list to a
Mediawiki database, where all titles must be unique (CaseSensitive).
For an example of how, exactly, it will be used, see code far below *.
Quote

>>A hash is not unique to each string. When two strings
>>produce the same hash, that is called a "collision",
>>and the lookup code then performs further lookups to
>>find the correct string.

>>Therefore, hash table lookup requires access to the
>>original strings.

>Ouch... :(

Some programmers believe (some successfully) that unique
128-bit (16-byte) digital digests can be made for each string
of any collection of unique strings. So if your strings much
longer, then you might be able to substitute a check against
a saved digital digest rather than a check against the saved
string.
The average titles byte-length seems to occupy about 18,8 bytes each
(Utf8).
Hm, I wonder how far one could get by storing the strings in an
a/b/c/d/e/f -structure, to some depth, and shortening the leading part
of remnant string.
Regards,
// Rolf Lampa
{ --------------------------------------------------------
Code example demonstrating the use of the hashed index :
-------------------------------------------------------- }
function XmlPageFromDiskByTitle(const aTitle: String): String;
// Returns a page in xml fmt, directly from disk.
var
PageObj: TMediawikiPage;
Locator: TMediawikiPageLocator;
begin
Result := '';
Index := DiskTitles.IndexOf(aTitle); // THashedStringList;
if Index>-1 then
begin
// Locator holds (stream) FromPos & ToPos
Locator := TMediawikiPage( TitleList.Objects[Index] );
// Creates page object and loads the xml data from disk
PageObj := Locator.EnsuredObject;
if PageObj<>nil then
Result := PageObj.AsXml;
end;
end;
 

Re: Fastest TStringList compatible Hashtable?

Matthew Jones skrev:
Quote
I used EZDSL for this sort of thing. ... at
www.boyet.com/FixedArticles/EZDSL.html
Looks interesting. I downloaded it and I will probably try it out.
Regards,
// Rolf Lampa
 

Re: Fastest TStringList compatible Hashtable?

Nick Hodges (CodeGear) skrev:
Quote
Rolf Lampa [RIL] writes:

>Doesn't seem to be the right one either.

It's far more appropriate than non-tech. ;-)
OK, OK, I got it.
But I got help, and that is what really counts. :b
Regards,
// Rolf Lampa
 

Re: Fastest TStringList compatible Hashtable?

Quote
Hm, I wonder how far one could get by storing the strings in an
a/b/c/d/e/f -structure, to some depth, and shortening the leading part
of remnant string.
You're building a wi-monster or what :P just want to add two options, a
btree would be good (a variant on the a/b/c.., uncomfortable to
implement IMO) or a tclientdataset (not sure if its good in timings),
without the usual index hassles. (fast to load and save with the
standard functions, you can include the hash key)
 

Re: Fastest TStringList compatible Hashtable?

Marius skrev:
Quote
>Hm, I wonder how far one could get by storing the strings in an
>a/b/c/d/e/f -structure, to some depth, and shortening the leading part
>of remnant string.

You're building a wi-monster or what :P
I make a tool for processing Mediawiki xml dumps.
A test example, scanning the English Wikipedia dump (as of
2007-10-09), creating a disk index for the dump I get the following
result, on a 1 CPU, 1.8GHz, 2GB Dell laptop (from the log):
* Done initializing 5654236 locators (time: 00:00:16).
* DiskIndex has 5654236 *titles*, occupying 131593 Kb of memory.
* PageLocators occupy 112557 Kb
(en Wikipedia exposed! =)
Quote
just want to add two options, a
btree would be good (a variant on the a/b/c.., uncomfortable to
implement IMO)
Any hints on a generic implementation for an a/b/c stringlist
structure (where TStringList could be replaced by a descendant)?
I can build my own structure but since it seems to be a farily a
generic concept someone probably has already done it, which would be
even faster than doing it myself (believe it or not). :)
Regards,
// Rolf Lampa
 

Re: Fastest TStringList compatible Hashtable?

Rolf Lampa [RIL] skrev:
Quote
A test example, scanning the English Wikipedia dump (as of 2007-10-09),
Ops, as of 2007-09-08.
// Rolf Lampa
 

Re: Fastest TStringList compatible Hashtable?

Why not have a look at www.yunqa.de/delphi/doku.php
Look at the DIContainers section
Hashes, Trees, Vectors, Dictionaries etc
regards
Paul
On Tue, 09 Oct 2007 22:21:42 +0200, "Rolf Lampa [RIL]"
<XXXX@XXXXX.COM>writes:
Quote
John Herbster skrev:
>"Rolf Lampa [RIL]" <XXXX@XXXXX.COM>wrote
>
>>>>I don't need the string data in the hashlist which it indexes,
>>>>I only need the indexes, and the object property.
>
>Rolf,
>
>Does this problem have unique object values for each string?

Yes. And the strings are unique to. it is actually a title list to a
Mediawiki database, where all titles must be unique (CaseSensitive).

 

Re: Fastest TStringList compatible Hashtable?

Paul Sjoerdsma skrev:
Quote
Why not have a look at www.yunqa.de/delphi/doku.php
Look at the DIContainers section

Hashes, Trees, Vectors, Dictionaries etc
Looks interesting. Which one would you recommend I try first, for my
needs? (I need only to retrieve the index, by title, and the object
property, and use as little memory as possible).
I also use Regex extensively, but due to relatively bad performance
(compared to direct delpih code) I often have to make my of code for
slower tasks.
Have you tried DIRegEx, or compared performance (and/or functionality)
with PerlRegEx.pas (based on www.pcre.org)?
Regards,
// Rolf Lampa
 

Re: Fastest TStringList compatible Hashtable?

This one went beyond for me. =)
Regards,
// Rolf Lampa
 

Re: Fastest TStringList compatible Hashtable?

Rolf Lampa [RIL] writes:
Quote
>just want to add two options, a btree would be good (a variant on the
>a/b/c.., uncomfortable to implement IMO)

Any hints on a generic implementation for an a/b/c stringlist structure
(where TStringList could be replaced by a descendant)?

I can build my own structure but since it seems to be a farily a generic
concept someone probably has already done it, which would be even faster
than doing it myself (believe it or not). :)
Its a bit different then the tstringlist approach. Its a collection of
index pages, each page has some records in it with some fixed sized
attributes like key-offset, pagenumber and a record#id.
Searching is easy, you start in the mainpage, if matching you're done,
if smaller continue in subpage otherwise take the next record.
I don't know if its any better then the less complicated tstringlist.
There's alway's the point with lesser machines where the wi-pages won't
fit into memory and you need other algo's to search..
 

Re: Fastest TStringList compatible Hashtable?

Why not use the Hash,
Standard you can use the TDIObjectHash with a AnsiStringKeyHandler
(standard provided by the library).
After taht you can easily lookup the objects by using the title as a
key.
If necessary you can specify a StringKeyHandler of your own if you
want to
regards
Paul
On Wed, 10 Oct 2007 18:57:29 +0200, "Rolf Lampa [RIL]"
<XXXX@XXXXX.COM>writes:
Quote
Paul Sjoerdsma skrev:
>Why not have a look at www.yunqa.de/delphi/doku.php
>Look at the DIContainers section
>
>Hashes, Trees, Vectors, Dictionaries etc


Looks interesting. Which one would you recommend I try first, for my
needs? (I need only to retrieve the index, by title, and the object
property, and use as little memory as possible).

I also use Regex extensively, but due to relatively bad performance
(compared to direct delpih code) I often have to make my of code for
slower tasks.

Have you tried DIRegEx, or compared performance (and/or functionality)
with PerlRegEx.pas (based on www.pcre.org)?

Regards,

// Rolf Lampa
 

Re: Fastest TStringList compatible Hashtable?

"Rolf Lampa [RIL]" <XXXX@XXXXX.COM>writes
Quote
This one went beyond for me. =)
Then, there's probably no point in telling you about a quick sort/search
algorithm for TStringList.
--
Mark Jacobs
DK Computing
www.dkcomputing.co.uk
 

Re: Fastest TStringList compatible Hashtable?

Mark Jacobs skrev:
Quote
"Rolf Lampa [RIL]" <XXXX@XXXXX.COM>writes
news:XXXX@XXXXX.COM...
>This one went beyond for me. =)

Then, there's probably no point in telling you about a quick sort/search
algorithm for TStringList.
I've spent quite some time trying to store only the hash keys, as
string, instead of the titles, in the list. And, since there are
duplicate hash keys, 65% in this case (I counted them on the fly), I
tried to save those entires, in the form of full titles, in a
parallell hashed list.
The problem was about saving memory. That is, saving space while not
reducing the speed. I already have good speed, using a dedicated
hashlist. It takes me only 17 seconds to build the list with 5+
million strings in it. Then I have a very very fast hashed list.
But it takes "ages", or more precise, about ten minutes to build the
list if sorting is involved. If using some kind of "quick" sorting.
However, back to the problem again. If testing for duplicates, while
in process of adding items to the list, more than 5 million strings
btw, it will slow down the loading process, because the binary search
part, the local part in the buckets, which locally uses the "half
method" described above, also called "binary serach, on the
duplicates, needs to update for each new entry in the list. Binary
search method requires sorting, which makes the whole thing {*word*88}.
So, in order to save space I loose in speed if any sorting will be
involved.
This means that I might do slightly better if you don't go into detail
describing the algorithm behind the quick sort/search algorithm, nor
about the hash buckets either for that part.
It is the sorting which IS the problem, not the solution. It kills the
whole thing. Sorting takes time. And the idea was to maintain the good
speed I already have, while reducing the space needed.
Both Peter and me have tried different approaches trying to solve this
problem. And any approach which requires sorting, while building the
list, is unthinkable, though. But we have many ideas untested still.
So you're perfectly right, there's absolutely no point in telling me
anything about the sorting stuff.
Regards,
// Rolf Lampa