Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider pros and cons of alternative data structure for pool #6

Open
xiaodaigh opened this issue Jan 22, 2018 · 6 comments
Open

Consider pros and cons of alternative data structure for pool #6

xiaodaigh opened this issue Jan 22, 2018 · 6 comments

Comments

@xiaodaigh
Copy link

xiaodaigh commented Jan 22, 2018

Currently pool is a WeakRefDict. Wondering if it can be made into something else to allow for more operations to be more efficient e.g. sorting/grouping. A B-tree perhaps?

@oxinabox
Copy link
Member

You don't perform operations on the bool.
It's not user-facing, and it is not safe for users to touch.

It needs exactly one operation: intern!, which isn't really related to sorting or grouping.
I guess that operation could be called getkey!.
It needs to check for the set containing an element equal the input, and if so return that element, if not add the input, and return it.

The pros and cons of alternative data structures are not around making any other operations efficient.
So here they are:

Custom Hashfunction hashset/dict

Pro:

Could be faster than current implementation

Con:

Work to implement basically from ground up

TreeSet/dict

Pro

Comparisons could be faster than current hash function

Con

Gets slower the more elements it contains. Technically so does Hashmap, but this is worse.
Slowness gets worse if strings share common prefixes.

Trie

All pros and cons of TreeSet/dict,
but additionally

Pro

Uses less memory in storage

Con

Big Con:
I can't workout away to actually pull this off that actually results in being able to return strong references to its contents, since it gains those memory savings by discarding the original strings and only storing parts.
I guess maybe if we had a RopeString class still, but then the strings themselves would be slow ...

There are a few other finite-automata-ish string storing structures with the same kinda problems as tries but perhaps more so.
Since a set of strings can be reduced down into definite a finite-automata that only accepts strings from that set.
But I think that line of reasoning is a deadend of slowness

@xiaodaigh
Copy link
Author

Yeah, but if the pool is stored in a data structure such that it's easy to obtain each element's rank within the pool then one can sort an InternedStrings vector quite fast using counting sorting. Suppose there are only 3 possible elements "a", "b", "c" and so their rank is 1,2,3. We can scan through the vector and keep a count of the rank, and then use the counts to sort the vector using the same algorithm as counting sort. I don't think this is currently possible.

Nice properties for the pool data structure are easy to look up, easy to insert, easy to delete, and easy to find rank

@oxinabox
Copy link
Member

Rather than talking about properties of a pool do you want to take a step outwards and talk about properties of the InternedString

Right now, the property of the interned string is:

  • very fast equality checking
  • use less memory

You are suggesting adding:

  • very fast < / > checking

Right?

@xiaodaigh
Copy link
Author

Yeah

@oxinabox
Copy link
Member

oxinabox commented May 5, 2018

Due to the removal of the InternedString type this is not directly possible anymore,
While there is still some value in the consideration,
and perhaps we could definate some kinda of order(a,b)= getind(intern(a)) < getind(intern(b)) type deal.
It is looking less likely.

I'm so dubious about the using of Tries for performance or for memory decreasing,
for most use cases.
But like maybe it could be a thing.
It might actually be faster than hashing I guess.

@ScottPJones
Copy link
Member

I used to use B+ trees for interning (this was for huge tables, millions of entries, the B+ trees were buffered in shared memory, with TB of disk).
It worked very well for us (and this was for NLP work), a nice thing was being able to get ordered output fast (at least, by Unicode codepoints), made finding prefixes trivial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants