Skip to content

Commit

Permalink
hashmap: v2.0 initial commit (new API + algorithm improvements)
Browse files Browse the repository at this point in the history
Hashmap 2.0 Highlights:
* New generic and type-safe API. We no longer need to use a macro to generate type-safe wrapper functions.
* Improved linear probing algorithm. The previous algorithm could fail on insert, rehash, or remove if a particularly poor hash function was provided. The new algorithm can never fail, even with a worst-case hash function. This adds user confidence and reduces failure modes.
* Added a supplemental hash function. Linear probing is especially sensitive to clustering due to poor hash functions. Since the hash function is user-supplied, adding a supplemental hash function provides more consistent performance.
* Now, always provide hashmap statistics API with no additional overhead to ordinary hashmap operations.
* Now, do lazy allocation on init. We reserve no memory on the heap until the first item is added.
* Default hashmap size is reduced to 128 elements.
* A hashmap_reserve() function was added to pre-allocate the hashmap.
* hashmap_foreach macros have been added to hide the complexities of iterator usage and streamline iteration.
  • Loading branch information
dleeds-cpi committed Jun 16, 2020
1 parent 14ea5d7 commit ff031ab
Show file tree
Hide file tree
Showing 8 changed files with 1,036 additions and 775 deletions.
5 changes: 5 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -88,3 +88,8 @@ export(EXPORT hashmap-targets

# Register package in user's package registry
export(PACKAGE HashMap)

##############################################
# Build unit test
enable_testing()
add_subdirectory(test)
128 changes: 72 additions & 56 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,90 +1,106 @@
# hashmap
Flexible hashmap implementation in C using open addressing and linear probing for collision resolution.
Templated type-safe hashmap implementation in C using open addressing and linear probing for collision resolution.

### Summary
This project came into existence because there are a notable lack of flexible and easy to use data structures available in C. Sure, higher level languages have built-in libraries, but plenty of embedded projects or higher level libraries start with core C code. It was undesirable to add a bulky library like Glib as a dependency to my projects, or grapple with a restrictive license agreement. Searching for "C hashmap" yielded results with questionable algorithms and code quality, projects with difficult or inflexible interfaces, or projects with less desirable licenses. I decided it was time to create my own.
This project came into existence because there are a notable lack of flexible and easy to use data structures available in C. C data structures with efficient, type-safe interfaces are virtually non-existent. Sure, higher level languages have built-in libraries and templated classes, but plenty of embedded projects or higher level libraries are implemented in C. It was undesirable to add a bulky library like Glib as a dependency to my projects, or grapple with a restrictive license agreement. Searching for "C hashmap" yielded results with questionable algorithms and code quality, projects with difficult or inflexible interfaces, or projects with less desirable licenses. I decided it was time to create my own.


### Goals
* **To scale gracefully to the full capacity of the numeric primitives in use.** E.g. on a 32-bit machine, you should be able to load a billion+ entries without hitting any bugs relating to integer overflows. Lookups on a hashtable with a billion entries should be performed in close to constant time, no different than lookups in a hashtable with 20 entries. Automatic rehashing occurs and maintains a load factor of 0.75 or less.
* **To provide a clean and easy-to-use interface.** C data structures often struggle to strike a balance between flexibility and ease of use. To this end, I provided a generic interface using void pointers for keys and data, and macros to generate type-specific wrapper functions, if desired.
* **To enable easy iteration and safe entry removal during iteration.** Applications often need these features, and the data structure should not hold them back. Both an iterator interface and a foreach function was provided to satisfy various use-cases. This hashmap also uses an open addressing scheme, which has superior iteration performance to a similar hashmap implemented using separate chaining (buckets with linked lists). This is because fewer instructions are needed per iteration, and array traversal has superior cache performance than linked list traversal.
* **To provide a clean and easy-to-use interface.** C data structures often struggle to strike a balance between flexibility and ease of use. To this end, I wrapped a generic C backend implementation with light-weight pre-processor macros to create a templated type-safe interface. All required type information is encoded in the hashmap declaration using the`HASHMAP()` macro. Unlike with header-only macro libraries, there is no code duplication or performance disadvantage over a traditional library with a non-type-safe `void *` interface.
* **To enable easy iteration and safe entry removal during iteration.** Applications often need these features, and the data structure should not hold them back. Easy to use `hashmap_foreach()` macros and a more flexible iterator interface are provided. This hashmap also uses an open addressing scheme, which has superior iteration performance to a similar hashmap implemented using separate chaining (buckets with linked lists). This is because fewer instructions are needed per iteration, and array traversal has superior cache performance than linked list traversal.
* **To use a very unrestrictive software license.** Using no license was an option, but I wanted to allow the code to be tracked, simply for my own edification. I chose the MIT license because it is the most common open source license in use, and it grants full rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell the code. Basically, take this code and do what you want with it. Just be nice and leave the license comment and my name at top of the file. Feel free to add your name if you are modifying and redistributing.

### Code Example
```C
#include <stdlib.h>
#include <stdbool.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>

#include <hashmap.h>

/* Some sample data structure with a string key */
struct blob {
char key[32];
size_t data_len;
unsigned char data[1024];
char key[32];
size_t data_len;
unsigned char data[1024];
};

/* Declare type-specific blob_hashmap_* functions with this handy macro */
HASHMAP_FUNCS_CREATE(blob, const char, struct blob)

/*
* Contrived function to allocate blob structures and populate
* them with randomized data.
*
* Returns NULL when there are no more blobs to load.
*/
struct blob *blob_load(void)
{
struct blob *b;
/*
* Hypothetical function that allocates and loads blob structures
* from somewhere. Returns NULL when there are no more blobs to load.
*/
return b;
}
static size_t count = 0;
struct blob *b;

if (++count > 100) {
return NULL;
}

if ((b = malloc(sizeof(*b))) == NULL) {
return NULL;
}
snprintf(b->key, sizeof(b->key), "%02lx", random() % 100);
b->data_len = random() % 10;
memset(b->data, random(), b->data_len);

/* Hashmap structure */
struct hashmap map;
return b;
}

int main(int argc, char **argv)
{
struct blob *b;
struct hashmap_iter *iter;
/* Declare type-specific hashmap structure */
HASHMAP(char, struct blob) map;
const char *key;
struct blob *b;
void *temp;
int r;

/* Initialize with default string key hash function and comparator */
hashmap_init(&map, hashmap_hash_string, strcmp);

/* Load some sample data into the map and discard duplicates */
while ((b = blob_load()) != NULL) {
r = hashmap_put(&map, b->key, b);
if (r < 0) {
/* Expect -EEXIST return value for duplicates */
printf("putting blob[%s] failed: %s\n", b->key, strerror(-r));
free(b);
}
}

/* Initialize with default string key functions and init size */
hashmap_init(&map, hashmap_hash_string, hashmap_compare_string, 0);
/* Lookup a blob with key "AbCdEf" */
b = hashmap_get(&map, "AbCdEf");
if (b) {
printf("Found blob[%s]\n", b->key);
}

/* Load some sample data into the map and discard duplicates */
while ((b = blob_load()) != NULL) {
if (blob_hashmap_put(&map, b->key, b) != b) {
printf("discarding blob with duplicate key: %s\n", b->key);
free(b);
/* Iterate through all blobs and print each one */
hashmap_foreach(key, b, &map) {
printf("blob[%s]: data_len %zu bytes\n", key, b->data_len);
}
}

/* Lookup a blob with key "AbCdEf" */
b = blob_hashmap_get(&map, "AbCdEf");
if (b) {
printf("Found blob[%s]\n", b->key);
}

/* Iterate through all blobs and print each one */
for (iter = hashmap_iter(&map); iter; iter = hashmap_iter_next(&map, iter)) {
printf("blob[%s]: data_len %zu bytes\n", blob_hashmap_iter_get_key(iter),
blob_hashmap_iter_get_data(iter)->data_len);
}

/* Remove all blobs with no data */
iter = hashmap_iter(&map);
while (iter) {
b = blob_hashmap_iter_get_data(iter);
if (b->data_len == 0) {
iter = hashmap_iter_remove(&map, iter);
free(b);
} else {
iter = hashmap_iter_next(&map, iter);

/* Remove all blobs with no data (using remove-safe foreach macro) */
hashmap_foreach_data_safe(b, &map, temp) {
if (b->data_len == 0) {
printf("Discarding blob[%s] with no data\n", b->key);
hashmap_remove(&map, b->key);
free(b);
}
}
}

/* Free all allocated resources associated with map and reset its state */
hashmap_destroy(&map);
/* Cleanup time: free all the blobs, and destruct the hashmap */
hashmap_foreach_data(b, &map) {
free(b);
}
hashmap_cleanup(&map);

return 0;
return 0;
}

```
Loading

0 comments on commit ff031ab

Please sign in to comment.