Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MutableMap and similar collections are not guaranteeing the type of the collection #502

Open
devnix opened this issue Jan 13, 2025 · 12 comments
Assignees
Labels
Priority: Medium This issue may be useful, and needs some attention. Status: Available No one has claimed responsibility for resolving this issue. Type: Bug Inconsistencies or issues which will cause an issue or problem for users or implementors.

Comments

@devnix
Copy link
Contributor

devnix commented Jan 13, 2025

Describe the bug
Suppose you define a map that only allows string as key and set a string that PHP would cast to an integer as an array key (any integer-ish string that doesn't start with zero). In that case, it will return an integer instead of the promised type by the static analysis.

To Reproduce

/** @var MutableMap<string, string> $map */
$map = new MutableMap([]);

$map->add('foo', 'foo');
$map->add('123', '123');

/** @psalm-trace $values */
$values = $map->keys()->toArray(); // Trace: $values: list<string>
var_dump($values); // array(2) {[0]=>  string(3) "foo"  [1]=>  int(123)}

Expected behavior
I would expect this output:

var_dump($values); // array(2) {[0]=>  string(3) "foo"  [1]=>  string(3) "123"}

I wonder if you think that this should be implemented differently to avoid such bugs as I do. I prefer a slightly slower/less memory-efficient but correct, runtime-safe implementation.

@devnix devnix added the Type: Bug Inconsistencies or issues which will cause an issue or problem for users or implementors. label Jan 13, 2025
@azjezz
Copy link
Owner

azjezz commented Jan 13, 2025

Fixing this would mean a huge performance penalty, and a BC break that requires a major release.

I personally do not use collections often, so i'm not really sure.

@veewee wdyt?

@azjezz
Copy link
Owner

azjezz commented Jan 13, 2025

cc @tkquizlet

@azjezz azjezz added Priority: Medium This issue may be useful, and needs some attention. Status: Available No one has claimed responsibility for resolving this issue. labels Jan 13, 2025
@devnix
Copy link
Contributor Author

devnix commented Jan 13, 2025

From a static analysis POV is not a BC but a bugfix. It even breaks when you do $map->at('123').

I can try a couple of test implementations and benchmark them 😊

@veewee
Copy link
Collaborator

veewee commented Jan 13, 2025

I must agree that this behaviour has bitten me in production as well on regular arrays and if we can fix it, that would be nice.

However I do agree that this is a BC break because people might get unexpected exceptions on code that worked just fine previously in a minor version. Also the performance consideration might be a blocker for people heavily using the maps: we should avoid that this basic building block becomes slow and therefore unusable.

Is there some smart way we could enforce this without too much of an overhead of actually parsing the types?
I was thinking in lines of storing the keys as a set and using the index as the value pointer.
But then I realized this problem also exists in sets ... :)

@devnix
Copy link
Contributor Author

devnix commented Jan 13, 2025

I was not thinking about parsing types at all, but to maintain two inner lists, one for the key and another for the value. I think it should not be such a traumatic overhead

@veewee
Copy link
Collaborator

veewee commented Jan 13, 2025

That would mean you have to iterate internally to find the value for a key, which is not very performant and not what you'dd expect from a map.

@devnix
Copy link
Contributor Author

devnix commented Jan 13, 2025

The alternative would be to use a non-cryptographic hashing algorithm and do the hash table in userland.

I'm very aware that it will cause performance degradation. Still, if performance to that point were my issue, I would probably not rely on this library but on the DS extension or something similar.

@devnix
Copy link
Contributor Author

devnix commented Jan 15, 2025

I was thinking about iterating the collection with a foreach, as the key exposed would be casted anyway, but it seems like it wouldn't through generators: https://3v4l.org/ucV2I

The problem here is that calling the toArray() method would screw a duplicated array position anyway, and also it would be very handy to build the collection from an iterable.

Fixing some of this things would be breaking changes, but there are some issues that could be fixed while being compatible!

@veewee
Copy link
Collaborator

veewee commented Jan 15, 2025

I was thinking: we could always prefix the key of the internal array with a specific string and store the actual (key, value) as a tuple instead.

That would fix the issue and could limit the performance impact as well.

However. I don't think we can really fix toArray because the way how the keys are converted is broken in php itself.

@azjezz
Copy link
Owner

azjezz commented Jan 15, 2025

This is quite easy to implement as mentioned via a hash, example:

$keys: array<string, K>;
$values: array<string, V>;

set:
   $hash = compute_hash($key);
   $this->keys[$hash] = $key;
   $this->values[$hash] = $value;

get:
   $hash = compute_hash($key);
   return $this->values[$hash];

contains:
   $hash = compute_hash($key);
   return isset($this->values[$hash]);

remove:
   $hash = compute_hash($key);
   unset($this->keys[$hash], $this->values[$hash]);

to_iterator:
   foreach($this->keys as $hash => $key) {
     yield $key => $this->values[$hash];
   }

to_array:
   iterator_to_array($this->toIterator());

compute_hash could be either a quick hashing algorithm, or simply sprintf("psl_%s_%s", $k, get_debug_type($k));.

we could also have the hasher be configureable, so one could do: Map::withHasher($func);.

however, this is going to introduce a "big" performance overhead.

another option: implement this in C or Rust and use FFI 😅 ( this is a semi-joke, because i do want to use FFI in Psl, but making it a required extension would be bad for a lot of people. )

@devnix
Copy link
Contributor Author

devnix commented Jan 15, 2025

I had the same hashing idea but I fear the performance too, and the tuple is a great idea indeed!

~~There's a downside: if you declare a MutableMap<int|string, mixed> you should be able to do this without overwriting:

$vector->add(1, "foo");
$vector->add("1", "bar");

I was thinking: how slow can be to serialize an int or a string, and use the serialized string as the key?~~

Forget it, I was heading out of the hospital and I missed out some parts of the post.

@devnix
Copy link
Contributor Author

devnix commented Jan 16, 2025

I can do a benchmark and a test implementation 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Medium This issue may be useful, and needs some attention. Status: Available No one has claimed responsibility for resolving this issue. Type: Bug Inconsistencies or issues which will cause an issue or problem for users or implementors.
Projects
None yet
Development

No branches or pull requests

3 participants