Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sim: Faster type coercion in toID #10619

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

larry-the-table-guy
Copy link
Contributor

Derivative of #10606

Stats

4*1024*1024 calls per row.

# uniq ... is a running total.
All others are per batch.

Stats during npm run full-test

id === '' id === text not new text # uniq text # uniq id
0.05% 64.66% 94.40% 8758 4789
0.01% 60.48% 97.89% 8834 4847
0.00% 40.75% 100.00% 9011 5009
0.58% 42.14% 100.00% 9069 5058
0.14% 47.23% 100.00% 9069 5058
0.00% 50.29% 100.00% 9069 5058
0.00% 39.74% 100.00% 9069 5058
0.00% 43.55% 100.00% 9070 5058
0.00% 35.38% 100.00% 9070 5058
0.00% 15.77% 100.00% 9070 5058
0.00% 52.12% 100.00% 9070 5058
0.00% 52.12% 100.00% 9070 5058
0.00% 52.07% 100.00% 9070 5058
0.00% 45.55% 100.00% 9070 5058
0.00% 30.35% 100.00% 9071 5058
0.00% 32.78% 100.00% 9072 5058
0.00% 55.63% 100.00% 9072 5058
0.00% 42.64% 100.00% 9072 5058
9.35% 40.16% 100.00% 9077 5061
1.14% 59.77% 100.00% 9087 5070
0.00% 47.81% 100.00% 9087 5070
0.00% 47.68% 100.00% 9087 5070
0.00% 48.33% 100.00% 9087 5070
0.00% 48.32% 100.00% 9087 5070
0.00% 48.25% 100.00% 9087 5070
0.00% 49.06% 100.00% 9087 5070
0.00% 49.30% 100.00% 9087 5070

About 100 million total calls.
Pretty high proportion of id === text, which means IDs are being passed to toID a lot.

profiling code
let empty_count = 0;
let id_same_as_text_count = 0;
const uniq_text = new Set();
let dup_text_count = 0;
const uniq_id = new Set();
let call_count = 0;
/** snip */
export function toID(text: any): ID {
        call_count++;
        if (typeof text !== 'string') {
                if (text) text = text.id || text.userid || text.roomid || text;
                if (typeof text === 'number') text = '' + text;
                else if (typeof text !== 'string') return '';
        }
        const id = text.toLowerCase().replace(/[^a-z0-9]+/g, '') as ID;

        if (text === '') ++empty_count;
        if (id === text) ++id_same_as_text_count;
        if (uniq_text.has(text)) ++dup_text_count;
        uniq_text.add(text);
        uniq_id.add(id);
        if (call_count === (1024 * 1024 * 4)) {
                console.log(); // the extra newline and letters are for grepping
                console.log("a", empty_count / call_count, id_same_as_text_count / call_count);
                console.log("b", dup_text_count / call_count, uniq_text.size, uniq_id.size);
                id_same_as_text_count = dup_text_count = call_count = empty_count = 0;
        }

        return id;
}

@larry-the-table-guy larry-the-table-guy marked this pull request as draft October 14, 2024 20:52
@larry-the-table-guy
Copy link
Contributor Author

larry-the-table-guy commented Oct 14, 2024

Ok, technically, this breaks for boolean and others. But there's no good reason to be passing those in the first place imo. Simple fix is to do '' + text more broadly, but I'll just put this on the backburner.

Edit: aaand now I remember the original code doesn't handle bool anyway. Sigh. any is not a good type for a parameter...

@Slayer95
Copy link
Contributor

Slayer95 commented Oct 14, 2024

The reason why any is the parameter type, is that toID is often used as a handy sanitizer. So, if an untrusted peer sends, say, a JSON object where a contained value is expected to be a non-empty string, then by using toId the whole validation can be simplified a lot. Supporting numbers is a nice extension to round up the concept of identifier.

So, the fact that toID is often called with arbitrary data as its parameter brings about my comment yesterday regarding the huge complexity that changing callsites not to pass IDs to toID would entail.

@larry-the-table-guy
Copy link
Contributor Author

I understand that's the intent but based on #10549 (comment)
The vast majority of the callsites already know they have a string.
And some know they have an ID, yet call methods like DexSpecies.get.

The function's contract is just unnecessarily weak.
IMO type coercion should have been separate from the lower+replace logic.
But it's not worth changing now.

@Slayer95
Copy link
Contributor

Slayer95 commented Oct 14, 2024

The vast majority of the callsites already know they have a string.

That's true.

And some know they have an ID, yet call methods like DexSpecies.get
But it's not worth changing now.

Yea. Changing that means that each of these functions would need a strict and a lax variant, thus doubling mental overhead from function names. And god forbid refactoring bugs, and having to think about Hidden Power *.

@larry-the-table-guy
Copy link
Contributor Author

larry-the-table-guy commented Oct 14, 2024

mental overhead

To me, the more pressing source of that is how many states the system can be in.
More possible types -> more states.
It's also why I think the lazy loading in Dex is meh. It adds another state to consider. I mean, at this very moment, there's a bug in DexStats - it reads dex.gen before that's been written.

I think a simple system comes from predictable data flow. Lazy loading and any types just don't help with that, IMO.
/endrant

Anyway, the fact that I can't meaningfully profile this ATM indicates I should be working on other problems.

@Zarel
Copy link
Member

Zarel commented Oct 29, 2024

For the record, I do admit it was probably a mistake to use this function in hot code. I don't mind a separate nameToID function for known strings. toID probably shouldn't be used in inner loops at all, although that might be a lost cause at this point. Blame... uh, the lack of any way to enforce type safety when I first wrote Showdown, I guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants