Feature Request: Similarity scores #33

2meito · 2023-12-09T23:37:11Z

Instead of only using strict equalities, percentage similarities would be more inclusive in potential matches.

cooljeanius · 2024-04-11T03:13:33Z

Along these lines: showing separate checkboxes (or tags?) for the separate criteria that each account meets would be helpful, too. So, for example, instead of just showing one of the following, show each of them simultaneously:

Included handle name in description
Same handle name
Same display name
Same profile picture (as described in issue idea: compare avatars #13)
Any others I'm forgetting

alanfl · 2024-10-18T10:20:30Z

Maybe also as a side-feature, allow prioritization or weighting of matching criteria -- e.g., identical handle names can be set as "Strong similarity" and "Handle name in description" set as "Minor similarity".

The rendering might get crowded, but you could then display top-N matches below each profile on the "Following" page, rather than just having a single Bluesky account below the Twitter account.

cooljeanius · 2024-10-18T21:43:54Z

Maybe also as a side-feature, allow prioritization or weighting of matching criteria -- e.g., identical handle names can be set as "Strong similarity" and "Handle name in description" set as "Minor similarity".

The rendering might get crowded, but you could then display top-N matches below each profile on the "Following" page, rather than just having a single Bluesky account below the Twitter account.

While this sounds like it could be interesting and useful, I worry that the added complexity that it would bring would be prohibitive to getting it to actually work...

PropertyOfMyCat · 2024-11-13T17:36:15Z

If you wanted to do fuzzy matching between display/handle names with similarity scoring, Levenshtein edit distance would be useful. There are implementations in most languages.

For Typescript: levenshtein-edit-distance

(I'm not a Typescript programmer, so sadly I can't help with this)

PropertyOfMyCat · 2024-11-18T13:09:26Z

I believe the intervention to support similarity scores, and Levenshtein edit distance specifically, should be made here:
src/lib/bskyHelpers.ts
The change to be made would involve the replacement of the checks for equality, for example:

  if (
    lowerCaseNames.accountName === bskyHandle ||
    lowerCaseNames.accountNameRemoveUnderscore === bskyHandle ||
    lowerCaseNames.accountNameReplaceUnderscore === bskyHandle
  ) {
    return {
      isSimilar: true,
      type: BSKY_USER_MATCH_TYPE.HANDLE,
    };
  }

Replace the tests fr equality with, for example:

const score = levenshteinEditDistance(lowerCaseNames.accountName, bskyHandle)

score === 0 implies perfect match (that is, the current behaviour). Values greater than zero imply fuzzy matching. This could be tuneable. One could keep searching until a match is found with the lowest edit distance and return the best matching profile and the similarity score(s) for a match.

More from the usage instructions:

import {levenshteinEditDistance} from 'levenshtein-edit-distance'

levenshteinEditDistance('levenshtein', 'levenshtein') // => 0
levenshteinEditDistance('sitting', 'kitten') // => 3
levenshteinEditDistance('gumbo', 'gambol') // => 2
levenshteinEditDistance('saturday', 'sunday') // => 3

// Insensitive to order:
levenshteinEditDistance('aarrgh', 'aargh') === levenshtein('aargh', 'aarrgh') // => true

// Sensitive to ASCII casing by default:
levenshteinEditDistance('DwAyNE', 'DUANE') !== levenshtein('dwayne', 'DuAnE') // => true
// Insensitive:
levenshteinEditDistance('DwAyNE', 'DUANE', true) === levenshtein('dwayne', 'DuAnE', true) // => true

[Disclaimer: I'm not a Typescript programmer!]

PropertyOfMyCat · 2024-11-18T16:35:58Z

This could be useful for comparing avatars: pixelmatch ("The smallest, simplest and fastest JavaScript pixel-level image comparison library"). I'm guessing that differences in image compression between Twitter and Bluesky on upload might introduce image artifacts that result in differences that otherwise would not occur, therefore it might be useful to greyscale and/or reduce the resolution of both images to the same for comparison. This would be a nice feature to have, but honestly this should be decoupled from the text matching part of this issue, which ought to be a lot easier to implement.

cooljeanius · 2024-11-18T16:44:20Z

This could be useful for comparing avatars: pixelmatch ("The smallest, simplest and fastest JavaScript pixel-level image comparison library"). I'm guessing that differences in image compression between Twitter and Bluesky on upload might introduce image artifacts that result in differences that otherwise would not occur, therefore it might be useful to greyscale and/or reduce the resolution of both images to the same for comparison. This would be a nice feature to have, but honestly this should be decoupled from the text matching part of this issue, which ought to be a lot easier to implement.

That should go in #13

PropertyOfMyCat · 2024-11-23T13:35:05Z

Here's a rudimentary PoC that I've tested and works; this is the first time I've coded in this language, so don't shoot me if I got something wrong! It found more of my friends, but also there were more false positives, as expected. It returns a match, but not the BEST match; some kind of loop to minimize and find the match with the lowest editDist is needed. I'm a git newbie and don't know how to a PR; sorry!

import type { ProfileView } from "@atproto/api/dist/client/types/app/bsky/actor/defs";
import { BSKY_USER_MATCH_TYPE } from "./constants";
import {levenshteinEditDistance} from 'levenshtein-edit-distance'

type xUserInfo = {
  bskyHandleInDescription: string;
  accountName: string;
  accountNameRemoveUnderscore: string;
  accountNameReplaceUnderscore: string;
  displayName: string;
};

export const isSimilarUser = (
  xUserInfo: xUserInfo,
  bskyProfile: ProfileView | undefined,
): {
  isSimilar: boolean;
  type: (typeof BSKY_USER_MATCH_TYPE)[keyof typeof BSKY_USER_MATCH_TYPE];
} => {
  if (!bskyProfile) {
    return {
      isSimilar: false,
      type: BSKY_USER_MATCH_TYPE.NONE,
    };
  }

  // this is to handle the case where the user has a bsky handle in their description
  if (xUserInfo.bskyHandleInDescription) {
    const formattedBskyHandle = bskyProfile.handle.replace("@", "");
    const formattedBskyHandleInDescription =
      xUserInfo.bskyHandleInDescription.replace("@", "");
    if (
      formattedBskyHandle === formattedBskyHandleInDescription ||
      formattedBskyHandle.includes(formattedBskyHandleInDescription)
    ) {
      return {
        isSimilar: true,
        type: BSKY_USER_MATCH_TYPE.HANDLE,
      };
    }
  }

  const lowerCaseNames = Object.entries(xUserInfo).reduce<xUserInfo>(
    (acc, [key, value]) => {
      if (!value) {
        return acc;
      }
      acc[key] = value.toLowerCase();
      return acc;
    },
    {} as xUserInfo,
  );

  const bskyHandle = bskyProfile.handle
    .toLocaleLowerCase()
    .replace("@", "")
    .split(".")[0];

  const editDist = 2;

  if (
    levenshteinEditDistance(lowerCaseNames.accountName, bskyHandle) <= editDist ||
    levenshteinEditDistance(lowerCaseNames.accountNameRemoveUnderscore, bskyHandle) <= editDist ||
    levenshteinEditDistance(lowerCaseNames.accountNameReplaceUnderscore, bskyHandle) <= editDist
  ) {
    return {
      isSimilar: true,
      type: BSKY_USER_MATCH_TYPE.HANDLE,
    };
  }

  if (
    levenshteinEditDistance(lowerCaseNames.displayName, bskyProfile.displayName?.toLocaleLowerCase()) <= editDist
  ) {
    return {
      isSimilar: true,
      type: BSKY_USER_MATCH_TYPE.DISPLAY_NAME,
    };
  }

  if (
    bskyProfile.description
      ?.toLocaleLowerCase()
      .includes(`@${lowerCaseNames.accountName}`) &&
    !["pfp ", "pfp: ", "pfp by "].some((t) =>
      bskyProfile.description
        .toLocaleLowerCase()
        .includes(`${t}@${lowerCaseNames.accountName}`),
    )
  ) {
    return {
      isSimilar: true,
      type: BSKY_USER_MATCH_TYPE.DESCRIPTION,
    };
  }

  return {
    isSimilar: false,
    type: BSKY_USER_MATCH_TYPE.NONE,
  };
};

PropertyOfMyCat · 2024-11-24T19:38:55Z

I don't know this programming language well enough to implement this myself, but I'm guessing the logic should be something like this:

in [searchBskyUsers.ts](https://github.com/kawamataryo/sky-follower-bridge/blob/main/src/lib/searchBskyUsers.ts):

initialise editDist = 0;
initialise `maxEditDist = <some small integer < 28-char limit for Twitter handles>
loop over Twitter followings list
loop over editDist (while(editDist <= maxEditDist))
loop over candidate Bluesky profiles
match? If not, editDist += 1
return editDist as similarity score

I think this should find the bsky user with the smallest edit distance (=best match) for each twitter following.

kawamataryo added the enhancement New feature or request label Dec 11, 2023

cooljeanius mentioned this issue Nov 11, 2024

Drop down menu for similar accounts #69

Closed

cooljeanius mentioned this issue Nov 18, 2024

idea: compare avatars #13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Similarity scores #33

Feature Request: Similarity scores #33

2meito commented Dec 9, 2023

cooljeanius commented Apr 11, 2024

alanfl commented Oct 18, 2024

cooljeanius commented Oct 18, 2024

PropertyOfMyCat commented Nov 13, 2024 •

edited

Loading

PropertyOfMyCat commented Nov 18, 2024

PropertyOfMyCat commented Nov 18, 2024

cooljeanius commented Nov 18, 2024

PropertyOfMyCat commented Nov 23, 2024

PropertyOfMyCat commented Nov 24, 2024

Feature Request: Similarity scores #33

Feature Request: Similarity scores #33

Comments

2meito commented Dec 9, 2023

cooljeanius commented Apr 11, 2024

alanfl commented Oct 18, 2024

cooljeanius commented Oct 18, 2024

PropertyOfMyCat commented Nov 13, 2024 • edited Loading

PropertyOfMyCat commented Nov 18, 2024

PropertyOfMyCat commented Nov 18, 2024

cooljeanius commented Nov 18, 2024

PropertyOfMyCat commented Nov 23, 2024

PropertyOfMyCat commented Nov 24, 2024

PropertyOfMyCat commented Nov 13, 2024 •

edited

Loading