Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Similarity scores #33

Open
2meito opened this issue Dec 9, 2023 · 9 comments
Open

Feature Request: Similarity scores #33

2meito opened this issue Dec 9, 2023 · 9 comments
Labels
enhancement New feature or request

Comments

@2meito
Copy link

2meito commented Dec 9, 2023

Instead of only using strict equalities, percentage similarities would be more inclusive in potential matches.

@kawamataryo kawamataryo added the enhancement New feature or request label Dec 11, 2023
@cooljeanius
Copy link

Along these lines: showing separate checkboxes (or tags?) for the separate criteria that each account meets would be helpful, too. So, for example, instead of just showing one of the following, show each of them simultaneously:

  • Included handle name in description
  • Same handle name
  • Same display name
  • Same profile picture (as described in issue idea: compare avatars #13)
  • Any others I'm forgetting

@alanfl
Copy link

alanfl commented Oct 18, 2024

Maybe also as a side-feature, allow prioritization or weighting of matching criteria -- e.g., identical handle names can be set as "Strong similarity" and "Handle name in description" set as "Minor similarity".

The rendering might get crowded, but you could then display top-N matches below each profile on the "Following" page, rather than just having a single Bluesky account below the Twitter account.

@cooljeanius
Copy link

Maybe also as a side-feature, allow prioritization or weighting of matching criteria -- e.g., identical handle names can be set as "Strong similarity" and "Handle name in description" set as "Minor similarity".

The rendering might get crowded, but you could then display top-N matches below each profile on the "Following" page, rather than just having a single Bluesky account below the Twitter account.

While this sounds like it could be interesting and useful, I worry that the added complexity that it would bring would be prohibitive to getting it to actually work...

@PropertyOfMyCat
Copy link

PropertyOfMyCat commented Nov 13, 2024

If you wanted to do fuzzy matching between display/handle names with similarity scoring, Levenshtein edit distance would be useful. There are implementations in most languages.

For Typescript: levenshtein-edit-distance

(I'm not a Typescript programmer, so sadly I can't help with this)

@PropertyOfMyCat
Copy link

I believe the intervention to support similarity scores, and Levenshtein edit distance specifically, should be made here:
src/lib/bskyHelpers.ts
The change to be made would involve the replacement of the checks for equality, for example:

  if (
    lowerCaseNames.accountName === bskyHandle ||
    lowerCaseNames.accountNameRemoveUnderscore === bskyHandle ||
    lowerCaseNames.accountNameReplaceUnderscore === bskyHandle
  ) {
    return {
      isSimilar: true,
      type: BSKY_USER_MATCH_TYPE.HANDLE,
    };
  }

Replace the tests fr equality with, for example:

const score = levenshteinEditDistance(lowerCaseNames.accountName, bskyHandle)

score === 0 implies perfect match (that is, the current behaviour). Values greater than zero imply fuzzy matching. This could be tuneable. One could keep searching until a match is found with the lowest edit distance and return the best matching profile and the similarity score(s) for a match.

More from the usage instructions:

import {levenshteinEditDistance} from 'levenshtein-edit-distance'

levenshteinEditDistance('levenshtein', 'levenshtein') // => 0
levenshteinEditDistance('sitting', 'kitten') // => 3
levenshteinEditDistance('gumbo', 'gambol') // => 2
levenshteinEditDistance('saturday', 'sunday') // => 3

// Insensitive to order:
levenshteinEditDistance('aarrgh', 'aargh') === levenshtein('aargh', 'aarrgh') // => true

// Sensitive to ASCII casing by default:
levenshteinEditDistance('DwAyNE', 'DUANE') !== levenshtein('dwayne', 'DuAnE') // => true
// Insensitive:
levenshteinEditDistance('DwAyNE', 'DUANE', true) === levenshtein('dwayne', 'DuAnE', true) // => true

[Disclaimer: I'm not a Typescript programmer!]

@PropertyOfMyCat
Copy link

This could be useful for comparing avatars: pixelmatch ("The smallest, simplest and fastest JavaScript pixel-level image comparison library"). I'm guessing that differences in image compression between Twitter and Bluesky on upload might introduce image artifacts that result in differences that otherwise would not occur, therefore it might be useful to greyscale and/or reduce the resolution of both images to the same for comparison. This would be a nice feature to have, but honestly this should be decoupled from the text matching part of this issue, which ought to be a lot easier to implement.

@cooljeanius
Copy link

This could be useful for comparing avatars: pixelmatch ("The smallest, simplest and fastest JavaScript pixel-level image comparison library"). I'm guessing that differences in image compression between Twitter and Bluesky on upload might introduce image artifacts that result in differences that otherwise would not occur, therefore it might be useful to greyscale and/or reduce the resolution of both images to the same for comparison. This would be a nice feature to have, but honestly this should be decoupled from the text matching part of this issue, which ought to be a lot easier to implement.

That should go in #13

@PropertyOfMyCat
Copy link

Here's a rudimentary PoC that I've tested and works; this is the first time I've coded in this language, so don't shoot me if I got something wrong! It found more of my friends, but also there were more false positives, as expected. It returns a match, but not the BEST match; some kind of loop to minimize and find the match with the lowest editDist is needed. I'm a git newbie and don't know how to a PR; sorry!

import type { ProfileView } from "@atproto/api/dist/client/types/app/bsky/actor/defs";
import { BSKY_USER_MATCH_TYPE } from "./constants";
import {levenshteinEditDistance} from 'levenshtein-edit-distance'

type xUserInfo = {
  bskyHandleInDescription: string;
  accountName: string;
  accountNameRemoveUnderscore: string;
  accountNameReplaceUnderscore: string;
  displayName: string;
};

export const isSimilarUser = (
  xUserInfo: xUserInfo,
  bskyProfile: ProfileView | undefined,
): {
  isSimilar: boolean;
  type: (typeof BSKY_USER_MATCH_TYPE)[keyof typeof BSKY_USER_MATCH_TYPE];
} => {
  if (!bskyProfile) {
    return {
      isSimilar: false,
      type: BSKY_USER_MATCH_TYPE.NONE,
    };
  }

  // this is to handle the case where the user has a bsky handle in their description
  if (xUserInfo.bskyHandleInDescription) {
    const formattedBskyHandle = bskyProfile.handle.replace("@", "");
    const formattedBskyHandleInDescription =
      xUserInfo.bskyHandleInDescription.replace("@", "");
    if (
      formattedBskyHandle === formattedBskyHandleInDescription ||
      formattedBskyHandle.includes(formattedBskyHandleInDescription)
    ) {
      return {
        isSimilar: true,
        type: BSKY_USER_MATCH_TYPE.HANDLE,
      };
    }
  }

  const lowerCaseNames = Object.entries(xUserInfo).reduce<xUserInfo>(
    (acc, [key, value]) => {
      if (!value) {
        return acc;
      }
      acc[key] = value.toLowerCase();
      return acc;
    },
    {} as xUserInfo,
  );

  const bskyHandle = bskyProfile.handle
    .toLocaleLowerCase()
    .replace("@", "")
    .split(".")[0];

  const editDist = 2;

  if (
    levenshteinEditDistance(lowerCaseNames.accountName, bskyHandle) <= editDist ||
    levenshteinEditDistance(lowerCaseNames.accountNameRemoveUnderscore, bskyHandle) <= editDist ||
    levenshteinEditDistance(lowerCaseNames.accountNameReplaceUnderscore, bskyHandle) <= editDist
  ) {
    return {
      isSimilar: true,
      type: BSKY_USER_MATCH_TYPE.HANDLE,
    };
  }

  if (
    levenshteinEditDistance(lowerCaseNames.displayName, bskyProfile.displayName?.toLocaleLowerCase()) <= editDist
  ) {
    return {
      isSimilar: true,
      type: BSKY_USER_MATCH_TYPE.DISPLAY_NAME,
    };
  }

  if (
    bskyProfile.description
      ?.toLocaleLowerCase()
      .includes(`@${lowerCaseNames.accountName}`) &&
    !["pfp ", "pfp: ", "pfp by "].some((t) =>
      bskyProfile.description
        .toLocaleLowerCase()
        .includes(`${t}@${lowerCaseNames.accountName}`),
    )
  ) {
    return {
      isSimilar: true,
      type: BSKY_USER_MATCH_TYPE.DESCRIPTION,
    };
  }

  return {
    isSimilar: false,
    type: BSKY_USER_MATCH_TYPE.NONE,
  };
};

@PropertyOfMyCat
Copy link

I don't know this programming language well enough to implement this myself, but I'm guessing the logic should be something like this:

in [searchBskyUsers.ts](https://github.com/kawamataryo/sky-follower-bridge/blob/main/src/lib/searchBskyUsers.ts):

  • initialise editDist = 0;
  • initialise `maxEditDist = <some small integer < 28-char limit for Twitter handles>
  • loop over Twitter followings list
  • loop over editDist (while(editDist <= maxEditDist))
  • loop over candidate Bluesky profiles
  • match? If not, editDist += 1
  • return editDist as similarity score

I think this should find the bsky user with the smallest edit distance (=best match) for each twitter following.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants