Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove duplicates on large number of files #568

Open
im7mortal opened this issue Jan 17, 2024 · 10 comments
Open

Remove duplicates on large number of files #568

im7mortal opened this issue Jan 17, 2024 · 10 comments
Labels
- photos Relates to the Ente Photos --mobile Platform is mobile --web Platform is web

Comments

@im7mortal
Copy link

I am trying to sync my google zip file with ente. I tried it to 2 years back #ente-io/photos-web/issues/243 and had problems.
So turn out that it every time created new file. Now I have 2 or 3 duplicates for at least 15 000 files.
ENTE_SELECTION

  1. I just don't want to do it again. It's exhausting and interface doesn't help.
  2. To revise it manually it's huge pain. I was trying to do it and after 1 hour the app was down for memory exceed.
  3. I can't trust ente as you do comparison of duplicates by size. And I saw many duplicates with important to me fotos.

So how I can see the possible solution

  1. Maybe limit duplicates by chunks. For example 1000 duplicates. Then it possible to remove this chunk and start new one.
  2. Memory crash should be debugged
  3. Add sophisticated filter to the duplicate search. : name, size, I believe you should have a some kind of hash also?

Some more prehistory . I was trying to migrate to ente in 2021 but I couldn't because of the bugs. Then I had my phone connected to ente. So I just can't remove all files and start google zip import from the scratch.

@im7mortal
Copy link
Author

20240116_214541

Like here. The photo on the left some trash screnshot but it has the same size as an important photo

@abhinavkgrd
Copy link
Contributor

The dedupe logic is the same across the web and desktop. Moving the issue to photos-web repo

@abhinavkgrd abhinavkgrd transferred this issue from ente-io/photos-desktop Jan 18, 2024
@vishnukvmd
Copy link
Member

@im7mortal hey, we've updated the implementation to use file-hashes instead of file-sizes + creation-times, since the former was more reliable.

You might still be seeing the older flow, since we might not have computed the hashes for items that were uploaded in the past. If you clear your library and re-upload, the experience will be as you expect it to be.

If you run into a crash, please share logs (Help > View logs) with [email protected], we'd love to take a look!

@im7mortal
Copy link
Author

If you clear your library and re-upload, the experience will be as you expect it to be.

As I mentioned I also used ente.io on the phone some time and I have other photos which I can't loose.

Is there an api to update hash ? I could run long task on my laptop

@vishnukvmd
Copy link
Member

We don't have an API to update the hash at the moment :(

Could you try running the de-dupe on your mobile device? There we only process those items that have a hash available.

In the future we intend to provide a way to de-dupe "similar" images, that would solve for this use case, but we don't have a clear timeline for that at the moment.

@im7mortal
Copy link
Author

I tried it overnight in android app. It's showed me a spin, I waited some time then went to sleep. As I just mentioned in photos-app#1380, it wasnt clear if ente was doing something or not.

When I got up it showed that it found only 2 duplicates , which I loaded this week.

@vishnukvmd
Copy link
Member

Sounds like there are only 2 photos that are exact duplicates (with a matching hash)?

@mnvr mnvr transferred this issue from ente-io/photos-web Mar 3, 2024
@mnvr mnvr added - photos Relates to the Ente Photos --web Platform is web --mobile Platform is mobile labels Mar 3, 2024
@im7mortal
Copy link
Author

im7mortal commented Sep 30, 2024

Hi @vishnukvmd, @mnvr , @abhinavkgrd !

I didn't try to clean duplicates from time we discussed it.

I saw there is AI feature now which will handle images locally. So I am thinking why do not also generate hashes for all images in the same run?

@mnvr
Copy link
Member

mnvr commented Oct 17, 2024

Yes. No ETA, but we've sketched out some approaches for using form of perceptual hashing or cosine similarity based deduping, and hope to get around to working on it at some point soon.

@im7mortal
Copy link
Author

im7mortal commented Oct 20, 2024

@mnvr

What I understand from this comment. Ente create simple hash on first upload of the image. So I was advised to erase all photos and reupload them to generate hash for all old photos.

I understand it, and I don't want to erase my photos and do it manually, but now ⬇️⬇️⬇️⬇️⬇️⬇️

111

Why don't generate the hash during indexing and do not update metadata? It will be quick effective solution.

I mean if we compare with perceptual hashing or cosine similarity which are more sophisticated , the hash generation is extremely straight forward and will not take time to implement it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
- photos Relates to the Ente Photos --mobile Platform is mobile --web Platform is web
Projects
None yet
Development

No branches or pull requests

4 participants