Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: build a separate CUDA-supporting versions of rclip for Linux & Windows #139

Open
yurijmikhalevich opened this issue Sep 20, 2024 · 16 comments
Labels
feat New feature or request hacktoberfest priority:low Low priority issues

Comments

@yurijmikhalevich
Copy link
Owner

The request came from the comment to this video: https://www.youtube.com/watch?v=tAJHXOkHidw&lc=Ugzh-lfA1HLsy8V9rpl4AaABAg. Remember to close the loop by letting the requester know once this is done.

@yurijmikhalevich yurijmikhalevich added feat New feature or request hacktoberfest priority:low Low priority issues labels Sep 20, 2024
@FlowDownTheRiver
Copy link

FlowDownTheRiver commented Dec 29, 2024

I have downloaded the msi for windows. After I realized it is working on cpu. for the internal(env) directory I wanted to install pytorch to have cuda enabled. I saw lots of errors then I realized the pre compiled exe and the directory was for python 3.9 and I had python 3.10. I have installed python 3.9 and tried to find a compatible pytorch with cuda enabled back in the days of python 3.9 have tried several versions ,tried up fix pillow and numpy version errors and after maybe 5 hours I gave up.

Is it possible for you to update the project to python 3.10 + or is it possible for you to release a pre compiled version for windows which supports cuda? I mean cpu scanning is not so bad but compared to cuda speed it is extremely slow as in many applications(I assume the same for your project)

Also big thanks for your great work!

@FlowDownTheRiver
Copy link

FlowDownTheRiver commented Dec 29, 2024

Rclip_GUI_Final_v6.zip
I also want to contribute with a GUI especially for windows users.Feel free to include it in your project or change stuff. I have used gemini,chat-gpt,claude and qwen to make it somewhat functional. it also has a preview window to show results,clickable thumbnails,thumbnail caching.auto fit thumbnails,thumbnail resizing. I wasn't able to make it copy search results so I can copy what I find from rclip,but there is a button in the preview window for that(which needs fixing). I also added drive modes to check specific drives or all drives but I realized rclip doesn't work on the main directory of the drives.so that also needs to be addressed. I was thinking to add a process bar with tqdm to be shown in the UI while indexing or searching but I failed on that as well.That would be nice as well.

However it is a functional GUI. Feel free to include or fix or add new features if you like. Please do not expect me to open a pr for it cos I am not familiar with PR's. @yurijmikhalevich

@yurijmikhalevich
Copy link
Owner Author

yurijmikhalevich commented Dec 29, 2024

Hi @FlowDownTheRiver! Thank you for using rclip and for sharing your thoughts!

Is it possible for you to update the project to python 3.10

Yes. The next Windows release version will be using Python 3.10.

is it possible for you to release a pre compiled version for windows which supports cuda?

This is in the plans, but I don't have a timeline yet. I'll let you know once this is available.

I also want to contribute with a GUI especially for windows users.

Thank you for sharing this! I'll check it out, but no promises to add it into the rclip.

I realized rclip doesn't work on the main directory of the drives

Do you mean that you can't run rclip on multiple Windows drives at the same time?

@FlowDownTheRiver
Copy link

FlowDownTheRiver commented Dec 29, 2024

@yurijmikhalevich
Yeah I tried rclip command on windows cmd on the very base of a drive like
C:

D:

etc. it didn't work. I have seen that it had problems with systemvolumeinformation folder. which is a hidden file generated by windows alongside the versions windows 7,windows 10 probably windows 11 as well. So in some of my trials I tried to implement ignoring system folders,temp folders,etc. Windows is locking these folder and rclip had trouble with them.

Also in the gui I have made it to be compatible with utf-8 charset.So very large variety of chars are supported in folder and filenames. So I tried a few tricks to make it work to scan the all computer but I failed and gave up.

BTW I lately seen that you support : rclip "image_path" option for similarity search at least by keywords matching. So this wasn't included in the gui. I have watched your video about it and how you demonstrate it. I have a question about it.

if we use image path in a folder under these conditions :

  1. let's say we have 60.000 images in a folder(or subfolders where we call rclip) and 15.000 of the images are indexed and 45.000 images are not yet indexed.

  2. if we call : rclip "image_path" -n , will it only index the "image_path" and search among the 15.000 images? or no indexing also ignores the "image_path" ? if so there needs to be a workaround for it,like copying the "image_path" to a unique temp folder in the system drive,index it there(single image), replace the commandline as : rclip "unique_temp_folder\abc.jpg" -n shall be the right way of indexing only the single search image and not indexing where rclip is launched(45.000 images left for indexing).

this will speed up image searching for not completely indexed folders or destinations.

  1. if we call : rclip "image_path" . will it index the "image_path" and also index the rest of the images(45.000 of non indexed) and then compare the "image_path" vs 60.000(total images) ?

Anyway I am gonna look for your update especially for Cuda Support which is very essential. Thank for you time to write a detailed answer.

"Do you mean that you can't run rclip on multiple Windows drives at the same time?"

in short the answer is -> yes it is not working on multiple drives or a single drive base folder.

@FlowDownTheRiver
Copy link

2024-12-29 19_00_40-C__Windows_System32_cmd exe

looks like this :

Microsoft Windows [Version 10.0.19045.3758]
(c) Microsoft Corporation. All rights reserved.

M:>rclip forest
checking images in the current directory for changes; use "--no-indexing" to skip this if no images were added, changed, or removed
0images [00:00, ?images/s]Exception in thread Thread-2:

Traceback (most recent call last):
Traceback (most recent call last):
File "threading.py", line 980, in _bootstrap_inner
File "main.py", line 271, in
File "threading.py", line 917, in run
File "main.py", line 244, in main
File "fs.py", line 15, in count_files
File "main.py", line 231, in init_rclip
File "fs.py", line 32, in walk
File "main.py", line 131, in ensure_index
PermissionError: [WinError 5] Access is denied: 'M:\System Volume Information'
File "fs.py", line 32, in walk
PermissionError: [WinError 5] Access is denied: 'M:\System Volume Information'
[PYI-25368:ERROR] Failed to execute script 'main' due to unhandled exception!

M:>rclip "forest"
checking images in the current directory for changes; use "--no-indexing" to skip this if no images were added, changed, or removed
0images [00:00, ?images/s]Exception in thread Thread-2:
0images [00:00, ?images/s]Traceback (most recent call last):

File "threading.py", line 980, in _bootstrap_inner
Traceback (most recent call last):
File "main.py", line 271, in
File "threading.py", line 917, in run
File "main.py", line 244, in main
File "fs.py", line 15, in count_files
File "main.py", line 231, in init_rclip
File "fs.py", line 32, in walk
File "main.py", line 131, in ensure_index
PermissionError: [WinError 5] Access is denied: 'M:\System Volume Information'
File "fs.py", line 32, in walk
PermissionError: [WinError 5] Access is denied: 'M:\System Volume Information'
[PYI-15304:ERROR] Failed to execute script 'main' due to unhandled exception!

@yurijmikhalevich
Copy link
Owner Author

@FlowDownTheRiver,

I tried rclip command on windows cmd on the very base of a drive like
C:

D:

etc. it didn't work. I have seen that it had problems with systemvolumeinformation folder.

Got it! This context is extremely helpful. I have created a ticket to track this issue here. Thank you!

if we call : rclip "image_path" -n , will it only index the "image_path" and search among the 15.000 images?

This is correct. The reference image is always "indexed." You can even link it via an HTTP URL.

if we call : rclip "image_path" . will it index the "image_path" and also index the rest of the images(45.000 of non indexed) and then compare the "image_path" vs 60.000(total images) ?

This is also correct.

Let me know if you have any other questions or feedback! :)

@FlowDownTheRiver
Copy link

FlowDownTheRiver commented Dec 29, 2024

I will try to test further. I just discovered the repo even though you have written it long time ago and you are still active on it.
maybe a suggestion can be adding a new commandline argument something like this as an option can be in my mind.

Filename weighted search. when we make a search it gives us some number of probability. 0.53 ,0.20 etc.
By default this percentage shall be %100 on clip search.(as in weight)
if we enable filename aided(weighted) search : --filename 10( % 10 weight on filename - % 90 weight on clip probability), filename can be searched with positive and negative queries with regular expressions,if it includes the query or not. --filename 50 can be (%50 on both weights).

this can do both clip and filename searching with some weights.

Here is another idea of filtering. --filter (type,filename,etc)
if we use filtering on filename this shall check for filename if it includes the query(exists in filename first), then among these images we can do clip search.

A bit of extra work for you if you wanna invest on :)

Maybe one of these ideas is good, other is not so useful or both are great to have. I haven't evaluated the benefits of these however these are some ideas...
@yurijmikhalevich

@yurijmikhalevich
Copy link
Owner Author

@FlowDownTheRiver, thank you for sharing these! 🙌

Filename weighted search.

rclip supports weighted search

  • any query can be prefixed with a multiplier, e.g. "2:cat", "0.5:./cat-sleeps-on-a-chair.jpg";
    adding a multiplier is especially useful when combining image and text queries because image
    queries are usually weighted more than text ones

You can do:

rclip "2:cat" + "0.5:./cat-sleeps-on-a-chair.jpg"

To combine multiple (filename and not) queries with different weights.


Here is another idea of filtering. --filter (type,filename,etc)
Filename weighted search.

Ah. I see. You mean to search the actual file name? Right? I am not sure if this makes sense for rclip at the moment, but I have noted your suggestions and will give them some thought ✍️

@FlowDownTheRiver
Copy link

FlowDownTheRiver commented Dec 30, 2024

I tested the weighted search,it gives different results when I change the weights.So I assume it works but only on the query based search.it doesn't include filename matching. So I couldn't get any related results for filename search(in this case I have written the filename or part of a filename in query).
what I had in mind is something like this :
https://chatgpt.com/share/6772f9c7-9c68-8005-b5a7-81152826a356
I asked chatgpt to write some examples for me to explain better.Hope this makes sense... @yurijmikhalevich
In short including the filename in the formula :)
The benefits are : We can have better results by pin-pointing the program what to search for.
Or if we have somebody or someone named pictures and we are looking for specific poses. Maybe nature photos folder etc and search withing that range. It is overall a better control.
maybe "parent folder name" can be an option as well. We point the program to some folder names matches,then inside we do the search...
We can run the rclip 10 levels high , on the recursive search if folder name matches ,results shall be isolated to a specific range...

I mean these are just ideas to have a better control over the expected results. I assume these features will be very well received by the people...and I know I asked a lot but I hope you consider these changes or improvements.

Long story short combination of algorithmic search + AI tagged search is the way to go ^^

@FlowDownTheRiver
Copy link

FlowDownTheRiver commented Dec 30, 2024

You can do:

rclip "2:cat" + "0.5:./cat-sleeps-on-a-chair.jpg"

To combine multiple (filename and not) queries with different weights.

This is different . Your example here is weighting the queries of 1) string query + 2) image similarity search(which is also a query as it going to be converted by clip) and it is very nice to have. What I mean was let's say your database is in csv format, filename shall have its own column, clip tags have their own column , parent directories have their own column etc. So if we for example :
rclip "red bird"--filename 0.6:"nature" ,it will %60 check for filename matching which has "nature" word in it and it will check
for "red bird". Filename matching can be added to formula or it can have its own formula. if filename query number is only one word it can be counted as 1.0 (%100 probability), "nature","sea" if not defined by user nature and sea words can have %50 probability each. if defined 0.8:"nature",0.2:"sea" %80 + %20 probability. So filename matching can be evaluated by its own like your query system. however the master evaluation can be put together with a formula such as classic queries(positives and negatives), and filename queries(positives and negatives).
more easy way to tell :
1)we get query probability
2)we get filename probability
3)we evaluate all of them and give the user results.
4)optional -> parent foldername of the images probability --foldername argument.Why ?we may not rename all the images but in real world people rename the folders to know what is inside.
This is actually for filtering the database...
This will probably require new database :) and a lot of work ^^

@yurijmikhalevich
Copy link
Owner Author

yurijmikhalevich commented Dec 31, 2024

@FlowDownTheRiver, I see! Thank you for explaining the idea.

Filename matching can be added to formula or it can have its own formula.
This is actually for filtering the database...

What is the use case of this? Can you give an example, please? I am asking because, in my practice, image filenames are often meaningless.

Do we need this if searching by image content alone yields great results?

@FlowDownTheRiver
Copy link

@yurijmikhalevich, thank you for considering this suggestion! Allow me to elaborate on the reasoning and potential benefits of incorporating filename matching and folder name filtering into the search algorithm.
Use Case and Examples:

While it’s true that filenames are sometimes meaningless, this is not universally the case. Many users, especially those who organize large datasets manually or semi-automatically, use filenames and folder structures as a way to categorize and locate files. Here are some specific examples:

  1. Photography Collections

    Scenario: A photographer has thousands of images stored in folders like Nature_2024, Events_Wedding_2023, or Portraits_Friends. Filenames often contain context, such as sunset_beach.jpg or wedding_ceremony.jpg.
    Challenge: Searching for a "red bird" might return great image content matches, but if the search engine also considers folder names (Nature_2024) or partial filename matches (redbird_in_forest.jpg), it would drastically improve accuracy by prioritizing matches from relevant collections.

  2. Scientific Research or Archival Work

    Scenario: A researcher has a dataset where filenames like exp2024_sampleA_temp30.png or exp2024_sampleB_temp40.png contain encoded metadata. The parent folder might represent broader categories, such as Experiment_Set1 or Control_Group.
    Challenge: Searching by image content alone won’t capture these organizational details, but leveraging filenames or folder names ensures that the search isolates relevant datasets, reducing noise.

  3. Personal Image Libraries

    Scenario: A user organizes family photos with filenames like 2024_vacation_beach.jpg or birthday_party_2023.jpg. Parent folders like Vacations or Birthdays further group related images.
    Challenge: Searching for "beach" in an AI-based image content system may yield irrelevant results (e.g., photos with blue skies but no beach). Filename and folder matching filters can pinpoint the intended results faster.

Benefits of the Proposed System:

Enhanced Precision with Minimal Effort:
    Leveraging filename and folder name metadata reduces the ambiguity of searches and improves precision. This is particularly valuable when users have specific contextual knowledge about their datasets.

Flexibility and Customization:
    Introducing weights (e.g., 0.6:filename, 0.4:content) allows users to tailor the search to their needs. Those who don’t rely on filenames can assign lower weights, while others can maximize its impact.

Real-World Relevance:
    Many users rely on filenames and folder structures to compensate for the limitations of their current tools. Integrating this into the algorithm acknowledges these habits, making the tool more user-friendly and widely adoptable.

Database Filtering and Scalability:
    Adding folder-based filtering or filename-based scoring doesn’t just enhance accuracy—it also acts as a pre-filter, reducing the computational load by narrowing down the search space before applying content-based AI algorithms.

Why Not Rely Solely on Image Content?

While content-based image searches are powerful, they can occasionally misinterpret context or return overly broad results. Filename and folder metadata act as a complementary layer of filtering that bridges the gap between AI's interpretation and human-organized datasets.

For example:

Searching for "red bird" might yield a parrot, cardinal, or even a robin. If the user knows they’re looking for images from a folder named Nature_2024 or a file named cardinal_spring.jpg, incorporating these metadata increases the likelihood of finding the correct image.

A Balanced Proposal:

Integrating this feature doesn’t need to replace or complicate current functionality. Instead:

Filename matching can be optional and weighted.
Folder name filtering can act as a pre-selection criterion for recursive searches.
A toggle system allows users to choose whether to incorporate filename and folder metadata into their queries.

This hybrid approach (algorithmic + metadata-aware) would cater to diverse user preferences, making the tool both intuitive and robust.

I hope this clarifies the use case and demonstrates the value of such a feature. Thank you for considering these ideas! 🙏

This response provides a clear rationale and examples to highlight why filename matching and folder-based filtering are valuable features.

Note : I asked ChatGPT about my ideas to explain the benefits :)

@yurijmikhalevich
Copy link
Owner Author

yurijmikhalevich commented Jan 1, 2025

@FlowDownTheRiver, thank you for your response 🙌

Reading this made me think about whether exposing an --include argument opposite to the currently existing --exclude-dir will cover 99% of the described use cases. The user should be able to pass a glob pattern to the --include param that will filter images by filepath (including both filename and dirname) before actually performing the query search.

--include param could also be easier to use compared to the weighted filename matching.

What do you think?

Happy New Year! 🎉

@FlowDownTheRiver
Copy link

FlowDownTheRiver commented Jan 1, 2025

@yurijmikhalevich Happy New Year 🎉
You can be right about it. It simplifies the task by a lot and this may save you from a lot of coding and logical practices which may lead to some hair loss at the end of the process :)

Jokes aside,I think this may work.
The structure idea can be like :
To ensure flexibility, it might be worth considering how --include could handle filename and folder name matches separately.
--include for general filepath patterns (both folder and filename).
--include-dir specifically for folder names.
--include-file specifically for filenames.

I am not sure how your current search algorithm works but I think a bit of forgiving in the search query shall help. like not looking for the exact match and allowing joker chars like ""("star").
Examples:
All files can be checked with . (star.star)
partial file or folder names natur
(naturstar.star) = can be "nature" ,"natural" etc. or this can be checked with another parameter to check for exact matches or not. so the program can do this in the background for fill "*"(star) automatically to simplify the process for end users.
Simply is it %100 exact match or partial matching question!

Another somewhat related and unrelated improvement can be :
Some options in Everything Search Engine :
2025-01-02 00_11_47-Window

I have been using void tools everything search engine for a very long time,it is also building a database and it has an extremely fast database builder and searcher. it can even show the results and update live as you type each letter very fast. I am not sure how does it do the database so fast and search the database that fast, but if you can find it,maybe your no indexing search or showing the results can be done in seconds.

This can be out of the scope of your project,but this is also a suggestion or a research topic for you to make it search faster if possible.If it your project can match the speed of it for database search,it can show instant results with a while loop calling for each letter from a GUI,but I don't know how this can be triggered from command-line( if not user can still search faster from a single line)

references :
https://www.voidtools.com/faq/#what_is_everything
https://www.voidtools.com/forum/viewtopic.php?t=9787#new

My final thoughts are : I am very positive about the file name and folder name with --include , --include-dir , --include-file.
I missed the simplicity and dived into weighting (good idea but complex : ) )

Note : I couldn't escape the char star in github,and it doesn't show here so I have written (star) here and there. star = multiplication sign in python :D

Also please forgive my never ending suggestions but I am very excited about your project.In the past I have contributed many projects(publicly open source or closed source ones) as in "idea" base as I am not that skilled in coding. However,most of those projects ended up much better with skilled developers who are open to new ideas. You and if any other coder contributer out there, are the one(s) who will make the magic happen.

@yurijmikhalevich
Copy link
Owner Author

@FlowDownTheRiver, thank you for sharing your thoughts! I appreciate the suggestions and the excitement! 🙌

All files can be checked with . (star.star)
partial file or folder names natur (naturstar.star) = can be "nature" ,"natural" etc.

This makes sense!

I am not sure how does it do the database so fast and search the database that fast, but if you can find it,maybe your no indexing search or showing the results can be done in seconds.

This can be done for rclip. To make this work, we will have to always run the "rclip daemon" in the background and let rclip talk to it.

The downside is increased RAM consumption due to the daemon (we will have to keep the model loaded in RAM constantly). It's a trade-off between RAM and speed.

Also, not having the daemon keeps rclip very simple.

Currently, I want to reap all the speed improvements I can without introducing the daemon. I have a few ideas, and then I will decide whether to introduce the daemon.

I am very positive about the file name and folder name with --include , --include-dir , --include-file.
I missed the simplicity and dived into weighting (good idea but complex : ) )

Thank you! I am excited to hear that! I am still trying to understand whether making rclip work with find instead is a better idea.

E.g., what if we can let the user do:
find ... | rclip "query" and then rclip will search over files found by find 💭

I guess the benefit of adding --include arguments to rclip is the simplicity of usage. Using it like this with find feels a bit cumbersome.

Also please forgive my never ending suggestions but I am very excited about your project.

Thank you again! I greatly appreciate your suggestions and even greatly appreciate your excitement! 🙌

@FlowDownTheRiver
Copy link

@yurijmikhalevich Thanks for the reply.I will just settle and wait now. Just take your time as much as you need and whenever you feel like it you can work on some features,no pressure... I really appreciate and respect your attitude for giving value to people's thoughts... Hoping everything goes well...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat New feature or request hacktoberfest priority:low Low priority issues
Projects
None yet
Development

No branches or pull requests

2 participants