diff --git a/.github/workflows/python-publish.yml b/.github/workflows/python-publish.yml index 73f5e5f..b40bae6 100644 --- a/.github/workflows/python-publish.yml +++ b/.github/workflows/python-publish.yml @@ -15,6 +15,6 @@ jobs: fetch-depth: 0 - run: python3 -m pip install --upgrade build && python3 -m build - name: Publish package - uses: pypa/gh-action-pypi-publish@v1.8.10 + uses: pypa/gh-action-pypi-publish@release/v1 with: - password: ${{ secrets.PYPI_GITHUB_ORGAFOLD }} \ No newline at end of file + password: ${{ secrets.PYPI_GITHUB_ORGAFOLD }} diff --git a/.github/workflows/run-unittest.yml b/.github/workflows/run-unittest.yml index 706762d..1c8cb08 100644 --- a/.github/workflows/run-unittest.yml +++ b/.github/workflows/run-unittest.yml @@ -5,7 +5,7 @@ jobs: runs-on: ubuntu-latest strategy: matrix: - python-version: [3.11, 3.12, 3.13] + python-version: [3.11, 3.12] steps: - uses: actions/checkout@v2 - name: Set up Python ${{ matrix.python-version }} diff --git a/Makefile b/Makefile index 9fbe7a5..5524e42 100644 --- a/Makefile +++ b/Makefile @@ -3,4 +3,4 @@ export TAG := `grep version pyproject.toml | pz --search '"(\d+\.\d+\.\d+(?:rc\d release: git tag $(TAG) git push origin $(TAG) - #mkdocs gh-deploy + mkdocs gh-deploy diff --git a/README.md b/README.md deleted file mode 100644 index 12f48d1..0000000 --- a/README.md +++ /dev/null @@ -1,261 +0,0 @@ -

Deduplidog – Deduplicator that covers your back.

-

- -

- -[![Build Status](https://github.com/CZ-NIC/deduplidog/actions/workflows/run-unittest.yml/badge.svg)](https://github.com/CZ-NIC/deduplidog/actions) - - -- [About](#about) - * [What are the use cases?](#what-are-the-use-cases) - * [What is compared?](#what-is-compared) - * [Why not using standard sync tools like meld?](#why-not-using-standard-sync-tools-like-meld) - * [Doubts?](#doubts) -- [Launch](#launch) -- [Examples](#examples) - * [Duplicated files](#duplicated-files) - * [Names shuffled](#names-shuffled) -- [Documentation](#documentation) - * [Parameters](#parameters) - * [Utils](#utils) - -# About - -## What are the use cases? -* I have downloaded photos and videos from the cloud. Oh, both Google Photos and Youtube *shrink the files* and change their format. Moreover, they shorten the file names to 47 characters and capitalize the extensions. So how am I supposed to know if I have everything backed up offline when the copies are resized? -* My disk is cluttered with several backups and I'd like to be sure these are all just copies. -* I merge data from multiple sources. Some files in the backup might have *the former orignal file modification date* that I might wish to restore. - -## What is compared? - -* The file name. - -Works great when the files keep more or less the same name. (Photos downloaded from Google have its stem shortened to 47 chars but that is enough.) Might ignore case sensitivity. - -* The file date. - -You can impose the same file *mtime*, tolerate few hours (to correct timezone confusion) or ignore the date altogether. - -Note: we ignore smaller than a second differences. - -* The file size, the image hash or the video frame count. - -The file must have the same size. Or take advantage of the media magic under the hood which ignores the file size but compares the image or the video inside. It is great whenever you end up with some files converted to a different format. - -* The contents? - -You may use `checksum=True` to perform CRC32 check. However for byte-to-byte checking, when the file names might differ or you need to check there is no byte corruption, some other tool might be better way, i.e. [jdupes](https://www.jdupes.com/). - -## Why not using standard sync tools like [meld](https://meldmerge.org/)? -These imply the folders have the same structure. Deduplidog is tolerant towards files scattered around. - -## Doubts? - -The program does not write anything to the disk, unless `execute=True` is set. Feel free to launch it just to inspect the recommended actions. Or set `inspect=True` to output bash commands you may launch after thorough examining. - -# Launch - -Install with `pip install deduplidog`. - -It works as a standalone program with all the CLI, TUI and GUI interfaces. Just launch the `deduplidog` command. - -# Examples - -## Media magic confirmation - -Let's compare two folders. - -```bash -deduplidog --work-dir folder1 --original-dir folder2 --media-magic --rename --execute -``` - -By default, `--confirm-one-by-one` is True, causing every change to be manually confirmed before it takes effect. So even though `--execute` is there, no change happen without confirmation. The change that happen is the `--rename`, the file in the `--work-dir` will be prefixed with the `βœ“` character. The `--media-magic` mode considers an image a duplicate if it has the same name and a similar image hash, even if the files are of different sizes. - -![Confirmation](asset/warnings_confirmation_example.avif "Confirmation, including warnings") - -Note that the default button is 'No' as there are some warnings. First, the file in the folder we search for duplicates in is bigger than the one in the original folder. Second, it is also older, suggesting that it might be the actual original. - - -## Duplicated files -Let's take a closer look to a use-case. - -```bash -deduplidog --work-dir /home/user/duplicates --original-dir /media/disk/origs" --ignore-date --rename -``` - -This command produced the following output: - -``` -Find files by size, ignoring: date, crc32 -Duplicates from the work dir at 'home' would be (if execute were True) renamed (prefixed with βœ“). -Number of originals: 38 -* /home/user/duplicates/foo.txt - /media/disk/origs/foo.txt - πŸ”¨home: renamable - πŸ“„media: DATE WARNING + a day πŸ›Ÿskipped on warning -Affectable: 37/38 -Affected size: 56.9 kB -Warnings: 1 -``` - -We found out all the files in the *duplicates* folder seem to be useless but one. It's date is earlier than the original one. The life buoy icon would prevent any action. To suppress this, let's turn on `set_both_to_older_date`. See with full log. - -```bash -deduplidog --work-dir /home/user/duplicates --original-dir /media/disk/origs --ignore-date --rename --set-both-to-older-date --log-level=10 -``` - -``` -Find files by size, ignoring: date, crc32 -Duplicates from the work dir at 'home' would be (if execute were True) renamed (prefixed with βœ“). -Original file mtime date might be set backwards to the duplicate file. -Number of originals: 38 -* /home/user/duplicates/foo.txt - /media/disk/origs/foo.txt - πŸ”¨home: renamable - πŸ“„media: redatable 2022-04-28 16:58:56 -> 2020-04-26 16:58:00 -* /home/user/duplicates/bar.txt - /media/disk/origs/bar.txt - πŸ”¨home: renamable -* /home/user/duplicates/third.txt - /media/disk/origs/third.txt - πŸ”¨home: renamable - ... -Affectable: 38/38 -Affected size: 59.9 kB -``` - -You see, the log is at the most brief, yet transparent form. The files to be affected at the work folder are prepended with the πŸ”¨ icon whereas those affected at the original folder uses πŸ“„ icon. We might add `execute=True` parameter to perform the actions. Or use `inspect=True` to inspect. - -```bash -deduplidog --work-dir /home/user/duplicates --original-dir /media/disk/origs --ignore-date --rename --set-both-to-older-date --inspect -``` - -The `inspect=True` just produces the commands we might subsequently use. - -```bash -touch -t 1524754680.0 /media/disk/origs/foo.txt -mv -n /home/user/duplicates/foo.txt /home/user/duplicates/βœ“foo.txt -mv -n /home/user/duplicates/bar.txt /home/user/duplicates/βœ“bar.txt -mv -n /home/user/duplicates/third.txt /home/user/duplicates/βœ“third.txt -``` - -## Names shuffled - -You face a directory that might contain some images twice. Let's analyze. We turn on `media_magic` so that we find the scaled down images. We `ignore_name` because the scaled images might have been renamed. We `skip_bigger` files as we examine the only folder and every file pair would be matched twice. That way, we declare the original image is the bigger one. And we set `log_level` verbosity so that we get a list of the affected files. - -``` -$ deduplidog --work-dir ~/shuffled/ --media-magic --ignore-name --skip-bigger --log-level=20 -Only files with media suffixes are taken into consideration. Nor the size nor the date is compared. Nor the name! -Duplicates from the work dir at 'shuffled' (only if smaller than the pair file) would be (if execute were True) left intact (because no action is selected, nothing will happen). - -Number of originals: 9 -Caching image hashes: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 9/9 [00:00<00:00, 16.63it/s] -Caching working files: 9it [00:00, 62497.91it/s] -* /home/user/shuffled/IMG_20230802_shrink.jpg - /home/user/shuffled/IMG_20230802.jpg -Affectable: 1/9 -Affected size: 636.4 kB -``` - -We see there si a single duplicated file whose name is `IMG_20230802_shrink.jpg`. - -# Documentation - -## Parameters - -Import the `Deduplidog` class and change its parameters. - -```python3 -from deduplidog import Deduplidog -``` - -Or change these parameter from CLI or TUI, by launching `deduplidog`. - -Find the duplicates. Normally, the file must have the same size, date and name. (Name might be just similar if parameters like strip_end_counter are set.) If `media_magic=True`, media files receive different rules: Neither the size nor the date are compared. See its help. - -| parameter | type | default | description | -|-----------|------|---------|-------------| -| work_dir | str \| Path | - | Folder of the files suspectible to be duplicates. | -| original_dir | str \| Path | - | Folder of the original files. Normally, these files will not be affected.
(However, they might get affected by `treat_bigger_as_original` or `set_both_to_older_date`). | -| **Actions** | -| execute | bool | False | If False, nothing happens, just a safe run is performed. | -| inspect | bool | False | Print bash commands that correspond to the actions that would have been executed if execute were True.
You can check and run them yourself. | -| rename | bool | False | If `execute=True`, prepend βœ“ to the duplicated work file name (or possibly to the original file name if treat_bigger_as_original).
Mutually exclusive with `replace_with_original` and `delete`. | -| delete | bool | False | If `execute=True`, delete theduplicated work file name (or possibly to the original file name if treat_bigger_as_original).
Mutually exclusive with replace_with_original and rename. | -| replace_with_original | bool | False | If `execute=True`, replace duplicated work file with the original (or possibly vice versa if treat_bigger_as_original).
Mutually exclusive with rename and delete. | -| set_both_to_older_date | bool | False | If `execute=True`, `media_magic=True` or (media_magic=False and `ignore_date=True`), both files are set to the older date. Ex: work file get's the original file's date or vice versa. | -| treat_bigger_as_original | bool | False | If `execute=True` and `rename=True` and `media_magic=True`, the original file might be affected (by renaming) if smaller than the work file. | -| skip_bigger | bool | False | If `media_magic=True`, all writing actions, such as `rename`, `replace_with_original`, `set_both_to_older_date` and `treat_bigger_as_original` are executed only if the affectable file is smaller (or the same size) than the other. | -| skip_empty | bool | False | Skip files with zero size. | -| neglect_warning | bool | False | By default, when a file with bigger size or older date should be affected, just warning is generated. Turn this to suppress it.| -| **Matching** | -| casefold | bool | False | Case insensitive file name comparing. | -| checksum | bool | False | If `media_magic=False` and `ignore_size=False`, files will be compared by CRC32 checksum.
(This mode is considerably slower.) | -| tolerate_hour | int \| tuple[int, int] \| bool | False | When comparing files in work_dir and `media_magic=False`, tolerate hour difference.
Sometimes when dealing with FS changes, files might got shifted few hours.
* bool β†’ -1 .. +1
* int β†’ -int .. +int
* tuple β†’ int1 .. int2
Ex: tolerate_hour=2 β†’ work_file.st_mtime -7200 ... + 7200 is compared to the original_file.st_mtime | -| ignore_name | bool | False | Files will not be compared by stem nor suffix. | -| ignore_date | bool | False | If `media_magic=False`, files will not be compared by date. | -| ignore_size | bool | False | If `media_magic=False`, files will not be compared by size. | -| space2char | bool \| str | False | When comparing files in work_dir, consider space as another char. Ex: "file 012.jpg" is compared as "file_012.jpg" | -| strip_end_counter | bool | False | When comparing files in work_dir, strip the counter. Ex: "00034(3).MTS" is compared as "00034.MTS" | -| strip_suffix | str | False | When comparing files in work_dir, strip the file name end matched by a regular. Ex: "001-edited.jpg" is compared as "001.jpg" | -| work_file_stem_shortened | int | None | Photos downloaded from Google have its stem shortened to 47 chars. For the comparing purpose, treat original folder file names shortened. | -| invert_selection | bool | False | Match only those files from work_dir that does not match the criterions. | -| **Media** | -| media_magic | bool | False | Nor the size or date is compared for files with media suffixes.
A video is considered a duplicate if it has the same name and a similar number of frames, even if it has a different extension.
An image is considered a duplicate if it has the same name and a similar image hash, even if the files are of different sizes.
(This mode is considerably slower.) | -| accepted_frame_delta | int | 1 | Used only when media_magic is True | -| accepted_img_hash_diff | int | 1 | Used only when media_magic is True | -| img_compare_date | bool | False | If True and `media_magic=True`, the work file date or the work file EXIF date must match the original file date (has to be no more than an hour around). | -| **Helper** | -| log_level | int | 30 (warning) | 10 debug .. 50 critical | -| output | bool | False | Stores the output log to a file in the current working directory. (Never overwrites an older file.) | - -## Utils - -The library might be invoked from a [Jupyter Notebook](https://jupyter.org/). - -```python3 -from deduplidog import Deduplidog -Deduplidog("/home/user/duplicates", "/media/disk/origs", ignore_date=True, rename=True).start() -``` - -In the `deduplidog.utils` packages, you'll find a several handsome tools to help you. You will find parameters by using your IDE hints. - -### `images` -*`urls: Iterable[str | Path]`* Display a ribbon of images. - -### `print_video_thumbs` -*`src: str | Path`* Displays thumbnails for a video. - -### `print_videos_thumbs` -*`dir_: Path`* To quickly understand the content of each video, output the duration and the first few frames. - -### `get_frame_count` -*`filename: str|Path`* Uses cv2 to determine the video frame count. Method is cached. - -### `search_for_media_wizzard` -*`cwd: str`* Repeatedly prompt and search for files with similar names somewhere in the specified path. Display all such files as images and video previews. - -### `are_contained` -*`work_dir: str, original_dir: str, sec_range: int = 60`* You got two dirs with files having different naming system (427.JPG vs DSC_1344) - which you suspect to contain the same set. The same files in the dirs seem to have the same timestamp. - The same timestamp means +/- sec_range (ex: 1 minute). - Loop all files from work_dir and display corresponding files having the same timestamp. - or warn that no original exists. - -### `remove_prefix_in_workdir` -*`work_dir: str`* Removes the prefix βœ“ recursively from all the files. The prefix might have been previously given by the deduplidog. - - -### `mark_symlink_by_target` -*`suspicious_directory: str | Path, starting_path: str`* If the file is a symlink, pointing to this path, rename it with an arrow. - -``` -:param suspicious_directory: Ex: /media/user/disk/Takeout/Photos/ -:param starting_path: Ex: /media/user/disk -``` - -### `mark_symlink_only_dirs` -*`dir_: str | Path`* If the directory is full of only symlinks or empty, rename it to an arrow. - -### `mtime_files_in_dir_according_to_json` -*`dir_: str | Path, json_dir: str | Path`* Google Photos returns JSON with the photo modification time. Sets the photos from the dir_ to the dates fetched from the directory with these JSONs. diff --git a/README.md b/README.md new file mode 120000 index 0000000..e892330 --- /dev/null +++ b/README.md @@ -0,0 +1 @@ +docs/index.md \ No newline at end of file diff --git a/deduplidog/__main__.py b/deduplidog/__main__.py index cf9674f..162668f 100644 --- a/deduplidog/__main__.py +++ b/deduplidog/__main__.py @@ -6,8 +6,6 @@ def main(): - # NOTE: I'd like to have the default in case work dir is not specified args=("--work-dir", str(Path.cwd())) - # Currently, args overthrows CLI arguments. with run(Deduplidog, interface=None) as m: try: while True: @@ -32,9 +30,9 @@ def main(): except KeyboardInterrupt: print("") sys.exit() - except Exception as e: - import ipdb - ipdb.post_mortem() # TODO + # except Exception as e: + # import ipdb + # ipdb.post_mortem() if __name__ == "__main__": diff --git a/deduplidog/deduplidog.py b/deduplidog/deduplidog.py index f55cd49..29bb85b 100644 --- a/deduplidog/deduplidog.py +++ b/deduplidog/deduplidog.py @@ -37,6 +37,8 @@ @dataclass class Action: + """ What is to be done with the duplicates. """ + execute: bool = False "If False, nothing happens, just a safe run is performed." @@ -63,6 +65,8 @@ class Action: @dataclass class Execution: + """ Parameters affecting the way the execution runs. """ + set_both_to_older_date: bool = False "If `execute=True`, `media_magic=True` or (media_magic=False and `ignore_date=True`), both files are set to the older date. Ex: work file get's the original file's date or vice versa." @@ -80,11 +84,15 @@ class Execution: "By default, when a file with bigger size or older date should be affected, just warning is generated. Turn this to suppress it." confirm_one_by_one: bool = True - """ Instead of executing changes all at once, confirm one by one. So that you may decide whether the media similarity detection works. """ + """ Instead of executing changes all at once, confirm one by one. + So that you may decide whether the media similarity detection works. + If a warning occurs, the default is 'no' to perform the action. """ @dataclass class Match: + """ The way the files are compared. """ + casefold: bool = False "Case insensitive file name comparing." checksum: bool = False @@ -123,6 +131,7 @@ class Match: @dataclass class Media: + """ Media files similarity detection. """ media_magic: bool = False """ Media files similarity detection. @@ -146,6 +155,8 @@ class Media: @dataclass class Helper: + """ Helper settings. """ + log_level: int = logging.WARNING "10 debug .. 50 critical" # NOTE check diff --git a/docs/Changelog.md b/docs/Changelog.md new file mode 100644 index 0000000..902b19f --- /dev/null +++ b/docs/Changelog.md @@ -0,0 +1,4 @@ +# Changelog + +## 0.7.0 (2025-01-09) +Working as expected. \ No newline at end of file diff --git a/docs/Overview.md b/docs/Overview.md new file mode 100644 index 0000000..c5dc3f2 --- /dev/null +++ b/docs/Overview.md @@ -0,0 +1,22 @@ +## Deduplidog overview + +Launch `deduplidog` and change these parameter from GUI/TUI, or set them via CLI. + +::: deduplidog.Deduplidog + options: + members: + - work_dir + - original_dir + show_aliases: false + show_source: false + +# Action +::: deduplidog.deduplidog.Action +# Execution +::: deduplidog.deduplidog.Execution +# Match +::: deduplidog.deduplidog.Match +# Media +::: deduplidog.deduplidog.Media +# Helper +::: deduplidog.deduplidog.Helper diff --git a/docs/Utils.md b/docs/Utils.md new file mode 100644 index 0000000..d67cc5f --- /dev/null +++ b/docs/Utils.md @@ -0,0 +1,50 @@ +## Utils + +The library might be invoked from a [Jupyter Notebook](https://jupyter.org/). + +```python3 +from deduplidog import Deduplidog +Deduplidog("/home/user/duplicates", "/media/disk/origs", ignore_date=True, rename=True).start() +``` + +In the `deduplidog.utils` packages, you'll find a several handsome tools to help you. You will find parameters by using your IDE hints. + +### `images` +*`urls: Iterable[str | Path]`* Display a ribbon of images. + +### `print_video_thumbs` +*`src: str | Path`* Displays thumbnails for a video. + +### `print_videos_thumbs` +*`dir_: Path`* To quickly understand the content of each video, output the duration and the first few frames. + +### `get_frame_count` +*`filename: str|Path`* Uses cv2 to determine the video frame count. Method is cached. + +### `search_for_media_wizzard` +*`cwd: str`* Repeatedly prompt and search for files with similar names somewhere in the specified path. Display all such files as images and video previews. + +### `are_contained` +*`work_dir: str, original_dir: str, sec_range: int = 60`* You got two dirs with files having different naming system (427.JPG vs DSC_1344) + which you suspect to contain the same set. The same files in the dirs seem to have the same timestamp. + The same timestamp means +/- sec_range (ex: 1 minute). + Loop all files from work_dir and display corresponding files having the same timestamp. + or warn that no original exists. + +### `remove_prefix_in_workdir` +*`work_dir: str`* Removes the prefix βœ“ recursively from all the files. The prefix might have been previously given by the deduplidog. + + +### `mark_symlink_by_target` +*`suspicious_directory: str | Path, starting_path: str`* If the file is a symlink, pointing to this path, rename it with an arrow. + +``` +:param suspicious_directory: Ex: /media/user/disk/Takeout/Photos/ +:param starting_path: Ex: /media/user/disk +``` + +### `mark_symlink_only_dirs` +*`dir_: str | Path`* If the directory is full of only symlinks or empty, rename it to an arrow. + +### `mtime_files_in_dir_according_to_json` +*`dir_: str | Path, json_dir: str | Path`* Google Photos returns JSON with the photo modification time. Sets the photos from the dir_ to the dates fetched from the directory with these JSONs. diff --git a/docs/asset b/docs/asset new file mode 120000 index 0000000..c5d74dc --- /dev/null +++ b/docs/asset @@ -0,0 +1 @@ +../asset \ No newline at end of file diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..b6feba7 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,162 @@ +

Deduplidog – Deduplicator that covers your back.

+

+ +

+ +[![Build Status](https://github.com/CZ-NIC/deduplidog/actions/workflows/run-unittest.yml/badge.svg)](https://github.com/CZ-NIC/deduplidog/actions) + + +- [About](#about) + * [What are the use cases?](#what-are-the-use-cases) + * [What is compared?](#what-is-compared) + * [Why not using standard sync tools like meld?](#why-not-using-standard-sync-tools-like-meld) + * [Doubts?](#doubts) +- [Launch](#launch) +- [Examples](#examples) + * [Duplicated files](#duplicated-files) + * [Names shuffled](#names-shuffled) +- [Documentation](#documentation) + +# About + +## What are the use cases? +* I have downloaded photos and videos from the cloud. Oh, both Google Photos and Youtube *shrink the files* and change their format. Moreover, they shorten the file names to 47 characters and capitalize the extensions. So how am I supposed to know if I have everything backed up offline when the copies are resized? +* My disk is cluttered with several backups and I'd like to be sure these are all just copies. +* I merge data from multiple sources. Some files in the backup might have *the former orignal file modification date* that I might wish to restore. + +## What is compared? + +* The file name. + +Works great when the files keep more or less the same name. (Photos downloaded from Google have its stem shortened to 47 chars but that is enough.) Might ignore case sensitivity. + +* The file date. + +You can impose the same file *mtime*, tolerate few hours (to correct timezone confusion) or ignore the date altogether. + +Note: we ignore smaller than a second differences. + +* The file size, the image hash or the video frame count. + +The file must have the same size. Or take advantage of the media magic under the hood which ignores the file size but compares the image or the video inside. It is great whenever you end up with some files converted to a different format. + +* The contents? + +You may use `checksum=True` to perform CRC32 check. However for byte-to-byte checking, when the file names might differ or you need to check there is no byte corruption, some other tool might be better way, i.e. [jdupes](https://www.jdupes.com/). + +## Why not using standard sync tools like [meld](https://meldmerge.org/)? +These imply the folders have the same structure. Deduplidog is tolerant towards files scattered around. + +## Doubts? + +The program does not write anything to the disk, unless `execute=True` is set. Feel free to launch it just to inspect the recommended actions. Or set `inspect=True` to output bash commands you may launch after thorough examining. + +# Launch + +Install with `pip install deduplidog`. + +It works as a standalone program with all the CLI, TUI and GUI interfaces. Just launch the `deduplidog` command. + +# Examples + +## Media magic confirmation + +Let's compare two folders. + +```bash +deduplidog --work-dir folder1 --original-dir folder2 --media-magic --rename --execute +``` + +By default, `--confirm-one-by-one` is True, causing every change to be manually confirmed before it takes effect. So even though `--execute` is there, no change happen without confirmation. The change that happen is the `--rename`, the file in the `--work-dir` will be prefixed with the `βœ“` character. The `--media-magic` mode considers an image a duplicate if it has the same name and a similar image hash, even if the files are of different sizes. + +![Confirmation](asset/warnings_confirmation_example.avif "Confirmation, including warnings") + +Note that the default button is 'No' as there are some warnings. First, the file in the folder we search for duplicates in is bigger than the one in the original folder. Second, it is also older, suggesting that it might be the actual original. + + +## Duplicated files +Let's take a closer look to a use-case. + +```bash +deduplidog --work-dir /home/user/duplicates --original-dir /media/disk/origs" --ignore-date --rename +``` + +This command produced the following output: + +``` +Find files by size, ignoring: date, crc32 +Duplicates from the work dir at 'home' would be (if execute were True) renamed (prefixed with βœ“). +Number of originals: 38 +* /home/user/duplicates/foo.txt + /media/disk/origs/foo.txt + πŸ”¨home: renamable + πŸ“„media: DATE WARNING + a day πŸ›Ÿskipped on warning +Affectable: 37/38 +Affected size: 56.9 kB +Warnings: 1 +``` + +We found out all the files in the *duplicates* folder seem to be useless but one. It's date is earlier than the original one. The life buoy icon would prevent any action. To suppress this, let's turn on `set_both_to_older_date`. See with full log. + +```bash +deduplidog --work-dir /home/user/duplicates --original-dir /media/disk/origs --ignore-date --rename --set-both-to-older-date --log-level=10 +``` + +``` +Find files by size, ignoring: date, crc32 +Duplicates from the work dir at 'home' would be (if execute were True) renamed (prefixed with βœ“). +Original file mtime date might be set backwards to the duplicate file. +Number of originals: 38 +* /home/user/duplicates/foo.txt + /media/disk/origs/foo.txt + πŸ”¨home: renamable + πŸ“„media: redatable 2022-04-28 16:58:56 -> 2020-04-26 16:58:00 +* /home/user/duplicates/bar.txt + /media/disk/origs/bar.txt + πŸ”¨home: renamable +* /home/user/duplicates/third.txt + /media/disk/origs/third.txt + πŸ”¨home: renamable + ... +Affectable: 38/38 +Affected size: 59.9 kB +``` + +You see, the log is at the most brief, yet transparent form. The files to be affected at the work folder are prepended with the πŸ”¨ icon whereas those affected at the original folder uses πŸ“„ icon. We might add `execute=True` parameter to perform the actions. Or use `inspect=True` to inspect. + +```bash +deduplidog --work-dir /home/user/duplicates --original-dir /media/disk/origs --ignore-date --rename --set-both-to-older-date --inspect +``` + +The `inspect=True` just produces the commands we might subsequently use. + +```bash +touch -t 1524754680.0 /media/disk/origs/foo.txt +mv -n /home/user/duplicates/foo.txt /home/user/duplicates/βœ“foo.txt +mv -n /home/user/duplicates/bar.txt /home/user/duplicates/βœ“bar.txt +mv -n /home/user/duplicates/third.txt /home/user/duplicates/βœ“third.txt +``` + +## Names shuffled + +You face a directory that might contain some images twice. Let's analyze. We turn on `media_magic` so that we find the scaled down images. We `ignore_name` because the scaled images might have been renamed. We `skip_bigger` files as we examine the only folder and every file pair would be matched twice. That way, we declare the original image is the bigger one. And we set `log_level` verbosity so that we get a list of the affected files. + +``` +$ deduplidog --work-dir ~/shuffled/ --media-magic --ignore-name --skip-bigger --log-level=20 +Only files with media suffixes are taken into consideration. Nor the size nor the date is compared. Nor the name! +Duplicates from the work dir at 'shuffled' (only if smaller than the pair file) would be (if execute were True) left intact (because no action is selected, nothing will happen). + +Number of originals: 9 +Caching image hashes: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 9/9 [00:00<00:00, 16.63it/s] +Caching working files: 9it [00:00, 62497.91it/s] +* /home/user/shuffled/IMG_20230802_shrink.jpg + /home/user/shuffled/IMG_20230802.jpg +Affectable: 1/9 +Affected size: 636.4 kB +``` + +We see there si a single duplicated file whose name is `IMG_20230802_shrink.jpg`. + +# Documentation + +See the docs overview at [https://cz-nic.github.io/deduplidog/](https://cz-nic.github.io/deduplidog/Overview/). \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml new file mode 100644 index 0000000..c8b35f9 --- /dev/null +++ b/mkdocs.yml @@ -0,0 +1,47 @@ +site_name: "Deduplidog" +repo_name: CZ-NIC/deduplidog +repo_url: https://github.com/CZ-NIC/deduplidog +docs_dir: docs + +theme: + name: "material" + features: navigation.expand + +plugins: + - search + - mermaid2 + - mkdocstrings: + handlers: + python: + options: + docstring_style: google + docstring_section_style: table + show_source: false + members_order: source + show_bases: false + show_labels: false + +markdown_extensions: + - pymdownx.highlight: + anchor_linenums: true + use_pygments: true # highlight code during build time, not in javascript + linenums: false # enable line numbering + linenums_style: pymdownx-inline # how lines are numbered + - pymdownx.inlinehilite + - pymdownx.snippets + - admonition + - pymdownx.details + - pymdownx.superfences: + custom_fences: # make exceptions to highlighting of code: + - name: mermaid + class: mermaid + format: !!python/name:mermaid2.fence_mermaid_custom + - pymdownx.keys # keyboard keys + - tables + - footnotes + +nav: + - Introduction: index.md + - Overview.md + - Utils.md + - Changelog.md