Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

git annex find --not --in web #22

Open
joeyh opened this issue Jan 16, 2014 · 11 comments
Open

git annex find --not --in web #22

joeyh opened this issue Jan 16, 2014 · 11 comments

Comments

@joeyh
Copy link
Collaborator

joeyh commented Jan 16, 2014

Finds a few files. This would be a good regression test for this repo.

@RichiH
Copy link
Owner

RichiH commented Jan 19, 2014

I am somewhat unclear on how this works... Does it read the remote webpages and tell me what else is available? How else does it get at that data? And as git-annex apparently knows new files exist: what is the canonical way to just add that data to the annex?

@joeyh
Copy link
Collaborator Author

joeyh commented Jan 19, 2014

Richard Hartmann wrote:

I am somewhat unclear on how this works... Does it read the remote webpages and
tell me what else is available? How else does it get at that data? And as
git-annex apparently knows new files exist: what is the canonical way to just
add that data to the annex?

No, this is just finding files in the annex that do not have a url
recorded, so git annex get is not going to be able to get them when
someone clones this repository.

see shy jo

@Millak
Copy link
Contributor

Millak commented Jan 20, 2014

I ran this on my repository, got:
efraim@debian-netbook:~/conference_proceedings$ git annex find --not --in web
FOSDEM/2007/md5sum.txt
FOSDEM/2008/devrooms/debian/LICENCE
FOSDEM/2008/devrooms/debian/MIRROR_ON_YOUR_RISK___READ_THE_README
FOSDEM/2008/devrooms/debian/ogg_theora/720x576/MD5SUMS.txt
FOSDEM/2008/devrooms/opensuse/MD5SUMS
FOSDEM/2008/devrooms/opensuse/README
FOSDEM/2008/devrooms/xorg/README
all files that should be unannexed.

then I checked out upstream/master, found 2 more files:
Linux_Conference_Australia/2014/Monday/lca2014_monday_keynote.mp4
Linux_Conference_Australia/2014/Tuesday/lca2014_tuesday_keynote.mp4
neither of these files are in my repo, but based on the other files those were probably renamed on the LCA side.

while 'git annex find --not --in web' will find files with no web remote, the only thing I can think of that would make sure that the web remote actually contained the file would be something like 'git annex fsck --from web', but currently we're well over 500GB. Is there some way to ping the files to check if there are actually files on the other end that match the filesize (at least for those added with --fast and not --relaxed) without downloading the whole file?

@joeyh
Copy link
Collaborator Author

joeyh commented Jan 20, 2014

Efraim Flashner wrote:

while 'git annex find --not --in web' will find files with no web remote, the
only thing I can think of that would make sure that the web remote actually
contained the file would be something like 'git annex fsck --from web', but
currently we're well over 500GB. Is there some way to ping the files to check
if there are actually files on the other end that match the filesize (at least
for those added with --fast and not --relaxed) without downloading the whole
file?

git annex fsck --fast --from web should do that, checking only that
the web has the files and that they're of the expected size.

Note that it might be a good idea for this repository to always use
git annex addurl --fast; this generates keys that have no defined size,
so if a video is later edited or re-encoded, git-annex won't care.

see shy jo

@RichiH
Copy link
Owner

RichiH commented Feb 19, 2016

@joeyh Your reply seems to be empty. Can you resend?

Ideally, there would be an "open" get which gets all content and offers to change everything that's changed. I.e. a global web remote update function. Could you reasonably implement that?

@RichiH
Copy link
Owner

RichiH commented Feb 29, 2016

@joeyh: @clacke sent in two PRs.

@clacke
Copy link
Contributor

clacke commented Feb 29, 2016

I don't think my two PRs are related to this, but I'll chime in and say that find --not --in web is great for discovering that somebody merged master without also merging (the right) git-annex. That's probably the only situation this catches, but it's an important one.

@clacke
Copy link
Contributor

clacke commented Feb 29, 2016

I have been trying to use fsck --fast --from web, but ran into more trouble than it's been worth. First of all, never run it on the repo that has the hack that uses the web uuid as uuid. :-)

But I think it has surprising behaviors even when running on a "normal" repo. For FOSDEM it always returns false even though the files are there, not sure why. Maybe it's the redirect? (videos.fosdem.org redirects in a round-robin fashion to one of 8-9 mirrors)

Maybe fsck --from web works better, but then you are looking at gigabytes of downloads.

EDIT: And now I realized I've basically restated what @Millak said two years ago. I'll let it stay, because I think I said it slightly differently, which may help. :-)

@clacke
Copy link
Contributor

clacke commented Feb 29, 2016

Possibly find --not --in web also catches when somebody has accidentally made git-annex think that all files disappeared from upstream, by running fsck --fast --from web.

@joeyh
Copy link
Collaborator Author

joeyh commented Feb 29, 2016

Claes Wallin (韋嘉誠) wrote:

But I think it has surprising behaviors even when running on a "normal" repo.
For FOSDEM it always returns false even though the files are there, not sure
why. Maybe it's the redirect? (videos.fosdem.org redirects in a round-robin
fashion to one of 8-9 mirrors)

fsck --fast --from web notices when urls are no longer accessible (after
following redirects), or when the size of the content at the url differs
from the recorded size of the file. addurl --relaxed avoids the latter
check.

see shy jo

@clacke
Copy link
Contributor

clacke commented Apr 6, 2016

Does that mean I should ask for a feature fsck --relaxed?

(I think those URLs were relaxed, but I can't be sure now)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants