How to file ICLAs #125

sebbASF · 2021-08-10T22:24:48Z

ICLAs are currently filed using a filename stem based on the full name.

This approach increasingly suffers from collisions; it also has the potential to expose PII.

We need to find a different identifier where the likelihood of collisions is very small.
Unfortunately humans don't have a unique immutable identifier, at least not one which is likely to be accessible to us.
So some other ID needs to be found.

Some possibilities to consider:

UUID generated at time of filing
hash of the ICLA file
email address (canonicalised if necessary to generate a valid file name stem)

Assuming two people don't share the same email address, all the above should be collision-free for distinct people.

Any others?

Note that whilst the above ids will uniquely identify an ICLA, additional ICLAs from the same person will generally have different ids (email may or may not be the same). This is also true of the full name used in ICLAs: apart from ICLAs which are sent to record a change of name, we sometimes get ICLAs with a different spelling of a name, or with changes to the given names.

The current approach for replacement ICLAs is to create a directory and store all the ICLAs in the same directory.
If email address is used, something similar would be needed.
For the other IDs, the list of files in a directory would need to be replaced with a list of IDs in the index.

sebbASF · 2021-08-12T11:42:16Z

The advantage of the file hash is that it automatically detects exact duplicates, and collisions are vanishingly rare with the appropriate choice of hash function.

It also does not expose any PII, though it would be best not to publish the hash wider than necessary.

I think it would be worth exploring how to proceed on that basis, so here goes:

The ICLA should be filed as hash.pdf, with hash.asc alongside if necessary.
No need for subdirectories.

There needs to be an index to relate the hash to the person.
This index could be maintained in the same directory as the files.
[If maintained elsewhere, it needs to have the same protections.]

This index needs to contain at least Full Name, Public Name, email, asfId and project.
It would be useful to add signing date to show which ICLA is the most recent for an individual.

There may need to be a separate extract containing the Public fields plus the hash for use by appropriately authorized groups.

The index also needs to have a means of linking related ICLAs together.
(This is currently done by using folders with multiple files)
One way to do this would be to have forward and backward links between the entries.
i.e. when a replacement ICLA is filed, the existing ICLA entry is updated to point to the new one (e.g. Replaced by:) and the replacement entry is updated to point to the previous one (e.g. Replaces:).

There would be no need to update the hash associated with an availid, as the chain could be followed easily if necessary (it won't be long).

On the face of it, this appears to be more complicated than the current filename-based solution, but that quickly becomes complicated (and not easily automated) when name duplicates occur.

sebbASF · 2021-08-12T13:40:18Z

There are some issues to be sorted out:

how to handle ICLAs consisting of multiple parts (e.g. JPEGs)
how to handle ICLAs which need to be rotated: hash before or after rotation (or even store both?)

The hash naming convention can co-exist if necessary with the current files.
Existing filename stems all contain at least one hyphen, and are very unlikely to have the same format as the hash.
However I would expect all the files to be migrated eventually to the new convention.

clr-apache · 2021-08-12T14:25:56Z

On Aug 12, 2021, at 6:40 AM, sebbASF ***@***.***> wrote: There are some issues to be sorted out: how to handle ICLAs consisting of multiple parts (e.g. JPEGs)

Current practice is to convert multiple bitmap formats into a single pdf before filing. Some historic entries can be converted now.

how to handle ICLAs which need to be rotated: hash before or after rotation (or even store both?)

What's important is the content which is a pdf representation of the document. I don't think it matters what the original format was, e.g. multiple pages of a single document, jpg or gif single page or multiple page. Whatever the format as received in email, the pdf version (rotated to be human readable if necessary) is what we file, and we don't need to keep a historic record in the iclas/ directory of the original. Craig

The hash naming convention can co-exist if necessary with the current files. Existing filename stems all contain at least one hyphen, and are very unlikely to have the same format as the hash. However I would expect all the files to be migrated eventually to the new convention. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#125 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD4M6RAIQVLWAYI4GPUBZJLT4PFM5ANCNFSM5B45BALQ>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>.

Craig L Russell ***@***.***

sebbASF · 2021-08-12T15:00:22Z

The reason for wanting to keep the original hash is to detect exact duplicates. When content is adjusted, it may not get the same hash.

We don't have to keep the original (it should be in the mail archive).
However this is an exploration of the approach, and I want to make sure that the consequences of not doing so are considered.

dave2wave · 2021-08-12T15:24:08Z

This is an interesting topic and approach. One question to consider is if there is a way other than a contributor emailing secretary@? Perhaps there can be a web form somewhere that can handle the submission of the ICLA. This could be followed by the secretary handling any linkage / "deduplication" through assignment of an Apache ID.

The web form could have the same information as the pdf/odt form only it can actually validate proper project names and apache ids. There would be three cases:

Contributor. This would lead to the current workflow minus the filing of the ICLA which would automatically be done by the user's action.
Committer filling out a new ICLA with access to my id. This is essentially like existing id.a.o and whimsy.a.o changes.
Committer filling out a new ICLA to regain access to my id. There would be a special workflow with the secretary.

A public web form would need some type of rate limiting / queueing to prevent abuse.

I agree that naming the files with a hash protects PII. There would need to be some type of index building/validation which would include the metadata for existing/older ICLA files.

sebbASF · 2021-08-12T22:30:09Z

@dave2wave That is an entirely separate issue from how ICLAs are filed. Please start a separate issue for alternative ways to submit ICLAs.

As to metadata for the existing ICLA files, I was thinking of an index file with primary key of the hash, but equally it could contain older data using the filename stem as the key.
Should be simple to convert iclas.txt into this format once the layout is determined.

sebbASF · 2021-08-15T14:06:12Z

It would probably be worth recording the following email meta-data as well:

email-from
message-id
date

This would make it easier to link back to the original email.

Possibly even consider treating the mail archive as the storage, i.e. not committing the ICLA to SVN.

apache deleted a comment from syaifulnizamiphone7 Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to file ICLAs #125

How to file ICLAs #125

sebbASF commented Aug 10, 2021

sebbASF commented Aug 12, 2021

sebbASF commented Aug 12, 2021

clr-apache commented Aug 12, 2021 via email

sebbASF commented Aug 12, 2021

dave2wave commented Aug 12, 2021

sebbASF commented Aug 12, 2021

sebbASF commented Aug 15, 2021

How to file ICLAs #125

How to file ICLAs #125

Comments

sebbASF commented Aug 10, 2021

sebbASF commented Aug 12, 2021

sebbASF commented Aug 12, 2021

clr-apache commented Aug 12, 2021 via email

sebbASF commented Aug 12, 2021

dave2wave commented Aug 12, 2021

sebbASF commented Aug 12, 2021

sebbASF commented Aug 15, 2021