Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to file ICLAs #125

Open
sebbASF opened this issue Aug 10, 2021 · 7 comments
Open

How to file ICLAs #125

sebbASF opened this issue Aug 10, 2021 · 7 comments

Comments

@sebbASF
Copy link
Contributor

sebbASF commented Aug 10, 2021

ICLAs are currently filed using a filename stem based on the full name.

This approach increasingly suffers from collisions; it also has the potential to expose PII.

We need to find a different identifier where the likelihood of collisions is very small.
Unfortunately humans don't have a unique immutable identifier, at least not one which is likely to be accessible to us.
So some other ID needs to be found.

Some possibilities to consider:

  • UUID generated at time of filing
  • hash of the ICLA file
  • email address (canonicalised if necessary to generate a valid file name stem)

Assuming two people don't share the same email address, all the above should be collision-free for distinct people.

Any others?

Note that whilst the above ids will uniquely identify an ICLA, additional ICLAs from the same person will generally have different ids (email may or may not be the same). This is also true of the full name used in ICLAs: apart from ICLAs which are sent to record a change of name, we sometimes get ICLAs with a different spelling of a name, or with changes to the given names.

The current approach for replacement ICLAs is to create a directory and store all the ICLAs in the same directory.
If email address is used, something similar would be needed.
For the other IDs, the list of files in a directory would need to be replaced with a list of IDs in the index.

@sebbASF
Copy link
Contributor Author

sebbASF commented Aug 12, 2021

The advantage of the file hash is that it automatically detects exact duplicates, and collisions are vanishingly rare with the appropriate choice of hash function.

It also does not expose any PII, though it would be best not to publish the hash wider than necessary.

I think it would be worth exploring how to proceed on that basis, so here goes:

The ICLA should be filed as hash.pdf, with hash.asc alongside if necessary.
No need for subdirectories.

There needs to be an index to relate the hash to the person.
This index could be maintained in the same directory as the files.
[If maintained elsewhere, it needs to have the same protections.]

This index needs to contain at least Full Name, Public Name, email, asfId and project.
It would be useful to add signing date to show which ICLA is the most recent for an individual.

There may need to be a separate extract containing the Public fields plus the hash for use by appropriately authorized groups.

The index also needs to have a means of linking related ICLAs together.
(This is currently done by using folders with multiple files)
One way to do this would be to have forward and backward links between the entries.
i.e. when a replacement ICLA is filed, the existing ICLA entry is updated to point to the new one (e.g. Replaced by:) and the replacement entry is updated to point to the previous one (e.g. Replaces:).

There would be no need to update the hash associated with an availid, as the chain could be followed easily if necessary (it won't be long).

On the face of it, this appears to be more complicated than the current filename-based solution, but that quickly becomes complicated (and not easily automated) when name duplicates occur.

@sebbASF
Copy link
Contributor Author

sebbASF commented Aug 12, 2021

There are some issues to be sorted out:

  • how to handle ICLAs consisting of multiple parts (e.g. JPEGs)
  • how to handle ICLAs which need to be rotated: hash before or after rotation (or even store both?)

The hash naming convention can co-exist if necessary with the current files.
Existing filename stems all contain at least one hyphen, and are very unlikely to have the same format as the hash.
However I would expect all the files to be migrated eventually to the new convention.

@clr-apache
Copy link
Contributor

clr-apache commented Aug 12, 2021 via email

@sebbASF
Copy link
Contributor Author

sebbASF commented Aug 12, 2021

The reason for wanting to keep the original hash is to detect exact duplicates. When content is adjusted, it may not get the same hash.

We don't have to keep the original (it should be in the mail archive).
However this is an exploration of the approach, and I want to make sure that the consequences of not doing so are considered.

@dave2wave
Copy link
Member

This is an interesting topic and approach. One question to consider is if there is a way other than a contributor emailing secretary@? Perhaps there can be a web form somewhere that can handle the submission of the ICLA. This could be followed by the secretary handling any linkage / "deduplication" through assignment of an Apache ID.

The web form could have the same information as the pdf/odt form only it can actually validate proper project names and apache ids. There would be three cases:

  1. Contributor. This would lead to the current workflow minus the filing of the ICLA which would automatically be done by the user's action.
  2. Committer filling out a new ICLA with access to my id. This is essentially like existing id.a.o and whimsy.a.o changes.
  3. Committer filling out a new ICLA to regain access to my id. There would be a special workflow with the secretary.

A public web form would need some type of rate limiting / queueing to prevent abuse.

I agree that naming the files with a hash protects PII. There would need to be some type of index building/validation which would include the metadata for existing/older ICLA files.

@sebbASF
Copy link
Contributor Author

sebbASF commented Aug 12, 2021

@dave2wave That is an entirely separate issue from how ICLAs are filed. Please start a separate issue for alternative ways to submit ICLAs.

As to metadata for the existing ICLA files, I was thinking of an index file with primary key of the hash, but equally it could contain older data using the filename stem as the key.
Should be simple to convert iclas.txt into this format once the layout is determined.

@sebbASF
Copy link
Contributor Author

sebbASF commented Aug 15, 2021

It would probably be worth recording the following email meta-data as well:

  • email-from
  • message-id
  • date

This would make it easier to link back to the original email.

Possibly even consider treating the mail archive as the storage, i.e. not committing the ICLA to SVN.

@apache apache deleted a comment from syaifulnizamiphone7 Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants