-
Notifications
You must be signed in to change notification settings - Fork 4
/
NOTES
371 lines (314 loc) · 18.1 KB
/
NOTES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
Architecture notes...
1. User identification and authentication
-----------------------------------------
In requests and sessions, users are identified by StoreUser objects
(and, indirectly, StoreGroup and StoreRealm objects).
A session is authenticated if there is a session cookie with key
"ezidAuthenticatedUser", if the key's associated value is the primary
key of a StoreUser object, and if logins are enabled for that user.
The Django admin app is inextricably tied to the Django auth app,
which EZID does not use. To unify these two systems, those users (and
only those users) that have access to the Django admin (currently just
the EZID administrator) have parallel user entries inserted in the
auth app, and when such users authenticate with EZID they are
authenticated with the auth app at the same time.
Anonymously-owned identifiers are identified by having an owner and
group, and owner PID and group PID, all equal to "anonymous". Note
that anonymously-owned identifiers are not stored in the search
database.
In requests, anonymous users are identified using the AnonymousUser
and AnonymousGroup classes, which are not Django model classes and are
not stored in the database, but replicate the StoreUser and StoreGroup
functionality to a sufficient fidelity that they can stand in as
replacements in access and policy computations.
2. Caching
----------
Caching is employed in several places. All in-memory caches are
emptied when EZID is reloaded.
ezid.conf settings
The settings used in modules are cached by those modules. Loaded
at module load time. Reloading EZID causes all settings to be
reloaded except Django and logging settings.
shoulder.py
Caches shoulder and datacenter objects from the store database;
the database itself caches the content of the external shoulder
file. Loaded when shoulders are first referenced. Note that
shoulders and datacenters are never changed within EZID.
store_group.py
Caches group objects from the store database. Loads objects on
demand, as they're referenced. Emptied when EZID is reloaded and
when groups are modified or deleted.
store_user.py
Caches user objects from the store database. Analogous to the
above.
store_profile.py
Caches profile objects from the store database. Analogous to the
above, except that profile objects are inserted as referenced by
clients, and they're never modified or deleted by EZID.
search database
The search database is not a cache, strictly speaking, but as a
quasi-clone of the store database it engenders the same kinds of
issues that caches do. In the search database, profiles and
datacenters are added as they are encountered (and they are never
deleted, so it is possible for extraneous entries to remain in the
database). Users, groups, and realms are kept in sync between the
two databases.
search_identifier.py
Caches user, group, datacenter, and profile objects from the
search database. Loads objects on demand, as they're referenced.
user and group account information
Cached in agent identifiers for the purposes of storage redundancy
and locality only. Written only, never read.
3. Identifier metadata
----------------------
Identifiers are stored internally and in the database as
StoreIdentifier objects, but an older "legacy" representation of
identifiers is still in use in several places, particularly where
serialization is required.
The legacy representation is a dictionary { name: value, ... } of
metadata elements. Names are arbitrary and uncontrolled, but those
beginning with an underscore are reserved for internal use by EZID and
other services. Reserved element names have two forms: a short form
and a longer, more readable form used in communicating with clients.
There are two serialization formats: ANVL, used for communicating with
clients; and a one-line "exchange" format (see functions fromExchange
and toExchange in util.py) used in database dumps.
+--------+-------------+-----------------------------------------------+
| stored | transmitted | |
| name | name | meaning |
+--------+-------------+-----------------------------------------------+
| _o | _owner | The identifier's owner. The owner is stored |
| | | as a persistent identifier (e.g., |
| | | "ark:/13030/foo") but returned as a local |
| | | name (e.g., "ryan"). The owner may also be |
| | | "anonymous". |
| _g | _ownergroup | The identifier's owning group, which is by |
| | | policy always the identifier's owner's |
| | | current group. The group is stored as a |
| | | persistent identifier (e.g., |
| | | "ark:/13030/bar") but returned as a local |
| | | name (e.g., "dryad"). The group may also be |
| | | "anonymous". |
| _c | _created | The time the identifier was created expressed |
| | | as a Unix timestamp, e.g., "1280889190". |
| _u | _updated | The time the identifier was last updated |
| | | expressed as a Unix timestamp, e.g., |
| | | "1280889190". |
| _t | _target | The identifier's target URL, e.g., |
| | | "http://foo.com/bar". (See _t1 below for |
| | | qualifications.) |
| | _shadowedby | Non-ARKs only. The identifier's shadow ARK, |
| | | e.g., "ark:/b5060/foo". Returned only. |
| | | Deprecated. |
| _p | _profile | The identifier's preferred metadata profile, |
| | | e.g., "erc". See module 'metadata' for more |
| | | information on profiles. A profile does not |
| | | place any requirements on what metadata |
| | | elements must be present or restrict what |
| | | metadata elements can be present. By |
| | | convention, the element names of a profile |
| | | are prefixed with the profile name, e.g., |
| | | "erc.who". |
| _is | _status | Identifier status. If present, either |
| | | "reserved" or "unavailable"; if not present, |
| | | effectively has the value "public". If |
| | | "unavailable", a reason may follow separated |
| | | by a pipe character, e.g., "unavailable | |
| | | withdrawn by author". Always returned. |
| _t1 | | If the identifier status is "public", not |
| | | present; otherwise, if the identifier status |
| | | is "reserved" or "unavailable", the target |
| | | URL as set by the client. (In these latter |
| | | cases _t is set to an EZID-defined URL.) Not |
| | | returned. |
| _x | _export | Export control. If present, has the value |
| | | "no"; if not present, effectively has the |
| | | value "yes". Determines if the identifier is |
| | | publicized by exporting it to external |
| | | indexing and harvesting services. Always |
| | | returned. |
| _d | _datacenter | DataCite DOIs only. The datacenter at which |
| | | the identifier is registered, e.g., |
| | | "CDL.DRYAD" (or will be registered, in the |
| | | case of a reserved identifier). |
| _cr | _crossref | Crossref DOIs only. If present, indicates |
| | | that the identifier is registered with |
| | | Crossref (or, in the case of a reserved |
| | | identifier, will be registered), and also |
| | | indicates the status of the registration |
| | | process. Syntactically, has the value "yes" |
| | | followed by a pipe character followed by a |
| | | status message, e.g., "yes | successfully |
| | | registered". |
+--------+-------------+---------------------------------------------- +
4. Agent identifiers
--------------------
"Agents" (users and groups) are internally referred to and stored as
ARK identifiers (e.g., "ark:/99166/foo"), but are externally referred
to by local names (e.g., "dryad"). Identifiers that identify agents
are termed "agent identifiers." Because potentially sensitive account
information is cached in agent identifiers (see above), not only are
agent identifiers not revealed to clients, they are owned by the EZID
administrator and can only be viewed by the EZID administrator.
5. Use of DataCite's active flag
--------------------------------
DataCite's 'active' flag (a DataCite-specific attribute of a DOI)
works as follows. It is true by default, and set to false by
performing an HTTP DELETE on the identifier. Note, though, that a
DELETE may be performed only if the identifier has metadata.
Performing a DELETE on an already deactivated identifier has no
effect. An identifier is (and can only be) reactivated by posting
metadata to it.
A deactivated identifier continues to exist in DataCite, but it is in
many ways deleted: an attempt to view the identifier returns 410 Gone,
and the identifier is removed from every DataCite service, including
the Crossref/DataCite content resolver. It is not entirely deleted,
however, as the identifier continues to exist in the Handle System and
therefore continues to resolve.
Note that the above API behavior has no effect on setting a DOI's
target URL: the target URL may be set whether the identifier is active
or not, and whether it has metadata or not. Starting 2013-01-01
DataCite will disallow a new registration if the identifier has no
metadata. Our understanding is that nothing else about the DataCite
API will change, in particular, that the target URL will continue to
be settable if the identifier is not active. It is unclear at the
time of this writing if the target URL for a legacy identifier lacking
metadata may be set without first uploading metadata.
With this background, EZID's manipulation of the active flag can be
summarized as follows:
event actions
------------------------------ -------------------------
_status: public -> unavailable url=tombstone; DEACTIVATE
_status: unavailable -> public restore url; ACTIVATE
delete url=invalid; DEACTIVATE
_export: yes -> no DEACTIVATE
_export: no -> yes ACTIVATE
In the above, _status takes precedence over _export.
There are two differences between an unavailable identifier and a
public-but-not-exported identifier. First, an unavailable
identifier's target URL is overriden with a tombstone URL. Second, a
public-but-not-exported identifier's metadata is still uploaded to
DataCite.
6. Offline scripts
------------------
Offline scripts (dump-store, dashboard, expunge, etc.) import EZID
modules and directly call EZID functions. This generally doesn't
cause problems with two exceptions. The first is logging: to avoid
appending to and possibly corrupting the running server's transaction
log file, offline scripts use the settings/logging.offline.conf
settings to log to standard error instead. Second, script update
actions may conflict with those of the running server, even though
database locking works across processes, because offline scripts don't
participate in the server's locking mechanism and won't necessarily
interact properly with server background processing daemons. This
explains why, for example, the expunge script performs its update
actions through the EZID API.
See .../SITE_ROOT/PROJECT_ROOT/tools/offline.py for more information.
7. Log file formats
-------------------
There are two slightly different log file formats. The transaction
log written by the running server (by module log.py) stores start,
progress, and end records for every transaction, for both read and
write operations, both successful and not, as well as server error and
server status records. But for space efficiency, historical
transaction logs are converted to a more compact form. The striplog
tool retains only records for transactions that successfully created,
updated, or deleted a non-test identifier. Furthermore, the multiple
records comprising a transaction are collapsed into a single record.
For example, the following two transactions (records have been wrapped
here for clarity):
2014-01-06 20:58:11,383 4ec86f4a775811e3bdd610ddb1cf39e7 BEGIN
mintIdentifier ark:/13030/c7 gjanee ark:/99166/p92z12p14 cdl
ark:/99166/p9z60c16v
2014-01-06 20:58:11,715 4ec86f4a775811e3bdd610ddb1cf39e7 END SUCCESS
ark:/13030/c7b56d41k
2014-01-06 20:58:11,715 4efb1a59775811e3a95e10ddb1cf39e7 BEGIN
createIdentifier ark:/13030/c7b56d41k gjanee ark:/99166/p92z12p14
cdl ark:/99166/p9z60c16v erc.what An%20example
2014-01-06 20:58:12,338 4efb1a59775811e3a95e10ddb1cf39e7 PROGRESS
noid_egg.setElements
2014-01-06 20:58:12,342 4efb1a59775811e3a95e10ddb1cf39e7 PROGRESS
store.insert
2014-01-06 20:58:12,345 4efb1a59775811e3a95e10ddb1cf39e7 END SUCCESS
get compacted into:
2014-01-06 20:58:12,345 createIdentifier ark:/13030/c7b56d41k gjanee
ark:/99166/p92z12p14 cdl ark:/99166/p9z60c16v erc.what An%20example
Note that record arguments in both types of log files are separated by
single spaces, and thus an empty argument will result in adjacent
spaces.
8. Database dump formats
------------------------
There are two slightly different database dump formats. A "raw" dump
lists identifiers as stored in the bind database: using internal
labels, with all status-related elements included, but with default
values omitted. Here are two example identifier records (wrapped here
for clarity):
ark:/99999/fk4030wkq _is reserved _p erc
_o ark:/99166/p92z12p14 _g ark:/99166/p9z60c16v
_c 1389071897 _u 1389071897
_t1 http://a.target/
_t https://ezid.cdlib.org/id/ark:/99999/fk4030wkq
ark:/99999/fk40v8s28x _u 1509662539 _t http://b.target/
_p erc _o anonymous _g anonymous _c 1509662539
A "normal" dump uses a record representation that is more
human-readable and more easily processed: using external labels, with
internal status-related elements omitted but with default values
explicitly listed, and with agent PIDs converted to names. The same
examples in normal form:
ark:/99999/fk4030wkq _status reserved _profile erc
_owner gjanee _ownergroup cdl
_created 1389071897 _updated 1389071897
_target http://a.target/
ark:/99999/fk40v8s28x _updated 1509662539 _target http://b.target/
_profile erc _export yes _owner anonymous _ownergroup anonymous
_created 1509662539 _status public
9. Crossref
-----------
Crossref does not provide an 'active' flag like DataCite does, and
this limits our ability to implement identifier status changes. Our
next-best-thing approach is as follows:
- The _crossref element is always set.
- The identifier must be exported.
- When the identifier is made public, it is registered with
Crossref.
- If the identifier's status is set to unavailable, the identifier
remains registered with Crossref, but its target URL is set to the
tombstone URL and the resource title is set to "WITHDRAWN". If
the identifier is deleted (by the EZID administrator), same thing,
but the target URL is set to http://datacite.org/invalidDOI.
10. Store vs. search database
-----------------------------
EZID continues to use two databases, though admittedly the rationales
for having two databases (initially concurrency, then implementation
freedom, then performance) have changed over time and the need has
largely abated. The databases may reside in the same physical
database or in separate databases; see the SEARCH_STORE_SAME_DATABASE
Django setting. If separate physical databases, the architectures may
be the same or different. Both databases store identifiers
(implication: storage requirements are roughly doubled), but there are
differences in exactly what is stored and in design goals.
The store database is the main database, which Django insists must be
referenced as the "default" database. It stores Django's internal
tables; the principal ezidapp_storeidentifier table; user, group, and
realm information; caches of shoulders and datacenters; processing
queues; identifier statistics; and basically anything and everything
needed for mainline processing. The design goal of the store database
is high-throughput transaction speed.
The search database stores, in the ezidapp_searchidentifier table, a
copy of most identifiers (only anonymously-owned test identifiers are
excluded), albeit with many more columns and indexes to support
searching over identifiers. The table is updated asynchronously by a
daemon thread. The design goal of the search database is to support
searching as well as the various bulk and asynchronous operations:
batch downloads, batch processing, statistics computation, the OAI-PMH
feed, and so forth. The link checker's table is also stored in the
search database.
The two identifier tables have foreign key dependencies on dependent
tables for users, groups, realms, profiles, and datacenters. As the
dependent tables can't be shared across potentially physically
separate databases, they are replicated and synchronized. Generally
the search database copies are stubs of the fuller information stored
in the store database.
With Django, databases are not explictly selected; instead, the
mapping between models (i.e., tables) and databases is defined in
settings/routers.py.