-
Notifications
You must be signed in to change notification settings - Fork 95
/
README.txt
568 lines (454 loc) · 21.8 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
Description
-----------
File Conveyor is designed to discover new, changed and deleted files via the
operating system's built-in file system monitor. After discovering the files,
they can be optionally be processed by a chain of processors – you can easily
write new ones yourself. After files have been processed, they can also
optionally be transported to a server.
Discovery happens through inotify on Linux (with kernel >= 2.6.13), through
FSEvents on Mac OS X (>= 10.5) and through polling on other operating systems.
Processors are simple Python scripts that can change the file's base name (it
is impossible to change the path) and apply any sort of processing to the
file's contents. Examples are image optimization and video transcoding.
Transporters are simple threaded abstractions around Django storage systems.
For a detailed description of the innards of file conveyor, see my bachelor
thesis text (find it via http://wimleers.com/tags/bachelor-thesis).
This application was written as part of the bachelor thesis [1] of Wim Leers
at Hasselt University [2].
[1] http://wimleers.com/tags/bachelor-thesis
[2] http://uhasselt.be/
<BLINK>IMPORTANT WARNING</BLINK>
--------------------------------
I've attempted to provide a solid enough README to get you started, but I'm
well aware that it isn't superb. But as this is just a bachelor thesis, time
was fairly limited. I've opted to create a solid basis instead of an extremely
rigourously documented piece of software. If you cannot find the answer in the
README.txt, nor the INSTALL.txt, nor the API.txt files, then please look at
my bachelor thesis text instead. If neither of that is sufficient, then please
contact me.
Upgrading
---------
If you're upgrading from a previous version of File Conveyor, please run
upgrade.py.
==============================================================================
| The basics |
==============================================================================
Configuring File Conveyor
-------------------------
The sample configuration file (config.sample.xml) should be self explanatory.
Copy this file to config.xml, which is the file File Conveyor will look for,
and edit it to suit your needs.
For a detailed description, see my bachelor thesis text (look for the
"Configuration file design" section).
Each rule consists of 3 components:
- filter
- processorChain
- destinations
A rule can also be configured to delete source files after they have been
synced to the destination(s).
The filter and processorChain components are optional. You must have at least
one destination.
If you want to use File Conveyor to process files locally, i.e. without
transporting them to a server, then use the Symlink or Copy transporter (see
below).
Starting File Conveyor
----------------------
File Conveyor must be started by starting its arbitrator (which links
everything together; it controls the file system monitor, the processor
chains, the transporters and so on). You can start the arbitrator like this:
python /path/to/fileconveyor/arbitrator.py
Stopping File Conveyor
----------------------
File Conveyor listens to standard signals to know when it should end, like the
Apache HTTP server does too. Send the TERMinate signal to terminate it:
kill -TERM `cat ~/.fileconveyor.pid`
You can configure File Conveyor to store the PID file in the more typical
/var/run location on *nix:
* You can change the PID_FILE setting in settings.py to
/var/run/fileconveyor.pid. However, this requires File Conveyor to be run with
root permissions (/var/run requires root permissions).
* Alternatively, you can create a new directory in /var/run which then no
longer requires root permissions. This can be achieved through these commands:
1. sudo mkdir /var/run/fileconveyor`
2. sudo chown fileconveyor-user /var/run/fileconveyor
3. sudo chown 700 /var/run/fileconveyor
Then, you can change the PID_FILE setting in settings.py to
/var/run/fileconveyor/fileconveyor.pid, and you won't need to run File
Conveyor with root permissions anymore.
File Conveyor's behavior
------------------------
Upon startup, File Conveyor starts the file system monitor and then performs a
"manual" scan to detect changes since the last time it ran. If you've got a
lot of files, this may take a while.
Just for fun, type the following while File Conveyor is syncing:
killall -9 python
Now File Conveyor is dead. Upon starting it again, you should see something like:
2009-05-17 03:52:13,454 - Arbitrator - WARNING - Setup: initialized 'pipeline' persistent queue, contains 2259 items.
2009-05-17 03:52:13,455 - Arbitrator - WARNING - Setup: initialized 'files_in_pipeline' persistent list, contains 47 items.
2009-05-17 03:52:13,455 - Arbitrator - WARNING - Setup: initialized 'failed_files' persistent list, contains 0 items.
2009-05-17 03:52:13,671 - Arbitrator - WARNING - Setup: moved 47 items from the 'files_in_pipeline' persistent list into the 'pipeline' persistent queue.
2009-05-17 03:52:13,672 - Arbitrator - WARNING - Setup: moved 0 items from the 'failed_files' persistent list into the 'pipeline' persistent queue.
As you can see, 47 items were still in the pipeline when File Conveyor was
killed. They're now simply added to the pipeline queue again and they will be
processed once again.
The initial sync
----------------
To get a feeling of File Conveyor's speed, you may want to run it in the console
and look at its output.
Verifying the synced files
--------------------------
Running the verify.py script will open the synced files database and verify
that each synced file actually exists.
==============================================================================
| Processors |
==============================================================================
Addressing processors
---------------------
You can address a specific processor by first specifying its processor module
and then the exact processor name (which is its class name):
- unique_filename.MD5
- image_optimizer.KeepMetadata
- yui_compressor.YUICompressor
- link_updater.CSSURLUpdater
But, it works with third-party processors too! Just make sure the third-party
package is in the Python path and then you can just use this in config.xml:
- MyProcessorPackage.SomeProcessorClass
Processor module: filename
--------------------------
Available processors:
1) SpacesToUnderscores
Changes a filename; replaces spaces by underscores. E.g.:
this is a test.txt --> this_is_a_test.txt
2) SpacesToDashes
Changes a filename; replaces spaces by dashes. E.g.:
this is a test.txt --> this-is-a-test.txt
Processor module: unique_filename
---------------------------------
Available processors:
1) Mtime
Changes a filename based on the file's mtime. E.g.:
logo.gif --> logo_1240668971.gif
2) MD5
Changes a filename based on the file's MD5 hash. E.g.:
logo.gif --> logo_2f0342a2b9aaf48f9e75aa7ed1d58c48.gif
Processor module: image_optimizer
---------------------------------
It's important to note that all metadata is stripped from JPEG images, as that
is the most effective way to reduce the image size. However, this might also
strip copyright information, i.e. this can also have legal consequences.
Choose one of the "keep metadata" classes if you want to avoid this.
When optimizing GIF images, they are converted to the PNG format, which also
changes their filename.
Available processors:
1) Max
optimizes image files losslessly (GIF, PNG, JPEG, animated GIF)
2) KeepMetadata
same as Max, but keeps JPEG metadata
3) KeepFilename
same as Max, but keeps the original filename (no GIF optimization)
4) KeepMetadataAndFilename
same as Max, but keeps JPEG metadata and the original filename (no GIF
optimization)
Processor module: yui_compressor
--------------------------------
Warning: this processor is CPU-intensive! Since you typically don't get new
CSS and JS files all the time, it's still fine to use this. But the initial
sync may cause a lot of CSS and JS files to be processed and thereby cause a
lot of load!
Available processors:
1) YUICompressor
Compresses .css and .js files with the YUI Compressor
Processor module: google_closure_compiler
-----------------------------------------
Warning: this processor is CPU-intensive! Since you typically don't get new
JS files all the time, it's still fine to use this. But the initial sync may
cause a lot of JS files to be processed and thereby cause a lot of load!
Available processors:
1) GoogleClosureCompiler
Compresses .js files with the Google Closure Compiler
Processor module: link_updater
------------------------------
Warning: this processor is CPU-intensive! Since you typically don't get new
CSS files all the time, it's still fine to use this. But the initial sync may
cause a lot of CSS files to be processed and thereby cause a lot of load! Note
that this processor will skip processing a CSS file if not all files that are
referenced from it, have been synced to the CDN yet. Which means the CSS files
may need to parsed over and over again until the images have been synced.
It seems this processor is suited for optimization. It uses the cssutils
Python module, which validates every CSS property. This is an enormous slow-
down: on a 2.66 GHz Core 2 Duo, it causes 100% CPU usage every time it runs.
This module also seems to suffer from rather massive memory leaks. Memory
usage can easily top 30 MB on Mac OS X where it would never go over 17 MB
without this processor!
This processor will replace all URLs in CSS files with references to their
counterparts on the CDN. There are a couple of important gotchas to use this
processor module:
- absolute URLs (http://, https://) are ignored, only relative URLs are
processed
- if a referenced file doesn't exist, its URL will remain unchanged
- if one of the referenced images or fonts is changed and therefor resynced,
and if it is configured to have a unique filename, the CDN URL referenced
from the updated CSS file will no longer be valid. Therefor, when you
update an image file or font file that is referenced by CSS files, you
should modify the CSS files as well. Just modifying the mtime (by using the
touch command) is sufficient.
- it requires the referenced files to be synced to the same server the CSS
file is being synced to. This implies that all the references files must
also be synced to the same server, or the file will never get synced!
Available processors:
1) CSSURLUpdater
Replaces URLs in .css files with their counterparts on the CDN
==============================================================================
| Transporters |
==============================================================================
Addressing transporters
-----------------------
You can address a specific transporter by only specifying its module:
- cf
- ftp
- cloudfiles
- s3
- sftp
- symlink_or_copy
But, it works with third-party transporters too! Just make sure the
third-party package is in the Python path and then you can just use this in
config.xml:
- MyTransporterPackage
Transporter: FTP (ftp)
----------------------
Value to enter: "ftp".
Available settings:
- host
- username
- password
- url
- port
- path
- key
Transporter: SFTP (sftp)
------------------------
Value to enter: "sftp".
Available settings:
- host
- username
- password
- url
- port
- path
Transporter: Amazon S3
----------------------
Value to enter: "s3".
Available settings:
- access_key_id
- secret_access_key
- bucket_name
- bucket_prefix
More than 4 concurrent connections doesn't show a significant speedup.
Transporter: Amazon CloudFront
------------------------------
Value to enter: "cf".
Available settings:
- access_key_id
- secret_access_key
- bucket_name
- bucket_prefix
- distro_domain_name
Transporter: Rackspace Cloud Files
----------------------------------
Value to enter: "cloudfiles".
Available settings:
- username
- api_key
- container
Transporter: Symlink or Copy
----------------------------
Value to enter: "symlink_or_copy".
Available settings:
- location
- url
Transporter: Amazon CloudFront - Creating a CloudFront distribution
-------------------------------------------------------------------
You can either use the S3Fox Firefox add-on to create a distribution or use
the included Python function to do so. In the latter case, do the following:
>>> import sys
>>> sys.path.append('/path/to/fileconveyor/transporters')
>>> sys.path.append('/path/to/fileconveyor/dependencies')
>>> from transporter_cf import create_distribution
>>> create_distribution("access_key_id", "secret_access_key", "bucketname.s3.amazonaws.com")
Created distribution
- domain name: dqz4yxndo4z5z.cloudfront.net
- origin: bucketname.s3.amazonaws.com
- status: InProgress
- comment:
- id: E3FERS845MCNLE
Over the next few minutes, the distribution will become active. This
function will keep running until that happens.
............................
The distribution has been deployed!
==============================================================================
| The advanced stuff |
==============================================================================
Constants in Arbitrator.py
--------------------------
The following constants can be tweaked to change where File Conveyor stores
its files, or to change its behavior.
RESTART_AFTER_UNHANDLED_EXCEPTION = True
Whether File Conveyor should restart itself after it encountered an
unhandled exception (i.e., a bug).
RESTART_INTERVAL = 10
After how much time File Conveyor should restart itself, after it has
encountered an unhandled exception. Thus, this setting only has an effect
when RESTART_AFTER_UNHANDLED_EXCEPTION == True.
LOG_FILE = './fileconveyor.log'
The log file.
PERSISTENT_DATA_DB = './persistent_data.db'
Where to store persistent data (pipeline queue, 'files in pipeline' list and
'failed files' list).
SYNCED_FILES_DB = './synced_files.db'
Where to store the input_file, transported_file_basename, url and server for
each synced file.
WORKING_DIR = '/tmp/fileconveyor'
The working directory.
MAX_FILES_IN_PIPELINE = 50
The maximum number of files in the pipeline. Should be high enough in order
to prevent transporters from idling too long.
MAX_SIMULTANEOUS_PROCESSORCHAINS = 1
The maximum number of processor chains that may be executed simultaneously.
If you've got CPU intensive processors and if you're running File Conveyor
on the web server, you'll want to keep this very low, probably at 1.
MAX_SIMULTANEOUS_TRANSPORTERS = 10
The maximum number of transporters that may be running simultaneously. This
effectively caps the number of simultaneous connections. It can also be used
to have some -- although limited -- control on the throughput consumed by
the transporters.
MAX_TRANSPORTER_QUEUE_SIZE = 1
The maximum of files queued for each transporters. It's recommended to keep
this low enough to ensure files are not unnecessarily waiting. If you set
this too high, no new transporters will be spawned, because all files will
be queued on the existing transporters. Setting this to 0 can only be
recommended in environments with a continuous stream of files that need
syncing. The default of 1 is to ensure each transporter is idling as little
as possible.
QUEUE_PROCESS_BATCH_SIZE = 20
The number of files that will be processed when processing one of the many
queues. Setting this too low will cause overhead. Setting this too high will
cause delays for files that are ready to be processed or transported. See
the "Pipeline design pattern" section in my bachelor thesis text.
CALLBACKS_CONSOLE_OUTPUT = False
Controls whether output will be generated for each callback. (There are
callbacks for the file system monitor, processor chains and transporters.)
CONSOLE_LOGGER_LEVEL = logging.WARNING
Controls the output level of the logging to the console. For a full list of
possibilities, see http://docs.python.org/release/2.6/library/logging.html#logging-levels.
FILE_LOGGER_LEVEL = logging.DEBUG
Controls the output level of the logging to the console. For a full list of
possibilities, see http://docs.python.org/release/2.6/library/logging.html#logging-levels.
RETRY_INTERVAL = 30
Sets the interval in which the 'failed files' list is appended to the
pipeline queue, to retry to sync these failed files.
Understanding persistent_data.db
--------------------------------
We'll go through this by using a sample database I created. You should be able
to reproduce similar output on your persistent_data.db file using the exact
same commands.
Access the database, by using the SQLite console application.
$ sqlite3 persistent_data.db
SQLite version 3.6.11
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite>
As you can see, there are three tables in the database, one for every
persistent data structure:
sqlite> .table
failed_files_list pipeline_list pipeline_queue
Simple count queries show how many items there are in each persistent data
structure. In this case for example, there are 2560 files waiting to enter the
pipeline, 50 were in the pipeline at the time of stopping File Conveyor (these
will be added to the queue again once we restart File Conveyor) and 0 files
are in the list of failed files. Files end up in there when their processor
chain or (one of) their transporters fails.
sqlite> SELECT COUNT(*) FROM pipeline_queue;
2560
sqlite> SELECT COUNT(*) FROM pipeline_list;
50
sqlite> SELECT COUNT(*) FROM failed_files_list;
0
You can also look at the database schemas of these tables:
sqlite> .schema pipeline_queue
CREATE TABLE pipeline_queue(id INTEGER PRIMARY KEY AUTOINCREMENT, item pickle);
sqlite> .schema pipeline_list
CREATE TABLE pipeline_list(id INTEGER PRIMARY KEY AUTOINCREMENT, item pickle);
sqlite> .schema failed_files_list
CREATE TABLE failed_files_list(id INTEGER PRIMARY KEY AUTOINCREMENT, item pickle);
As you can see, the three tables have identical schemas. the type for the
stored item is 'pickle', which means that you can store any Python object in
there as long as it can be "pickled", which means as much as "convertable to
a string representation". "Serialization" is the term PHP developers have
given to this, although pickling is much more advanced.
The Python object stored in there is the same for all three tables: a tuple of
the filename (as a string) and the event (as an integer). The event is one of
FSMonitor.CREATED, FSMonitor.MODIFIED, FSMonitor.DELETED.
This file is what tracks the curent state of File Conveyor. Thanks to this file,
it is possible for File Conveyor to crash and not lose any data.
Deleting this file would cause File Conveyor to lose all of its current work.
Only new (as in: after the file was deleted) changes in the file system would
be picked up. Changes that still had to be synced, would be forgotten.
Understanding fsmonitor.db
--------------------------
This database has a single table: pathscanner (which is inherited from the
pathscanner module around which the fsmonitor module is built). Its schema is:
sqlite> .schema pathscanner
CREATE TABLE pathscanner(path text, filename text, mtime integer);
This file is what tracks the current state of the directory tree associated
with each source. When an operating system's file system monitor is used, this
database will be updated through its callbacks. When no such file system
monitor is available, it will be updated through polling.
Deleting this file would cause File Conveyor to have to sync all files again.
Understanding synced_files.db
-----------------------------
We'll go through this by using a sample database I created. You should be able
to reproduce similar output on your synced_files.db file using the exact
same commands.
Access the database, by using the SQLite console application.
$ sqlite3 synced_files.db
SQLite version 3.6.11
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite>
As you can see, there's only one table: synced_files.
sqlite> .table
synced_files
Let's look at the schema. There are 4 fields: input_file,
transported_file_basename, url and server. input_file is the full path.
transported_file_basename is the base name of the file that was transported to
the server. This is stored because the filename might have been altered by the
processors that have been applied to it, but the path cannot change. I use
this to delete the previous version of a file if a file has been modified. The
url field is of course the URL to retrieve the file from the server. Finally,
the server field contains the name you've assigned to the server in the
configuration file. Each file may be synced to multiple servers and this
allows you to check if a file has been synchronized to a specific server.
sqlite> .schema synced_files
CREATE TABLE synced_files(input_file text, transported_file_basename text, url text, server text);
We can again use simple count queries to learn more about the synced files. As
you can see, 845 files have been synced, of which 602 have been synced to a
the server that was named "origin pull cdn" and 243 to the server that was
named "ftp push cdn".
sqlite> SELECT COUNT(*) FROM synced_files;
845
sqlite> SELECT COUNT(*) FROM synced_files WHERE server="origin pull cdn";
602
sqlite> SELECT COUNT(*) FROM synced_files WHERE server="ftp push cdn";
243
License
-------
This application is dual-licensed under the GPL and the UNLICENSE.
Due to the dependencies that were initially included within File Conveyor,
which were all subject to GPL-compatible licenses, it made sense to initially
release the source code under the GPL.
Then, it was decided the UNLICENSE was a better fit.
Author
------
Wim Leers ~ http://wimleers.com/
This application was written as part of the bachelor thesis of Wim Leers at
Hasselt University.