CSCI-599 Spring 2016 - Team 1, Assignment 1
from rw.reader import *
from bfa.bfa import *
reader = FileReader(" PATH TO FILE ")
analyzer = BFAnalyzer()
reader.read(analyzer.compute)
print analyzer.clean()
# COMPANDING FN: x ^ ( 1 / 1.5 )
# http://www.forensicswiki.org/w/images/f/f9/Mcdaniel01.pdf
from bfc.bfc import *
fs = BFCorrelator(baseSignature)
print fs.correlate(compareWith)
from bfa.cross import *
c = BFCrossCorrelator(signature)
print c.correlate()
from fht.fht import *
# OFFSET -> user defined integer
r = HTFileReader(" PATH TO FILE ", OFFSET)
fht = FHTAnalyzer(OFFSET)
r.read(fht.compute)
print fht.signature()
from fht.fht import *
from fht.compare import *
cm = CompareFHT(fht1.signature(), fht2.signature())
print cm.correlate()
print cm.assuranceLevel()
# Note: Assurance level returns the __MEAN__ not the __MAX__ as specified here.
The Trec-Polar data set contains data in the common crawl format. It's available on S3.
Use RIOFS Mount the given S3 bucket onto local folder
While invoking download.py pass MOUNT_POINT and END_POINT as command line arguments.
The script traverses the bucket via DFS and checks the file-type of each file with tika and then copies it to your local directory
# Start the download script
python download.py <MOUNT_POINT> <END_POINT>
# To count the number of files of each type
find <END_POINT> -type f | sed 's%/[^/]*$%%' | sort | uniq -c
Progressive file download (MAP Phase):
Tika server is inherently multi-threaded. So parallelizing the download
process reduces the download time considerably. Using python multiprocess
download and parsing
of the files from the S3 bucket can be parallelized. Initial investigation reviled that com/
,
org/
, gov/
and edu/
are the 2 biggest sub folders in the S3 bucket.
Thus 4 separate python processes each for the contents of the mentioned folder and one for all the other folders is an optimal download strategy. Each python process parallelizes with 5 threads each.
cd ./download
python -u progressive.py <MOUNT_POINT>/org <END_POINT>/org progress-org.log > progress-org-op.log > /dev/null &
python -u progressive.py <MOUNT_POINT>/com <END_POINT>/com progress-com.log > progress-com-op.log > /dev/null &
python -u progressive.py <MOUNT_POINT>/org <END_POINT>/org progress-org.log > progress-org-op.log > /dev/null &
python -u progressive.py <MOUNT_POINT>/edu <END_POINT>/edu progress-edu.log > progress-edu-op.log > /dev/null &
python -u progressive.py <MOUNT_POINT>/other <END_POINT>/other progress-other.log > progress-other-op.log > /dev/null &
# To display download errors
cat *-op.log | grep "ERROR"
Accumulating results (Reduce Phase):
Once the 5 individual folders are downloaded completely they can be grouped into one collection using the
group.py
script.
python group.py <PATH TO PARENT FOLDER>
The application-octet stream folder contained a lot of empty files. The empty.py
script bins these empty files
separately.
We leveraged python multiprocess
BFA and FHT signature computation using a 2 phased Map-Reduce approach.
- Map Phase: A pool of n threads compute signatures for files in a given bin and write the file-size and signatures
into a CSV output file. (
batch/bfa.py
) - Reduce Phase: A python program reads the file line by line and computes the average of each signature.
(
batch/bfa_aggreagte.py
) Note: This only uses 75% of the signatures to compute the average. For types with fewer than 5 files all the signatures are used.
The r/size-summary.r
script reads the generated signature files and runs k-means clustering based on the file-sizes
for a given type. It also generates jpeg plots for the file-size variations for each cluster.
The batch/bfa_size_aggreagate.py
script reads the output generated from size-summary.r
and computes average signatures
for each size-cluster for a given file type. Note: This only happens for types with no fewer than 5 unique signatures.
- Map Phase: A pool of n threads read the first and last 4,8 and 16 bytes of all the files in a given bin and store them
onto 3 separate files. (
batch/fht.py
) - Reduce Phase: A python program reads these files 16,32 and 64 bits at a time and computes the aggregate FHT signature.
(
batch/fht_aggreagte.py
)
Similar approaches for BFC and BFCC calculation with bfc_aggregate.py
and bfcc_aggregate.py
.
A separate grunt angular application has been built to visualize these results. The visualize.py
script computes
BFA, BFC, Cross-Correlation and FHT on given files and produces signatures in a format readable by the web-app.
Note: The generated signatures need to be moved into the /data folder in the webapp after computation.
python visualize.py bfa <FilePath>
python visualize.py bfc <FilePath1> <FilePath2>
python visualize.py bfcc <FilePath>
python visualize.py fht <FilePath> <OFFSET>
python visualize.py fhtc <FilePath1> <FilePath2> <OFFSET>