-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathnotes.txt
executable file
·83 lines (50 loc) · 4.25 KB
/
notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
This file is to record interesting information about the profiling
If not enough VM for graph then process kills itself.
C malloc does not alloc RAM, but only VM.
Graphs with greater number of nodes might be smaller because less edges.
We want to profile memory usage of the graph, but framework usage varies as well.
Although it is possible to compute memory usage of a python object, it is an expensive operation
which takes a lot of time when graphs starts having many sub objects (5 mn for a 2000 node graphs),
using pickle.dumps. There might be a better way, but in general it is not an easy thing to do.
Therefore, it makes more sense to show global memory usage, since some of the RAM allocated to the framework
might be transferred into VM.
Add filesize to profile metrics.
Zen objects are roughly 3 times bigger when loaded into memory.
For 75 MB, maximum available vm is ~271 MB
Using pickle dumps :
Empty graph is 4200 bytes => 4kB
100 node graph is 15,753 bytes ~ 15kB => 100 nodes(erdos_renyi) ~= 11kB
1000 node graph is 1,784,598 bytes ~ 1.7 MB
10000 node graph is likely to be 100MB
20000 node graph would be 400MB
Available VM :
Taking into account zen memory usage, 150 MB of RAM give us an vm availability of 889MB.
The point is to constrain the memory usable by the graph but not too much otherwise there is not enough space for virtual memory and the program crashes.
==== At this point we are able to generate graphs, generate partitions and give an estimate upper bound on how many partitions have to be loaded
in order to answer the queries. We are currently investigating the different partitioning schemes available using the chaco tool. We are investigating the
efficiency of the technique. We would want to compare the results found with K&L to some found using a randomized partition.
Observations:
We note a little difference between 'no edge cost' KL solution and a random partition when k=1, but otherwise the difference is pretty minor. This is when we have
a total of 16 partitions. We now will attempt to compute results with more and with less partitions.
We should add edge weights to our example for partitioning.
===== Monday march 25th ===========
Tasks :
- Generate power law graphs.
- Analyze the graph using three techniques : random, ML-KL (no edge costs), ML-KL (with edge costs).
- render the graphs.
Assuming 500 friends on average per person on Facebook and approximating the user base to 1 billion, we have an average edge cover for each node of 0.0000005 which very sparse
(much sparser than the sparse graphs used here. We should maybe consider these graphs more), ie generate more graphs with a smaller amount of edges and more vertices.
Very good news for the barabasi graphs when fairly sparse. The same is not true when the distributions are more dense. We notice that the difference between the KL partitioning and
a fully random partitioning gets larger when k gets bigger. When k=4, the number of partitions visited is reduced by up to 90%.
We are now proceeding to comparing partitioning with or without edge costs. Generating the analysis has been successful, we are sure to compare partitionings that orignated from
the same graph. We notice that there is some randomness. Depending on the situation, edge costs may or may not bring a benefit.
The results were obtained using averages over 10 random start nodes.
We try now over a larger sample to see if we can get rid (at least partly) of this randomness.
===== Wednesday April 3rd =========
We now attempted to uncover facebook graphs. The number of edges is larger than expected but a lot of duplicate edges. Have to manage duplication.
The duplication was managed using an alternate file for the jar (maybe should incorporate this mechanism on a seperate jar.
Duplication has to be verified.
The partitioning has happened on duplicated data (data specified both ways). The analysis, however used the duplication free version of the facebook
graph. We have to make sure the two are consistent (and understand why the duplication free graph cannot be easily partitioned by chaco ==> structure
of the graph might be corrupt). Another bug is the presence of the "no object reference warnings" about the graph.
These messages usually mean we are trying to use an index not assigned to a node.