Skip to content

Commit

Permalink
reword
Browse files Browse the repository at this point in the history
  • Loading branch information
freemanzhang committed Jan 14, 2017
1 parent 9b2b2db commit 08c794e
Show file tree
Hide file tree
Showing 3 changed files with 25 additions and 14 deletions.
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Fight for 100 commits
# Fight for 200 commits

<!-- MarkdownTOC -->

Expand Down Expand Up @@ -35,12 +35,13 @@
- [Special case 20%](#special-case-20%)
- [Analysis 25%](#analysis-25%)
- [Tradeoff 15%](#tradeoff-15%)
- [Knowledge base 15%](#knowledge-base-15%)
- [Tradeoffs](#tradeoffs)
- [Tradeoffs between latency and durability](#tradeoffs-between-latency-and-durability)
- [Tradeoffs between availability and consistency](#tradeoffs-between-availability-and-consistency)
- [Consistency](#consistency)
- [Update consistency](#update-consistency)
- [Read consistency](#read-consistency)
- [Knowledge base 15%](#knowledge-base-15%)
- [Principles of Good Software Design](#principles-of-good-software-design)
- [Simplicity](#simplicity)
- [Loose coupling](#loose-coupling)
Expand Down
9 changes: 9 additions & 0 deletions system-design.iml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
<?xml version="1.0" encoding="UTF-8"?>
<module type="WEB_MODULE" version="4">
<component name="NewModuleRootManager" inherit-compiler-output="true">
<exclude-output />
<content url="file://$MODULE_DIR$" />
<orderEntry type="inheritedJdk" />
<orderEntry type="sourceFolder" forTests="false" />
</component>
</module>
25 changes: 13 additions & 12 deletions topk.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@
- [MapReduce](#mapreduce)
- [Standalone word count program](#standalone-word-count-program)
- [Distributed word count program](#distributed-word-count-program)
- [Common structure](#common-structure)
- [Shuffle](#shuffle)
- [Generic process](#generic-process)
- [Interface structures](#interface-structures)
- [MapReduce steps](#mapreduce-steps)
- [Transmission in detail](#transmission-in-detail)
- [Word count MapReduce program](#word-count-mapreduce-program)
- [Offline TopK](#offline-topk)
- [Algorithm level](#algorithm-level)
Expand Down Expand Up @@ -77,7 +77,7 @@ for each wordCount received from first phase
- Need to partition the intermediate data (wordCount) from first phase.
- Shuffle the partitions to the appropriate machines in second phase.

## Common structure
## Interface structures
* In order for mapping, reducing, partitioning, and shuffling to seamlessly work together, we need to agree on a common structure for the data being processed.

| Phase | Input | Output |
Expand All @@ -90,7 +90,15 @@ for each wordCount received from first phase
- Map: The list of (key/value) pairs is broken up and each individual (key/value) pair, &lt;k1, v1&gt; is processed by calling the map function of the mapper. In practice, the key k1 is often ignored by the mapper. The mapper transforms each &lt; k1,v1 &gt; pair into a list of &lt; k2, v2 &gt; pairs. For word counting, the mapper takes &lt; String filename, String file_content ;&gt and promptly ignores filename. It can output a list of &lt; String word, Integer count &gt;. The counts will be output as a list of &lt; String word, Integer 1&gt; with repeated entries.
- Reduce: The output of all the mappers are aggregated into one giant list of &lt; k2, v2 &gt; pairs. All pairs sharing the same k2 are grouped together into a new (key/value) pair, &lt; k2, list(v2) &gt; The framework asks teh reducer to process each one of these aggregated (key/value) pairs individually.

### Shuffle
## MapReduce steps
1. Input: The system reads the file from GFS
2. Split: Splits up the data across different machines, such as by hash value (SHA1, MD5)
3. Map: Each map task works on a split of data. The mapper outputs intermediate data.
4. Transmission: The system-provided shuffle process reorganizes the data so that all {Key, Value} pairs associated with a given key go to the same machine, to be processed by Reduce.
5. Reduce: Intermediate data of the same key goes to the same reducer.
6. Output: Reducer output is stored.

### Transmission in detail
* Partition: Partition sorted output of map phase according to hash value. Write output to local disk.
- Why local disk, not GFS (final input/output all inside GFS):
+ GFS can be too slow.
Expand All @@ -99,13 +107,6 @@ for each wordCount received from first phase
* Send: Send sorted partitioned data to corresponding reduce machines.
* Merge sort: Merge sorted partitioned data from different machines by merge sort.

### Generic process
1. The system reads the file from GFS, splits up the data across different machines, such as by hash value (SHA1, MD5)
2. Each map task works on a split of data.
3. The mapper outputs intermediate data.
4. The system-provided shuffle process reorganizes the data so that all {Key, Value} pairs associated with a given key go to the same machine, to be processed by Reduce.
5. Intermediate data of the same key goes to the same reducer. Get list of topK: {topK1, topK2, topK3, ...} from each machine
6. Reducer output is stored. Merge results from the returned topK list to get final TopK.

## Word count MapReduce program

Expand Down

0 comments on commit 08c794e

Please sign in to comment.