This project implements a simplified MapReduce model for parallel word counting, demonstrating efficient data processing using multithreading techniques. The system breaks down text processing into four key phases: Input Processing, Mapping, Shuffling, and Reducing.
Multithreaded-MapReduce-Word-Count-In-CPP/
│
├── src/
│ ├── main.cpp # Main program entry point
│ ├── functions.cpp # Implementation of core MapReduce functions
│ ├── functions.h # Header file with function declarations
│ └── testcases.cpp # Test cases for the MapReduce implementation
├── Makefile # Build configuration and commands
- Parallel Processing: Utilizes POSIX threads (pthreads) for concurrent word counting
- Thread Safety: Implements mutexes and semaphores to protect shared data structures
- Memory Efficiency: Systematic memory management to prevent leaks
- Scalable Design: Modular architecture supporting various input sizes
The word count system follows the classic MapReduce workflow:
-
Input Processing:
- Sanitize input text
- Remove special characters
- Convert to lowercase
- Distribute text chunks to mapper threads
-
Mapping Phase:
- Generate key-value pairs for each word
- Thread-safe intermediate data storage
-
Shuffling Phase:
- Group and organize key-value pairs
- Prepare data for reduction
-
Reducing Phase:
- Aggregate word counts
- Generate final output
Explanation:
- Input Processing: The raw input text is cleaned, converted into words, and distributed to mapper threads.
- Mapping Phase: Parallel threads process text chunks independently and generate intermediate key-value pairs for each word.
- Shuffling Phase: A centralized process sorts and groups the key-value pairs for easier reduction.
- Reducing Phase: Multiple threads work on grouped keys to calculate the final counts.
- Final Output: The aggregated word frequencies are displayed in a user-friendly format.
- C++ Compiler (with pthread support)
- Make utility
- POSIX-compatible operating system
make main
make test
make
# or
make all
make clean
- Efficiently processes large text inputs
- Minimal overhead from thread synchronization
- Scales well with increasing input size
The implementation includes comprehensive test cases:
- Basic word count scenarios
- Edge cases (single words, empty inputs)
- Stress tests with large text volumes
- Special character and mixed-case handling
This project is licensed under the MIT License - see the MIT License file for details.