- Report.docx - Thesis on Equal Pay
- mapper.py - Mapper script for Hadoop
- reducer.py - Reducer script for Hadoop
- salaries.csv - Initial data set on salaries in San Francisco from Kaggle
- salaries.txt - Output from the preprocessing script
- salaries_discrimination.csv - Output from the Hadoop MapReduce job
- salaries_discrimination_analytics.R - Script to perform analytics on salaries_discrimination.csv
- salaries_preprocessing.R - Script to perform preprocessing on salaries.csv
Note you will have to change the file paths within the R scripts to match your system
- Open salaries_preprocessing.R in RStudio and run it.
- Follow the instructions to install the virtual machine that contains Hadoop here.
- Transfer salaries.txt, mapper.py, and reducer.py to the virtual machine using the instructions here.
- Make directories in HDFS on the virtual machine using the following commands:
hadoop fs -mkdir jobinput
hadoop fs -mkdir joboutput
- Put salaries.txt into HDFS using the following command:
hadoop fs -put salaries.txt jobinput
- Run the MapReduce job using the following command:
hs mapper.py reducer.py jobinput joboutput
- The job's output will be stored in the file part-00000 located in the joboutput directory. The file can be retrieved from HDFS using the following command:
hadoop fs -get joboutput/part-00000 salaries_discrimination.csv
- Transfer salaries_discrimination.csv to the local machine using the instructions outlined earlier.
- Open salaries_discrimination_analytics.R and run it.
Thanks to Cloudera for their free course: Intro. to Hadoop and MapReduce