Big Data - Continuous Assessment

Files

Report.docx - Thesis on Equal Pay
mapper.py - Mapper script for Hadoop
reducer.py - Reducer script for Hadoop
salaries.csv - Initial data set on salaries in San Francisco from Kaggle
salaries.txt - Output from the preprocessing script
salaries_discrimination.csv - Output from the Hadoop MapReduce job
salaries_discrimination_analytics.R - Script to perform analytics on salaries_discrimination.csv
salaries_preprocessing.R - Script to perform preprocessing on salaries.csv

Note you will have to change the file paths within the R scripts to match your system

Open salaries_preprocessing.R in RStudio and run it.
Follow the instructions to install the virtual machine that contains Hadoop here.
Transfer salaries.txt, mapper.py, and reducer.py to the virtual machine using the instructions here.
Make directories in HDFS on the virtual machine using the following commands:

hadoop fs -mkdir jobinput
hadoop fs -mkdir joboutput

hadoop fs -put salaries.txt jobinput

hs mapper.py reducer.py jobinput joboutput

The job's output will be stored in the file part-00000 located in the joboutput directory. The file can be retrieved from HDFS using the following command:

hadoop fs -get joboutput/part-00000 salaries_discrimination.csv

Transfer salaries_discrimination.csv to the local machine using the instructions outlined earlier.
Open salaries_discrimination_analytics.R and run it.

Thanks to Cloudera for their free course: Intro. to Hadoop and MapReduce

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
Report.docx		Report.docx
mapper.py		mapper.py
reducer.py		reducer.py
salaries.csv		salaries.csv
salaries.txt		salaries.txt
salaries_discrimination.csv		salaries_discrimination.csv
salaries_discrimination_analytics.R		salaries_discrimination_analytics.R
salaries_preprocessing.R		salaries_preprocessing.R