Support DESK


H6.3 - matchIT Hub for Spark - Running on AWS

Previous Article matchIT Hub Index Next Article


You will need an AWS account and the AWS CLI (Command Line Interface). For details of how to install and configure the CL see:

In order to create EMR cluster and submit Spark jobs, you will need to create and download a Key Pair, create default roles, and configure inbound rules.

Key pair

You will need a Key Pair when submitting jobs. Create and download a named Key Pair from the EC2 console. For more details see:

Default roles

To create clusters in EMR we need a set of default roles. Use the following AWS CLI command to create a set of default rules:

$ aws emr create-default-roles

Inbound rules

In order to use the Spark UI, the CIDR range of IP addresses you are accessing from needs to be added to the master Security Group (SG). In the EC2 console: click on Security Groups in the left-hand menu and select the “ElasticMapReduce-master” security group. On the Inbound tab, click Edit to add a rule to allow your IP access.

Deploying matchIT Hub for Spark to AWS

Create an S3 bucket called matchithub-spark. Create a folder called “log”.

Copy your activation code to a file called activation.txt in the root folder of the bucket:

$ aws s3 cp activation.txt s3://matchithub-spark/

From your local matchIT Hub for Spark installation folder, copy the contents of the lib folder:

$ aws s3 cp matchithub-spark/lib s3://matchithub-spark/lib/ --recursive

Deploying the sample job

Copy the pre-built DedupeTextFile-jar-with-dependencies.jar, example1.txt, and sampleconfig.xml files from the DedupeTextFile sample app folder:

$ aws s3 cp matchithub-spark/samples/DedupeTextFile/DedupeTextFile-jar-with-dependencies.jar s3://matchithub-spark/samples/DedupeTextFile/
$ aws s3 cp matchithub-spark/samples/DedupeTextFile/example1.txt s3://matchithub-spark/samples/DedupeTextFile/
$ aws s3 cp matchithub-spark/samples/DedupeTextFile/sampleconfig.xml s3://matchithub-spark/samples/DedupeTextFile/

Running the sample job

In your matchIT Hub for Spark installation folder there is a sub-folder called ‘emr’. In there, you’ll find a script called

Uses the aws emr create-cluster command to spin up a cluster, submit and run a job (step), and auto-terminate the cluster. Edit this file change things like the instance type and availability zone.

Usage: <key_name> <job_name> <steps_file>


  • <key_name> is the name of your Key Pairs file.
  • <job_name> is an arbitrary name for the job.
  • <steps_file> is a json file containing the steps to run (see below).


Sample steps file. This contains the spark-submit command for running the DedupeTextFile application with the example1 data.

To submit the sample job, run: sample sample-job.json

In the Amazon EMR console you should see a cluster called “sample” starting up. Once it completes, the output will be written to s3://matchithub-spark/samples/DedupeTextFile/outputPairs.

Previous Article matchIT Hub Index Next Article
Was this article helpful?
0 out of 0 found this helpful


Please sign in to leave a comment.