Support DESK

Follow

H6.4.2 - matchIT Hub for Spark - DedupeHive

Previous Article matchIT Hub Index Next Article 

The DedupeHive application demonstrates using the matchIT Hub for Spark ‘Row’ classes to work with a Hive datasource, Rows, and Spark Dataset & RDDs.

DedupeHive.png

Configuration

The command line argument to DedupeHive is the name of a configuration file. This is an xml file in the following format:

<?xml version="1.0" encoding="utf-8" ?>
<config>
<dedupeHive>
<warehouseLocation>/user/hive/warehouse</warehouseLocation>
<!-- Define one input for single table Matching -->
<input>
<table>input</table>
</input>

<!-- Define two inputs for Overlap Matching
<input>
</input> -->

<!-- Output database and table name. -->
<output>
<table>matchingPairsSpark</table>
</output>
<delimiter>\t</delimiter>

<licenceFile>./activation.txt</licenceFile>
<logLevel>error</logLevel>
<groupingAlgorithm>hub</groupingAlgorithm>
<idField>0</idField>
<maxIterations>4</maxIterations>
</dedupeHive>

<hub>
<data>
<input table="0" columns="|UniqueRef|FullName|Company|Address1|Address2|City|State|Zip" />
<options>...</options>
</data>
<matching>
<outputs>...</outputs>
</matching>
<threads>0</threads>
<advanced>
<nationality>USA</nationality>
</advanced>
</hub>
</config>

The <dedupeHive> section is specific to this application.

warehouseLocation Location of the Hive warehouse.
input Details of an input database table.
table The name of a database table.
output Details of an output database table.
delimiter The delimiter used when converting records to delimited string in order to pass to the underlying matching engine.

See DedupeConfiguration for a description of the configuration options: licenceFile, logLevel, groupingAlgorithm, idField, and maxIterations.

The <hub> section configures the underlying matching engine. Refer to the matchIT Hub documentation for details. The <hub> section must contain the following sub-sections: data, matching, threads, advanced.

Running the sample

The DedupeHive sample can’t be run out-the-box like the DedupeTextFile sample because it requires a Hive database source. Nevertheless, a sampleconfig.xml and run.sh are provided to demonstrate how to set this up to run.

Previous Article matchIT Hub Index Next Article 
Was this article helpful?
0 out of 0 found this helpful

0 Comments

Please sign in to leave a comment.