Support DESK

Follow

H6.4.1 - matchIT Hub for Spark - DedupeTextFile

Previous Article matchIT Hub Index Next Article 

The DedupeTextFile application demonstrates using the matchIT Hub for Spark ‘String’ classes to work with text files, delimited record strings, and Spark RDDs.

DedupeTextFile.png

Configuration

The command line argument to DedupeTextFile is the name of a configuration file. This is an xml file in the following format:

<?xml version="1.0" encoding="utf-8" ?>
<config>
<dedupeTextFile>
<mainFile>./example1.txt</mainFile>
<delimiter>\t</delimiter>
<outputPath>./matching-pairs</outputPath>
<licenceFile>./activation.txt</licenceFile>
</dedupeTextFile>

<hub>
<data>
<input table="0" columns="|UniqueRef|FullName|Company|Address1|Address2|City|State|Zip" />
<options>...</options>
</data>
<matching>
<outputs>...</outputs>
</matching>
<threads>0</threads>
<advanced>
<nationality>USA</nationality>
</advanced>
</hub>
</config>

The <dedupeTextFile> section is specific to this application.

mainFile The file to deduplicate, or the main file for an overlap.
overlapFile The file to overlap with the main file. Do not specify this to perform an internal dedupe of a single file.
delimiter The delimiter used in the file and used in the Strings in the RDDs.
outputPath The output folder for the matching output.
licenceFile A file containing the product activation code.

The <hub> section configures the underlying matching engine. Refer to the matchIT Hub documentation for details. The <hub> section must contain the following sub-sections: data, matching, threads, advanced.

Running the sample

cd samples/DedupeTextFile
chmod 777 DedupeTextFile-jar-with-dependencies.jar run.sh
./run.sh

Previous Article matchIT Hub Index Next Article 
Was this article helpful?
0 out of 0 found this helpful

0 Comments

Please sign in to leave a comment.