Support DESK

Follow

H6.3 - matchIT Hub for Spark - Class Reference

Previous Article matchIT Hub Index Next Article 

The jar HubSpark.jar is the matchIT Hub for Spark package (com.matchIT.Hub.spark) and contains the following classes. You will only need this jar if you are building your own applications, the sample apps already include it.

HubStats Used to collect statistics across the various stages and partitions of a job.
HubSchema Provides a schema for each stage of processing, based on configuration settings.
KeyGenerationString Data processing class that performs Key Generation
KeyGenerationRow Data processing class that performs Key Generation
KeyedToKeyValuesString Data processing class that converts Keyed records into {key, value} pairs.
KeyedToKeyValuesRow Data processing class that converts Keyed records into {key, value} pairs.
PairMatchingString Data processing class that performs Pair Matching.
PairMatchingRow Data processing class that performs Pair Matching.
GroupingString Data processing class that performs Grouping of matching pairs.
GroupingRow Data processing class that performs Grouping of matching pairs.

HubStats

Used to collect statistics across the various stages and partitions of a job. Create one instances of this class and pass it to all the data processing tasks. When processing is finished call HubStats::print() to display the total statistics.

HubSchema

Provides a schema for each stage of processing, based on configuration settings. Helper class that populates org.apache.spark.sql.types.StructType schema structures for each data processing classes inputs and outputs.

The constructor is passed an activation code and the same Hub settings xml used by the data processing classes.

HubSchema(String activationCode, String hubSettings)

The class’ public methods are:

StructType getInputSchema(String columns) Generates an input schema given a delimited list of input columns (columns must start with the delimiter used).
StructType getKeyGenerationOutputSchema(int table) Generates Key Generation output schema for the given table.
StructType getPairMatchingOutputSchema() Generates the Pair Matching output schema (i.e. matching pairs).
StructType getGroupingOutputSchema() Generates the Grouping output schema.

Data Processing Classes

The following classes implement Spark functions and have two versions. A “String” version for working with JavaRDDs, where String is a delimited record, and a “Row” version for working with JavaRDD/Dataset.

KeyGeneration

Appends all key values to each record using Hub in a new “Key Generation” mode.

The constructors take: a table number (0, 1, or 2), activation code, Hub xml settings, a delimiter to use when constructing the delimited records passed to Hub, and an instance of HubStats.

KeyGenerationString(int table,
String activationCode,
String hubSettings,
String delimiter,
HubStats stats)

KeyGenerationRow(int table,
String activationCode,
String hubSettings,
String delimiter,
HubStats stats)

KeyGenerationString

Implements PairFlatMapFunction<String, String, String> for use with JavaRDD::mapPartitions().

Example usage:

JavaRDD keyed = rows.mapPartitions(
new KeyGenerationString(overlap ? 1 : 0,
activationCode,
hubSettings,
delimiter,
stats));

KeyGenerationRow

Implements MapPartitionsFunction<Row, Row> for use with Dataset::mapPartitions().

Dataset keyed = rowsDF.mapPartitions(
new KeyGenerationRow(overlap ? 1 : 0,
activationCode,
hubSettings,
delimiter,
stats),
encoder);

Implements FlatMapFunction<Iterator, Row> for use with JavaRDD::mapPartitions().

JavaRDD keyed = rowsDF.javaRDD().mapPartitions(
new KeyGenerationRow(overlap ? 1 : 0,
activationCode,
hubSettings,
delimiter,
stats));

KeyedToKeyValues

Applied to the output of KeyGeneration, generates {key, value} pairs for each key. The output of KeyGeneration has all the key values appended in new field. This task converts that input {key, value} pairs and converts the JavaRDD into a JavaPairRDD.

KeyedToKeyValuesString

The constructor takes the delimiter used in the String records.

KeyedToKeyValuesString(String delimiter)

Implements PairFlatMapFunction<String, String, String> for use with JavaRDD::flatMapToPair().

JavaPairRDD<String, String> keys = keyed.flatMapToPair(
new KeyedToKeyValuesString(delimiter));

KeyedToKeyValuesRow

The constructor takes the StructType schema used in the Row records.

KeyedToKeyValuesRow(StructType schema)

Implements PairFlatMapFunction<Row, String, Row> for use with JavaRDD::flatMapToPair().

JavaPairRDD<String, Row> keys = keyed.javaRDD().flatMapToPair(
new KeyedToKeyValuesRow(keyGenOutputSchema));

PairMatching

Applied to the output of KeyedToKeyValues grouped by key, compares every record in each group with every other record in the group (whilst avoiding duplicate comparisons). Sends pairs of records to Hub in Pair Matching mode. Outputs matching pairs.

The constructors take: a flag to indicate if overlap matching, activation code, Hub xml settings, a delimiter to use when constructing the delimited records passed to Hub, and an instance of HubStats.

PairMatchingString(boolean overlap,
String activationCode,
String hubSettings,
String delimiter,
HubStats stats)

PairMatchingRow(boolean overlap,
String activationCode,
String hubSettings,
String delimiter,
HubStats stats)

PairMatchingString

Implements FlatMapFunction<Iterator<Tuple2<String, Iterable>>, String> for use with JavaPairRDD<String, Iterable>::mapPartitions().

JavaRDD pairs = clusters.mapPartitions(
new PairMatchingString(overlap,
activationCode,
hubSettings,
delimiter,
stats));

PairMatchingRow

Implements FlatMapFunction<Iterator<Tuple2<String, Iterable>>, Row> for use with JavaPairRDD<String, Iterable>::mapPartitions().

JavaRDD pairs = clusters.mapPartitions(
new PairMatchingRow(overlap,
activationCode,
hubSettings,
delimiter,
stats));

Grouping

Groups matching pairs.

The constructors take: activation code, Hub xml settings, a delimiter to use when constructing the delimited records passed to Hub, and an instance of HubStats.

GroupingString(String activationCode,
String hubSettings,
String delimiter,
HubStats stats)

GroupingRow(String activationCode,
String hubSettings,
String delimiter,
HubStats stats)

GroupingString

Implements FlatMapFunction<Iterator, String> for use with JavaRDD::mapPartitions().

JavaRDD groups = allPairs.mapPartitions(
new GroupingString(activationCode,
hubSettings,
delimiter,
stats));

GroupingRow

Implements FlatMapFunction<Iterator, Row> for use with JavaRDD::mapPartitions().

JavaRDD groups = allPairs.mapPartitions(
new GroupingRow(activationCode,
hubSettings,
delimiter,
stats));

Previous Article matchIT Hub Index Next Article 
Was this article helpful?
0 out of 0 found this helpful

0 Comments

Please sign in to leave a comment.