Support DESK

Follow

S2.6.2 - SSIS Tutorial - Understanding Matching - Why did this score 97 on Business Level

Previous Article matchIT SQL Index Next Article 

Why did this score 97 on Business

Example 1:

Record A
Company: KEEN MAILING & MARKETING
Add1: MAILROOM
Add2: LOWERCROFT ROAD
Add3: BURY
Add4: LANCASHIRE
Postcode:  BL8 3AP

Example 2: 

Record B
Company: Keen Mailing
Add1: Lowercroft Road
Add2:  
Add3: Bury St Edmunds
Add4: Lancashire
Postcode: BL8 3PA

And we’re trying to match them on name or company.

The first question we should be asking is actually, are they even being compared in the first place? Because remember our matching works as a two step process, we line it up on the keys, then score it.

So we have 3 default keys

 keys

A

B

mkorgname1 + mkpostout

kym + BL8

kym + BL8

mkorgname1 + mkphoneticstreet

kym + lyykryft

kym + lyykryft

mkpostout + mkpostin

BL8 + 3AP

BL8 + 3PA

Because one of the records have the same phonetic first word of the company name and the same postout, we were able to identify the potential match on the 1st key.

The 2nd key would have also found this as a potential match because they also have the same phentic street value. Remember though, because it was found on the 1st key wasn't compared by the 2nd key.

But because we had the same postout but a different postin, this wouldn't have been found on the third key

So we know its being compared, often clients may wonder why something isn’t scoring, but the actual issue is that it wasn’t being compared in the first place, so that should be the first thing you check.

So now that we know its being compared, let's find out what it's going to score.

By default, the score is a cumulative score, we don’t work on percentages like most others, we try to look at it more like a human would, instead of a machine.

There’s three main components to the score

Company
Address
Postcode

1) Organization

With the Company, we want to look at the mknormalizedorganization, as mentioned previously, we look at it left to right, generally limiting ourselves to the first three significant words

So we have

Company A : KEEN MAILING & MARKETING      
Company B : Keen Mailing


which normalise to

KEEN,MAILING,

And
KEEN,MAILING,

There’s a matrix xml that we have with many predetermined decisions, so when we look at this from left to right we see that this is a sure match

When acronym matching, we’ll scan through the matrix and realise there wasn’t a match, and do a final check and see that, for example, PAC and Pamer Air Charters line up, and default it to a likely score.

If we look at the matrix stored by default in 

C:\matchIT SQL\config\matchingMatrices\businessLevel\organizationmatchingmatrix.xml

If we had something like Palmr Air and Palmer Air Charters, 

If we navigate through it, we see the pattern the match follows, and the associated score. So in this case, the likely score is 40 points.  Similar to the score level of the acronym match

Whereas  if we had a Palmer Air, and Palmer Air charters  it would  be  Equal, Equal, one empty, and score  sure, which is 60 points

Whereas if the name were transposed  like Air Charters by Palmer, and  Palmer Air Charters, it would only score possible, which is 25.

Some advanced clients will replace the sure/likely/possible entries in the matrix with decimal values, to get more granularity in their results

2) Address

When we score on the address, we look at the address lines as a whole

Record A:
Add1: MAILROOM
Add2: LOWERCROFT ROAD
Add3: BURY
Add3: LANCASHIRE

Record B:
Add1: Lowercroft Road
Add2:
Add3: Bury St Edmunds
Add4: Lancashire

We use our own proprietary algorithm that looks across these columns as a whole.

In this case, because there is no premise, they will be limited to a maximum likely weight which is 22. The thoroughfare is correctly identified in both addresses but because of the Mailroom element and St Edmunds missing from the town in the first record, the score is reduced to 17. If less than 50% of it was right, we’d end up just throwing it out and not scoring at all.

Address scores range from 15-30 by default in the UK

Or 20-40 by default in the US

 

3) Postcode

In this case both records had a postcode but the letters in the second records postin were transposed. This mean is will have been given a likely weight which is 20.

 BL8 3AP matching

 BL8 3PA

You can access these weights from findmatches or find overlap tasks, each matching level has its own set of weights

 

Sure:

A sure score is when you have two records with the same postcode

 BL8 3AP  matching

 BL8 3AP =  sure   = 30

Likely:

UB7 7PQ matching

UR7 7PQ

Where only postin is the same   = likely  = 20

Possible: 

 BL8 3AP matching

 BL8 3PA = possible  = 15

 

So company names are based on the matrix, address is based on an algorithm that looks at the address lines as a whole, and the postcode has its own separate rules.

Once we go through all 3 we add up the score

Name = 60
Address = 17
Postcode = 15

Total =  97

To get more insight into why your matches scored what they did, ensure you’re breaking out the component scores, its an option in your findmatches/findoverlap task when you’re showing advanced options.

 

Previous Article matchIT SQL Index Next Article
Was this article helpful?
0 out of 0 found this helpful

0 Comments

Please sign in to leave a comment.