|Cortex Index||Next Article|
Cortex offers a drag and drop workflow environment for deduplication. It does this through a range of tools that make accessing, cleaning, and outputting data faster and easier. Data preparation tools that include Matching and Normalization functionality. This allows you to find matches in and across any combination of data sources including SQL Server, Excel, & delimited files. Note that if you would like another database connector, please let our support team know and we will introduce additional connectors according to demand.
Cortex is compatible with Microsoft Windows 2008/7/10 and Microsoft Windows Server operating systems (2008 or more recent). We strongly recommend using a fully up-to-date and patched operating system, as this will benefit Cortex in terms of robustness, stability, and security.
Cortex can run entirely in-memory. As the volume of data increases, memory requirements also increase. It is highly recommended that Cortex is used on a machine with enough memory to sufficiently process the data without requiring disk storage.
As a rough guideline:
- a machine with 8 GB of RAM should comfortably process 15 million rows;
- a machine with 16 GB of RAM should comfortably process 30 million rows;
- a machine with 32 GB of RAM should comfortably process 60 million rows;
- a machine with 48 GB of RAM should comfortably process 80 million rows.
If overlapping two sources of data, then use their summed row counts with these guidelines (for example, 100 million vs. 20 million would require 80 GB of RAM.
Note that these figures are highly dependent on factors such as:
- the average size of each row (these figures assume an average row size of 150 bytes);
- which match keys are used (refer to the Configuration Guide for details on match keys);
- the amount of duplication in the data.
Normalization: Note that when an engine is configured for normalization, a row of data added to the engine is discarded immediately after it's processed and output; it is otherwise not retained in RAM. The above RAM requirements are therefore not applicable, and memory usage is minimal.
Cortex can fall back to storing data on disk, for example if memory usage exceeds a predetermined threshold. This can significantly impact performance, but will allow for processing greater volumes of data. Should disk usage be necessary, then fast disks (such as SSDs) are highly recommended.
The first time you run Cortex, you will be prompted to enter an activation code - simply enter your code in the activation window (see below) and your install will be activated.
Run matchIT Cortex. The main window is divided into 5 main areas.
The menu bar contains various application command buttons and a progress bar.
|Used to clear the workflow canvas, open a previously saved workflow, save the current workflow, and save the current workflow with a different name.|
|Used to enter or update an activation code.|
|Application settings like, log file and log severity.|
|Opens an “About” dialog with version information.|
|Used to edit workflow canvas tools and connections.|
|Start the current workflow running and abort a running job.|
Use the Config button to determine whether messages, warnings, errors should be written to a log file and determine the path of that log file.
The console shows messages. Configuration options let you choose which severity of messages to see, from:
The toolbox contains all the tools you can drag onto the workflow canvas divided into the categories: Input, Process, and Output.
The input category has tools to load data from:
- Databases - currently, Microsoft SQL Server is supported. If you would like another database connector, please let us know by contacting our support team and we will introduce additional connectors according to demand.
- Delimited files (Tab delimited, comma delimited, etc);
The process category has tools for
- Matching – dedupe single table or overlap two tables;
- Normalization – produce normalized version of input data;
- Grouping – group previously matched pairs for multiple runs;
- Union – union two data source with the same layout;
- Select – select a subset of the input columns to pass through to downstream components;
- Sample Groups – filter the Matching groups output to produce a sample of groups for each score.
The output category has tools to output data to:
- Databases (SQL Server)
- Delimited files (Tab delimited, comma delimited, etc)
Tools are dragged from the toolbox onto the workflow canvas, where they can be connected together:
|Cortex Index||Next Article|