Tools → Avro Extractor Tool

About the Avro Extractor Tool

The Apache Kafka connector requires an .avro file for a given Kafka topic. The Avro file describes how to serialize and flatten a semi-structured, schema-less JSON message into structured data consisting of one or more related physical schema tables within Incorta.

The Avro Extractor Tool is a JAR file, avroExtractor.jar, that you can use to generate an Avro file from a single JSON message sample.

The resulting .avro file defines how a specific JSON object maps to one or more physical schema tables, and how specific key-value pairs map to table columns. The Avro file itself is in JSON format. The file contains Avro annotations which define the mapping of JSON key-value pairs to tables, columns, data types, and keys. Annotations for keys describe the relationships between tables.

Deployment

The default location of the Avro Extractor Tool is:

/home/incorta/IncortaAnalytics/IncortaNode/bin/avroExtractor.jar

The avroExtractor.jar file requires a dependent JAR file, javax.json-1.0.4.jar, to be in the same parent directory.

Here are the steps to copy javax.json-1.0.4.jar to the same parent directory as the avroExtractor.jar:

  • On an Incorta Node where you created the sample JSON file, as the incorta user, copy the javax.json-1.0.4.jar file to the same directory as ../IncortaAnalytics/IncortaNode/bin/.
sudo su incorta
cp /home/incorta/IncortaAnalytics/IncortaNode/runtime/lib/javax.json-1.0.4.jar /home/incorta/IncortaAnalytics/IncortaNode/bin/
  • Confirm that you copied the file as the following will show javax.json-1.0.4.jar:
ls -l /home/incorta/IncortaAnalytics/IncortaNode/bin/ | grep javax
Note

If you have JAVA installed on your local machine, you can run the Avro Extractor Tool as an independent JAR file on your machine. Simply copy both the avroExtractor.jar and javax.json-1.0.4.jar file to the same parent directory.

How to use the Avro Extractor Tool

The tool requires a JSON file that contains a single JSON message from a Kafka topic. After creating the sample file, you can run the Avro Extractor Tool as deployed on the Incorta Node. You can then copy the .avro file to your local desktop. Creating an external data source that uses the Apache Kafka connector for the related Kafka topic requires this .avro file.

Create a sample JSON file

In order to generate a .avro file, the Avro Extractor Tool requires a .json file that contains a single JSON message. Here is an example of a JSON message for the FruitSales topic in Kafka:

{ "messageId": 143, "entity": "tblFruitSales", "data": { "id": 143, "dt": "2019-09-30 19:28:23", "item": "jackfruit", "quantity": 35, "price": 5.72 } }

The JSON message itself must contain:

  • a Message Type key-pair, such as
{ ... "entity": "tblData" ...}
  • a data key-pair, where the value of the data key contains the actual data you want to store in Incorta, for example:
{... "data": { "name1": "value1" } }

The Avro Extractor Tool will ignore the data key, and you will not see this key in the resulting .avro file.

To create a sample JSON file for the Avro Extractor Tool, follow these steps:

  • Switch to the incorta user on the same host as where you will run the Avro Extractor Tool.
sudo su incorta
  • In the /tmp directory, create the mySampleMessage.json file using vim.
cd /tmp
vim mySampleMessage.json
  • Using the i keystroke, switch to Insert mode, and copy your message to the file. Here is an example:
{ "messageId": 143, "entity": "tblFruitSales", "data": { "id": 143, "dt": "2019-09-30 19:28:23", "item": "jackfruit", "quantity": 35, "price": 5.72 } }
  • Using the ESC keystroke, switch back to Read mode.
  • With the keystroke :wq!, force quit and save.
  • Verify the console output of the message contents.
cat mySampleMessage.json

Generate an Avro file for the sample JSON message

After deploying the Avro Extractor Tool and creating the sample JSON message file, you can run the Avro Extractor Tool. By default, the tool will generate an .avro file in the same directory as the input .json file. The Avro Extractor has several available input parameters that you can specify.

The resulting file in the following example is mySampleMessage.json.avro. Here are the steps to create the .avro file as the incorta user:

  • Change directories to the location of the avroExtractor.jar file.
cd /home/incorta/IncortaAnalytics/IncortaNode/bin/
java -jar avroExtractor.jar -messageType entity -input /tmp/mySampleMessage.json
  • The console output will the location of the .avro file:
Extracting the schema for: /tmp/mySampleMessage.json
output to: /tmp/mySampleMessage.json.avro
messageType key is: entity
trimAfterDash: false
  • Read the content of the .avro file in the /tmp directory.
cat /tmp/mySampleMessage.json.avro
  • Verify that the file is similar to the following:
{ "avroSchema": [
{ "type": "record", "name": "tblFruitSales", "fields": [
{ "name": "id", "type": "long"},
{ "name": "dt", "type": "string"},
{ "name": "item", "type": "string"},
{ "name": "quantity", "type": "long"},
{ "name": "price", "type": "double"}
]}
], "annotations": []}
  • Secure copy the .avro file to your desktop. Here is an example of using scp to copy a file from the remote host to the desktop:
INCORTA_NODE_HOST=100.101.102.103
scp -i ~/.ssh/host_pemkey.pem incorta@${INCORTA_NODE_HOST}:/tmp/mySampleMessage.json.avro ~/Desktop/mySampleMessage.json.avro
  • Edit the .avro file and add Avro Notations that describe keys and relationships between tables.
  • Save your changes.

Avro Extractor Tool input parameters

Here are the various input parameters available for the Avro Extractor Tool:

ParameterDescription
-inputSpecify the sample JSON message file name and path, if not in the current directory
-outputOptional. Specify this parameter to designate the name of the output file and path, if not the current directory. If not specified, the tool will output the .avro file to the same directory as the input file and will use the input filename, appending the .avro file extension to the name.
-messageTypeSpecify the root table name for the JSON messages. If the key name is entity as such in this example {"entity" : "Employee"}, the specify -messageType entity. Incorta will create an Employee table as the root table.
-trimAfterDashOptional. Specify as true orfalse.
If the value of the -messageType has a dash, Incorta will remove the dash and everything after the dash. For example, if the key name is entity as in this example, {"entity" : "Employee-USA"}, use -trimAfterDash true so that Incorta trims the value of Employee-USA to Employee and create an Employee table as the root table. Incorta does not allow a dash (-) in a physical schema table name.

Avro Annotations

The Avro Extractor Tool discerns the data types for key-value pairs within the JSON sample. In many cases, the Avro Extractor Tool must flatten nested JSON objects into tables.

In order to describe relationships between tables, you must edit the .avro file and add Avro notations. You can use various annotations to Incorta to ignore certain columns, define columns as keys, and create both child and parent tables.

All annotations start with a path attribute that describes a field or a record using dots to form a fully qualified name. Here is an example of defining an column as a key in the mySampleMessage.json.avro file:

{ "avroSchema": [
{ "type": "record", "name": "tblFruitSales", "fields": [
{ "name": "id", "type": "long"},
{ "name": "dt", "type": "string"},
{ "name": "item", "type": "string"},
{ "name": "quantity", "type": "long"},
{ "name": "price", "type": "double"}
]}
], "annotations": [
{ "path": "tblFruitSales.id", "isKey": "true"}
]}

Here are the supported Avro Annotations:

AnnotationNameDescriptionExample
isKeyPrimary KeySets the given field or column to be part of the primary key of its parent table. This flag is set on the field level.Example:
{ "path": "person.id", "isKey": "true" }
isIgnoredIgnoredAny field marked as "isIgnored": "true" will not be shown in schema discovery.Example:
{ "path": "person.corrections", "isIgnored": "true" }
isTableSet as tableThe default behavior of Incorta is to flatten the fields of nested child JSON objects inside its parent. If it is needed to have a nested child being a separate table, the isTable flag should be set to true.Example:
{"path": "person.qualifications", "isTable": "true" }
isParentParent tableA nested child object in JSON is by default a child table in Incorta. If the table is a parent table, the isParent flag is set to true. Note that isParent must be used in conjunction with "isTable": "true"Example:
{ "path": "person.organization", "isTable": "true", "isParent": "true" }
isPersistentPersistentIf the data for a JSON nested object will actually be sent via another Kafka message, then the nested object does not need to be saved as part of the current message. In this case, the isPersistent should be set to false. Note that isPersistent must be used in conjunction with "isTable": "true" and "isParent": "true".Example:
{ "path": "person.organization", "isTable" : "true", "isParent": "true", "isPersistent": "false" }
isMapMapIf a nested JSON object is maps the keys in the JSON to a set of child records, use "isMap": "true" to define a child table. Note that isMap cannot be used in conjunction with isTable.Example:
{ "path": "person.addresses", "isMap": "true" }
<KEY>Map KeyIn the "path", when referencing fields or records that are children of a map, then the path needs to include the variable element, <KEY>.Example:
{ "path": "person.addresses.<KEY>.typeName", "isKey": "true" }
tableNameTable NameAn alias table name can be added to any table, map, or array.Example:
{ "path": "country.lastModification", "isTable": "true", "tableName": "coun_lsat_mod"}
tableNameArray MapIf a map is actually a set of fields inside a record, then the annotations need to specify which field names and the corresponding table name. The list of names of the fields is specified inside the "path" must be comma-separated. Note that the tableName must be used in conjunction with "isMap": "true".Example:
{ "path": "person.addresses.<KEY>.local1,local2", "isMap": "true", "tableName": "person.addresses.LocalAddresses" }
isOneToOneOne To OneWhile annotating a table as "isTable": "true", you can also annotate the table as one-to-one table. During schema discovery, Incorta will only mark Key columns that are defined in and inherited from the parent table.Example:
{ "path": "person.demographic", "isTable": "true", "isOneToOne": "true"}
isEncryptedSource encryptedAnnotating a field as "isEncrypted": "true" means that it is encrypted from the source and needs to be decrypted by Incorta using custom classExample:
{ "path": "person.basic.firstName", "isEncrypted": "true" }
encryptionNameEncryption nameThis annotation must follow "isEncrypted": "true". You must specify the name of crypto class as defined in the kafka-crypto.properties file.Example:
{ "path": "person.basic.firstName", "isEncrypted" : "true", “encryptionName” : “person”}