Talend: Schema compatibility check
Most of the time when talking about Talend jobs, people think of standard ETL (Extract, Transform, Load). But in some cases there’s the need to check the incoming data before loading them into the target rather than just transforming it. We refer to this process as E-DQ-L (Extract, Data Quality, Load).
One of the things that you might want to check before loading is schema compatibility. For example: you expect to get a String that’s 5 long. If you, for any reason, receive a String that is larger than 5, it will generate an error. Or perhaps you expect a percent (in format BigDecimal like 0.19), but you receive it as a string (“19%”). This example will result into a failing job with an error saying “Type mismatch: cannot convert from dataType to otherDataType”.
Before I continue this blog I would like to emphasize that all the solutions below are possible with the Data Integration version of Talend, except for the last one. The last option requires a Talend Data Quality license.
Let’s create an example case: We want to extract data on a regular basis from a third-party source which we cannot fully trust in terms of schema-settings. We know how many columns we can expect and we have a rough idea of what it contains, but we do not fully trust the source to not give incompatible data. We want to load the records that are valid and we want to separately store the ‘corrupt’ data for logging purposes. I’ve gathered several solutions for this problem:
- Use rejected flow on an input-component
One thing you can do is reject the records as soon as you import them. Disable “die on error” on the basic settings tab of you input-component and then right-click it and select “Reject”. The rows will be rejected based on the schema of the file. In the example below we put phone number as an integer and as you can see 1 records is begin rejected. This is because the phone number contains characters and therefore cannot be read as an integer. If you did not disable the “die on error”-option then this component would make the job fail.
- In case of the target being a database: use rejected links
You can also choose to directly input the data into your database, but to reject any rows that would create an error. You can then create a separate flow to determine what to do with these rejected records.
In your database output component (for example tOracleOutput) change the following:
- Basic settings: Uncheck “Die on error”
- Advanced settings: Uncheck “Use batch size”
Now, right-click on your component and select “Row-Reject” and connect it to an output-component. The output you’ll receive will be the rejected rows and what error would have been generated if you tried inserting them, as you can see in the picture below.
- Use a tFilter-component
You can make the data go through a filter-component before inserting it into your target. You can (manually) decide what’s allowed to go through. This can be useful when your destination is not a database, in which case option 1 is most likely not available.
A tFilterRow-component also has the possibility to output the rejected rows, including the reason why they got rejected. You can enable this by right-clicking on your filter and selecting “Row-Reject”. An example of rejected rows by the filter:
Note – You can also use self-defined routines in the tFilterRow-component by checking “Use advanced mode”. This can be useful when you want to check whether or not converting is possible. For example: you could define a routine called “isInterger” that returns true if the conversion is valid and false if it’s impossible.
- Use a trustworthy service
– tSchemaComplianceCheck-component
Another way of making sure that your schema is compatible is by using the tSchemaComplianceCheck-component. Unfortunately, this component is only integrated in the Data Quality version of Talend.
It’s a very easy component to use. The only thing you have to do is connect the incoming data to the tSchemaComplianceCheck-component and then continue its flow to the destination source. You can get the rejected rows the same way as previously (by right clicking on it and then selecting “Row->Reject”).
The rejected rows and their error message look like this:
That’s it for now. There’s probably a lot of other ways of checking schema compatibility. Feel free to comment if you know any. Thank you for reading!