Hello Everybody,
We have been struggling with the data quality issues in our group. Our group is primarily a Database development group and we are responsible for the processing of the large amount of data, reporting requirements and supple necessary data to the OLTP systems. Processing data from several internal and external sources with the following sources
1) Flat file data processing through ETL
2) Connection to internal database system via OLE DB connections
3) Live Streaming/APIs from upstream systems
Our database design is compatible to handle source data but as all of know, some time source systems data types changes without our knowledge/bad data comes in which trigger batch/streaming data processing to fail. We have implemented not to fail the entire job if in case few rows in a batch/API calls/Streaming data are bad.
I am looking to forward to implement robust framework/best practices/methodology to handle data processing and then build a some kind of dashboard to monitor data quality reporting.
Please let us know, if you have any framework implemented and some kind of design ideas.
Thanks in advance.