I have a project where I am using Python scripts to collect data from various API's (Google Analytics, Facebook, Instagram, etc). I write the collected data to a flat file, and then use SSIS to extract the data from a file, do some ETL work, and then insert
into our Data Warehouse.
The issue I am having is with Unicode values, it looks like they might not be encoded/decoded correctly and a different character is being inserted into the database than what it really is. Here's the process involved:
I encode the data value and write to a file using the csv module:
import csv
with open('{0}{1}.txt'.format(file_path, file_name), 'ab+') as f:
writer = csv.writer(f, delimiter='\t')
try:
writer.writerow(data['name'].encode('utf-8'))
except Exception, ex:
logging.exception(ex)
When I open the file in a text editor like Sublime Text, all of the unicode characters are being shown correctly.
Once all the data is done writing to a file, I start collecting it using SSIS. In SSIS, I have a Flat File Source Task that pulls in the data. I've defined the data type for the 'name' column in the connection manager as DT_WSTR(Length
4000)
. The code page for the flat file connection is 65001(UTF-8)
.
The destination database I am writing to is a SQL Azure Database with collation SQL_Latin1_General_CP1_CI_AS
.
The destination database column is defined as nvarchar(max)
.
If I try to write to a regular SQL Server database with same collation, the result is the same.
What am I doing wrong here? There are lots of emoji type characters that I collect and don't care much about, what is important is accented and non-English characters. If I need to provide any more detail or anything else, please let me know.