Is there a simple transform when a csv field is sometimes enclosed in double quotes, and sometimes not?
I am processing an inbound csv file. For the most part, it is well defined and consistent. The exception occurs when a text field has an embedded comma. When it does, the field is enclosed in double quotes. When it does not, the field is
not enclosed in double quotes.
I am using a Flat File Connection Manager with a Delimited format, using the comma as the Column delimiter. The Text qualifier in the General properties sheet is set to <none>.
Under these conditions, when a text field encloses in quotes with an embedded comma is encountered, the quote character is treated as a regular character and the embedded comma is treated as a column delimiter.
If I set the Text qualifier to the double quote character, the Data Flow task fails when it encounters a field that is not delimited by the double quote character.
My current work around is to first process the file in a C# script that uses a regular expression to find a field enclosed in quotes with an embedded comma, and strip the quotes and replace the comma with a tilde. I then update the target table by replacing
each tilde character with a comma. I believe this is sufficient for my needs, but it will not handle the following scenarios:
The field is the first or last field in the row
The field has more than one embedded comma
This is my "find" regular expression: ,"([^"]*),([^"]*)",
This is my "replace" regular expression: ,$1~$2,
Can anyone suggest an alternative approach, or a more robust regular expression pattern?
Thanks,
Ed