Working with CSV files

Working with CSV files

Introduction

CSV is a common file format used regularly in financial services.

Most applications will allow you to import and export data in CSV format.

It is therefore important to get a good understanding of the CSV format to better handle the data you work with on a daily basis.

This article runs through some of the nuances of working with CSV files and provides plenty of examples to tie everything together. Feel free to share your own tips in the comments section at the end.

What is CSV?

CSV (Comma Separated Values) format is the most common import and export format for spreadsheets and databases.

A CSV file contains a number of rows, each containing a number of columns, usually separated by commas.

Note: With no set standard each application that uses CSV can format it in slightly different ways. While RFC 4180 below provides some guidelines often you'll find yourself viewing a CSV file in a text editor to determine what formatting has been used.

Why use CSV?

The file format .csv is not meant to be one in which people store data long-term; rather, it is meant to be a way to move data from one program to another.

Consider it an international cargo container, if that helps.

For example, if you had a large portfolio of positions stored in an accounting application but wanted to use these positions in Microsoft Excel, you could export the portfolio to .csv and then import them into Microsoft Excel.

What does CRLF stand for?

CR on its own stands for Carriage Return and is a control character used to reset a device's position to the beginning of a line of text. LF is Line Feed, a move to the next line.

In plain English CRLF moves the position to the start of a new line. In the case of CSV this is used to define a new row.

Is there a CSV file format standard?

A general standard for the CSV file format does not exist, but RFC 4180 provides a de facto standard for some aspects.

Below are the RFC 4180 definitions of CSV formats:

(Hint: You don't type in 'CRLF', it merely shows where you move to a new line)

Rule 1

1. Each record is located on a separate line, delimited by a line break (CRLF). For example:
		
		   aaa,bbb,ccc CRLF
		   zzz,yyy,xxx CRLF

Rule 2

2. The last record in the file may or may not have an ending line break. For example:
		
		   aaa,bbb,ccc CRLF
		   zzz,yyy,xxx

Rule 3

3. There maybe an optional header line appearing as the first line of the file with the same format as normal record lines.  This header will contain names corresponding to the fields in the file and should contain the  same number of fields as the records in the rest of the file (the presence or absence of the header line      should be indicated via the optional "header" parameter of this MIME type).  For example:

		   field_name,field_name,field_name CRLF
		   aaa,bbb,ccc CRLF
		   zzz,yyy,xxx CRLF

Rule 4

4. Within the header and each record, there may be one or more fields, separated by commas.  Each line should contain the same number of fields throughout the file.  Spaces are considered part of a field and should not  be ignored.  The last field in the record must not be followed by a comma.  For example:

			aaa,bbb,ccc

Rule 5

5. Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all).  If fields are not enclosed with double quotes, then double quotes may not     appear inside the fields.  For example:

		   "aaa","bbb","ccc" CRLF
		   zzz,yyy,xxx

Rule 6

6. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.  For  example:

		   "aaa","b CRLF
		   bb","ccc" CRLF
		   zzz,yyy,xxx

Rule 7

7. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped  by preceding it with another double quote.  For example:

			"aaa","b""bb","ccc"

 

  • Each record is one line - Line separator may be LF (0x0A) or CRLF (0x0D0A), a line separator may also be embedded in the data (making a record more than one line but still acceptable).
  • Fields are separated with commas.
  • Leading and trailing whitespace is ignored - Unless the field is delimited with double-quotes in that case the whitespace is preserved.
  • Embedded commas - Field must be delimited with double-quotes.
  • Embedded double-quotes - Embedded double-quote characters must be doubled, and the field must be delimited with double-quotes.
  • Embedded line-breaks - Fields must be surrounded by double-quotes.
  • Always Delimiting - Fields may always be delimited with double quotes, the delimiters will be parsed and discarded by the reading applications.

Example


Year Make Model Description Price
1997 Ford E350 ac, abs, moon 3000.00
1999 Chevy Venture "Extended Edition"   4900.00
1999 Chevy Venture "Extended Edition, Very Large"   5000.00
1996 Jeep Grand Cherokee MUST SELL!
air, moon roof, loaded
4799.00

The above table of data may be represented in CSV format as follows:

Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00

Note: In some European countries where a comma is used to represent a decimal place for currencies a semicolon may be used to delimit (seperate) the data. Examples below:


Example of a USA/UK CSV file (where the decimal separator is a period/full stop and the value separator is a comma):

Year,Make,Model,Length
1997,Ford,E350,2.34
2000,Mercury,Cougar,2.38

Example of an analogous German and Dutch CSV/DSV file (where the decimal separator is a comma and the value separator is a semicolon):

Year;Make;Model;Length
1997;Ford;E350;2,34
2000;Mercury;Cougar;2,38

The latter format is not RFC 4180 compliant. Compliance could be achieved by the use of a comma instead of a semicolon as a separator and either the international notation for the representation of the decimal mark or the practice of quoting all numbers that have a decimal mark.

Are there data limits on CSV files when working with Excel?

Yes, you can import or export up to 1,048,576 rows and 16,384 columns.

CSV Vs. XML format

XML (Extensible Markup Language) is a markup language that defines a set of rules for encoding documents in a format that is both human and machine readable.

It's mostly used to transmit data across the internet between applications and is used in many API's.

In advantage over CSV, XML has a cleary defined format.

CSV does have one advantage over XML for specification in new designs. CSV has much lower overhead, thereby using much less bandwidth and storage than XML.

For new designs where overhead issues have high importance or criticality, CSV, or a more robust low-bandwidth alternative, may be the better choice.

Looking to convert CSV to XML?

Here's a free CSV to XML converter from the people at Creativyst Software

Go to free converter
 

Summary

While it might be more common for new applications to use XML or JSON to transmit data the majority of legacy systems will still be using CSV for a very long time.

The lack of a standard can make working with CSV files tricky although that may be about to change in 2014.

The best you can do sometimes is insist on a consistent format for the CSV files received and point any strange setups to RFC 4180.

It is often quicker to have someone fix the format of the file sent to you than building a number of work arounds for poorly formatted files.

Further Reading:

Here are some articles that may be of interest: