Working with CSV files
Introduction
CSV is a common file format used regularly in financial services.
Most applications will allow you to import and export data in CSV format.
It is therefore important to get a good understanding of the CSV format to better handle the data you work with on a daily basis.
This article runs through some of the nuances of working with CSV files and provides plenty of examples to tie everything together. Feel free to share your own tips in the comments section at the end.
What is CSV?
CSV (Comma Separated Values) format is the most common import and export format for spreadsheets and databases.
A CSV file contains a number of rows, each containing a number of columns, usually separated by commas.
Why use CSV?
The file format .csv is not meant to be one in which people store data long-term; rather, it is meant to be a way to move data from one program to another.
Consider it an international cargo container, if that helps.
For example, if you had a large portfolio of positions stored in an accounting application but wanted to use these positions in Microsoft Excel, you could export the portfolio to .csv and then import them into Microsoft Excel.
What does CRLF stand for?
CR on its own stands for Carriage Return and is a control character used to reset a device's position to the beginning of a line of text. LF is Line Feed, a move to the next line.
In plain English CRLF moves the position to the start of a new line. In the case of CSV this is used to define a new row.
Is there a CSV file format standard?
A general standard for the CSV file format does not exist, but RFC 4180 provides a de facto standard for some aspects.
Below are the RFC 4180 definitions of CSV formats:
(Hint: You don't type in 'CRLF', it merely shows where you move to a new line)
Rule 1
1. Each record is located on a separate line, delimited by a line break (CRLF). For example: aaa,bbb,ccc CRLF zzz,yyy,xxx CRLF
Rule 2
2. The last record in the file may or may not have an ending line break. For example: aaa,bbb,ccc CRLF zzz,yyy,xxx
Rule 3
3. There maybe an optional header line appearing as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file (the presence or absence of the header line should be indicated via the optional "header" parameter of this MIME type). For example: field_name,field_name,field_name CRLF aaa,bbb,ccc CRLF zzz,yyy,xxx CRLF
Rule 4
4. Within the header and each record, there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. Spaces are considered part of a field and should not be ignored. The last field in the record must not be followed by a comma. For example: aaa,bbb,ccc
Rule 5
5. Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields. For example: "aaa","bbb","ccc" CRLF zzz,yyy,xxx
Rule 6
6. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example: "aaa","b CRLF bb","ccc" CRLF zzz,yyy,xxx
Rule 7
7. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example: "aaa","b""bb","ccc"
- Each record is one line - Line separator may be LF (0x0A) or CRLF (0x0D0A), a line separator may also be embedded in the data (making a record more than one line but still acceptable).
- Fields are separated with commas.
- Leading and trailing whitespace is ignored - Unless the field is delimited with double-quotes in that case the whitespace is preserved.
- Embedded commas - Field must be delimited with double-quotes.
- Embedded double-quotes - Embedded double-quote characters must be doubled, and the field must be delimited with double-quotes.
- Embedded line-breaks - Fields must be surrounded by double-quotes.
- Always Delimiting - Fields may always be delimited with double quotes, the delimiters will be parsed and discarded by the reading applications.
Example
Year | Make | Model | Description | Price |
---|---|---|---|---|
1997 | Ford | E350 | ac, abs, moon | 3000.00 |
1999 | Chevy | Venture "Extended Edition" | 4900.00 | |
1999 | Chevy | Venture "Extended Edition, Very Large" | 5000.00 | |
1996 | Jeep | Grand Cherokee | MUST SELL! air, moon roof, loaded |
4799.00 |
The above table of data may be represented in CSV format as follows:
Year,Make,Model,Description,Price 1997,Ford,E350,"ac, abs, moon",3000.00 1999,Chevy,"Venture ""Extended Edition""","",4900.00 1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00 1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799.00
Example of a USA/UK CSV file (where the decimal separator is a period/full stop and the value separator is a comma):
Year,Make,Model,Length 1997,Ford,E350,2.34 2000,Mercury,Cougar,2.38
Example of an analogous German and Dutch CSV/DSV file (where the decimal separator is a comma and the value separator is a semicolon):
Year;Make;Model;Length 1997;Ford;E350;2,34 2000;Mercury;Cougar;2,38
The latter format is not RFC 4180 compliant. Compliance could be achieved by the use of a comma instead of a semicolon as a separator and either the international notation for the representation of the decimal mark or the practice of quoting all numbers that have a decimal mark.
Are there data limits on CSV files when working with Excel?
Yes, you can import or export up to 1,048,576 rows and 16,384 columns.
CSV Vs. XML format
XML (Extensible Markup Language) is a markup language that defines a set of rules for encoding documents in a format that is both human and machine readable.
It's mostly used to transmit data across the internet between applications and is used in many API's.
In advantage over CSV, XML has a cleary defined format.
CSV does have one advantage over XML for specification in new designs. CSV has much lower overhead, thereby using much less bandwidth and storage than XML.
For new designs where overhead issues have high importance or criticality, CSV, or a more robust low-bandwidth alternative, may be the better choice.
Looking to convert CSV to XML?
Here's a free CSV to XML converter from the people at Creativyst Software
Summary
While it might be more common for new applications to use XML or JSON to transmit data the majority of legacy systems will still be using CSV for a very long time.
The lack of a standard can make working with CSV files tricky although that may be about to change in 2014.
The best you can do sometimes is insist on a consistent format for the CSV files received and point any strange setups to RFC 4180.
It is often quicker to have someone fix the format of the file sent to you than building a number of work arounds for poorly formatted files.
Further Reading:
Here are some articles that may be of interest:
- The Wikipedia entry on CSV is quite good
- How to import or export text (.txt or .csv) files from Microsoft Excel
- When and Why is XML preferable to CSV? answered on stackoverflow
- So You Want To Write Your Own CSV code? - a call for not re-inventing the wheel