Blog de Nathan Story

PHP Fgetcsv Is Broken

| Comments

The PHP built-in function fgetcsv is broken.

Background

If you aren’t familiar with the CSV file format, see RFC 4180. I’ll summarize the relevant bits below.

In the simplest case, a CSV file takes the following form:

state,capital,demonym
Massachusetts,Boston,Bay Stater
New Hampshire,Concord,New Hampshirite
Illinois,Springfield,Illinoisan

This is fine for the data listed above, but, what if we want to include values that contain commas? In that case, we quote the value to ensure that commas are interpreted literally:

state,state motto
Massachusetts,Ense petit placidam sub libertate quietem
New Hampshire,Live free or die
Illinois,"State sovereignty, national union"

What then, you may ask, shall we do if we want literal quotation marks in one of these quoted values? Well, in that case, we double the quotes:

language,program
C,"printf(""Hello, World!"");"
PHP,"echo ""Hello, World!"";"

The Problem

To see where the problem in PHP’s implementation lies, let’s start by examining the function signature:

array fgetcsv ( resource $handle [, int $length = 0 [, string $delimiter = ','
[, string $enclosure = '"' [, string $escape = '\\' ]]]] )

Let’s ignore $handle and $length as they aren’t relevant to this discussion. Let’s break down the last three parameters:

  • $delimiter this is what appears between values
  • $enclosure this is what appears around values that contain the delimiter
  • $escape WTF is this

To understand WTF that is, look at the following shell session:

$ echo 'a,"b\",c' | php -r 'var_dump(fgetcsv(STDIN));'
array(2) {
  [0]=>
  string(1) "a"
  [1]=>
  string(6) "b\",c
"
}

Do you see what it did there? The backslash character is escaping the closing quote, causing the rest of the line to be interpreted as part of that field. If you read the RFC, then you know that sort of behavior ain’t part of an authentic CSV implementation, and is, actually, entirely redundant given the quoting mechanism discussed above.

Solution

Now, what we don’t want to do is write our own CSV parser. It’s a pain, you’ll undoubtedly forget some edge-case, and it will probably be molasses slow. Instead, you can use the following parameters to trick the function into working correctly:

$ echo 'a,"b\",c' | php -r 'var_dump(fgetcsv(STDIN, 0, ",", "\"", "\""));'
array(3) {
    [0]=>
    string(1) "a"
    [1]=>
    string(2) "b\"
    [2]=>
    string(1) "c"
}

We set the “escape” character to be the same as the enclosure character. This seems to neutralize the escape behavior entirely – checking whether a character is the enclosure takes precedence within the function.

Note

The above solution relies on an undocumented and unsupported implementation detail of this function. We have no guarantee that this won’t stop working in a future version of PHP (I ran the above code in 5.4.16).

Comments