Long and wide matrices

A quick lesson on representing data and reshaping matrices

There are two broad ways data can be stored in a matrix. In "wide" format, everything for a given subject / record is on a single line / row. For example:

patientId   control_cells   control_temp   drugA_cells   drugA_temp   drugB_cells   drugB_temp
1           ...
2           ...
...

Whereas the "long" format has multiple rows for a subject, one for each condition. For example:

patientId   treatment   cells   temp
1           control     ...
1           drugA       ...
1           drugB       ...
2           control     ...
...

Wide is generally more useful than long. (It's a good rule of thumb to have a record per line, although what you consider to be a record may vary.) So how do you convert between the two? Easy. First, let's make a long matrix::

Note the row / subject id has to be a factor::

long_df$patientId <- factor (long_df$patientId)

And use "recast" from the "reshape2" library. The parameters are:

  • the dataframe (don't know if this works with a matrix, may have to cast)
  • a formula showing how the data is grouped, '...' means 'everything else'

This assumes that everything else is a measurement column. Columns will be named appropriately:

library (reshape2)

wide_df <- recast (long_df, patientId ~ treatment + ...)
## Using patientId, treatment as id variables
print (wide_df)
##   patientId control_temp control_cells drugA_temp drugA_cells drugB_temp
## 1         1    0.9923105     0.8185061  0.4760233   0.3968007  0.2569790
## 2         2    0.1548171     0.9000275  0.7203557   0.7429774  0.9330161
## 3         3    0.5281820     0.8107075  0.4606059   0.6311145  0.5738539
##   drugB_cells
## 1   0.9420978
## 2   0.7072848
## 3   0.1813425
print (colnames (wide_df))
## [1] "patientId"     "control_temp"  "control_cells" "drugA_temp"   
## [5] "drugA_cells"   "drugB_temp"    "drugB_cells"

Now this is just the barebones of recast and there are lots of options and possibilities. There are also several other libraries (notably tidyr) that can also make this conversion and are arguably more powerful. However, to my eye they're much more complicated. For simple cases, the above works just fine. Note that many of the examples scattered across the web also assume there's only one measurement column.


Last posts

  1. Tree balance signature of mass extinction is erased by continued evolution ...

    tags: publicationmacroevolutionphylogenyextinction

  2. Why bioinformaticians don't get no respect

    tags: bioinformaticsacademia

  3. Using AWS for research computing

    tags: programmingsoftwareresearchamazonaws

  4. Academic job ad red flags

    tags: careerjob-search

  5. Infoglut 2017

    tags: infoglutmoviestvtravelbooks