Matlab tables: loading and joining

A couple years back, Matlab added yet another datatype to its standard kernel. This new datatype is called table, and it’s like a cross between a cell and a structure. It seems quite neat, in that it’s easier to store metadata; apparently this datatype is a spruced up version of something that’s been available in the statistics toolbox for some time. I wouldn’t know; I avoid stats.

For some new work I thought it would be a good opportunity to try to learn table. The very first thing that attracted me to it was its readtable function to import CSV files in a sensible way; if you’re like me you would have lamented at various points that Matlab’s built-in csvread is basically useless since it only reads numbers.

An example is probably worthwhile. Say we have a CSV file that looks like:

,Mass,MassSigma,COM,COMSigma
headneck,7.13,0.59,87.42,1.8
trunk,46.49,1.61,69.31,0.7
pelvis,11.17,0.94,49,5.6
arm,3.73,0.28,43.8,7
forearm,1.33,0.09,42.95,9.6
hand,0.57,0.06,49.67,3.5

In a new-ish version of Matlab, we can now write T = readtable('example.csv','ReadRowNames',true) to receive

T =
                  Mass     MassSigma     COM     COMSigma
                  _____    _________    _____    ________
    headneck       7.13    0.59         87.42     1.8
    trunk         46.49    1.61         69.31     0.7
    pelvis        11.17    0.94            49     5.6
    arm            3.73    0.28          43.8       7
    forearm        1.33    0.09         42.95     9.6
    hand           0.57    0.06         49.67     3.5

Note that the mix of strings and numbers have imported without trouble, and we even have nice row and column headings. (Okay, I haven’t shown a big mix of data types, but trust me that it’s more robust. It does seem like each column must be one data type, however.) Accessing data in a table can be done in various ways, for example: T.Mass gives an array of the mass column; T{:,{'Mass','COM'}} extracts two columns; T{{'arm'}, {'COM'}} (or just T{4,3}) extracts that one value; and so on.

Having in-built metadata like this allows a number of powerful set-like operations to be performed on tables, with new functions such as intersect, setdiff, and so on. In addition, there’s even an Excel-PivotTable-like summary command that takes the data in a table and attempts to do some (simple) automatic stats on it. Nice for data exploration.

Like for any new tool, learning table hasn’t been without a few wrong turns and dead-ends for me. Here’s one of them.

Dissimilar tables don’t join

I have a number of variables that each contains a value for a set of parameters (Rows). The rows are not (necessarily) the same for each variable, however. I want to join the variables together so the results are all displayed in a single table. E.g., I want to join these together: (drawn side by side to save space)

          Var_A                    Var_B
         ________                 _______
    a     0.36744            b    0.88517
    b     0.98798            c    0.91329
    c    0.037739            d    0.79618

Matlab provides three functions for joining tables; join, innerjoin, and outerjoin. The obvious syntax doesn’t work:

A = table(rand(3,1),'VariableNames',{'Var_A'},'RowNames',{'a','b','c'})
B = table(rand(3,1),'VariableNames',{'Var_B'},'RowNames',{'b','c','d'})

try
  C = join(A,B)
catch e
  disp(e.identifier)
  disp(e.message)
end

This results in:

MATLAB:table:join:CantInferKey
Cannot find a common table variable to use as a key variable.

Okay, so maybe join isn’t intended for this – what about outerjoin? Its documentation sounds promising:

The outer join includes the rows that match between A and B, and also unmatched rows from either A or B, all with respect to the key variables. C contains all variables from both A and B, including the key variables.

Well, outerjoin apparently can’t be used with tables with row names! (I just get nonsense because it needs a common variable to match against in each table.) This is the closest I’ve found that does what I want, but seems to be against the idea of the table data structure to some degree:

AA = table({'a';'b';'c'},rand(3,1));
AA.Properties.VariableNames = {'param','Var_A'}

BB = table({'b';'c';'d'},rand(3,1));
BB.Properties.VariableNames = {'param','Var_B'}

CC = outerjoin(AA,BB,'Keys',1,'MergeKeys',true)

This results in

param     Var_A      Var_B
_____    _______    _______

'a'      0.10676        NaN
'b'      0.65376    0.77905
'c'      0.49417    0.71504
'd'          NaN    0.90372

I.e., the row is just stored as a separate variable. This means it can’t be indexed using “logical” notation such as CC{'a',:}. But this can be fixed with:

CCC = CC(:,2:end);
CCC.Properties.RowNames = CC{:,1}

Which finally results in:

CCC =

          Var_A      Var_B
         _______    ________

    a     0.4168         NaN
    b    0.65686     0.29198
    c    0.62797     0.43165
    d        NaN    0.015487

But is this really the best way to go about things? This seems like a pretty big edge case.

The joining of tables by taking their row names and turning them into variables can be automated with the following function:

function C = fakejoin(A,B)

fake = {'FakeRowNames'};

AT = array2table(A.Properties.RowNames,'VariableNames',fake);
BT = array2table(B.Properties.RowNames,'VariableNames',fake);

% these can't be inlined into outerjoin() for some reason!
AV = [AT, A];
BV = [BT, B];

CV = outerjoin(AV,BV,'Keys',1,'MergeKeys',true);

C = CV(:,2:end);
C.Properties.RowNames = CV{:,1};

return

One caveat to all this: I’m currently using Matlab 2015a, so some of this behaviour may well have already changed and will soon.

I quite like the table data structure, but I feel (based on experiences like this) that it’s a little rough around the edges. Or perhaps I simply haven’t internalised its proper use cases yet, and I’m the weird one. Very possible.