Matlab tables: loading and joining
A couple years back, Matlab added yet another datatype to its standard kernel.
This new datatype is called table
, and it’s like a cross between a cell and a
structure. It seems quite neat, in that it’s easier to store metadata; apparently
this datatype is a spruced up version of something that’s been available in the
statistics toolbox for some time. I wouldn’t know; I avoid stats.
For some new work I thought it would be a good opportunity to try to learn
table
. The very first thing that attracted me to it was its readtable
function
to import CSV files in a sensible way; if you’re like me you would have lamented
at various points that Matlab’s built-in csvread
is basically useless since it
only reads numbers.
An example is probably worthwhile. Say we have a CSV file that looks like:
,Mass,MassSigma,COM,COMSigma
headneck,7.13,0.59,87.42,1.8
trunk,46.49,1.61,69.31,0.7
pelvis,11.17,0.94,49,5.6
arm,3.73,0.28,43.8,7
forearm,1.33,0.09,42.95,9.6
hand,0.57,0.06,49.67,3.5
In a new-ish version of Matlab, we can now write T = readtable('example.csv','ReadRowNames',true)
to receive
T =
Mass MassSigma COM COMSigma
_____ _________ _____ ________
headneck 7.13 0.59 87.42 1.8
trunk 46.49 1.61 69.31 0.7
pelvis 11.17 0.94 49 5.6
arm 3.73 0.28 43.8 7
forearm 1.33 0.09 42.95 9.6
hand 0.57 0.06 49.67 3.5
Note that the mix of strings and numbers have imported without trouble, and we
even have nice row and column headings. (Okay, I haven’t shown a big mix of data types,
but trust me that it’s more robust. It does seem like each column must be one data type, however.)
Accessing data in a table
can be done in various
ways, for example: T.Mass
gives an array of the mass column; T{:,{'Mass','COM'}}
extracts two columns; T{
{'arm'}, {'COM'}
}
(or just T{4,3}
) extracts that one value; and so on.
Having in-built metadata like this allows a number of powerful set-like operations
to be performed on tables, with new functions such as intersect
, setdiff
, and
so on. In addition, there’s even an Excel-PivotTable-like summary
command that takes the
data in a table
and attempts to do some (simple) automatic stats on it.
Nice for data exploration.
Like for any new tool, learning table
hasn’t been without a few wrong turns and
dead-ends for me. Here’s one of them.
Dissimilar tables don’t join
I have a number of variables that each contains a value for a set of parameters (Rows
).
The rows are not (necessarily) the same for each variable, however.
I want to join the variables together so the results are all displayed in a single table.
E.g., I want to join these together: (drawn side by side to save space)
Var_A Var_B
________ _______
a 0.36744 b 0.88517
b 0.98798 c 0.91329
c 0.037739 d 0.79618
Matlab provides three functions for joining tables; join
, innerjoin
, and outerjoin
.
The obvious syntax doesn’t work:
A = table(rand(3,1),'VariableNames',{'Var_A'},'RowNames',{'a','b','c'})
B = table(rand(3,1),'VariableNames',{'Var_B'},'RowNames',{'b','c','d'})
try
C = join(A,B)
catch e
disp(e.identifier)
disp(e.message)
end
This results in:
MATLAB:table:join:CantInferKey
Cannot find a common table variable to use as a key variable.
Okay, so maybe join
isn’t intended for this – what about outerjoin
? Its documentation sounds promising:
The outer join includes the rows that match between A and B, and also unmatched rows from either A or B, all with respect to the key variables. C contains all variables from both A and B, including the key variables.
Well, outerjoin
apparently can’t be used with tables with row names!
(I just get nonsense because it needs a common variable to match against in each table.)
This is the closest I’ve found that does what I want, but seems to be against the idea of the table
data structure to some degree:
AA = table({'a';'b';'c'},rand(3,1));
AA.Properties.VariableNames = {'param','Var_A'}
BB = table({'b';'c';'d'},rand(3,1));
BB.Properties.VariableNames = {'param','Var_B'}
CC = outerjoin(AA,BB,'Keys',1,'MergeKeys',true)
This results in
param Var_A Var_B
_____ _______ _______
'a' 0.10676 NaN
'b' 0.65376 0.77905
'c' 0.49417 0.71504
'd' NaN 0.90372
I.e., the row
is just stored as a separate variable. This means it can’t be indexed using “logical” notation such as CC{'a',:}
.
But this can be fixed with:
CCC = CC(:,2:end);
CCC.Properties.RowNames = CC{:,1}
Which finally results in:
CCC =
Var_A Var_B
_______ ________
a 0.4168 NaN
b 0.65686 0.29198
c 0.62797 0.43165
d NaN 0.015487
But is this really the best way to go about things? This seems like a pretty big edge case.
The joining of tables by taking their row names and turning them into variables can be automated with the following function:
function C = fakejoin(A,B)
fake = {'FakeRowNames'};
AT = array2table(A.Properties.RowNames,'VariableNames',fake);
BT = array2table(B.Properties.RowNames,'VariableNames',fake);
% these can't be inlined into outerjoin() for some reason!
AV = [AT, A];
BV = [BT, B];
CV = outerjoin(AV,BV,'Keys',1,'MergeKeys',true);
C = CV(:,2:end);
C.Properties.RowNames = CV{:,1};
return
One caveat to all this: I’m currently using Matlab 2015a, so some of this behaviour may well have already changed and will soon.
I quite like the table
data structure, but I feel (based on experiences like
this) that it’s a little rough around the edges. Or perhaps I simply haven’t
internalised its proper use cases yet, and I’m the weird one. Very possible.