Working with data in MATLAB
Contents
Reading and saving data
Suppose that you have carried out a large computation with MATLAB - one that takes a couple of hours or so to finish. Do you need to run this program every time you want to look at the output? No, you don't - not if you've saved the output to a file.
MATLAB provides many different options for saving data to a file, in many different formats using commands such as
fprintf
or
fwrite
These commands produce data files that can be read into MATLAB or into different programming languages; detailed descriptions of how to work with them can be found using the help command.
The easiest data format to work with in MATLAB -- and the one we'll focus on -- is MATLAB's own native format. Files in this format end with the ".mat" suffix - e.g.
data_out.mat
To save data in the .mat file 'flnm.mat', the command
save flnm.mat
is used. To retrieve data from the .mat file 'flnm.mat', the command
load flnm.mat
is used. Consider the following example, in which we numerically approximate the solution to an ODE
dt = .01; t = 0:dt:10; x(1) = 3; for k=2:length(t) x(k) = x(k-1)+dt*(1-x(k-1)^2); end
We can look at the solution:
plot(t,x)
To save the output to the file data1.mat we type
save data1.mat
We can then clear memory
clear all
If we now try to plot x against t, we'll get an error message - because neither is in memory:
plot(t,x)
Undefined function or variable 't'.
But if we load the data back into memory
load data1.mat
then both of the arrays x and t are back in memory
plot(t,x)
Being able to save to and read from data files is very important when working with programs that take a long time to run (so you don't want to have to do it over and over again), or when working with large datasets.
Working with data
Suppose we've measured the POC concentration in ocean sediments at a number of different depths and we want to estimate the surface sediment flux and refractory POC concentration, knowing that the bioturbation diffusivity is
D = 35 cm^2/kyr
that the remineralization rate coefficient is
k = 1.5x10^-9 s^-1
and that the burial flux rate w is negligible. As we discussed in section 6.5, we can do this using linear regression modelling.
But what are our data? Well, the data file
poc_data.mat
contains measurements of POC (in units mol/m^3) at depths z (measured in m) with the sign convention that z increases with depth. These data are saved in the arrays C and z. This dataset can be downloaded from the course website at
http://web.uvic.ca/~monahana/eos225/poc_data.mat
Download this file to your home directory. Having downloaded the data, let's read them in to MATLAB
load poc_data.mat
and have a look at them
plot(C,z,'o')
Note that because z increases with depth, deeper values of C are at the top of the plot. We can use the following command to flip the orientation of the plot
set(gca,'ydir','reverse')
The set command can be used to change the plot in all sorts of ways - in this case, to set the y direction ('ydir') from 'normal' to 'reverse' (that is, increasing downward). To get more information, you can type
help set
These data show a general decrease of POC concentration with depth (as expected), but one which is noisy.
From our discussion in class, we know that we can write the POC profile predicted from the steady-state POC budget as the linear regression model
y = ax + b
with
y = C x = (1/sqrt(D*K))*exp(-sqrt(k/D)z) a = Phi b = C_r
To estimate a and b, we'll need to compute some statistics. These computations involve adding up lots of numbers. Such sums are done easily in MATLAB when data is stored as 1xN arrays using the 'sum' command:
sum(M)
which simply adds up all of the elements in the 1xN array M. E.g. with
M = [1 3 8]
M = 1 3 8
then
sum(M)
ans = 12
is just the sum 1+3+8 = 12.
To carry out our statistical analysis, type the following in a text editor and save as poc_stats.m
load poc_data.mat N = length(C); D = 35/((100^2)*(1000*365*24*3600)); %% change D from cm^2/kyr to m^2/s k = 1.5e-9; y = C; x = exp(-sqrt(k/D)*z)/sqrt(k*D); mu_x = sum(x)/N; mu_y = sum(y)/N; sig_x = sqrt(sum((x-mu_x).^2)/(N-1)); sig_y = sqrt(sum((y-mu_y).^2)/(N-1)); rho = sum((x-mu_x).*(y-mu_y))/((N-1)*sig_x*sig_y); a = (sig_y/sig_x)*rho; b = mu_y-rho*sig_y*mu_x/sig_x; Cr = b; Phi = a;
Looking at a scatterplot of x with y:
plot(x,y,'o')
we can compare this to the line predicted by the regression model:
y_reg = a*x+b; hold on plot(x,y_reg,'r') hold off
and we see that the regression model does a pretty good job of characterising the scatter of data. The fraction of variance of y "explained" by the regression is
rho^2
ans = 0.8168
and the predicted values of the surface POC flux (in mol m^-2 s^-1) and concentration of refractory POC (in mol m^-3) are
Phi Cr
Phi = 5.7693e-10 Cr = 40.2022
To get Phi in mmol m^-2 day^-1 we have
Phi*1000*24*3600
ans = 0.0498
We can also look at plots not of x against y, but of z against C for both the data and the regression fit. Having computed estimates of Phi and Cr, we have all of the parameters necessary to plot our function C(z).
C_pred = Cr+(Phi/sqrt(k*D))*exp(-sqrt(k/D)*z); plot(C,z*100,'o'),hold on plot(C_pred,z*100,'r'),hold off set(gca,'ydir','reverse') xlabel('C (mol m^{-3})') ylabel('z (cm)')
Note that we were able to get a superscript in the x label by typing '^{-3}'. The caret indicates that everything in the curly brackets is in superscript.
Exercise 1
Consider the dataset:
x y -8.6 -3.7 6.6 -4.3 -6.0 -8.4 10.0 7.4 0.13 -1.3 -4.4 0.80 -3.1 -7.1 -5.2 -7.5 -12.3 -8.2 -10.3 -25.2
Assuming the linear regression model y = ax+b, determine least squares estimates of a and b. Plot both the original data and the regression line on the same figure. What fraction of the variance of y is accounted for by this regression?
Exercise 2
Repeat the estimates of Phi and Cr from the dataset poc_data.mat, using data from only the upper 2 cm. How do your estimates change?
Exercise 3
Suppose that for the dataset poc_data.mat you know that Phi = 0.05 mmol m^-2 day^-1 and Cr = 40 mol m^-3. Use linear regression modelling to estimate D and k. Hint: Have a look at Example 6.8.