Aqua Phoenix
     >>  Lectures >>  Matlab 8  


8.1 Datasets

Finding data sets is not easy, and it is just as difficult to create your own, especially when the data is geographical and/or social in nature.

The U.S. government is comprised of hundreds of agencies that produce a vast amount of statistical data related to all sectors of society. A few selected links follow:

Some of these sites publish ready-made PDF files with tabulated entries. Others have custom-designed interfaces that allow for access to data. Unfortunately, neither the sites, nor the ready-made material, nor the query systems follow any guidelines.

Another source to consider are hard-copy publications, for example:

Statistical Abstract of the United States: 2006, The National Data Book, 125th Edition, US Cencus BureauThis book contains statistics from all sectors of society, including consumption, production, education, disabilities, etc. This book is available in the library.

8.2 Data Files

8.2.1 Tables of Data

Statistical data tends to come in a tabular format, e.g.:

Date        Location    Item    Price in cents 
7/8/1997    New York    Apple   59
8/5/2000    Los Angeles Banana  69 
...         ...         ...     ...
The first row in this example is considered a "header", and the other rows are observations. Some more complicated tables may have several lines of headers, and may include sub-tables. For the purpose of using this data in Matlab, it is recommended that tables be re-formatted as in this example. If necessary, remove undesired entries, move data around, and possibly merge or split data. A good amount can be done through spreadsheets (e.g. Microsoft Excel), if the data set is small enough.

8.2.2 File Format: CSV

Comma Separated Value files (CSV) are text files that contain human-readable data. Special delimiters (commas, tabs, carriage returns, quotes) are put in place for separating columns and rows. Glancing over such a data file may not reveal a sensible structure, but once loaded into a spreadsheet application, columns and rows can be identified more easily. A typical comma-separated file may look as follows:

Date,Location,Item,Price in cents
7/8/1997,New York,Apple,59
8/5/2000,Los Angeles,Banana,69
Note that every field in this example is separated by a comma, hence Comma Separated Value. Sometimes, individual values are enclosed in double quotes:

"Date","Location","Item","Price in cents"
"7/8/1997","New York","Apple","59"
"8/5/2000","Los Angeles","Banana","69"
Matlab is able to import from CSV files, as this is the most universally portable file format.

8.2.3 File Format: XLS

Proprietary spreadsheet file formats that Matlab can import include MS Excel XLS files. Unless errors occur when importing an XLS file, no pre-processing is required. Should errors occur, it is recommended to inspect the file using MS Excel, and re-saving it. If this does not help, the XLS data should be exported to CSV format within Excel and imported in Matlab.

8.2.4 File Format: MAT

MAT is Matlab's own format for storing data. It is possible to save an entire Workspace of data, or selected matrices. It is unlikely that statistics are distributed in this format.

8.2.5 Size issues

MS Excel imposes a size limit on a single data set table. This limit is: 65536 rows and 256 columns (i.e. 216 * 28 = 224 (16,777,216) cells).

Matlab does not impose a pro forma limit on the dimensions and size of matrices. Keep in mind that hardware memory and hard disk size (for swapping memory) are the ultimate deciding factors of how much data can be loaded. For comparison, a reasonable size of data that can be loaded into Matlab exceeds the capabilities of Excel by far. It is thus possible to read in numerical values for a million data rows of five columns.

8.2.6 Pre-processing data with Matlab

If the size of a data set exceeds the limitations of Excel or other applications, it must be pre-processed in Matlab. This may include one or more of the following steps:
  • Cleaning of data by removing rows.
  • Splitting data: if a data set contains data from several types of observations, filtering and splitting of the data may be necessary. This can be done by iterating over the data set and selectively moving data rows to other matrices. For example, if a dataset includes observations for U.S. states, U.S. regions, and U.S. cities, the three types may have to be moved to 3 different matrices.
  • Sorting data: data sets can be sorted using the command sortrows.

An example of processing data with Matlab can be found in the collection of M-files in the beginning of this lecture.

8.2.7 Pre-processing data outside of Matlab

When possible, it is recommended to prepare datasets in a spreadsheet program before importing in Matlab. While Matlab does have a spreadsheet-like editor, it is not meant to replace a spreadsheet program.

To prepare data in a spreadsheet, keep in mind that each column should maintain the same data type (double, int, string, ...). Columns (or row) headers should be distinguishable, and preferrably one per column (or row). Try to refrain from merging cells.

8.3 Importing Data

Matlab has several command-line functions that can be used to import many data types, including CSV files. However, for simplicity we will use the graphical interface. In the "Current Directory" file listing, highlight a data file. At this point, we can either use "File->Import", or open the context menu and choose "Import Data".

Figure 8.1
Click image to enlarge, or click here to open
Depending on the file type, the Import Wizard may start at different points in the import process. When importing from an XLS file, only the last of the Import Wizard screens appears. When importing from a CSV file, the process is slightly longer.

The first page of the wizard displays a portion of the text file, as well as a preview of the matrix-version of the data.

Figure 8.2
Click image to enlarge, or click here to open
The preview is broken into 2 spreadsheets: one for "data" and one for "textdata". "data" refers to numerical data, while "textdata" refers to anything that is not unambiguously numerical. That is, the strings "my house" and "60m" are considered "textdata", while "4" and "624.92746" are considered "data". Matlab distinguishes between the two and does not allow mixing of these data types in matrices. When importing a CSV file with numerical and textual data, Matlab thus splits the data and creates two matrices, one for each data type. More on the differences is discussed below.

Figure 8.3
Click image to enlarge, or click here to open
Several data-specific decisions have to be made on the first page. Under "Select Column Separator(s)", select the delimiter that delimits columns. In many cases, this is the comma. Under "Number of text header lines", select the number of non-data rows that appear in the beginning of your data file. The text header lines will then not appear in the numerical data matrix.

Figure 8.4
Click image to enlarge, or click here to open
Note that the preview on the first page may not accurately depict the final matrices. The second screen of the Import Wizard shows a more realistic version of the final matrices.

The second page of the Import Wizard displays a preview of the parsed matrices from the data file. From here, matrices can be renamed and excluded for the final import. It is recommended to inspect all matrices and their sizes before proceeding with the import. When ready, hit the "Finish" button.

Figure 8.5
Click image to enlarge, or click here to open
After the import process is complete, the imported matrices will appear in the workspace.

It is sometimes desirable to create sub-matrices out of the imported ones, especially if the import process did not successfully interpret all of the data. For example, the file regions.csv clearly contains column headers (region name), row headers (dates), and numerical values (price in cents). During the import process row and column headers were not identified, and instead were cast into one large text matrix. The following expressions disect the text matrix for easy processing later:

The first row of the textdata matrix contains column headers, including the field "Date" and region names. We extract the region names:

The first column of the textdata matrix contains row headers, including the field "Date" and individual dates. We extract the dates:

There is no need to further process the data matrix.

Figure 8.6
Click image to enlarge, or click here to open

8.4 Numerical data versus String data

Matlab uses several data types for differently typed data, and depending on the type, certain operations are allowed and others are not. The predominant type is "Double", which can be used for any numerical data: real, rational, and natural. "Uint8" is another numerical type, which is constrained to natural numbers in the range of 0..255.

"Char" is a data type used for non-numerical data, i.e. textual data. Most imported data sets contain textual data, such as header lines. Manipulation of "Cell" data is somewhat different from numerical data. Below is a comparison of numerical versus textual data:

Operation Numerical Textual
Scalar type Double Char
Assignment a=5 a='hello'
Multidimensional type Double Cell
Vector b=[1,5,2] b={'abc','def','ghi'}
Indexing b(2) b{2}
Matrix c=[1,2,3;4,5,6] c={'abc','def','ghi';'jkl','mno','pqr'}
Addition d=3+4 d=strcat('ab','cd')
Conversion in between str2num('52.23') num2str(6)
(Example) s='52.23'; 5+str2num(s), i=6; strcat('Hello Nr.', num2str(i))
Table 8.1
Figure 8.7: Doubles, Chars, and Cells
Click image to enlarge, or click here to open
Figure 8.8: Addition and Concatenation
Click image to enlarge, or click here to open
There are many other functions by which char and cell can be manipulated. The main application for text manipulation for our purposes is plotting and meshing, especially for assigning x,y,z labels, titles, etc. For example, when column headers have been imported as textual data (cells), we are now able to manipulate and use them for the purpose of building bar graphs, plots, etc.

Figure 8.9: Indexing
Click image to enlarge, or click here to open