Aqua Phoenix
     >>  Lectures >>  Java 6  
 

Navigator
   
 
       
   

6.5 String Tokenizer

When reading text files line-by-line, it is usually desirable to tokenize the data. Tokenizing refers to the splitting of data on some common separator, e.g. commas. The following line of text will then be split into different fields:

10/10/2003,New York,60,55
Tokenized on the comma results in fields:
  • 10/10/2003
  • New York
  • 60
  • 55

The first field can be further tokenized on delimiter slash (/), resulting in:

  • 10
  • 10
  • 2003

Tokenizing is commonly performed on CSV (comma separated value) files, e.g. those for spreadsheets.

Java includes a string tokenizer named StringTokenizer. It is available in package java.util, and hence to use it, we must import that package:

import java.util.*;
To instantiate a StringTokenizer, we pass a line of text to StringTokenizer, and also specify the delimiter:

StringTokenizer st = new StringTokenizer("Tokenize,me,on,commas,because,I,have,so,many,of,them", ",");
Once the StringTokenizer has been set with a String of text and a delimiter, it is possible to iterate over all tokens to extract each one:

while (st.hasMoreTokens()) {
  System.out.println(st.nextToken());
}
We can combine the process of reading from a text file with tokenizing each line. Given the following dataset in a file, we can read each line, then tokenize it, and store it in some data structure.

Contents of file dataset.csv:

1994-11-28,1P,Region,11,120.8,4,4,1994
1994-12-05,1P,Region,12,118.3,4,1,1994
1994-12-12,1P,Region,12,116.0,4,2,1994
1994-12-19,1P,Region,12,114.1,4,3,1994
1994-12-26,1P,Region,12,113.4,4,4,1994
1994-11-28,1B,Region,11,126.2,4,4,1994
1994-12-05,1B,Region,12,124.9,4,1,1994
1994-12-12,1B,Region,12,123.0,4,2,1994
1994-12-19,1B,Region,12,121.5,4,3,1994
1994-12-26,1B,Region,12,121.1,4,4,1994
1994-11-28,2P,Region,11,112.2,4,4,1994
1994-12-05,2P,Region,12,108.6,4,1,1994
1994-12-12,2P,Region,12,105.7,4,2,1994
Read file, tokenize, and print out:

StringTokenizer st;
String line;
try {
  BufferedReader bufferedReader = new BufferedReader(new FileReader(new File("dataset.csv")));
  while ((line = bufferedReader.readLine()) != null) {
    st = new StringTokenizer(line, ",");
    while (st.hasMoreTokens()) {
      System.out.print(st.nextToken() + "  ");
      // or put in data structure
    }
    System.out.println();
  }
  bufferedReader.close();
} catch (Exception e) {
  e.printStackTrace();
}
Output:

1994-11-28  1P  Region  11  120.8  4  4  1994 
1994-12-05  1P  Region  12  118.3  4  1  1994 
1994-12-12  1P  Region  12  116.0  4  2  1994 
1994-12-19  1P  Region  12  114.1  4  3  1994 
1994-12-26  1P  Region  12  113.4  4  4  1994 
1994-11-28  1B  Region  11  126.2  4  4  1994 
1994-12-05  1B  Region  12  124.9  4  1  1994 
1994-12-12  1B  Region  12  123.0  4  2  1994 
1994-12-19  1B  Region  12  121.5  4  3  1994 
1994-12-26  1B  Region  12  121.1  4  4  1994 
1994-11-28  2P  Region  11  112.2  4  4  1994 
1994-12-05  2P  Region  12  108.6  4  1  1994 
1994-12-12  2P  Region  12  105.7  4  2  1994 
Instead of printing the data to the screen, we could have cast the individual fields to different data types (e.g. int, double, etc.), and put them in a data object for later usage.