water.parser
Class DParseTask

java.lang.Object
  extended by jsr166y.ForkJoinTask<java.lang.Void>
      extended by jsr166y.CountedCompleter
          extended by water.H2O.H2OCountedCompleter
              extended by water.DTask<T>
                  extended by water.DRemoteTask<T>
                      extended by water.MRTask<DParseTask>
                          extended by water.parser.DParseTask
All Implemented Interfaces:
java.io.Serializable, java.lang.Cloneable, java.util.concurrent.Future<java.lang.Void>, Freezable, CustomParser.DataOut
Direct Known Subclasses:
SVMLightDParseTask

public class DParseTask
extends MRTask<DParseTask>
implements CustomParser.DataOut

Class responsible for actual parsing of the datasets. Works in two phases, first phase collects basic statistics and determines the column encodings of the dataset. Second phase the goes over all data again, encodes them and writes them to the result VA. The parser works distributed for CSV parsing, but is one node only for the XLS and XLSX formats (they are not fully our code).

See Also:
Serialized Form

Nested Class Summary
static class DParseTask.AtomicUnion
           
 
Field Summary
protected  boolean _map
           
static long[] powers10i
           
 
Fields inherited from class water.MRTask
_hi, _lo
 
Fields inherited from class water.DRemoteTask
_fs, _is_local, _keys
 
Fields inherited from class water.DTask
_cls, _eFromNode, _exception, _fname, _lineNum, _msg, _mth
 
Constructor Summary
DParseTask()
           
 
Method Summary
 void addColumns(int ncols)
           
 void addInvalidCol(int colIdx)
          Adds invalid value to the column.
 void addNumCol(int colIdx, double value)
          Adds double value to the column.
 void addNumCol(int colIdx, long number, int exp)
           
 void addStrCol(int colIdx, ValueString str)
          Adds string (enum) value to the column.
 DParseTask clone2()
           
protected  void createEnums()
           
 DParseTask createPassOne(Value dataset, ParseDataset job, CustomParser parser)
          Creates a phase one dparse task.
 DParseTask createPassTwo()
          Creates the second pass dparse task from a first phase one.
protected  water.parser.DParseTask.OutputStreamRecord[] createRecords(long firstRow, int rowsToParse)
          Returns the list of streams that should be used to store the given rows.
protected  void createValueArrayHeader()
          Creates the value header based on the calculated columns.
static boolean fitsIntoInt(double d)
           
 void init()
           
 void invalidLine(java.lang.String err)
           
 void invalidValue(int line, int col)
           
 boolean isString(int idx)
           
protected  DParseTask makePhase2Clone(ParseDataset.FileInfo finfo)
           
 void map(Key key)
          Map function for distributed parsing of the CSV files.
 long memOverheadPerChunk()
           
 void newLine()
          Advances to new line.
 void normalizeSigma()
           
 void passOne()
          Executes the phase one of the parser.
 void passTwo()
          Executes the phase two of the parser task.
 void phaseOneInitialize()
          Initialize phase one data structures with the appropriate number of columns.
 void phaseTwoInitialize()
          Initializes the phase two data.
static double pow10(int exp)
           
static long pow10i(int exp)
           
 void reduce(DParseTask dpt)
           
 void rollbackLine()
          Rolls back parsed line.
 void setColumnNames(java.lang.String[] colNames)
          Sets the column names and creates the array of the enums for each column.
 
Methods inherited from class water.MRTask
hi, lcompute, lo, lonCompletion, onExceptionalCompletion
 
Methods inherited from class water.DRemoteTask
alsoBlockFor, alsoBlockFor, clone, compute2, dfork, getFutures, invoke, invokeOnAllNodes, keys, merge, merge, merge, onCompletion, reduceAlsoBlock
 
Methods inherited from class water.DTask
copyOver, dinvoke, frozenType, getDException, hasException, logVerbose, newInstance, onAck, onAckAck, read, setException, toDocField, write, writeJSONFields
 
Methods inherited from class water.H2O.H2OCountedCompleter
compute, priority
 
Methods inherited from class jsr166y.CountedCompleter
addToPendingCount, compareAndSetPendingCount, complete, exec, getCompleter, getPendingCount, getRawResult, setCompleter, setPendingCount, setRawResult, tryComplete
 
Methods inherited from class jsr166y.ForkJoinTask
adapt, adapt, adapt, cancel, compareAndSetForkJoinTaskTag, completeExceptionally, fork, get, get, getException, getForkJoinTaskTag, getPool, getQueuedTaskCount, getSurplusQueuedTaskCount, helpQuiesce, inForkJoinPool, invoke, invokeAll, invokeAll, invokeAll, isCancelled, isCompletedAbnormally, isCompletedNormally, isDone, join, peekNextLocalTask, pollNextLocalTask, pollTask, quietlyComplete, quietlyInvoke, quietlyJoin, reinitialize, setForkJoinTaskTag, tryUnfork
 
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface water.Freezable
frozenType, newInstance, read, toDocField, write, writeJSONFields
 

Field Detail

_map

protected boolean _map

powers10i

public static long[] powers10i
Constructor Detail

DParseTask

public DParseTask()
Method Detail

addColumns

public void addColumns(int ncols)

memOverheadPerChunk

public long memOverheadPerChunk()
Overrides:
memOverheadPerChunk in class MRTask<DParseTask>

clone2

public DParseTask clone2()

createRecords

protected water.parser.DParseTask.OutputStreamRecord[] createRecords(long firstRow,
                                                                     int rowsToParse)
Returns the list of streams that should be used to store the given rows. None of the returned streams is initialized.


isString

public boolean isString(int idx)
Specified by:
isString in interface CustomParser.DataOut

makePhase2Clone

protected DParseTask makePhase2Clone(ParseDataset.FileInfo finfo)

createPassOne

public DParseTask createPassOne(Value dataset,
                                ParseDataset job,
                                CustomParser parser)
Creates a phase one dparse task.

Parameters:
dataset - Dataset to parse.
resultKey - VA to store results to.
parserType - Parser type to use.
Returns:
Phase one DRemoteTask object.

passOne

public void passOne()
Executes the phase one of the parser. First phase detects the encoding and basic statistics of the parsed dataset. For CSV parsers it detects the parser setup and then launches the distributed computation on per chunk basis. For XLS and XLSX parsers that do not work in distrubuted way parses the whole datasets.

Throws:
java.lang.Exception

createPassTwo

public DParseTask createPassTwo()
Creates the second pass dparse task from a first phase one.


passTwo

public void passTwo()
Executes the phase two of the parser task. In phase two the data is encoded to the final VA, which is then created properly at the end. For CSV launches the distributed computation. For XLS and XLSX parsers computes all the chunks itself as there is no option for their distributed processing.


normalizeSigma

public void normalizeSigma()

createValueArrayHeader

protected void createValueArrayHeader()
Creates the value header based on the calculated columns. Also stores the header to its appropriate key. This will be the VA header of the parsed dataset.


createEnums

protected void createEnums()

init

public void init()
Overrides:
init in class MRTask<DParseTask>

setColumnNames

public void setColumnNames(java.lang.String[] colNames)
Sets the column names and creates the array of the enums for each column.

Specified by:
setColumnNames in interface CustomParser.DataOut

phaseOneInitialize

public void phaseOneInitialize()
Initialize phase one data structures with the appropriate number of columns.


phaseTwoInitialize

public void phaseTwoInitialize()
Initializes the phase two data.


map

public void map(Key key)
Map function for distributed parsing of the CSV files. In first phase it calculates the min, max, means, encodings and other statistics about the dataset, determines the number of columns. The second pass then encodes the parsed dataset to the result key, splitting it into equal sized chunks.

Specified by:
map in class MRTask<DParseTask>

reduce

public void reduce(DParseTask dpt)
Specified by:
reduce in class DRemoteTask<DParseTask>

pow10

public static double pow10(int exp)

pow10i

public static long pow10i(int exp)

fitsIntoInt

public static final boolean fitsIntoInt(double d)

newLine

public void newLine()
Advances to new line. In phase two it also must make sure that the

Specified by:
newLine in interface CustomParser.DataOut

rollbackLine

public void rollbackLine()
Rolls back parsed line. Useful for CsvParser when it parses new line that should not be added. It can easily revert it by this.

Specified by:
rollbackLine in interface CustomParser.DataOut

addNumCol

public void addNumCol(int colIdx,
                      double value)
Adds double value to the column.

Specified by:
addNumCol in interface CustomParser.DataOut
Parameters:
colIdx -
value -

addInvalidCol

public void addInvalidCol(int colIdx)
Adds invalid value to the column.

Specified by:
addInvalidCol in interface CustomParser.DataOut

addStrCol

public void addStrCol(int colIdx,
                      ValueString str)
Adds string (enum) value to the column.

Specified by:
addStrCol in interface CustomParser.DataOut

addNumCol

public void addNumCol(int colIdx,
                      long number,
                      int exp)
Specified by:
addNumCol in interface CustomParser.DataOut

invalidLine

public void invalidLine(java.lang.String err)
Specified by:
invalidLine in interface CustomParser.DataOut

invalidValue

public void invalidValue(int line,
                         int col)
Specified by:
invalidValue in interface CustomParser.DataOut