DParseTask

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

water.parser
Class DParseTask

java.lang.Object
  jsr166y.ForkJoinTask<java.lang.Void>
      jsr166y.CountedCompleter
          water.H2O.H2OCountedCompleter
              water.DTask<T>
                  water.DRemoteTask<T>
                      water.MRTask<DParseTask>
                          water.parser.DParseTask

All Implemented Interfaces:: java.io.Serializable, java.lang.Cloneable, java.util.concurrent.Future<java.lang.Void>, Freezable, CustomParser.DataOut

Direct Known Subclasses:: SVMLightDParseTask

public class DParseTask
extends MRTask<DParseTask>
implements CustomParser.DataOut
extends MRTask<DParseTask>
implements CustomParser.DataOut

Class responsible for actual parsing of the datasets. Works in two phases, first phase collects basic statistics and determines the column encodings of the dataset. Second phase the goes over all data again, encodes them and writes them to the result VA. The parser works distributed for CSV parsing, but is one node only for the XLS and XLSX formats (they are not fully our code).

See Also:: Serialized Form

Nested Class Summary
`static class`	`DParseTask.AtomicUnion`

Field Summary
`protected boolean`	`_map`
`static long[]`	`powers10i`

Fields inherited from class water.MRTask
`_hi, _lo`

Fields inherited from class water.DRemoteTask
`_fs, _is_local, _keys`

Fields inherited from class water.DTask
`_cls, _eFromNode, _exception, _fname, _lineNum, _msg, _mth`

Constructor Summary
`DParseTask()`

Method Summary
`void`	`addColumns(int ncols)`
`void`	`addInvalidCol(int colIdx)` Adds invalid value to the column.
`void`	`addNumCol(int colIdx, double value)` Adds double value to the column.
`void`	`addNumCol(int colIdx, long number, int exp)`
`void`	`addStrCol(int colIdx, ValueString str)` Adds string (enum) value to the column.
`DParseTask`	`clone2()`
`protected void`	`createEnums()`
`DParseTask`	`createPassOne(Value dataset, ParseDataset job, CustomParser parser)` Creates a phase one dparse task.
`DParseTask`	`createPassTwo()` Creates the second pass dparse task from a first phase one.
`protected water.parser.DParseTask.OutputStreamRecord[]`	`createRecords(long firstRow, int rowsToParse)` Returns the list of streams that should be used to store the given rows.
`protected void`	`createValueArrayHeader()` Creates the value header based on the calculated columns.
`static boolean`	`fitsIntoInt(double d)`
`void`	`init()`
`void`	`invalidLine(java.lang.String err)`
`void`	`invalidValue(int line, int col)`
`boolean`	`isString(int idx)`
`protected DParseTask`	`makePhase2Clone(ParseDataset.FileInfo finfo)`
`void`	`map(Key key)` Map function for distributed parsing of the CSV files.
`long`	`memOverheadPerChunk()`
`void`	`newLine()` Advances to new line.
`void`	`normalizeSigma()`
`void`	`passOne()` Executes the phase one of the parser.
`void`	`passTwo()` Executes the phase two of the parser task.
`void`	`phaseOneInitialize()` Initialize phase one data structures with the appropriate number of columns.
`void`	`phaseTwoInitialize()` Initializes the phase two data.
`static double`	`pow10(int exp)`
`static long`	`pow10i(int exp)`
`void`	`reduce(DParseTask dpt)`
`void`	`rollbackLine()` Rolls back parsed line.
`void`	`setColumnNames(java.lang.String[] colNames)` Sets the column names and creates the array of the enums for each column.

Methods inherited from class water.MRTask
`hi, lcompute, lo, lonCompletion, onExceptionalCompletion`

Methods inherited from class water.DRemoteTask
`alsoBlockFor, alsoBlockFor, clone, compute2, dfork, getFutures, invoke, invokeOnAllNodes, keys, merge, merge, merge, onCompletion, reduceAlsoBlock`

Methods inherited from class water.DTask
`copyOver, dinvoke, frozenType, getDException, hasException, logVerbose, newInstance, onAck, onAckAck, read, setException, toDocField, write, writeJSONFields`

Methods inherited from class water.H2O.H2OCountedCompleter
`compute, priority`

Methods inherited from class jsr166y.CountedCompleter
`addToPendingCount, compareAndSetPendingCount, complete, exec, getCompleter, getPendingCount, getRawResult, setCompleter, setPendingCount, setRawResult, tryComplete`

Methods inherited from class jsr166y.ForkJoinTask
`adapt, adapt, adapt, cancel, compareAndSetForkJoinTaskTag, completeExceptionally, fork, get, get, getException, getForkJoinTaskTag, getPool, getQueuedTaskCount, getSurplusQueuedTaskCount, helpQuiesce, inForkJoinPool, invoke, invokeAll, invokeAll, invokeAll, isCancelled, isCompletedAbnormally, isCompletedNormally, isDone, join, peekNextLocalTask, pollNextLocalTask, pollTask, quietlyComplete, quietlyInvoke, quietlyJoin, reinitialize, setForkJoinTaskTag, tryUnfork`

Methods inherited from class jsr166y.ForkJoinTask

adapt, adapt, adapt, cancel, compareAndSetForkJoinTaskTag, completeExceptionally, fork, get, get, getException, getForkJoinTaskTag, getPool, getQueuedTaskCount, getSurplusQueuedTaskCount, helpQuiesce, inForkJoinPool, invoke, invokeAll, invokeAll, invokeAll, isCancelled, isCompletedAbnormally, isCompletedNormally, isDone, join, peekNextLocalTask, pollNextLocalTask, pollTask, quietlyComplete, quietlyInvoke, quietlyJoin, reinitialize, setForkJoinTaskTag, tryUnfork

Methods inherited from class java.lang.Object
`equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Methods inherited from interface water.Freezable
`frozenType, newInstance, read, toDocField, write, writeJSONFields`

Field Detail

_map

protected boolean _map

powers10i

public static long[] powers10i

Constructor Detail

DParseTask

public DParseTask()

Method Detail

addColumns

public void addColumns(int ncols)

memOverheadPerChunk

public long memOverheadPerChunk()

Overrides:: memOverheadPerChunk in class MRTask<DParseTask>

clone2

public DParseTask clone2()

createRecords

protected water.parser.DParseTask.OutputStreamRecord[] createRecords(long firstRow,
                                                                     int rowsToParse)

Returns the list of streams that should be used to store the given rows. None of the returned streams is initialized.

isString

public boolean isString(int idx)

Specified by:: isString in interface CustomParser.DataOut

makePhase2Clone

protected DParseTask makePhase2Clone(ParseDataset.FileInfo finfo)

createPassOne

public DParseTask createPassOne(Value dataset,
                                ParseDataset job,
                                CustomParser parser)

Creates a phase one dparse task.

Parameters:: dataset - Dataset to parse.; resultKey - VA to store results to.; parserType - Parser type to use.
Returns:: Phase one DRemoteTask object.

passOne

public void passOne()

Executes the phase one of the parser. First phase detects the encoding and basic statistics of the parsed dataset. For CSV parsers it detects the parser setup and then launches the distributed computation on per chunk basis. For XLS and XLSX parsers that do not work in distrubuted way parses the whole datasets.

Throws:: java.lang.Exception

createPassTwo

public DParseTask createPassTwo()

Creates the second pass dparse task from a first phase one.

passTwo

public void passTwo()

Executes the phase two of the parser task. In phase two the data is encoded to the final VA, which is then created properly at the end. For CSV launches the distributed computation. For XLS and XLSX parsers computes all the chunks itself as there is no option for their distributed processing.

normalizeSigma

public void normalizeSigma()

createValueArrayHeader

protected void createValueArrayHeader()

Creates the value header based on the calculated columns. Also stores the header to its appropriate key. This will be the VA header of the parsed dataset.

createEnums

protected void createEnums()

init

public void init()

Overrides:: init in class MRTask<DParseTask>

setColumnNames

public void setColumnNames(java.lang.String[] colNames)

Sets the column names and creates the array of the enums for each column.

Specified by:: setColumnNames in interface CustomParser.DataOut

phaseOneInitialize

public void phaseOneInitialize()

Initialize phase one data structures with the appropriate number of columns.

phaseTwoInitialize

public void phaseTwoInitialize()

Initializes the phase two data.

map

public void map(Key key)

Map function for distributed parsing of the CSV files. In first phase it calculates the min, max, means, encodings and other statistics about the dataset, determines the number of columns. The second pass then encodes the parsed dataset to the result key, splitting it into equal sized chunks.

Specified by:: map in class MRTask<DParseTask>

reduce

public void reduce(DParseTask dpt)

Specified by:: reduce in class DRemoteTask<DParseTask>

pow10

public static double pow10(int exp)

pow10i

public static long pow10i(int exp)

fitsIntoInt

public static final boolean fitsIntoInt(double d)

newLine

public void newLine()

Advances to new line. In phase two it also must make sure that the

Specified by:: newLine in interface CustomParser.DataOut

rollbackLine

public void rollbackLine()

Rolls back parsed line. Useful for CsvParser when it parses new line that should not be added. It can easily revert it by this.

Specified by:: rollbackLine in interface CustomParser.DataOut

addNumCol

public void addNumCol(int colIdx,
                      double value)

Adds double value to the column.

Specified by:: addNumCol in interface CustomParser.DataOut

Parameters:: colIdx -; value -

addInvalidCol

public void addInvalidCol(int colIdx)

Adds invalid value to the column.

Specified by:: addInvalidCol in interface CustomParser.DataOut

addStrCol

public void addStrCol(int colIdx,
                      ValueString str)

Adds string (enum) value to the column.

Specified by:: addStrCol in interface CustomParser.DataOut

addNumCol

public void addNumCol(int colIdx,
                      long number,
                      int exp)

Specified by:: addNumCol in interface CustomParser.DataOut

invalidLine

public void invalidLine(java.lang.String err)

Specified by:: invalidLine in interface CustomParser.DataOut

invalidValue

public void invalidValue(int line,
                         int col)

Specified by:: invalidValue in interface CustomParser.DataOut

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

water.parser Class DParseTask

_map

powers10i

DParseTask

addColumns

memOverheadPerChunk

clone2

createRecords

isString

makePhase2Clone

createPassOne

passOne

createPassTwo

passTwo

normalizeSigma

createValueArrayHeader

createEnums

init

setColumnNames

phaseOneInitialize

phaseTwoInitialize

map

reduce

pow10

pow10i

fitsIntoInt

newLine

rollbackLine

addNumCol

addInvalidCol

addStrCol

addNumCol

invalidLine

invalidValue

water.parser
Class DParseTask