1. Active Duration

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter	No	Comments
Duplicate rows > 10% ?	✓	93.38% Unique Rows
Missing values > 50% for one or more columns ?	✓	0 Columns have missing values > 50%
Most recent updates is before 6 months ago ?	✓	Latest Date: 29-Jun-2019
Data contain PII (Personally Identificable Information) ?	✓

1.1. Metadata Summary

1.1.1. Data Dimensions

Number of Columns: 4

Number of Rows: 101,886,928

Date Range: 01-Jan-2012 to 29-Jun-2019

1.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

1.1.3. Data Subset

Sample 6 rows of the dataset.

1.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

1.1.5. Missing Data

There are no missing records present in the data.

1.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

1.2.1. Histogram and Statistical Summary - Continuous Variable(s)

Histogram is a representation of the distibution of numerical data. The numerical values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.

1.2.1.1 Column - active_duration

Column active_duration represents the average number of days each posting was active in a given grouping

1.2.2. Frequency Counts - Categorical Variable(s)

1.2.2.1 Column - company_names

Below table represents the frequency count of top 10 and bottom 10 company names

There are about 37,264 different company names available in the data

Top 10		Bottom 10
Company Name	Frequency	Company Name	Frequency
Cardinal Health	13,684	Greater Seattle Chamber of Commerce	2,716
Lyft	10,968	Hocking Technical College	2,717
Target	10,953	New Naschitti Elementary School	2,717
Cardinal Logistics	10,949	Santa Clara University	2,717
Mesilla Valley Transportation	10,943	Teletronics Technology Corporation	2,717
UPS	10,941	Wheaton College	2,717
AXA	8,227	First Atlantic Health Care	2,718
St. Joseph Health System	8,226	Gwinnett County Public Schools	2,718
Yavapai Regional Medical Center	8,226	Hancock Regional Hospital	2,718
Mosaic	8,225	Tahoe Truckee Unified School District	2,718

1.3. Time Series Analysis

Periodicity: Daily

Below graph shows the frequency trend over the period January 2012 to June 2019

2. Company Reference

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter	Yes	No	Comments
Duplicate rows > 10% ?		✓	100% Unique Rows
Missing values > 50% for one or more columns ?	✓		4 Columns have missing values > 50%
Most recent updates is before 6 months ago ?		✓	No date columns in data
Data contain PII (Personally Identificable Information) ?		✓

2.1. Metadata Summary

2.1.1. Data Dimensions

Number of Columns: 10

Number of Rows: 61,209

Date Range: 24-Feb-2016 to 31-Aug-2019

2.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

2.1.3. Data Subset

Sample 6 rows of the dataset.

2.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

2.1.5. Missing Data

There are no missing records present in the data.

2.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

2.2.1. Histogram and Statistical Summary - Continuous Variable(s)

Histogram is a representation of the distibution of numerical data. The numerical values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.

There are no rlevant columns to plot histogram

2.2.2. Frequency Counts - Categorical Variable(s)

2.2.2.1 Column - company_name

Below table represents the frequency count of top 10 and bottom 10 company names

There are about 56,638 different companies available in the data

Top 10		Bottom 10
Company Name	Frequency	Company Name	Frequency
Mesilla Valley Transportation	10	’ike Group	1
CDW	9	!ndigo	1
Averitt Express	7	[24]7 Inc	1
Pizza Hut	7	[X+1]	1
Ruan	7	01 Communique Laboratory Inc	1
UPS	7	1-800-FLOWERS.COM, Inc	1
Western Express	7	1-800-GOT-JUNK	1
Advantage Solutions	6	1-800-Sweeper/800Sweeper, LLC	1
Celadon	6	1-800 CONTACTS	1
Express Scripts	6	1-800 CONTACTS, INC	1

2.2.2.2 Column - lei

Below table represents the frequency count of top 10 and bottom 10 lei

There are about 4,278 different lei available in the data

Top 10		Bottom 10
lei	Frequency	lei	Frequency
4KF48RN45X1OO8UBLY20	29	01KWVG908KE7RKPTNP46	1
5493000GH5DTFC8LLR93	11	02CBKVAOND0BEIOF0V84	1
4YV9Y5M8S0BRK1RP0397	7	03D0JEWFDFUS0SEEKG89	1
5493008TXYN3II3PU369	7	04Y1L40RYNUCL020XS57	1
549300VHDC555R46LM46	6	05MQKGBWLLX7RPPDO189	1
GTJS1N8S8I28A7L4WG97	6	06BTX5UWZD0GQ5N5Y745	1
225YDZ14ZO8E1TXUSU86	5	08IRJODWFYBI7QWRGS31	1
5493007JDSMX8Z5Z1902	5	0IDE18EMH1CUKQCUYE69	1
54930080C93RVZRSDV26	5	0JRPR7R1EOV4Z2M47L22	1
549300A4FSM5CB5M5D32	5	0M6M8M9HXLW8D3AU0N32	1

2.2.2.3 Column - open_perm_id

Below table represents the frequency count of top 10 and bottom 10 lei

There are about 23,749 different open perm ids available in the data

Top 10		Bottom 10
perm_id	Frequency	perm_id	Frequency
4295905360	29	-5040258481	1
4297833085	17	-5059003739	1
4297113611	15	102100625	1
4297213968	14	114840374	1
5038021810	13	21521133901	1
5040952276	13	296469794	1
5044197001	13	4294988003	1
4295349307	12	4295001381	1
4297385008	11	4295011269	1
4295904886	9	4295012519	1

2.2.2.4 Column - naics_code

Below table represents the frequency count of top 10 and bottom 10 naics codes

There are about 916 different naics codes available in the data

Top 10		Bottom 10
NAICS Code	Frequency	NAICS Code	Frequency
611110	1031	111211	1
622110	664	111219	1
999990	499	111320	1
611310	405	111421	1
541511	301	112111	1
443142	292	112340	1
621111	273	114111	1
522110	253	115310	1
541512	230	211112	1
921120	212	212230	1

2.3. Time Series Analysis

Periodicity: Daily

Below graph shows the frequency trend over the period February 2012 to August 2019.

3. Daily Created

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter	No	Comments
Duplicate rows > 10% ?	✓	93.2% Unique Rows
Missing values > 50% for one or more columns ?	✓	0 Columns have missing values > 50%
Most recent updates is before 6 months ago ?	✓	Latest Date: 29-Jun-2019
Data contain PII (Personally Identificable Information) ?	✓

3.1. Metadata Summary

3.1.1. Data Dimensions

Number of Columns: 4

Number of Rows: 102,340,815

Date Range: 01-Jan-2012 to 29-Jun-2019

3.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

3.1.3. Data Subset

Sample 6 rows of the dataset.

3.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

3.1.5. Missing Data

3.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

3.2.1. Histogram and Statistical Summary - Continuous Variable(s)

Histogram is a representation of the distibution of numerical data. The numerical values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.

3.2.1.1 Column - created_job_count

Column created_job_count represents how many new job postings were created for the given grouping

Note: Figure 2 is a zoomed version of figure 1 (containing values greater than 2 and lesser than 20)

3.2.2. Frequency Counts - Categorical Variable(s)

3.2.2.1 Column - company_names

Below table represents the frequency count of top 10 and bottom 10 company names

There are about 36,679 differentcomapny names available in the data

Top 10		Bottom 10
Company Name	Frequency	Company Name	Frequency
Cardinal Health	13,395	Healthcare Risk Advisors	2,209
UPS	11,163	Milan Laser Hair Removal	2,209
Schneider	11,045	Zurchers	2,211
Lyft	10,782	Acme Brick Tile & More	2,215
Target	10,662	Horizon Connects	2,216
Cardinal Logistics	10,608	Blue Sky Pest Control	2,217
Mesilla Valley Transportation	10,417	Hacker USA	2,218
Averitt Express	9,184	The Third Floor Inc	2,218
Community Medical Centers	9,140	Ministry of Social Development	2,220
St. Joseph Health System	8,914	Stratton Amenities	2,221

3.3. Time Series Analysis

Periodicity: Daily

Below graph shows the frequency trend over the period January 2012 to June 2019.

4. Daily Deleted

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter	No	Comments
Duplicate rows > 10% ?	✓	92.33% Unique Rows
Missing values > 50% for one or more columns ?	✓	0 Columns have missing values > 50%
Most recent updates is before 6 months ago ?	✓	Latest Date: 29-Jun-2019
Data contain PII (Personally Identificable Information) ?	✓

4.1. Metadata Summary

4.1.1. Data Dimensions

Number of Columns: 4

Number of Rows: 100,560,809

Date Range: 01-Jan-2012 to 29-Jun-2019

4.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

4.1.3. Data Subset

Sample 6 rows of the dataset.

4.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

4.1.5. Missing Data

4.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

4.2.1. Histogram and Statistical Summary - Continuous Variable(s)

Histogram is a representation of the distibution of numerical data. The numerical values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.

4.2.1.1 Column - deleted_job_count

Column deleted_job_count represents how many job postings were removed for the given grouping

Note: Figure 2 is a zoomed version of figure 1 (containing values greater than 2 and lesser than 20)

4.2.2. Frequency Counts - Categorical Variable(s)

4.2.2.1 Column - company_names

Below table represents the frequency count of top 10 and bottom 10 company names

There are about 36,038 different company names available in the data

Top 10		Bottom 10
Company Name	Frequency	Company Name	Frequency
Cardinal Health	13,704	Atlanta Braves MLB	2,706
Target	10,995	Barge Waggoner Sumner & Cannon	2,706
Mesilla Valley Transportation	10,937	CGI Group Inc	2,706
UPS	10,925	Fort Miller	2,706
Lyft	10,923	Worley Catastrophe Response	2,707
Schneider	10,913	Zitter Health Insights	2,707
Cardinal Logistics	10,880	Alverno Laboratories	2,708
USA Truck	8,262	Attic Angel	2,708
Ruan	8,259	Cabot Corporation	2,708
XPO Logistics	8,259	CBRE UK	2,708

4.3. Time Series Analysis

Periodicity: Daily

Below graph shows the frequency trend over the period January 2012 to June 2019.

5. Ticker Reference

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter	Yes	No	Comments
Duplicate rows > 10% ?		✓	100% Unique Rows
Missing values > 50% for one or more columns ?	✓		2 Columns have missing values > 50%
Most recent updates is before 6 months ago ?		✓	No date columns in data
Data contain PII (Personally Identificable Information) ?		✓

5.1. Metadata Summary

5.1.1. Data Dimensions

Number of Columns: 9

Number of Rows: 66,402

Date Range: 26-Nov-1968 to 29-Jul-2019

5.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

5.1.3. Data Subset

Sample 6 rows of the dataset.

5.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

5.1.5. Missing Data

There are no missing records present in the data.

5.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

5.2.1. Histogram and Statistical Summary - Continuous Variable(s)

Histogram is a representation of the distibution of numerical data. The numerical values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.

There are no rlevant columns to plot histogram

5.2.2. Frequency Counts - Categorical Variable(s)

5.2.2.1 Column - company_id

Below table represents the frequency count of top 10 and bottom 10 company ids

There are about 0 different companies IDs available in the data

Top 10		Bottom 10
Company ID	Frequency	Company ID	Frequency
32373	36	10007	1
14034	35	1015	1
3580	35	10291	1
3745	34	10297	1
43173	34	10323	1
54945	34	1047	1
56056	28	10584	1
10736	25	1077	1
19352	25	10797	1
6568	25	10823	1

5.2.2.2 Column - stock_ticker

Below table represents the frequency count of top 10 and bottom 10 stock tickers

There are about 22,124 different stock tickers available in the data

Top 10		Bottom 10
Stock Ticker	Frequency	Stock Ticker	Frequency
HCA	422	000046	1
IBM	350	000050	1
WMT	252	000120	1
UNH	244	000150	1
XRX	225	000166	1
BRK	224	000660	1
BRKB	224	000725	1
ORCL	192	000728	1
UTX	168	001	1
JNJ	160	002371	1

5.2.2.3 Column - stock_exchange_country

Below table represents the frequency count of top 10 and bottom 10 lei

There are about 78 different open perm ids available in the data

Top 10		Bottom 10
Stock Exchange Country	Frequency	Stock Exchange Country	Frequency
US	19,203	CY	1
DE	17,495	EG	1
GB	8,175	IS	1
MX	4,382	JM	1
CA	3,082	KW	1
AT	2,801	MU	1
CH	2,664	OM	1
FR	1,123	PK	1
JP	993	RS	1
CL	844	SK	1

5.2.2.4 Column - stock_exchange_name

Below table represents the frequency count of top 10 and bottom 10 naics codes

There are about 113 different naics codes available in the data

Top 10		Bottom 10
Stock Exchange Name	Frequency	Stock Exchange Name	Frequency
FRA	11,969	BEL	1
NYS	9,355	BRA	1
NAS	9,056	BURG	1
LON	8,172	CAI	1
MEX	4,382	CYS	1
BER	2,864	DAR	1
WBO	2,801	DFM	1
SWX	2,613	ICE	1
TSX	1,509	IEXG	1
TSE	1,507	IST	1

5.2.2.5 Column - primary_flag

5.3. Time Series Analysis

Periodicity: Daily

Below graph shows the frequency trend over the period January 2012 to August 2019.

6. Unique Active

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter	No	Comments
Duplicate rows > 10% ?	✓	92.61% Unique Rows
Missing values > 50% for one or more columns ?	✓	0 Columns have missing values > 50%
Most recent updates is before 6 months ago ?	✓	Latest Date: 29-Jun-2019
Data contain PII (Personally Identificable Information) ?	✓

6.1. Metadata Summary

6.1.1. Data Dimensions

Number of Columns: 4

Number of Rows: 101,889,662

Date Range: 01-Jan-2012 to 29-Jun-2019

6.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

6.1.3. Data Subset

Sample 6 rows of the dataset.

6.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

6.1.5. Missing Data

6.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

6.2.1. Histogram and Statistical Summary - Continuous Variable(s)

Histogram is a representation of the distibution of numerical data. The numerical values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.

6.2.1.1 Column - unique_active_job_count

Column unique_active_job_count represents how many unique job postings were active for the given grouping

Note: Figure 2 is a zoomed version of figure 1 (containing values greater than 2 and lesser than 20)

6.2.2. Frequency Counts - Categorical Variable(s)

6.2.2.1 Column - company_names

Below table represents the frequency count of top 10 and bottom 10 company names

There are about 36,540 different company names available in the data

Top 10		Bottom 10
Company Name	Frequency	Company Name	Frequency
Cardinal Health	13,691	Casey’s Cupcakes	2,712
Lyft	10,954	Castlight Health	2,712
Mesilla Valley Transportation	10,935	DirectMedica	2,712
Target	10,935	Transaction Network Services	2,712
Cardinal Logistics	10,931	Capita	2,713
UPS	10,916	City of Carlsbad	2,713
Knight Transportation	8,223	Jaunt	2,713
Constellation Brands	8,222	Nutmeg	2,713
Schneider	8,222	Puratos	2,713
St. Joseph Health System	8,217	US Storage Centers	2,713

6.3. Time Series Analysis

Periodicity: Daily

Below graph shows the frequency trend over the period January 2012 to June 2019.

Appendix 1: Data Summary

1 Univariate Analysis

Univariate Analysis involves the analysis of one variable at a time.

1.1 Histogram

A Histogram visualizes the distribution of a numerical field over its continuous range of values. Each bar in a histogram represents the tabulated frequency at each interval/bin.

1.2 Density Plot

A Density Plot visualizes the distribution of data over a continuous interval or time period. This chart is a variation of a Histogram that uses kernel smoothing to plot values, allowing for smoother distributions by smoothing out the noise. The peaks of a Density Plot help display where values are concentrated over the interval.

1.3 QQ Plot

A QQ plot is a scatterplot created by plotting two sets of quantiles (theoretical and sample) against one another. The shape of the QQ plot indicates whether the data is normally distributed, skewed, or has a heavy tail.

2 Bivariate Analysis

Bivariate Analysis involves the analysis of two variables for the purpose of determining the empirical relationship between them. It explores the concept of the relationship between two variables, whether there exists an association and the strength of this association, or whether there are differences between two variables and the significance of these differences.

2.1 Correlation

A correlation matrix displays the coefficient of correlation for every pair of variables present in the dataset. This allows you to see which pairs have the highest correlation. It can be used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses.

3 Trend Charts

Trend charts are simple and efficient graphical representations of time-series data. Monthly trend charts can often reveal seasonal trends for a variable while yearly trend charts show trends over a longer period.

Appendix 2: Data Quality Scorecard

Tresvista - s Data Quality (DQ) framework is designed to assess the data quality and data health. Data quality monitoring is performed on an ongoing basis to ensure sustainable data quality.

Dimensions of Data Quality

A Data Quality Dimension is a term used to describe a data quality measure that can relate to multiple data elements including attribute, record, table, system or more abstract groupings such as business unit, company or product range. While there are multiple parameters on which a dataset can be assessed in terms of quality, we have identified the following 5 core dimensions for our assessment.

1. Availability - Sufficient availability of data points

It is ensuring that enough data is available to end users and applications, when and where they need it for further analysis. This is particularly important, as many of the machine learning algorithms require enough data samples for training and testing/validating the models.

2. Completeness - All required data is captured, and no data is missing

It is identifying the percentage of records with non-NULL values. It can also be termed as comprehensiveness. Missing or incomplete data can hamper the analysis and affect the interpretability of the insights.

3. Uniqueness - Data redundancy should be avoided

It points out that there should be no data duplicates reported. Asserting uniqueness of the entities within a data set implies that no entity exists more than once within the data set and that there is a key that can be used to uniquely access each entity (and only that specific entity) within the data set.

4. Recency - Data is recent and not outdated

It is the degree to which information is recent with the current period. It measures how - up-to-date information is, and whether it is correct despite possible time-related changes.

5. Consistency - Uniform data types across the column

It refers to the data values in one column being consistent across the column. A strict definition of consistency specifies that two data values drawn from the same column must not conflict with each other (column level consistency).

Product Structure Tree

1. Active Duration

1.1. Metadata Summary

1.1.1. Data Dimensions

1.1.2. Structure/Data Types

1.1.3. Data Subset

1.1.4. Data Completeness

1.1.5. Missing Data

1.2. Univariate Analysis

1.2.1. Histogram and Statistical Summary - Continuous Variable(s)

1.2.1.1 Column - active_duration

1.2.2. Frequency Counts - Categorical Variable(s)

1.2.2.1 Column - company_names

1.3. Time Series Analysis

2. Company Reference

2.1. Metadata Summary

2.1.1. Data Dimensions

2.1.2. Structure/Data Types

2.1.3. Data Subset

2.1.4. Data Completeness

2.1.5. Missing Data

2.2. Univariate Analysis

2.2.1. Histogram and Statistical Summary - Continuous Variable(s)

2.2.2. Frequency Counts - Categorical Variable(s)

2.2.2.1 Column - company_name

2.2.2.2 Column - lei

2.2.2.3 Column - open_perm_id

2.2.2.4 Column - naics_code

2.3. Time Series Analysis

3. Daily Created

3.1. Metadata Summary

3.1.1. Data Dimensions

3.1.2. Structure/Data Types

3.1.3. Data Subset

3.1.4. Data Completeness

3.1.5. Missing Data

3.2. Univariate Analysis

3.2.1. Histogram and Statistical Summary - Continuous Variable(s)

3.2.1.1 Column - created_job_count

3.2.2. Frequency Counts - Categorical Variable(s)

3.2.2.1 Column - company_names

3.3. Time Series Analysis

4. Daily Deleted

4.1. Metadata Summary

4.1.1. Data Dimensions

4.1.2. Structure/Data Types

4.1.3. Data Subset

4.1.4. Data Completeness

4.1.5. Missing Data

4.2. Univariate Analysis

4.2.1. Histogram and Statistical Summary - Continuous Variable(s)

4.2.1.1 Column - deleted_job_count

4.2.2. Frequency Counts - Categorical Variable(s)

4.2.2.1 Column - company_names

4.3. Time Series Analysis

5. Ticker Reference

5.1. Metadata Summary

5.1.1. Data Dimensions

5.1.2. Structure/Data Types

5.1.3. Data Subset

5.1.4. Data Completeness

5.1.5. Missing Data

5.2. Univariate Analysis

5.2.1. Histogram and Statistical Summary - Continuous Variable(s)

5.2.2. Frequency Counts - Categorical Variable(s)

5.2.2.1 Column - company_id

5.2.2.2 Column - stock_ticker

5.2.2.3 Column - stock_exchange_country

5.2.2.4 Column - stock_exchange_name

5.2.2.5 Column - primary_flag

5.3. Time Series Analysis

6. Unique Active

6.1. Metadata Summary

6.1.1. Data Dimensions

6.1.2. Structure/Data Types

6.1.3. Data Subset

6.1.4. Data Completeness

6.1.5. Missing Data

6.2. Univariate Analysis

6.2.1. Histogram and Statistical Summary - Continuous Variable(s)