title
Market Research

BY

ETF_Logo


Product Structure Tree

Number of Files Received: 6

1. Active Duration

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter Yes No Comments
Duplicate rows > 10% ? 93.38% Unique Rows
Missing values > 50% for one or more columns ? 0 Columns have missing values > 50%
Most recent updates is before 6 months ago ? Latest Date: 29-Jun-2019
Data contain PII (Personally Identificable Information) ?

1.1. Metadata Summary

1.1.1. Data Dimensions

Number of Columns: 4

Number of Rows: 101,886,928

Date Range: 01-Jan-2012 to 29-Jun-2019

1.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

1.1.3. Data Subset

Sample 6 rows of the dataset.

1.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

1.1.5. Missing Data

There are no missing records present in the data.

1.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

1.2.1. Histogram and Statistical Summary - Continuous Variable(s)

Histogram is a representation of the distibution of numerical data. The numerical values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.

1.2.1.1 Column - active_duration

Column active_duration represents the average number of days each posting was active in a given grouping

1.2.2. Frequency Counts - Categorical Variable(s)
1.2.2.1 Column - company_names

Below table represents the frequency count of top 10 and bottom 10 company names

There are about 37,264 different company names available in the data

Top 10
Bottom 10
Company Name Frequency Company Name Frequency
Cardinal Health 13,684 Greater Seattle Chamber of Commerce 2,716
Lyft 10,968 Hocking Technical College 2,717
Target 10,953 New Naschitti Elementary School 2,717
Cardinal Logistics 10,949 Santa Clara University 2,717
Mesilla Valley Transportation 10,943 Teletronics Technology Corporation 2,717
UPS 10,941 Wheaton College 2,717
AXA 8,227 First Atlantic Health Care 2,718
St. Joseph Health System 8,226 Gwinnett County Public Schools 2,718
Yavapai Regional Medical Center 8,226 Hancock Regional Hospital 2,718
Mosaic 8,225 Tahoe Truckee Unified School District 2,718

1.3. Time Series Analysis

Periodicity: Daily

Below graph shows the frequency trend over the period January 2012 to June 2019

2. Company Reference

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter Yes No Comments
Duplicate rows > 10% ? 100% Unique Rows
Missing values > 50% for one or more columns ? 4 Columns have missing values > 50%
Most recent updates is before 6 months ago ? No date columns in data
Data contain PII (Personally Identificable Information) ?

2.1. Metadata Summary

2.1.1. Data Dimensions

Number of Columns: 10

Number of Rows: 61,209

Date Range: 24-Feb-2016 to 31-Aug-2019

2.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

2.1.3. Data Subset

Sample 6 rows of the dataset.

2.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

2.1.5. Missing Data

There are no missing records present in the data.

2.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

2.2.1. Histogram and Statistical Summary - Continuous Variable(s)

Histogram is a representation of the distibution of numerical data. The numerical values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.

There are no rlevant columns to plot histogram

2.2.2. Frequency Counts - Categorical Variable(s)
2.2.2.1 Column - company_name

Below table represents the frequency count of top 10 and bottom 10 company names

There are about 56,638 different companies available in the data

Top 10
Bottom 10
Company Name Frequency Company Name Frequency
Mesilla Valley Transportation 10 ’ike Group 1
CDW 9 !ndigo 1
Averitt Express 7 [24]7 Inc 1
Pizza Hut 7 [X+1] 1
Ruan 7 01 Communique Laboratory Inc 1
UPS 7 1-800-FLOWERS.COM, Inc 1
Western Express 7 1-800-GOT-JUNK 1
Advantage Solutions 6 1-800-Sweeper/800Sweeper, LLC 1
Celadon 6 1-800 CONTACTS 1
Express Scripts 6 1-800 CONTACTS, INC 1
2.2.2.2 Column - lei

Below table represents the frequency count of top 10 and bottom 10 lei

There are about 4,278 different lei available in the data

Top 10
Bottom 10
lei Frequency lei Frequency
4KF48RN45X1OO8UBLY20 29 01KWVG908KE7RKPTNP46 1
5493000GH5DTFC8LLR93 11 02CBKVAOND0BEIOF0V84 1
4YV9Y5M8S0BRK1RP0397 7 03D0JEWFDFUS0SEEKG89 1
5493008TXYN3II3PU369 7 04Y1L40RYNUCL020XS57 1
549300VHDC555R46LM46 6 05MQKGBWLLX7RPPDO189 1
GTJS1N8S8I28A7L4WG97 6 06BTX5UWZD0GQ5N5Y745 1
225YDZ14ZO8E1TXUSU86 5 08IRJODWFYBI7QWRGS31 1
5493007JDSMX8Z5Z1902 5 0IDE18EMH1CUKQCUYE69 1
54930080C93RVZRSDV26 5 0JRPR7R1EOV4Z2M47L22 1
549300A4FSM5CB5M5D32 5 0M6M8M9HXLW8D3AU0N32 1
2.2.2.3 Column - open_perm_id

Below table represents the frequency count of top 10 and bottom 10 lei

There are about 23,749 different open perm ids available in the data

Top 10
Bottom 10
perm_id Frequency perm_id Frequency
4295905360 29 -5040258481 1
4297833085 17 -5059003739 1
4297113611 15 102100625 1
4297213968 14 114840374 1
5038021810 13 21521133901 1
5040952276 13 296469794 1
5044197001 13 4294988003 1
4295349307 12 4295001381 1
4297385008 11 4295011269 1
4295904886 9 4295012519 1
2.2.2.4 Column - naics_code

Below table represents the frequency count of top 10 and bottom 10 naics codes

There are about 916 different naics codes available in the data

Top 10
Bottom 10
NAICS Code Frequency NAICS Code Frequency
611110 1031 111211 1
622110 664 111219 1
999990 499 111320 1
611310 405 111421 1
541511 301 112111 1
443142 292 112340 1
621111 273 114111 1
522110 253 115310 1
541512 230 211112 1
921120 212 212230 1

2.3. Time Series Analysis

Periodicity: Daily

Below graph shows the frequency trend over the period February 2012 to August 2019.

3. Daily Created

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter Yes No Comments
Duplicate rows > 10% ? 93.2% Unique Rows
Missing values > 50% for one or more columns ? 0 Columns have missing values > 50%
Most recent updates is before 6 months ago ? Latest Date: 29-Jun-2019
Data contain PII (Personally Identificable Information) ?

3.1. Metadata Summary

3.1.1. Data Dimensions

Number of Columns: 4

Number of Rows: 102,340,815

Date Range: 01-Jan-2012 to 29-Jun-2019

3.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

3.1.3. Data Subset

Sample 6 rows of the dataset.

3.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

3.1.5. Missing Data

3.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

3.2.1. Histogram and Statistical Summary - Continuous Variable(s)

Histogram is a representation of the distibution of numerical data. The numerical values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.

3.2.1.1 Column - created_job_count

Column created_job_count represents how many new job postings were created for the given grouping

Note: Figure 2 is a zoomed version of figure 1 (containing values greater than 2 and lesser than 20)

3.2.2. Frequency Counts - Categorical Variable(s)
3.2.2.1 Column - company_names

Below table represents the frequency count of top 10 and bottom 10 company names

There are about 36,679 differentcomapny names available in the data

Top 10
Bottom 10
Company Name Frequency Company Name Frequency
Cardinal Health 13,395 Healthcare Risk Advisors 2,209
UPS 11,163 Milan Laser Hair Removal 2,209
Schneider 11,045 Zurchers 2,211
Lyft 10,782 Acme Brick Tile & More 2,215
Target 10,662 Horizon Connects 2,216
Cardinal Logistics 10,608 Blue Sky Pest Control 2,217
Mesilla Valley Transportation 10,417 Hacker USA 2,218
Averitt Express 9,184 The Third Floor Inc 2,218
Community Medical Centers 9,140 Ministry of Social Development 2,220
St. Joseph Health System 8,914 Stratton Amenities 2,221

3.3. Time Series Analysis

Periodicity: Daily

Below graph shows the frequency trend over the period January 2012 to June 2019.

4. Daily Deleted

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter Yes No Comments
Duplicate rows > 10% ? 92.33% Unique Rows
Missing values > 50% for one or more columns ? 0 Columns have missing values > 50%
Most recent updates is before 6 months ago ? Latest Date: 29-Jun-2019
Data contain PII (Personally Identificable Information) ?

4.1. Metadata Summary

4.1.1. Data Dimensions

Number of Columns: 4

Number of Rows: 100,560,809

Date Range: 01-Jan-2012 to 29-Jun-2019

4.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

4.1.3. Data Subset

Sample 6 rows of the dataset.

4.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

4.1.5. Missing Data

4.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

4.2.1. Histogram and Statistical Summary - Continuous Variable(s)

Histogram is a representation of the distibution of numerical data. The numerical values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.

4.2.1.1 Column - deleted_job_count

Column deleted_job_count represents how many job postings were removed for the given grouping

Note: Figure 2 is a zoomed version of figure 1 (containing values greater than 2 and lesser than 20)

4.2.2. Frequency Counts - Categorical Variable(s)
4.2.2.1 Column - company_names

Below table represents the frequency count of top 10 and bottom 10 company names

There are about 36,038 different company names available in the data

Top 10
Bottom 10
Company Name Frequency Company Name Frequency
Cardinal Health 13,704 Atlanta Braves MLB 2,706
Target 10,995 Barge Waggoner Sumner & Cannon 2,706
Mesilla Valley Transportation 10,937 CGI Group Inc 2,706
UPS 10,925 Fort Miller 2,706
Lyft 10,923 Worley Catastrophe Response 2,707
Schneider 10,913 Zitter Health Insights 2,707
Cardinal Logistics 10,880 Alverno Laboratories 2,708
USA Truck 8,262 Attic Angel 2,708
Ruan 8,259 Cabot Corporation 2,708
XPO Logistics 8,259 CBRE UK 2,708

4.3. Time Series Analysis

Periodicity: Daily

Below graph shows the frequency trend over the period January 2012 to June 2019.

5. Ticker Reference

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter Yes No Comments
Duplicate rows > 10% ? 100% Unique Rows
Missing values > 50% for one or more columns ? 2 Columns have missing values > 50%
Most recent updates is before 6 months ago ? No date columns in data
Data contain PII (Personally Identificable Information) ?

5.1. Metadata Summary

5.1.1. Data Dimensions

Number of Columns: 9

Number of Rows: 66,402

Date Range: 26-Nov-1968 to 29-Jul-2019

5.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

5.1.3. Data Subset

Sample 6 rows of the dataset.

5.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

5.1.5. Missing Data

There are no missing records present in the data.

5.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

5.2.1. Histogram and Statistical Summary - Continuous Variable(s)

Histogram is a representation of the distibution of numerical data. The numerical values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.

There are no rlevant columns to plot histogram

5.2.2. Frequency Counts - Categorical Variable(s)
5.2.2.1 Column - company_id

Below table represents the frequency count of top 10 and bottom 10 company ids

There are about 0 different companies IDs available in the data

Top 10
Bottom 10
Company ID Frequency Company ID Frequency
32373 36 10007 1
14034 35 1015 1
3580 35 10291 1
3745 34 10297 1
43173 34 10323 1
54945 34 1047 1
56056 28 10584 1
10736 25 1077 1
19352 25 10797 1
6568 25 10823 1
5.2.2.2 Column - stock_ticker

Below table represents the frequency count of top 10 and bottom 10 stock tickers

There are about 22,124 different stock tickers available in the data

Top 10
Bottom 10
Stock Ticker Frequency Stock Ticker Frequency
HCA 422 000046 1
IBM 350 000050 1
WMT 252 000120 1
UNH 244 000150 1
XRX 225 000166 1
BRK 224 000660 1
BRKB 224 000725 1
ORCL 192 000728 1
UTX 168 001 1
JNJ 160 002371 1
5.2.2.3 Column - stock_exchange_country

Below table represents the frequency count of top 10 and bottom 10 lei

There are about 78 different open perm ids available in the data

Top 10
Bottom 10
Stock Exchange Country Frequency Stock Exchange Country Frequency
US 19,203 CY 1
DE 17,495 EG 1
GB 8,175 IS 1
MX 4,382 JM 1
CA 3,082 KW 1
AT 2,801 MU 1
CH 2,664 OM 1
FR 1,123 PK 1
JP 993 RS 1
CL 844 SK 1
5.2.2.4 Column - stock_exchange_name

Below table represents the frequency count of top 10 and bottom 10 naics codes

There are about 113 different naics codes available in the data

Top 10
Bottom 10
Stock Exchange Name Frequency Stock Exchange Name Frequency
FRA 11,969 BEL 1
NYS 9,355 BRA 1
NAS 9,056 BURG 1
LON 8,172 CAI 1
MEX 4,382 CYS 1
BER 2,864 DAR 1
WBO 2,801 DFM 1
SWX 2,613 ICE 1
TSX 1,509 IEXG 1
TSE 1,507 IST 1
5.2.2.5 Column - primary_flag

5.3. Time Series Analysis

Periodicity: Daily

Below graph shows the frequency trend over the period January 2012 to August 2019.

6. Unique Active

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter Yes No Comments
Duplicate rows > 10% ? 92.61% Unique Rows
Missing values > 50% for one or more columns ? 0 Columns have missing values > 50%
Most recent updates is before 6 months ago ? Latest Date: 29-Jun-2019
Data contain PII (Personally Identificable Information) ?

6.1. Metadata Summary

6.1.1. Data Dimensions

Number of Columns: 4

Number of Rows: 101,889,662

Date Range: 01-Jan-2012 to 29-Jun-2019

6.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

6.1.3. Data Subset

Sample 6 rows of the dataset.

6.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

6.1.5. Missing Data

6.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

6.2.1. Histogram and Statistical Summary - Continuous Variable(s)

Histogram is a representation of the distibution of numerical data. The numerical values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.

6.2.1.1 Column - unique_active_job_count

Column unique_active_job_count represents how many unique job postings were active for the given grouping

Note: Figure 2 is a zoomed version of figure 1 (containing values greater than 2 and lesser than 20)

6.2.2. Frequency Counts - Categorical Variable(s)
6.2.2.1 Column - company_names

Below table represents the frequency count of top 10 and bottom 10 company names

There are about 36,540 different company names available in the data

Top 10
Bottom 10
Company Name Frequency Company Name Frequency
Cardinal Health 13,691 Casey’s Cupcakes 2,712
Lyft 10,954 Castlight Health 2,712
Mesilla Valley Transportation 10,935 DirectMedica 2,712
Target 10,935 Transaction Network Services 2,712
Cardinal Logistics 10,931 Capita 2,713
UPS 10,916 City of Carlsbad 2,713
Knight Transportation 8,223 Jaunt 2,713
Constellation Brands 8,222 Nutmeg 2,713
Schneider 8,222 Puratos 2,713
St. Joseph Health System 8,217 US Storage Centers 2,713

6.3. Time Series Analysis

Periodicity: Daily

Below graph shows the frequency trend over the period January 2012 to June 2019.

Appendix 1: Data Summary

1 Univariate Analysis

Univariate Analysis involves the analysis of one variable at a time.

1.1 Histogram

A Histogram visualizes the distribution of a numerical field over its continuous range of values. Each bar in a histogram represents the tabulated frequency at each interval/bin.

1.2 Density Plot

A Density Plot visualizes the distribution of data over a continuous interval or time period. This chart is a variation of a Histogram that uses kernel smoothing to plot values, allowing for smoother distributions by smoothing out the noise. The peaks of a Density Plot help display where values are concentrated over the interval.

1.3 QQ Plot

A QQ plot is a scatterplot created by plotting two sets of quantiles (theoretical and sample) against one another. The shape of the QQ plot indicates whether the data is normally distributed, skewed, or has a heavy tail.

2 Bivariate Analysis

Bivariate Analysis involves the analysis of two variables for the purpose of determining the empirical relationship between them. It explores the concept of the relationship between two variables, whether there exists an association and the strength of this association, or whether there are differences between two variables and the significance of these differences.

2.1 Correlation

A correlation matrix displays the coefficient of correlation for every pair of variables present in the dataset. This allows you to see which pairs have the highest correlation. It can be used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses.

3 Trend Charts

Trend charts are simple and efficient graphical representations of time-series data. Monthly trend charts can often reveal seasonal trends for a variable while yearly trend charts show trends over a longer period.

Appendix 2: Data Quality Scorecard

Tresvista - s Data Quality (DQ) framework is designed to assess the data quality and data health. Data quality monitoring is performed on an ongoing basis to ensure sustainable data quality.

Dimensions of Data Quality

A Data Quality Dimension is a term used to describe a data quality measure that can relate to multiple data elements including attribute, record, table, system or more abstract groupings such as business unit, company or product range. While there are multiple parameters on which a dataset can be assessed in terms of quality, we have identified the following 5 core dimensions for our assessment.

1. Availability - Sufficient availability of data points

It is ensuring that enough data is available to end users and applications, when and where they need it for further analysis. This is particularly important, as many of the machine learning algorithms require enough data samples for training and testing/validating the models.

2. Completeness - All required data is captured, and no data is missing

It is identifying the percentage of records with non-NULL values. It can also be termed as comprehensiveness. Missing or incomplete data can hamper the analysis and affect the interpretability of the insights.

3. Uniqueness - Data redundancy should be avoided

It points out that there should be no data duplicates reported. Asserting uniqueness of the entities within a data set implies that no entity exists more than once within the data set and that there is a key that can be used to uniquely access each entity (and only that specific entity) within the data set.

4. Recency - Data is recent and not outdated

It is the degree to which information is recent with the current period. It measures how - up-to-date information is, and whether it is correct despite possible time-related changes.

5. Consistency - Uniform data types across the column

It refers to the data values in one column being consistent across the column. A strict definition of consistency specifies that two data values drawn from the same column must not conflict with each other (column level consistency).