LABOR MARKET INTELLIGENCE DATA

Product Structure Tree

Number of Files Received: 5

1. Master

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter	Yes	No	Comments
Duplicate rows > 10% ?		✓	92.82% Unique Rows
Missing values > 50% for one or more columns ?	✓		1 Columns have missing values > 50%
Most recent updates is before 6 months ago ?		✓	Latest Date: 30-Mar-2019
Data contain PII (Personally Identificable Information) ?		✓

1.1. Metadata Summary

1.1.1. Data Dimensions

Number of Columns: 21

Number of Rows: 9,948,150

Date Range: 25-Apr-2016 to 30-Mar-2019

1.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

1.1.3. Data Subset

Sample 6 rows of the dataset.

1.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

1.1.5. Missing Data

The plot shows the percentage of missing values for each column, color-coded on a spectrum from Green(0%) to Red(100%).

1.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

1.2.1. Histogram and Statistical Summary - Continuous Variable(s)

Histogram is a representation of the distibution of numerical data. The numerical values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.

1.2.1.1 Column - salary

Column salary represents the estimated pay level

Note: Figure 2 is the zoomed version of figure 1 (containing values greater than 0 and less than 75000).

1.2.2. Frequency Counts - Categorical Variable(s)

1.2.2.1. Column - company

There are 6193 unique company names

Below table represents the frequency count of top 10 and bottom 10 company names

Top 10		Bottom 10
Company Name	Frequency	Company Name	Frequency
Care.Com	670,514	3M Company	1
Aramark	152,257	3M Health Information Systems	1
Wells+Fargo	132,242	A f l a c	1
Marriott International Inc	111,047	Acadia Healhcare	1
Wells Fargo	110,970	ACADIA HEALTHCARE	1
Express	106,088	Adams Resources & Energy Inc	1
Southern	105,905	adidas AG	1
Care.com	87,558	Advanced Emission Control Solutions	1
TARGET	82,568	Aflac - Capital Region	1
Lockheed+Martin	79,254	Aflac Careers	1

1.2.2.2. Column - location

There are 54050 unique locations

Below table represents the frequency count of top 10 and bottom 10 locations

Top 10		Bottom 10
Location	Frequency	Location	Frequency
New York NY	77,567	65th Infantry PR 00,923	1
Atlanta GA	66,526	Abbeville LA 70,511	1
Seattle WA	65,405	Abbeville MS 38,601	1
Charlotte NC	58,432	Abbott TX 76,621	1
Houston TX	57,550	Abbottstown PA	1
Chicago IL	51,836	Abbyville KS 67,510	1
United States	50,547	Aberdeen IN	1
San Francisco CA	44,994	Abie NE 68,001	1
Minneapolis MN	39,806	Abington CT 06,230	1
Dallas TX	39,739	Abington NC	1

1.2.2.3. Column - city

There are 17370 unique cities

Below table represents the frequency count of top 10 and bottom 10 cities

Top 10		Bottom 10
City	Frequency	City	Frequency
New York	164,137	65th Infantry	1
Houston	108,095	Abbyville	1
Atlanta	103,103	Abie	1
Chicago	102,014	Academy	1
Charlotte	90,206	Adah	1
Seattle	89,611	Adair Village	1
San Francisco	78,337	Adairville	1
Dallas	74,010	Adams County	1
Austin	71,042	Addy	1
Phoenix	66,116	Adelphia	1

1.2.2.4. Column - state_long

There are 52 unique states

Below table represents the frequency count of top 10 and bottom 10 states

Top 10		Bottom 10
State	Frequency	State	Frequency
California	1,066,787	Wyoming	9,978
Texas	725,249	Vermont	12,031
Florida	481,256	Puerto Rico	12,591
New York	423,525	Alaska	13,675
Illinois	346,562	Montana	16,416
Pennsylvania	326,239	South Dakota	16,886
North Carolina	307,983	North Dakota	19,467
Georgia	306,908	Maine	27,760
Virginia	292,574	Idaho	31,023
Ohio	276,539	Rhode Island	31,751

1.2.2.5. Column - county

There are 1837 unique counties

Below table represents the frequency count of top 10 and bottom 10 counties

Top 10		Bottom 10
County	Frequency	County	Frequency
Los Angeles	219,785	Benson	1
King	180,297	Billings	1
Santa Clara	176,393	Broomfield	1
New York	165,407	Camas	1
Cook	161,817	Dundy	1
Orange	149,905	Eureka	1
Maricopa	144,711	Faulk	1
Dallas	141,594	Haakon	1
Harris	130,443	Harding	1
Middlesex	127,478	Hayes	1

1.2.2.6. Column - region_state

There are 416 unique regions

Below table represents the frequency count of top 10 and bottom 10 region_states

Top 10		Bottom 10
Region State	Frequency	Region State	Frequency
New York-Northern New Jersey-Long Island NY-NJ-PA MSA	469,139	All other territories and foreign countries	7
Los Angeles-Long Beach-Santa Ana CA MSA	301,634	GU NONMETROPOLITAN AREA	166
Chicago-Naperville-Joliet IL-IN-WI MSA	293,304	MA NONMETROPOLITAN AREA	334
Dallas-Fort Worth-Arlington TX MSA	272,626	Danville IL MSA	737
Washington-Arlington-Alexandria DC-VA-MD-WV MSA	270,099	Madera CA MSA	784
San Francisco-Oakland-Fremont CA MSA	237,849	Palm Coast FL MSA	802
Atlanta-Sandy Springs-Marietta GA MSA	220,896	Pine Bluff AR MSA	860
Boston-Cambridge-Quincy MA-NH MSA	215,914	Bay City MI MSA	967
Philadelphia-Camden-Wilmington PA-NJ-DE-MD MSA	186,660	Hinesville-Fort Stewart GA MSA	1,058
San Jose-Sunnyvale-Santa Clara CA MSA	176,857	Lewiston ID-WA MSA	1,076

1.2.2.7. Column - company_ref

There are 3192 unique company references

Below table represents the frequency count of top 10 and bottom 10 company references

Top 10		Bottom 10
Company Reference	Frequency	Company Reference	Frequency
CARE.COM INC	758,072	ADAMS RESOURCES & ENERGY INC	1
WELLS FARGO & COMPANY	243,217	ADIDAS	1
ARAMARK	152,702	ADVANCED EMISSIONS SOLUTIONS INC	1
KINDRED HEALTHCARE INC	147,931	AGNC INVESTMENT CORP	1
TARGET CORPORATION	139,739	AIR PRODUCTS AND CHEMICALS INC	1
MACY’S INC	130,728	AIR TRANSPORT SERVICES GROUP INC	1
MARRIOTT	116,405	ALLIN CORPORATION	1
EXPRESS INC	106,123	AMERICAN REALTY INVESTORS INC	1
THE SOUTHERN COMPANY	105,905	AMERICAS UNITED BANK	1
THE HOME DEPOT INC	100,318	AMKOR TECHNOLOGY INC	1

1.2.2.8. Column - ticker

There are 3153 unique tickers

Below table represents the frequency count of top 10 and bottom 10 tickers

Top 10		Bottom 10
Ticker	Frequency	Ticker	Frequency
CRCM	758,072	ADDDF	1
WFC	243,217	ADES	1
MAR	174,368	AE	1
ARMK	152,702	AGNC	1
KND	147,931	AGTC	1
TGT	139,739	ALLN	1
M	130,728	AMGP	1
EXPR	106,123	AMKR	1
SO	105,905	AMPG	1
HD	100,318	APD	1

1.3. Time Series Analysis

Periodicity: Daily

Below graph shows the trend of job postings over the period April-2016 to March-2019.

2. Role

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter	Yes	No	Comments
Duplicate rows > 10% ?	✓		25% Unique Rows
Missing values > 50% for one or more columns ?		✓	0 Columns have missing values > 50%
Most recent updates is before 6 months ago ?		✓	Latest Date: 30-Mar-2019
Data contain PII (Personally Identificable Information) ?		✓

2.1. Metadata Summary

2.1.1. Data Dimensions

Number of Columns: 3

Number of Rows: 6,539,168

Date Range: 04-Mar-2019 to 30-Mar-2019

2.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

2.1.3. Data Subset

Sample 6 rows of the dataset.

2.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

2.1.5. Missing Data

There are no missing records present in the data.

2.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

2.2.1. Histogram and Statistical Summary - Continuous Variable(s)

There are no relevant continuous variables to plot histograms

2.2.2. Frequency Counts - Categorical Variable(s)

2.2.2.1. Column - role

Below plot below shows top 10 and bottom 10 job roles in the company

There are about 1,667 different job roles in the data

Top 10		Bottom 10
Role	Frequency	Role	Frequency
Sales/Marketing (all)	593,243	Application Assistant	3
Manager	505,026	Apprentice Plumber	3
Engineer	379,686	Behavioral Health Tech	3
Sales	335,743	Care Team Member	3
Associate	263,894	Carpenter Helper	3
Team Member	142,387	Chief Analytics Officer	3
Driver	140,769	Cutting Technician	3
Assistant	129,017	Desktop Support Administrator	3
salesperson	123,307	Dietetic Technician	3
service tech/mechanic	121,162	Disaster Recovery Manager	3

2.3. Time Series Analysis

Time series is plotted against the number of job posted against the day of the month in the data

3. Tag

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter	Yes	No	Comments
Duplicate rows > 10% ?	✓		33.33% Unique Rows
Missing values > 50% for one or more columns ?		✓	0 Columns have missing values > 50%
Most recent updates is before 6 months ago ?		✓	Latest Date: 30-Mar-2019
Data contain PII (Personally Identificable Information) ?		✓

3.1. Metadata Summary

3.1.1. Data Dimensions

Number of Columns: 3

Number of Rows: 13,607,028

Date Range: 05-Mar-2019 to 30-Mar-2019

3.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

3.1.3. Data Subset

Sample 6 rows of the dataset.

3.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

3.1.5. Missing Data

There are no missing records present in the data.

3.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

3.2.1. Histogram and Statistical Summary - Continuous Variable(s)

There are no relevant continuous variables to plot histograms

3.2.2. Frequency Counts - Categorical Variable(s)

3.2.2.1. Column - tag

Below plot below shows top 10 and bottom 10 job tags in the company

There are about 1,846 different tags in the data

Top 10		Bottom 10
Tag	Frequency	Tag	Frequency
sales	459,128	CouchDB	2
Team	418,784	Dispatch Systems	2
Operations	324,657	Mother Baby	2
Management	295,872	Neuropsychiatry	2
Design	252,709	Organ Donation	2
Engineering	220,825	PeopleCode	2
Hiring	193,161	PeopleSoft HRMS	2
Customer Service	192,642	Pulse Ox	2
Training	161,920	Solving Equations	2
Lead	160,402	appexchange	3

3.3. Time Series Analysis

Time series is plotted against the number of job posted against the day of the month in the data

4. Timelog

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter	Yes	No	Comments
Duplicate rows > 10% ?	✓		20% Unique Rows
Missing values > 50% for one or more columns ?		✓	0 Columns have missing values > 50%
Most recent updates is before 6 months ago ?		✓	Latest Date: 30-Mar-2019
Data contain PII (Personally Identificable Information) ?		✓

4.1. Metadata Summary

4.1.1. Data Dimensions

Number of Columns: 5

Number of Rows: 148,139,575

Date Range: 25-Apr-2016 to 30-Mar-2019

4.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

4.1.3. Data Subset

Sample 6 rows of the dataset.

4.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

4.1.5. Missing Data

4.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

4.2.1. Histogram and Statistical Summary - Continuous Variable(s)

There are no relevant continuous columns to plot histograms

4.2.2. Frequency Counts - Categorical Variable(s)

4.2.2.1. Column - duration_days

Below plot below shows top 10 number of days for which listing was open

There are about 285 different number of days for which listings are open

4.3. Time Series Analysis

4.3.1. Column - post_date

Below graph shows the number of posts listed over 2016 to 2019

4.3.2. Column - remove_date

Below graph shows the number of posts removed over 2016 to 2019

5. Title

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter	Yes	No	Comments
Duplicate rows > 10% ?	✓		50% Unique Rows
Missing values > 50% for one or more columns ?		✓	0 Columns have missing values > 50%
Most recent updates is before 6 months ago ?		✓	Latest Date: 30-Mar-2019
Data contain PII (Personally Identificable Information) ?		✓

5.1. Metadata Summary

5.1.1. Data Dimensions

Number of Columns: 3

Number of Rows: 1,294,582

Date Range: 04-Mar-2019 to 30-Mar-2019

5.1.2. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

5.1.3. Data Subset

Sample 6 rows of the dataset.

5.1.4. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

5.1.5. Missing Data

5.2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

5.2.1. Histogram and Statistical Summary - Continuous Variable(s)

There are no relevant continuous variables to plot histograms

5.2.2. Frequency Counts - Categorical Variable(s)

5.2.2.1 Column - Title

There are about 298,762 different titles in the data

The table below shows Top 10 and Bottom 10 headline titles included in each job listings

Top 10		Bottom 10
Title	Frequency	Title	Frequency
Assistant Manager	4,788	‘Back Up’ Nanny Needed For 1 Child In Brooklyn	2
Sales Associate	4,772	Customer Support Engineer - Lancope Stealthwatch	2
Server	4,212	Experienced With Newborn/infants - CPR Certified - Active..	2
Delivery Driver	3,768	Personal/Office Assistant (1 Day A Week)	2
Store Manager	2,964	Retail Support - Receiving Team Lead, Flex: Augusta Mall	2
Dishwasher	2,682	-Plant Shift Supervisor - 2nd Shift Memphis, TN	2
Cook	2,672	-Senior Software Development Manager	2
Assistant Store Manager	2,644	!! Restaurant positions open !!	2
General Manager	2,626	!!! CAREGIVER PART TIME DAY SHIFT 6AM-2PM Starting Wages $15..	2
Retail Sales Associate	2,536	!!! FULL TIME CAREGIVER PM Shift 2-10pm !!!	2

5.3. Time Series Analysis

Time series is plotted against the number of job posted against the day of the month in the data

Appendix 1: Data Summary

1 Univariate Analysis

Univariate Analysis involves the analysis of one variable at a time.

1.1 Histogram

A Histogram visualizes the distribution of a numerical field over its continuous range of values. Each bar in a histogram represents the tabulated frequency at each interval/bin.

1.2 Density Plot

A Density Plot visualizes the distribution of data over a continuous interval or time period. This chart is a variation of a Histogram that uses kernel smoothing to plot values, allowing for smoother distributions by smoothing out the noise. The peaks of a Density Plot help display where values are concentrated over the interval.

1.3 QQ Plot

A QQ plot is a scatterplot created by plotting two sets of quantiles (theoretical and sample) against one another. The shape of the QQ plot indicates whether the data is normally distributed, skewed, or has a heavy tail.

2 Bivariate Analysis

Bivariate Analysis involves the analysis of two variables for the purpose of determining the empirical relationship between them. It explores the concept of the relationship between two variables, whether there exists an association and the strength of this association, or whether there are differences between two variables and the significance of these differences.

2.1 Correlation

A correlation matrix displays the coefficient of correlation for every pair of variables present in the dataset. This allows you to see which pairs have the highest correlation. It can be used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses.

3 Trend Charts

Trend charts are simple and efficient graphical representations of time-series data. Monthly trend charts can often reveal seasonal trends for a variable while yearly trend charts show trends over a longer period.

Appendix 2: Data Quality Scorecard

Tresvista - s Data Quality (DQ) framework is designed to assess the data quality and data health. Data quality monitoring is performed on an ongoing basis to ensure sustainable data quality.

Dimensions of Data Quality

A Data Quality Dimension is a term used to describe a data quality measure that can relate to multiple data elements including attribute, record, table, system or more abstract groupings such as business unit, company or product range. While there are multiple parameters on which a dataset can be assessed in terms of quality, we have identified the following 5 core dimensions for our assessment.

1. Availability - Sufficient availability of data points

It is ensuring that enough data is available to end users and applications, when and where they need it for further analysis. This is particularly important, as many of the machine learning algorithms require enough data samples for training and testing/validating the models.

2. Completeness - All required data is captured, and no data is missing

It is identifying the percentage of records with non-NULL values. It can also be termed as comprehensiveness. Missing or incomplete data can hamper the analysis and affect the interpretability of the insights.

3. Uniqueness - Data redundancy should be avoided

It points out that there should be no data duplicates reported. Asserting uniqueness of the entities within a data set implies that no entity exists more than once within the data set and that there is a key that can be used to uniquely access each entity (and only that specific entity) within the data set.

4. Recency - Data is recent and not outdated

It is the degree to which information is recent with the current period. It measures how - up-to-date information is, and whether it is correct despite possible time-related changes.

5. Consistency - Uniform data types across the column

It refers to the data values in one column being consistent across the column. A strict definition of consistency specifies that two data values drawn from the same column must not conflict with each other (column level consistency).